Issue Deploying 2.16.0 Helm Chart to AWS EKS

AG_AOSM · 20 October 2023 19:59

I’m running into an issue while trying to deploy a new instance of Speckle Server (2.16.0 with frontend 1) into an EKS cluster on AWS. I’ve deployed speckle-server successfully on AWS a few times, but wanted to start with a fresh cluster with this release.

The /graphql endpoint doesn’t seem to be responding. Any requests to /graphql are timing out, including the liveness/readiness probes on the pod. This means that the speckle-server pods, while running, aren’t Ready. I’ve tried to launch older releases on a fresh cluster and I’ve been running into the same problem. This makes me think it’s my cluster configuration, but I’m not sure how it could be causing this error since the probe requests are made from the speckle-server pod to itself and wouldn’t be impacted by any security group settings or ingress/dns/certificate issues.

Additional context:

I’ve deployed a nearly identical configuration locally and it works without any issues. The only real differences in the local deployment are that it doesn’t require TLS, PGSSL and uses letsencrypt-staging.

My postgres and redis resources are in the same VPC and share the VPC’s default security group. I’ve also tried adding them to the nodegroup groups in case there was an issue with the connection between PG and the server timing out.

I’ve tried toggling the networkPolicy and serviceAccount properties for all of the containers just in case that’s related.

The landing page loads to the “Interoperability in seconds” screen without a login and throws the errors you’d expect:

WebSocket connection to 'wss://speckle.<my_domain>.app/graphql' failed
POST https://speck.<my_domain>.app/graphql 503 (Service Unavailable)

All other pods are Running except except speckle-webhook-service and speckle-preview-service which are stuck in Pending.

Here’s the specific error in the speckle-server pod’s log:

{"level":"info","time":"2023-10-19T19:05:55.474Z","req":{"id":"36601963-d742-4c3f-b1a1-ba8757fcdb44","method":"GET","path":"/graphql","headers":{"content-type":"application/json","host":"127.0.0.1:3000","connection":"close","x-request-id":"36601963-d742-4c3f-b1a1-ba8757fcdb44"}},"res":{"statusCode":null,"headers":{"x-request-id":"36601963-d742-4c3f-b1a1-ba8757fcdb44","access-control-allow-origin":"*"}},"responseTime":9704,"msg":"request aborted"}

Here’s the rest of the startup log in speckle-server, which looks completely normal to me:

{"level":"info","time":"2023-10-19T19:05:39.828Z","phase":"db-startup","msg":"Loaded knex conf for production"}
{"level":"info","time":"2023-10-19T19:05:42.739Z","component":"modules","msg":"💥 Init core module"}
{"level":"info","time":"2023-10-19T19:05:42.909Z","component":"modules","msg":"🔔 Initializing postgres notification listening..."}
{"level":"info","time":"2023-10-19T19:05:42.910Z","component":"modules","msg":"🔑 Init auth module"}
{"level":"info","time":"2023-10-19T19:05:42.972Z","component":"modules","msg":"💅 Init graphql api explorer module"}
{"level":"info","time":"2023-10-19T19:05:42.972Z","component":"modules","msg":"📧 Init emails module"}
{"level":"info","time":"2023-10-19T19:05:43.742Z","component":"modules","msg":"♻️ Init pwd reset module"}
{"level":"info","time":"2023-10-19T19:05:43.742Z","component":"modules","msg":"💌 Init invites module"}
{"level":"info","time":"2023-10-19T19:05:43.746Z","component":"modules","msg":"📸 Init object preview module"}
{"level":"info","time":"2023-10-19T19:05:43.747Z","component":"modules","msg":"📄 Init FileUploads module"}
{"level":"info","time":"2023-10-19T19:05:43.747Z","component":"modules","msg":"🗣 Init comments module"}
{"level":"info","time":"2023-10-19T19:05:43.747Z","component":"modules","msg":"📦 Init BlobStorage module"}
{"level":"info","time":"2023-10-19T19:05:43.818Z","component":"modules","msg":"📞 Init notifications module"}
{"level":"info","time":"2023-10-19T19:05:43.818Z","component":"modules","msg":"📞 Initializing notification queue consumption..."}
{"level":"info","time":"2023-10-19T19:05:43.833Z","component":"modules","msg":"🤺 Init activity module"}
{"level":"info","time":"2023-10-19T19:05:43.837Z","component":"modules","msg":"🔐 Init access request module"}
{"level":"info","time":"2023-10-19T19:05:43.837Z","component":"modules","msg":"🎣 Init webhooks module"}
{"level":"info","time":"2023-10-19T19:05:43.837Z","component":"modules","msg":"🔄️ Init cross-server-sync module"}
{"level":"info","time":"2023-10-19T19:05:43.837Z","component":"modules","msg":"🤖 Init automations module"}
{"level":"info","time":"2023-10-19T19:05:43.971Z","component":"cross-server-sync","msg":"⬇️  Ensuring base onboarding stream asynchronously..."}
{"level":"info","time":"2023-10-19T19:05:43.971Z","component":"cross-server-sync","msg":"Ensuring onboarding project is present..."}
{"level":"info","time":"2023-10-19T19:05:45.350Z","phase":"startup","msg":"🚀 My name is Speckle Server, and I'm running at 0.0.0.0:3000"}

iainsproat · 20 October 2023 21:18

Hi @AG_AOSM - welcome to Speckle’s community. If you haven’t already, feel free to introduce yourself to the community. We’d love to know more about what you are using Speckle for and where you are using it - it really helps us understand how we can guide Speckle’s future to better help the community. Hopefully we can debug your issues and get you going.

From the logs you posted, the speckle-server Pod has successfully connected to Postgres and Redis. I believe we can ignore those.

The 503 (Service Unavailable) message you seen in the frontend is often returned by a load balancer when it is unable to connect to the server pods. This may be because they are reporting unready and so the load balancer has no ready pods to connect to. It seems as if the problem is with ingress to speckle-server.

To begin with, I would suggest disabling all network policy rules. This should help eliminate that as the source of your issues.

Are there errors being shown when you kubectl describe the speckle-server pods? Kubernetes should be logging the errors with the liveness & readiness probes.

Have you enabled the Helm test, and what are the logs that it shows?

It’s been a few years since I deployed to EKS, but hopefully there are others in the community that are more familiar and can help out with those specifics.

Iain

AG_AOSM · 24 October 2023 16:19

Hi Iain, thanks for responding! I’ve been trying to reconfigure but still no luck.

All network policy rules are disabled.

When I describe speckle-server, it reports that the readiness probe failed, but no other errors:

Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  4m                   default-scheduler  Successfully assigned speckle/speckle-server-58f46cdd8f-jlsdf to ip-10-0-4-158.ec2.internal
  Normal   Pulled     3m59s                kubelet            Container image "speckle/speckle-server:2" already present on machine
  Normal   Created    3m59s                kubelet            Created container main
  Normal   Started    3m59s                kubelet            Started container main
  Warning  Unhealthy  3s (x22 over 3m39s)  kubelet            Readiness probe failed: command "node -e try { require('node:http').request({headers: {'Content-Type': 'application/json'}, port:3000, hostname:'127.0.0.1', path:'/graphql?query={serverInfo{version}}', method: 'GET', timeout: 2000 }, (res) => { body = ''; res.on('data', (chunk) => {body += chunk;}); res.on('end', () => {process.exit(res.statusCode != 200 || body.toLowerCase().includes('error'));}); }).end(); } catch { process.exit(1); }" timed out

When I run helm test speckle-server, I get job failed: BackoffLimitExceeded and the pods have the graphql error you’d expect given the results of the readiness probe:

Using Speckle server 'https://<my_domain>'
Frontend accessible
Traceback (most recent call last):
  File "/app/./run_tests.py", line 61, in <module>
    assert isinstance(server_info, ServerInfo), "GraphQL ServerInfo query error"
AssertionError: GraphQL ServerInfo query error

I’ll look into the ingress configuration a little more, but so far (having looked at route tables and security groups) there’s no reason the ingress wouldn’t be able to reach the nodes or the server pod.

iainsproat · 24 October 2023 16:29

The readiness probe is effectively run from within the container and is attempting to reach the localhost, i.e.127.0.0.1 in the above code. This may be failing because your Kubernetes is not using IPv4, and is instead using IPv6. I’m not too familiar with EKS, but it seems to be a configuration that can be edited when the cluster is created. Would you be able to try configuring it to IPv4?

Iain

AG_AOSM · 26 October 2023 02:40

I set up the cluster, and the VPC, to only use IPv4 (not “dual stack” or IPv6), so I don’t think that’s it.

For what it’s worth, I deployed a dummy pod containing a simple apollo/express server with a graphql endpoint. It has the same base image as speckle-server and the same deployment configuration, including the probes and it runs without any issue. I can even send requests to it from the speckle-server pod. So there’s no issue with comms between pods or the image/runtime.

I’m 99% sure it’s an issue with one of the dependencies and I think I’ve found which one. I’m going to do a complete rebuild now to verify, but I think I’ve found the issue. I will follow up with the solution for posterity once I confirm.

AG_AOSM · 26 October 2023 18:44

The issue had to do with encryption in Elasticache (Redis). I originally had the “Transit encryption mode” set to Required. By switching it to Preferred, the error no longer appears and the deployment goes smoothly. I expected the server to throw an error when it’s initializing the modules/configuring the Redis pubsub if there was an issue, but instead it just timed out whenever a graphql query was dispatched.

I’ve written a step-by-step deployment guide for launching Speckle in AWS for my company to use. I’d like to post it if you think it’d be helpful for others.

Thanks @iainsproat !

iainsproat · 26 October 2023 19:12

Thank you for posting the resolution to this problem. We’d be very happy to read a step-by-step guide to Speckle on AWS - as I’m sure many others will be too! Feel free to post it in a new discussion on this forum if you wish.

As for the root cause of your issue, Speckle should be able to do better in providing an error message, and ideally it would also support the Required mode for encrypted TLS connections. We’ll add these suggested improvements to the backlog, or please make a contribution to our code if you feel up for the challenge!

Iain

AG_AOSM · 27 October 2023 14:11

Perfect, I’ll post the guide.

Also, I’d love to contribute that feature. I’ll start on it next week.