Recomendations/guidelines for cluster resource management

parkerjspe · 22 June 2022 23:36

Hi Speckle Team/Community!

Put together simple proof of concept for running speckle on AKS that community may find useful:

Curious if anyone has/can point me to any recomendations/guidelines for cluster resource management? Really just looking for basic ROTs/heuristics at this point for starting to think about things like:

per node memory/compute requirements for typical range of workloads/use cases
node/replica strategies for variable/spiky loads.
etc.

Cheers

gergo · 23 June 2022 08:02

Hey @parkerjspe

nice job getting stuff up and running on AKS, we’re not the most prominent Azure users, but I know it can be a pain to deal with sometimes

Around resource requirements, the default cpu and request request and limit in the helm chart generously over provisions resources for the basic usecases, so that we provide a smooth out of the box experience. To give you an example, we’re using the defaults for our internal staging server, which only breaks a sweat, when we intentionally push it to the limit. Otherwise the team and the beta tester group use it happily for our daily workflow.

If you want to take things a step further, each service exposes a set of prometheus metrics, that paired with metrics from something like the kube-prometheus-stack, can give you insights on your instance loads.

The only spiky load scenario, that you have to be aware of, is heavy receive operations. If you anticipate massive model reads, i recommend scaling the server backend to 2 instances.

Hope this helps

parkerjspe · 23 June 2022 20:07

Thanks @gergo, This is very helpful. Apart from hitting resource limits (which sound like you aren’t doing), have you observed any instability during load tests? Are you targeting specific types of loads?

gergo · 24 June 2022 12:38

Instability as in unexpected unexplainable crashes, we haven’t noticed.
If we were to notice any sort of instability it instantly jumps to a P0 task.

We did run into a few performance regressions and some bug induced crashes on the server.
But we monitor our internal dev environment the same way we do for the prod environments, so these are usually ironed out while we’re dogfeeding every pre-release.

Naturally we’re not perfect, and things can slip. If you notice anything, pls should loud and quick.

One thing we were aware of being a performance issue was multiple concurrent big receive operations would clog the database connection pool and we were accumulating pending database queries. This resulted in some noticeable performance degradation on the frontend side. With the latest release 2.6.3, we pushed some updates to this:

the server database connection pool limits can now be customized in the helm chart / env variables
made some connection parameter tuning to enable better connection cleanup

But even with the old config we havent noticed crashes on the server. In the last release cycle, none of the k8s pods restarted due to a crash and even if they did, we’re running enough redundancy that it wouldn’t cause user facing service instability.