One of the things I learned the hard way about K8s is that it is primarily designed to keep microservices running. If you have a batch processing job that runs for a long time you will have a bad time. I joined the team after much of the AI Cloud infrastructure was already in place and I was very new to K8s when I joined, so I had a steep learning curve ahead of me.
While I was learning the whole platform I started to encounter unexplained failure modes for some data pipelines in the system. It was rare but it was happening. The symptom was that a batch job that was meant to run only once seemed to be started twice but only completed once. This was a real head-scratcher. If the job was idempotent it wasn’t a big deal, but if it wasn’t then the data was corrupted.
It took a long time to figure out the GKE autoscaler was triggering the removal of the node hosting the batch job’s pod. The autoscaler was doing its job and removing nodes that were underutilized. But the batch job pods were not ready to be stopped. A batch processing system needs the jobs to complete without interruption, whereas a microservice can be stopped and started at any time on any node without any loss of data. In fact that is the whole point of microservices: to be able to scale up and down without any loss of data.
So, how do you run a batch processing job on K8s? The trick is in adding tolerations to the pod spec so that the pod is not automatically shut down and moved when the node has the NoSchedule taint. This allows the pod to continue to run and prevents the autoscaler from removing the node until the pod execution is complete.
Simple, but not obvious.