Running a Migration Making 800k API Calls Without Crashing Prod
Executing long-running scripts in production can be quite a time-consuming, inelegant task. At Drop, we previously had a repository of scripts that we would use by manually accessing the Kubernetes cluster through the terminal and running the script.
This was tedious and worked alright for a while, but soon we found ourselves with an issue that was near-impossible to solve with this method:
The main server ingests many requests per second, with 2/3 of them coming in from our third-party APIs, potentially causing latency spikes among other performance issues. These numbers would only continue to grow as Drop gained more users.
Currently, we have third-party APIs sending requests to the main server, which we then forward to the secondary server, which handles all of the request processing. An obvious improvement would be to set the third-party APIs to send requests directly to the secondary server, reducing the request load on the main server by 67%.
The problem is that the webhook endpoint that we tell our third party APIs to send requests to, is configured on a per-user basis. In other words, we will need to make a massive number of requests to our third party APIs to update the webhook endpoint, one-by-one.
If we used our usual method to perform this task, it would take 10 days running 24/7 in order to complete. As a result, we decided to switch to using the Maintenance Task gem to streamline the process.
What Are Maintenance Tasks?
Shopify’s Maintenance Tasks gem provides a Rails engine for executing long-running tasks, with many extraordinarily useful features. It also provides a nice dashboard to manage and monitor the progress and status of tasks.
Persistence Across Deployments
Using the previous method of manually accessing the Kubernetes cluster and running the script here would block us from deploying any new back-end code to production for 10 full days, undermining our CI/CD efforts and drastically slowing development at Drop, wasting dev time.
By using the Maintenance Task gem, we were able to keep our tasks running across deployments without adding any additional logic. The following task ran for about 22 hours (during which we probably deployed to production quite a few times), but tasks can theoretically be run indefinitely.
Working With CSV Files
With our previous method of running scripts, if we wanted a task to loop over a CSV file, we would have to add the CSV file to the code repository itself (and merge and deploy it to QA, canary, then production Kubernetes clusters every single time the CSV file changes).
The Maintenance Task gem allowed us to upload a CSV file in the web dashboard, and have the process run for each item in the file. With the vast majority of our current scripts using CSV files, this will save a lot of time in the long run.
Customizing Runtime Variables
Before we added error handling to our tasks, they would often crash every few hours. The Maintenance Task gem allows any user of the web dashboard to pass in custom arguments for each run of a given task.
We added the arguments processing_window_start
and processing_window_end
to define a processing window, especially useful to test the task on a small subset of items first, or to tell the task to continue processing from where it left off.
Handling Errors
While running our tasks, we encountered numerous errors including vendor-side issues with our third party APIs, networking issues when restarting Kubernetes pods, and database contention.
Unfortunately, there was no built-in or default behaviour to retry the task or resolve the error in another way. There were some built-in features that could be used to make this functionality ourselves such as the after_error
callback and the throttling mechanism. However, having default error handling behaviour that can be overridden would be welcome.
To resolve these issues, we implemented our own custom error handling logic: we rescued the errors, retried the process with linear backoff, and logged the error if it persisted. This helped ensure that we could continue running the maintenance task theoretically indefinitely, avoiding temporary/rare issues from halting task progress, while still keeping track of what was causing errors that required additional developer intervention to fix.
Speeding Up The Process
After implementing error handling, the task was able to continue running across deploys for 24+ hours. However, the problem remained that it would still take 10 days running 24/7 to finish the task.
Unfortunately, the Maintenance Task gem does not support instantiating multiple instances of the same task to run in parallel, so we duplicated the task to create TaskA, TaskB, and TaskC, which we could run in parallel.
We divided the items that still needed to be run into 3 ranges of equal size and assigned each range to a task using the processing_window_start
and processing_window_end
arguments. This allowed us to run 3 of the same task in parallel, and we were able to finish 3x as fast (3-4 days instead of 10 days).
Logging and Metrics Collection
We wanted to track the progress of our maintenance tasks, but there was no built-in method to collect & display custom logs and metrics live.
As a result, we used a logging provider for logging as well as DataDog and StatsD for metrics collection, allowing us to figure out how many users had data being sent to legacy versus current webhook endpoints and ensure that our tasks were working correctly in real-time. We also created the unique_runtime_id
argument in order to help differentiate between different runs.
Above is a screenshot from DataDog showing the number of requests going to the main server (light blue) and the secondary server (dark blue) stacked on top of each other. Over time, you can clearly see a visible change as incoming requests change from going through both servers (both light blue and dark blue) to directly to the secondary server (only dark blue).
Impact
In the end, we reduced the traffic load coming into the main server by 67%, opening up more resources to handle more customer traffic on the app, and reducing request latency by 80%. We also created a cleaner separation of logic between the main server and the secondary server. Without the Maintenance Tasks gem, this process would have been far more complicated and inefficient.