Running a Migration Making 800k API Calls Without Crashing Prod

George Shao
Drop Engineering
Published in
6 min readJun 29, 2022

--

Executing long-running scripts in production can be quite a time-consuming, inelegant task. At Drop, we previously had a repository of scripts that we would use by manually accessing the Kubernetes cluster through the terminal and running the script.

This was tedious and worked alright for a while, but soon we found ourselves with an issue that was near-impossible to solve with this method:

The main server ingests many requests per second, with 2/3 of them coming in from our third-party APIs, potentially causing latency spikes among other performance issues. These numbers would only continue to grow as Drop gained more users.

Currently, we have third-party APIs sending requests to the main server, which we then forward to the secondary server, which handles all of the request processing. An obvious improvement would be to set the third-party APIs to send requests directly to the secondary server, reducing the request load on the main server by 67%.

Before/After General System Diagram

The problem is that the webhook endpoint that we tell our third party APIs to send requests to, is configured on a per-user basis. In other words, we will need to make a massive number of requests to our third party APIs to update the webhook endpoint, one-by-one.

If we used our usual method to perform this task, it would take 10 days running 24/7 in order to complete. As a result, we decided to switch to using the Maintenance Task gem to streamline the process.

What Are Maintenance Tasks?

Shopify’s Maintenance Tasks gem provides a Rails engine for executing long-running tasks, with many extraordinarily useful features. It also provides a nice dashboard to manage and monitor the progress and status of tasks.

Shopify’s Maintenance Task Gem Dashboard

Persistence Across Deployments

Using the previous method of manually accessing the Kubernetes cluster and running the script here would block us from deploying any new back-end code to production for 10 full days, undermining our CI/CD efforts and drastically slowing development at Drop, wasting dev time.

By using the Maintenance Task gem, we were able to keep our tasks running across deployments without adding any additional logic. The following task ran for about 22 hours (during which we probably deployed to production quite a few times), but tasks can theoretically be run indefinitely.

Maintenance Task Dashboard Showing Completed 22 Hour Task

Working With CSV Files

With our previous method of running scripts, if we wanted a task to loop over a CSV file, we would have to add the CSV file to the code repository itself (and merge and deploy it to QA, canary, then production Kubernetes clusters every single time the CSV file changes).

The Maintenance Task gem allowed us to upload a CSV file in the web dashboard, and have the process run for each item in the file. With the vast majority of our current scripts using CSV files, this will save a lot of time in the long run.

“Upload CSV” Button on Maintenance Task Dashboard

Customizing Runtime Variables

Before we added error handling to our tasks, they would often crash every few hours. The Maintenance Task gem allows any user of the web dashboard to pass in custom arguments for each run of a given task.

We added the arguments processing_window_start and processing_window_end to define a processing window, especially useful to test the task on a small subset of items first, or to tell the task to continue processing from where it left off.

Maintenance Task Dashboard Arguments

Handling Errors

While running our tasks, we encountered numerous errors including vendor-side issues with our third party APIs, networking issues when restarting Kubernetes pods, and database contention.

Unfortunately, there was no built-in or default behaviour to retry the task or resolve the error in another way. There were some built-in features that could be used to make this functionality ourselves such as the after_error callback and the throttling mechanism. However, having default error handling behaviour that can be overridden would be welcome.

To resolve these issues, we implemented our own custom error handling logic: we rescued the errors, retried the process with linear backoff, and logged the error if it persisted. This helped ensure that we could continue running the maintenance task theoretically indefinitely, avoiding temporary/rare issues from halting task progress, while still keeping track of what was causing errors that required additional developer intervention to fix.

Speeding Up The Process

After implementing error handling, the task was able to continue running across deploys for 24+ hours. However, the problem remained that it would still take 10 days running 24/7 to finish the task.

Unfortunately, the Maintenance Task gem does not support instantiating multiple instances of the same task to run in parallel, so we duplicated the task to create TaskA, TaskB, and TaskC, which we could run in parallel.

We divided the items that still needed to be run into 3 ranges of equal size and assigned each range to a task using the processing_window_start and processing_window_end arguments. This allowed us to run 3 of the same task in parallel, and we were able to finish 3x as fast (3-4 days instead of 10 days).

Logging and Metrics Collection

We wanted to track the progress of our maintenance tasks, but there was no built-in method to collect & display custom logs and metrics live.

As a result, we used a logging provider for logging as well as DataDog and StatsD for metrics collection, allowing us to figure out how many users had data being sent to legacy versus current webhook endpoints and ensure that our tasks were working correctly in real-time. We also created the unique_runtime_id argument in order to help differentiate between different runs.

Incoming Third Party API Request Destination Count (from DataDog)

Above is a screenshot from DataDog showing the number of requests going to the main server (light blue) and the secondary server (dark blue) stacked on top of each other. Over time, you can clearly see a visible change as incoming requests change from going through both servers (both light blue and dark blue) to directly to the secondary server (only dark blue).

Impact

In the end, we reduced the traffic load coming into the main server by 67%, opening up more resources to handle more customer traffic on the app, and reducing request latency by 80%. We also created a cleaner separation of logic between the main server and the secondary server. Without the Maintenance Tasks gem, this process would have been far more complicated and inefficient.

--

--