Leveling Up Transactions @ Drop — INT to BIG INT!

Published in

Drop Engineering

10 min readOct 4, 2019

The Importance of Transactions at Drop

Transaction data is core to Drop’s ability to reward our members and personalize their shopping experience. Using machine learning models and transaction data, we have surfaced highly relevant brands to members they would not otherwise be aware of.

Before diving deeper, let’s take a quick tour of what happens behind the scenes when our member, Jamie, is awarded points.

Jamie links a credit/debit card to the Drop account
Jamie makes a purchase at Walmart online/in-store
Drop securely fetches transactions and maps it to an offer activated by Jamie
Jamie is awarded Drop points for the purchase! 🎉

In this post, I’m going to share a scaling challenge that we faced as we experienced significant member growth and an explosive amount of transaction data.

How scale broke the system

We use PostgreSQL as our database to store transactions. The primary key for the transactions table was created with the SERIAL data type (i.e. an auto-incrementing column of integers). This data type served our needs until one fine day we exhausted the number of transaction IDs we could generate.

Reaching the maximum positive signed integer limit (2³¹ — 1) in PostgreSQL meant that all subsequent inserts would fail since it is not able to generate the next primary key value. New transactions weren’t able to be inserted, we weren’t able to award points to our members, resulting in a degraded production system.

Building a scalable solution

With a member first philosophy, our primary concern was to award points to our members as soon as possible! We were brainstorming ideas that would satisfy the following, in order of importance:

Get members their points ASAP
Guarantee data integrity
Smaller and parallel iterations that could be verified at every stage
Minimize application code change

“Quality tech doesn’t happen in silos”

We came up with a 5 step solution that would allow us to satisfy the constraints and minimize downtime. We’ll first go over the strategies behind the solution, then dig deeper into how we executed the steps.

Strategy

Stop-gap Fix

Seeing the urgency for a solution and the complexity of the problem, we used the negative integer space as a temporary stop-gap fix. Since an integer is a signed data type in PostgreSQL, we leveraged its signed nature to our advantage. We allowed the primary key to have negative values starting from -1. In theory, this gave us room for an additional 2³¹ IDs, but in the negative space up to -2³¹. Although negative IDs are not ideal and a bit weird, this allowed us to immediately restore the processing of transactions and resume awarding points to our members.

Now that our system was restored and resumed processing transactions, the next step for us was to come up with a long term solution by migrating over to the BIG INT column type.

Create new column of type BIG INT

The range of a signed BIG INT (or BIG SERIAL) is from -2⁶³ to 2⁶³ -1, which is a little over 9.2 Quintilian values in the positive space. By migrating the primary key column to the BIG INT type, we would have a solution that would allow us to grow our transactions data set by 2³², which would last us a long time. Thus, we added a column of type BIG INT to the transactions table — id_bigint. This column would be populated in tandem with the original ID column that is currently counting backward until we have switched over to the new column.

Start populating in the new ID column

Once we added the new column, we updated our application code so that every new insert into the transactions table would also be inserting into the new BIG INT column if the column was present. This conditional insert is important to call out because later we would be replacing the old column with this new column and we wanted to avoid synchronizing code changes and deployments when this happened.

At the same time, we started back-filling historical data into the new column, a very slow and resource-intensive process. Naively trying, “UPDATE transactions SET id_bigint=id” was not an option for us since it would have to update more than 2³¹ records in one go, and it would have completely blown up the database in terms of CPU and IOPS. Updating in batches was the only option for us, as we wanted to control the impact on CPU and IOPS to not degrade other services. Based on our staging environment (which is identical to our production environment), this update would have taken us 4–5 days.

Prepare migration to the new ID column

Before migrating to the new column, we had to ensure that it had the same indices and foreign keys as the existing column. We created a new unique index on the new column, as it would serve as a primary key constraint. The unique index was created using PostgreSQL’s CREATE INDEX CONCURRENTLY option. By using the concurrently option, PostgreSQL built the index without acquiring any locks that otherwise would have blocked inserts, updates, or deletes from happening at the same time. The caveat is that index creation takes longer to finish, but this allowed us to continue to use the table as it was being built.

This prep work sped up the migration to the new column and minimized the downtime.

Switch over to the new primary key column

At this point, we were fully ready to migrate to the new column. To do this, we dropped the existing primary key constraint on the transactions table, created a new primary key constraint using the unique index we created earlier, and then we renamed old and new ID columns so that they replaced each other. Finally, we dropped the old sequence that was decrementing into the negative space and created a new sequence for the new column to start from where we had left off in the positive space (2³¹).

Execution

With any engineering challenge, we need to have a step by step plan to execute safely, and with minimal disruption. We started by writing all the SQL queries required to run as part of the migration. Then we ran the queries in our staging environment, to get a sense of the time it would take to execute.

For long-running queries or queries that needed locks at the table level, we broke them down to smaller queries, then re-ran it on our staging environment just to make sure we didn’t miss anything during our query plan optimization. We were prepared for the worst — what-if rolling back goes sideways as well? To mitigate failure effects, we decided to create a manual database snapshots after each step. We notified our internal stakeholders about the plan and the timeline, and our members were notified with an in-app banner about the limited functionality during the final steps of the migration.

Stop-gap Fix

To start using negative integers as IDs, we simply modified the sequence associated with the existing ID column. We specified a new minimum value for the sequence, the new value to restart the sequence from, and set the increment value as -1 to have it change to a descending sequence.

ALTER SEQUENCE transactions_id_seq INCREMENT BY -1 RESTART -1 MINVALUE -2147483648;

Create new column of type BIG INT

This change was a very simple addition of a new column:

ALTER TABLE transactions ADD COLUMN ‘id_bigint’ bigint;

Our rollback strategy here was simply to drop the new column.

ALTER TABLE transactions DROP COLUMN ‘id_bigint’;

Start populating in the new ID column

Updating application code to start inserting into the new column was relatively trivial. However, back-filling historical data into the column in a controlled and efficient way was challenging. For this, we used asynchronous workers using our background job processing framework, Sidekiq. We were running the workers to batch update 10,000 records at a time. We came up with a batch size of 10,000 based on our experience running it on our staging environment and making sure it didn’t blow up the database. It took almost 5 days to back-fill all the historical data.

Prepare migration to the new ID column

We consciously split this step and the next step into two days to minimize downtime and degraded performance. As well, we ran these steps overnight when our database had relatively less load.

Based on our estimates from running it in a staging environment, creating the unique index on the new ID column would complete in about 6 hours. In reality, this took a little over 10 hours. We had expected the number to be higher since our staging environment does not have the same consistent load as on production. Since this was not a blocking operation, we weren’t too stressed about it as long as it successfully created a unique index.

CREATE UNIQUE INDEX CONCURRENTLY transactions_pkey_new ON transactions(id_bigint);

Rolling back this step simply meant dropping the index in case index creation was not successful or if the index became corrupted.

DROP INDEX CONCURRENTLY IF EXISTS transactions_pkey_new;

Next, we had to create foreign keys. Creating foreign key constraints requires a SHARE UPDATE EXCLUSIVE lock. This lock would prevent concurrent schema changes and VACUUM runs but would not impact inserts, updates, or deletes. This took us about 4 hours but since inserts and updates were not impacted, there was no downtime.

BEGIN;
LOCK TABLE transactions IN SHARE UPDATE EXCLUSIVE MODE;
ALTER TABLE transactions ADD CONSTRAINT fk_rails_7b42c113e2_2 FOREIGN KEY (someother_id_bigint) REFERENCES transactions(id_bigint) NOT VALID;
ALTER TABLE transactions VALIDATE CONSTRAINT fk_rails_7b42c113e2_2;
COMMIT;

If for any reason this step didn’t go as planned, then we would simply conditionally drop the constraint

ALTER TABLE transactions DROP CONSTRAINT IF EXISTS fk_rails_7b42c113e2_2;

The replacement of the old foreign key constraints with the new foreign key constraints was relatively trivial and ran almost instantaneously

BEGIN;
LOCK TABLE transactions IN ACCESS EXCLUSIVE MODE;
ALTER TABLE transactions RENAME CONSTRAINT fk_rails_7b42c113e2 TO fk_rails_7b42c113e2_defunct;
ALTER TABLE transactions RENAME CONSTRAINT fk_rails_7b42c113e2_2 TO fk_rails_7b42c113e2;
COMMIT;

Switch over to the new primary key column

This step was the most risky step of them all. Switching over to the new columns required an ACCESS EXCLUSIVE lock, so inserts and updates would be blocked but reads would go through. It took approximately 1.5 hours to add the primary key constraint in this step. All the other statements ran almost instantaneously. This was the only step where we had a degraded system, as for 1.5 hours, no new transactions (and thus, points) were being created.

BEGIN;
LOCK TABLE transactions IN ACCESS EXCLUSIVE MODE;
ALTER TABLE transactions DROP CONSTRAINT IF EXISTS transactions_pkey CASCADE;
ALTER TABLE transactions ADD CONSTRAINT transactions_pkey PRIMARY KEY USING INDEX transactions_pkey_new;ALTER TABLE transactions RENAME COLUMN id TO id_defunct;
ALTER TABLE transactions RENAME COLUMN id_bigint TO id;DROP SEQUENCE transactions_id_seq CASCADE;
CREATE SEQUENCE transactions_id_seq START WITH 2147483648;
ALTER TABLE transactions ALTER COLUMN id SET DEFAULT NEXTVAL(‘transactions_id_seq’);
COMMIT;

The rollback procedure for this step was a little more complicated of all the steps, as it involves undoing all the steps we had done but in reverse order.

BEGIN;
LOCK TABLE transactions IN ACCESS EXCLUSIVE MODE;
ALTER TABLE transactions RENAME COLUMN id TO id_bigint;
ALTER TABLE transactions RENAME COLUMN id_defunct TO id;ALTER TABLE transactions DROP CONSTRAINT IF EXISTS transactions_pkey;
CREATE UNIQUE INDEX transactions_pkey_id ON transactions (id);
ALTER TABLE transactions ADD CONSTRAINT transactions_pkey PRIMARY KEY USING INDEX transactions_pkey_id;SELECT MIN(id) — 1 FROM transactions;
DROP SEQUENCE transactions_id_seq CASCADE;
CREATE SEQUENCE transactions_id_seq START WITH <value from before>; 
ALTER TABLE transactions ALTER COLUMN id SET DEFAULT NEXTVAL(‘transactions_id_seq’);
COMMIT;

Learnings

During this event, our error tracking tool, Sentry, was noisy in terms of the number of errors that weren’t actionable and we could have missed out on this error. Fortunately, that didn’t happen with us. Since then, we’ve actively worked on cleaning up the majority of non-actionable errors in Sentry to ensure most errors are actionable.

This incident also showed us the need to invest in our transaction ingestion infrastructure. As a result, we’ve invested significantly in re-designing our transaction services for ingestion performance and isolating it from our other production workloads.

Sh*t happens, and as long as we have actionable learnings and have measures in place so that we don’t repeat it, we treat this as a valuable learning experience. What we learned, albeit the hard way, is to be proactive rather than reactive. Storage is cheap, so new tables have a BIG INT primary key column by default. Particularly in this case, if we had necessary monitoring and alerting in place for high growth tables like transactions, we could have avoided this mayhem.

Conclusion

We believe in hiring problem solvers, not coders. Very few of us on the Drop engineering team had advanced database administration experience when this event happened, so it was a learning experience for many of us. Coming up with creative solutions to problems is what we strive for. For example, leveraging negative numbers as transaction IDs was not ideal but it was a creative fix to ensure our members’ experience was not affected while we agreed on the long term solution and fixing it the right way.

“Quality tech doesn’t happen in silos” is one of our Drop Tech values we live and breathe by. It might seem like a trivial process to migrate a primary key column to another primary key column, but it was very involved, challenging us to think out of the box while under pressure to get production up and running ASAP. It wasn’t a one-person army, so I would like to thank Darren Fung, Fahad Fayyaz and Ahmed Abu Lawi in helping at various points to get it to the finish line.

Does the way we build and firefight resonate with you? Well, you’ll be happy to know that we’re hiring! Take a look at our open roles here.

Leveling Up Transactions @ Drop — INT to BIG INT!

The Importance of Transactions at Drop

How scale broke the system

Building a scalable solution

Strategy

Stop-gap Fix

Create new column of type BIG INT

Start populating in the new ID column

Prepare migration to the new ID column

Switch over to the new primary key column

Execution

Stop-gap Fix

Create new column of type BIG INT

Start populating in the new ID column

Prepare migration to the new ID column

Switch over to the new primary key column

Learnings

Conclusion

Written by Harsh Shah