How we built an event-driven architecture at Drop

Published in

Drop Engineering

9 min readJun 22, 2021

Drop aims to be the most personalized rewards platform in the world.

“Do things that don’t scale” is a common adage heard at early-stage startups, and building software is no exception to this rule. As a startup grows, though, unscalable software solutions need to be revisited and rearchitected, as software architecture that worked for thousands of users may not work for a user base in the millions! Here at Drop, we’re moving towards an event-driven architecture to support our scaling needs. In this blog post, I’ll cover the following topics:

what is an event-driven architecture?
how did Drop build an event-driven architecture?
what are Drop’s first use cases for an event-driven architecture?

What is an event-driven architecture?

Firstly, what is an event? An event refers to some change of state in a software system; computations that cause this state change are referred to as event producers, while computations that occur as a result of this state change are called event consumers. The goal of an event-driven architecture is to decouple our producer logic from our consumer logic.

Consider the following Ruby on Rails code snippet:

class Foo < ApplicationRecord
  after_create :call_bar_service  private  def call_bar_service
    BarService.new.perform_async(self.id)
  end
end

In this example:

our event is an INSERT to the foos table in a SQL database
our event producer is a call to .create on Foo
our event consumer is our BarService Sidekiq job, which performs asynchronously using the new Foo instance’s ID as a parameter

The implementation of Foo has knowledge of BarService, meaning our producer and consumer are tightly coupled. We want to avoid this type of design; the above example is simple, but this type of tight coupling in a large codebase makes a software application difficult and expensive to maintain. An event driven architecture should enable a BarService to run without Foo having any knowledge of that service; in other words, these software components would be decoupled.

Kafka and the Kafka consumer

To build this decoupled event-driven architecture, we selected Apache Kafka to enable fault-tolerant, real time streaming of events. Kafka is a software bus that stores event data written by an event producer (also known as a record). This record is eventually read by an event consumer. This data is divided into different topics based on their source.

A Kafka consumer is a program that executes in response to new data being added into Kafka, and is configured to monitor specific topics that it will read event data from. Drop uses Racecar, a Kafka consumer framework written in Ruby, to consume these events and execute useful business logic upon consumption.

Debezium and data capture

If Racecar is the event consumer in this architecture, then what is the event producer? We use Debezium, an open-source change data capture (CDC) Kafka connector on top of our PostgreSQL database.

When an INSERT, UPDATE, or DELETE occurs on a database table we’re interested in, PostgreSQL captures the complete contents of this change with column-level granularity in its write-ahead log. Debezium then converts these changes into JSON payloads, which are then placed onto Kafka in a topic corresponding to the table name. This JSON payload specifies the type of write that occurred (INSERT vs. UPDATE vs. DELETE), as well as all of the values contained in the columns of the new row.

Putting it all together

We use Debezium to produce an event for every write to a table of interest in PostgreSQL; the event is then persisted to a topic on Kafka. Racecar, our Kafka consumer, will read this event data off of Kafka topics of interest, which then runs application-specific logic based on the contents of that event.

Returning to our original example, we can redefine Foo to be significantly simpler:

class Foo < ApplicationRecord
  # We no longer need an after_create callback,
  # or any reference to BarService!
end

We configure Debezium to:

listen to writes to the foos table in PostgreSQL
create corresponding event data on Kafka’s foos topic

We then configure Racecar to listen to the foos topic on Kafka, and define the consumer like so:¹

class FooEventConsumer < Racecar::Consumer
  subscribes_to 'foos'  def process(message)
    event_data = JSON.parse(message.value)    # c for 'create', ie. row insertion
    if event_data['op'] == 'c'
      foo_id = event_data['payload']['id']
      BarService.new.perform_async(foo_id)
    end
  end
end

The steps in the data flow look like so:

A new row is inserted into the foos table on PostgreSQL (ie. a call to Foo.create in our Rails application code was successful).
Debezium consumes this data from PostgreSQL’s write-ahead log and generates a JSON payload from it.
Debezium writes this JSON payload to the foos topic on Kafka.
A Racecar consumer, which has been listening to the foos topic on Kafka, reads and parses the JSON payload (which contains complete information about the new row in the foos table).
The Racecar service calls BarService with the ID of the newly created Foo.

Data flow in our event-driven architecture

Our first use case: Audience membership calculation

So, how do we actually use this event-driven architecture at Drop? Our first use case is calculating which Audiences our members belong to.

Drop aims to be the most personalized rewards platform in the world, meaning that there’s a high degree of personalization in determining what content we show to our members in the Drop app. An Audience is a domain model that enables this personalization, and determines what our members see when they open the Drop app. A user belongs to many Audiences, which can be:

based on demographics (eg. age, gender)
based on behaviour (eg. having previously purchased from one of our partner brands, having recently onboarded onto the Drop app)

In our previous implementation, we calculated which Audiences a user belonged to whenever they opened the Drop app on their mobile device. However, there were two problems with this approach:

Calculating a user’s Audience memberships is computationally intensive. It requires a number of database queries, which slows down the app load time when a user opens Drop on their mobile device. To minimize this slow load time, we cached users’ Audience memberships. However, there was a tradeoff in doing so — a user wouldn’t necessarily see the most up to date content that was relevant to them due to this caching.
We didn’t persist this Audience membership data to PostgreSQL for analytics purposes. Any persisted values would be inherently out-of-date and inaccurate, as recalculation was dependent on a user opening their app on their phone. This means that from an analytics perspective, our previous implementation limited any possible insight into which users belonged to which Audiences.

Drop’s previous implementation of Audience membership calculation

Our new event-driven architecture allows us to precompute these values in real-time for an active user with the following steps:

Configure Debezium to produce events when our Rails application writes to a table that may have changed a user’s Audience memberships. For example, our “device type” audience (ie. an Android vs. iOS user) may change upon a write to the devices PostgreSQL table (which stores data about what device type a user opened the Drop app with). So, a write to this devices table will cause Debezium to generate an event, which is written to Kafka.
Run Racecar (our Kafka consumer) to process these events that may cause a change in a user’s Audience memberships. The Racecar consumer will:
- read a user ID from the event
- determine the Audience type from the topic (eg. “device type” from the devices topic)
- enqueue an asynchronous Sidekiq job to recompute Audience memberships of that Audience type (eg. “device type”) for that user.
In the asynchronous Sidekiq job from 2), compute the user’s Audience memberships (ie. the intensive computation we used to do at app load time), then persist those values to PostgreSQL.
Query PostgreSQL for these persisted values when the user opens Drop on their mobile device (rather than do an expensive recomputation).

Recalculating Audience memberships in our event-driven architecture

Non-obvious implementation details

Some interesting challenges we faced while implementing the above architecture included:

Debouncing: in practice, our Rails application does many writes in short succession for the same user and Audience type. For example, when Drop syncs credit card transaction data for a user, we don’t need to recalculate that user’s Audience memberships for every single credit card transaction (ie. each database write). We would get the correct result if we did, but this would be an inefficient use of our compute resources. To avoid wasteful duplicate reprocessing, we added debouncing logic into our Kafka consumer.
Transaction log disk usage: our initial configuration for our Kafka brokers didn’t provision enough disk space, leading to unexpected interactions between Debezium and Kafka, ultimately causing downtime in our Kafka brokers. A quick re-provisioning with more disk space for the Kafka brokers mitigated this issue.
Adding new tables: Debezium’s configuration is administered through the Kafka Connect REST API. Typically, we avoid making manual REST requests by using a Terraform provider. However, due to some Debezium-specific quirks, a manual process of making REST requests is required when we want to change the set of tables we produce events for.
Numerous edge cases in our Audiences logic: not all Audience membership changes are caused by database writes. For example, a change in an age-based Audience isn’t altered by any writes to the database; it’s actually determined by the passage of time, requiring a periodic job to recompute age-based memberships for some users. There are many such edge cases for precomputing Audience memberships.

Impact in production

To verify and quantify the correctness and impact of our new architecture, we used Scientist, a “Ruby library for carefully refactoring critical paths.” In the area of our codebase where we currently compute our Audience memberships at app open time, we define an experiment. In this experiment, we:

define a control block (ie. our current, “known good” implementation for computing Audience memberships)
define a candidate block (ie. using our stored Audience memberships that were precomputed upon consuming a Debezium event)

Scientist will always return the value from the control block, but will report whether or not our candidate block returned the same value as our control block, as well as provide timing information. Using Scientist, we verified that our stored Audience memberships:

had an average accuracy rate of 99.98% when compared to the control value²
had an average latency that was 1.15% higher than the control value. This control value includes our stale cached values, meaning that returning our up-to-date precomputed Audience memberships is comparable in performance to returning stale cached values. In absolute terms, this adds under 1ms of latency to a computation that takes 20–30ms, which is well within the parameters of this project’s success metrics.

So, our new implementation:

provides more up-to-date Audience membership data than our previous caching implementation (with no meaningful hit to performance and accuracy)
stores this data so that it can be queried for analytics purposes

Generalized use case: the Outbox pattern

Our next use case for Debezium at Drop is the outbox communication pattern. By writing to the outbox table in PostgreSQL, we atomically update our database and publish an event to Kafka via Debezium. A Kafka consumer can subscribe to the outbox topic on Kafka, which may contain messages created by other services. This enables interservice communication that decouples our publisher and subscriber.

Drop’s current pub-sub model is loosely based on the outbox pattern; however, it:

uses Sidekiq across service boundaries, which works for our relatively monolithic architecture, but won’t be scalable as we move to a more service-based architecture
uses publishers that are loosely coupled with their subscribers (rather than completely decoupled)

This new architecture allows us to decouple our services in a scalable manner.

Conclusion

Using Drop’s new event-based architecture, our first use case allows us to:

precompute a user’s Audience memberships before they open the Drop app, allowing users to see up-to-date and personalized offers in a performant manner
accurately query which users belong to which Audiences from an analytics perspective

It’s worth noting that Kafka can also be used for more than just enabling an event-driven architecture; more generally, Kafka can act as a message bus for one or more services. Looking forward to the future, this will allow Drop to move towards a less monolithic and more service-based architecture as our company scales and grows.

Footnotes

[1]: The JSON format that this example could successfully parse doesn’t quite map to a real Debezium payload. For brevity and simplicity, this blog post skips over some of the complexities of Debezium (such as schema metadata and before/after values), and so this example is simpler than our actual production code.

[2]: Why not 100% accuracy? We compute our Audience memberships asynchronously in a Sidekiq queue, meaning that there’s some lag between when an action occurs that should update a user’s Audience memberships and when that computation actually occurs.

This may sound strange, given that one of the goals of this project was to remove the caching that caused our users to not have up-to-date Audience memberships. However, the delay caused by this caching was measured in hours, while the delay caused by asynchronous processing is measured in seconds.