Quadrant: Time-Series Database to Visualize Live Galileo Traffic and KPIs
|Summary||In order to speed up the diagnosis of potential issues with experiments and rollouts in our Galileo experimentation platform, we built Quadrant – a flexible, cheap, and fast data store. It allows us to store, aggregate, and serve real-time time-series data to our web frontend and keep both latency and storage costs extremely low.|
|Team||Data & AI|
|Author(s)||Roman Atachiants, Ph.D
|About the Author(s)||
As we began scaling out Galileo, our dynamic configuration platform, to hundreds of users and thousands of various feature rollouts and experiments, it became apparent that providing timely feedback in our web portal is critical. We needed to be able to show our users how their rollouts and experiments are performing and how much traffic they have, ideally with minimal latency.
Such live traffic visualization not only gives users of Galileo confidence in their integrations or A/B tests but also allows them to debug certain situations (e.g. misconfigurations, network security policy issues) while integrating their microservice(s) with the Galileo platform.
We decided to tackle this problem in a rather unorthodox way by building a small yet high-performance time-series database of our own. The rationale behind building this was simple: we wanted to be able to aggregate a time-series consisting of 7 days worth of minute-level data points within a very short period of time, ideally under 1 millisecond.
We wanted to present data as fresh as possible whenever it made sense, for example, if a user of our platform selected 100 experiments to view, we would need to show time-series for all of them as quickly as possible. That means potentially loading, aggregating, and transferring time-series based on 1 million data points, during page view.
Aside from mere convenience for our users, fetching the data faster allows us to detect and resolve problematic experiments or mis-configurations early, potentially avoiding negatively impacting our customers or captains. For example, when rolling out a particular feature on our Super App, engineers are required to monitor the health of the system and whether the traffic is flowing correctly to the new rolled out configuration, the faster the data is propagated and shown to engineers, the faster remediation of any potential issues can happen.
In order to explain how we built the system let us first examine a high-level architecture of Quadrant, our time-series storage service that we have designed and exposed to our frontend React application.
As you can see from the image above, the service is relatively simple with a few main components:
- A local caching layer with TinyLFU cache admission policy boosts the effectiveness of our caching layer. Specifically, we have used dgraph/ristretto as the caching layer of choice for Quadrant.
- A persistent storage implementation, which in our case is a simple Redis cluster. This needs to be some kind of relatively fast key/value store and can be substituted with anything else. In fact, we have plans to support archival of the data into AWS S3 to make sure that we never lose any time-series data.
- Since we store and cache our data in key/value stores, an efficient time-series encoding scheme is necessary so we can load, decode and aggregate data as quickly as possible. In the next section, we will dive into the detailed implementation of this encoding scheme.
When a metric needs to be loaded, we first generate a series of keys that need to be fetched in parallel. In Quadrant, our keys are a combination of a metric name and a unix epoch day, allowing us to categorize a time-series into one of 3 categories:
- Hot data represents a time-series that is actively being updated with new, real-time data. For each time series, we have only a single key which we consider as hot – today’s data.
- Warm data represents a time-series that is not being modified but still very useful to our UI, while it’s configurable, we consider the last 30 days of data for any metric as warm.
- Cold data represents data that is archived and rarely used and can be safely archived into a slower object store such as S3 but still might be needed by our users.
As shown in the diagram above, we have a local caching layer in front of all of the stores, which significantly speeds up the subsequent retrieval of time-series data. However, we keep different cache time-to-live (TTL) for different categories of data, since the data which is currently being modified needs to change and the data that isn’t going to change is safe to cache for a long time and let the least recently used caching policy do its thing.
Anatomy of a Time Series
When designing Quadrant, we wanted to keep the entire system relatively simple and not be limited by a specific storage system (Redis in our case). For that, we consider our storage layers to be a simple key-value store with potentially some transactional support and we have designed a data structure format for a time-series that can efficiently store large sets of timestamps accompanied by their respective floating-point values.
Our encoding scheme for the said data structure is largely inspired by the Gorilla paper, with some notable differences that are described below.
Given that the time series data in our case is immutable, a single time series is sorted by time (unix epoch in seconds) and both timestamps and values are encoded using a delta encoding scheme and written as variable size integers.
As shown below, the actual bytes are laid out as two arrays:
- The array of deltas of time stamps, encoded as variable size integers.
- The array of deltas of floating point values, where delta is taken using a XOR operation and bits are reversed.
After this encoding is done, the entire byte array is then compressed using a general-purpose s2 compression (extension of Snappy) which we have empirically determined to produce better performance and smaller size.
Given that the values of our time-series are encoded as IEEE 754 floating-point numbers, we first reverse the bits which move the sign bit and most significant bits to the end, and then we apply a XOR comparison with the previous value in order to generate a delta encoding for floating-point numbers. The final result is then encoded to the underlying memory buffer as a variable-size integer value, omitting the leading zeroes.
While in the Gorilla paper authors also deal with the trailing zeroes, we decided not to introduce this engineering complexity in our encoding/decoding process and instead leave it to a general-purpose compression algorithm to deal with.
In the below example you can see the actual bits that are being produced for a hypothetical array of floating-point numbers [10, 20, 30]. Note that in this example we have kept floating points to 32-bit precision, but in reality, we deal with 64-bit precision numbers (doubles). As you can see, the resulting deltas (post XOR) have a sizable amount of both leading and trailing zeroes, which is easily compressible.
As mentioned previously, the final step of encoding a single time-series value is to apply a general-purpose compression algorithm on the resulting binary data. We chose a snappy variant called S2 as it provided a good trade-off between decompression speed and relatively small size.
For example, encoding a time-series of 10,000 incremental values with regular timestamp deltas (1-second interval) resulted in the overall binary size of the value of merely 932 bytes. This represents the compression rate of 99.4% as each 16-byte time/value pair is encoded into 0.1 byte on average.
On the other hand, a compressed time-series of 10,000 completely random values with regular timestamp deltas encodes it into ~50KB time-series, resulting in the compression rate of around 70% which is most likely to be the worst case for many systems (or ~5 bytes per time/value pair).
We benchmarked our encoding and decoding on an Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz laptop with time-series containing random values and sizes from 100 to 10,000 elements per time-series. For our use case, we aimed to look at a single value of 10,000 elements given that if you capture minute-level, this is the order of magnitude of data that needs to be encoded/decoded for a single day.
For the encoding, which is shown below, we can see that it takes roughly around 125 microseconds to encode a time-series with 10,000 random values.
|100||660,360/sec||1.8 μs/op||1456 B/op||7 allocs/op|
|1000||97,830/sec||11.6 μs/op||12,501 B/op||7 allocs/op|
|10000||8,372/sec||122.8 μs/op||196,909 B/op||7 allocs/op|
Similarly, the decoding of a 10,000 element time-series (shown below) takes around 140 microseconds.
|100||677709/sec||1.7 μs/op||2,257 B/op||5 allocs/op|
|1000||81376/sec||14.7 μs/op||20,600 B/op||5 allocs/op|
|10000||8655/sec||137.3 μs/op||213,200 B/op||5 allocs/op|
Visualizing Live Traffic
Our initial use-case for Quadrant is to provide to our users an easy way to understand visualization of the real-time traffic for experiments and rollouts.
For example, when we first released Galileo in mid-2021, we encountered a couple of issues where experiments were not properly terminated. This was our bug, but the traffic visualization allowed us to quickly diagnose the fact that the experiment was still receiving data despite being in an “aborted” state. After fixing the issue in our platform, our users have started seeing the correct behavior in the platform and it was confirmed by the “flat line” following the experiment traffic once the experiment is aborted, as expected.
In order to achieve this, our ReactJS web frontend that powers the Galileo platform displays traffic for the experiment since it started and until now if the experiment is running, or until experiment end-time, if the experiment is completed, with a one-hour interval.
For experiments, we visualize the entire traffic for the duration of the experiment (between start and end time). This is shown first on the page that lists all of the experiments, in a paginated table, as shown on the image below. If a user opens a page with 100 experiments, we need to display all of the traffic data for each of the experiments, with each experiment potentially requiring the aggregation of several weeks of data.
Next, we also display a breakdown of the experiment traffic on a page that displays a particular variable (i.e. feature toggle). For example, in the image below you can see a page for a single variable that shows 2 currently running experiments with 1 control and 1 treatment each. Both time-series charts and traffic information is served by Quadrant, which aggregates data on the fly depending on what UI requires.
Similarly, we display traffic information aggregated on our pages that list variables as well as specific per-branch (if statement) traffic drill-down when users need to view how traffic flows for different branches in their configuration. For example, the figure below shows a rollout with two branches and both time-series represent how traffic flows between the two, in real-time.
While Quadrant supports batch retrieval that allows to load and aggregate multiple metrics in a single request, for practical reasons in the UI we fetch them one by one and render using Ant Design Charts library for ReactJS. Every request to quadrant takes about ~10ms which includes data transfer, load from storage, and aggregation over at least 1 week of data. In our case, rendering a full table with 100 rows and 6 columns takes less than half a second, which is acceptable and it loads, aggregates, transfers, and renders over a million data points.
In this article, we described how we built a special-purpose database for visualizing our live traffic for experiments and rollouts. Our custom encoding of time-series data helped us to keep our data size small, and the algorithm relatively simple to implement and maintain.
In return, this allows us to store large amounts of time series data and keep our infrastructure costs low, with more memory it can cache more data and keep the response time of the UI under 1 second for most cases.
The architecture of the system is also kept simple so it can be further evolved if the need arises and relatively easy to understand by engineers. Most importantly, we have been running Quadrant in production for many months now and the service performed very well, the UI is responsive and we even sometimes forget that Quadrant exists and powers all of this.