Galileo: Scalable Platform for Dynamic Configuration & Experimentation

Summary In order to unify feature toggling and experimentation, consolidate existing tools and enable data-driven decision making in Careem, we have built a custom, state-of-the-art dynamic configuration and experimentation platform internally named Galileo. This platform allows us to re-configure Careem’s mobile, backend and web systems dynamically, run A/B tests and more. All of that was made possible by treating configurations as LUA code and dynamically injecting them into relatively simple SDKs while tracking the decisions made.
Team Data & AI
Author(s) Roman Atachiants, Ph.D
About the Author(s) Roman is a seasoned software engineer in Careem, working in the Data & AI team and helping to build a state-of-the-art experimentation, feature toggles platform `well as a machine learning platform.

Introduction

In January of 2021, amid the ongoing pandemic, we in Careem set out to reimagine and reimplement our existing feature toggling and experimentation systems. 

Back then, we had a number of tools that were used across the company: a few internal tools for feature toggling, one internal tool for experimentation, and a couple of 3rd party tools for web/mobile feature toggling and experimentation. That was a lot of tools all doing essentially the same thing — configuring our systems dynamically without requiring a restart or re-deployment!

While our primary goal was to actually build a trustworthy experimentation platform (more on that in later blog posts), we needed to have a solid foundation and an ability to inject those experiments dynamically across Careem’s ecosystem of backend services and applications.

This article describes the technical challenges we were trying to solve, and the adopted solution on how we treated dynamic configuration as code and built a reliable and scalable SDK to support dynamic configuration, feature toggling, kill switches, and experimentation.

Problems with Existing Systems

At Careem, we used to use several tools for backend experimentation and third-party tools. These tools did not provide good interoperability between themselves and did not integrate well with our internal Data Warehouse. Moreover, there are multiple SDKs needed to be used by engineers thus unnecessarily increasing cognitive load. 

The table below shows a non-exhaustive list of the mismatching or missing capabilities and things we needed to support.

Capability Existing Internal Tools 3rd Party Tools
Unified feature toggles and experimentation ✔️
Used for both mobile and backend
Supports online segmentation ✔️ ✔️
Supports offline segmentation
Integrated with our Data Warehouse ✔️
Fault-tolerant on the backend
Captures real-time traffic and KPIs ✔️
Can be easily extended by our team ✔️

Our existing experimentation tool adopted client-side experimentation but contained business logic of parsing JSON definitions and serving the experiments. However, the complexity within that SDK made it (a) difficult to create SDKs for other backend systems and (b) with every new feature introduced the complexity will increase substantially and services that use SDK will require a re-deployment every time a new feature or bug is introduced into the SDK. This became unmanageable and increased our engineering costs.

There was also a lack of internal SDKs for mobile and web, with the previous approach is to simply expose the SDK as a service through HTTP but no end-user SDKs is provided to make engineers’ life easier and provide facilities that would be required for every user, such as persistent storage of the configurations in the device, circuit breakers, etc.

Client-Side SDK for Dynamic Configuration

We strongly believe that the “client-side” experimentation that Careem previously adopted for existing systems is the way to go for a scalable and fault-tolerant way of delivering online experimentation. That being said, in order to make the said SDK easier to maintain, it should (a) not contain any significant business logic and (b) track every decision made and store it in the data warehouse for analytics and diagnostics. 

As shown in the diagram above, we first decouple the SDK from business logic by making it akin to a simple scripting runtime which would simply execute LUA code loaded dynamically from a predefined S3 bucket. This moves the logic of parsing and interpreting metadata to the backend and reduces the rate of change on the SDK itself, reducing the need to re-deploy when the Galileo team releases new features. It will also increase fault tolerance by allowing us to set memory/stack/CPU limits even when the SDK is embedded into a backend service.

Our SDKs also send telemetry on every decision made back to a central service which then would flow into our data warehouse. This needs to be done in a “best-effort” way and in a case whether this service is unavailable the SDK will need to continue to function seamlessly, without affecting the uptime and not causing any outages. To ensure that we never run out of memory and cause any of these issues, our buffer actually implements a simple, fixed-size circular buffer inside the SDK.

Unifying Feature Toggles & Experimentation

In order to create a “one-stop-shop” for experimentation and feature toggling, we first needed to unify feature toggles and experimentation conceptually and then provide the users with an easy to use and understandable user interface where both tech-savvy and non-technical users can create rollouts and experiments (given an approval flow, of course).

Feature toggles are a simple yet powerful technique that ultimately allows users to modify a system’s behavior without changing code. When you think about it, whether you “toggle a feature on/off”, “progressively rollout” or “run an A/B test” on a toggle, it really is a separation between the value (logic) and the toggle itself. For the sake of brevity, we’re not going to delve deeper into feature toggles and we suggest reading through Martin Fowler’s write up on the subject.

In order to unify the two, we created a new abstraction called a “Variable” which is simply an alias for a value, drawing parallels to programming language design. For example, we could have a variable named “banner_color” which would allow end-users to control the color of the banner on the app through release toggles and experiments.

The above diagram shows how conceptually the unification process will look like. A variable is associated with a set of configurations (i.e. values), which then are automatically translated by our backend system into executable code and shipped to the edge. 

It’s important to note that experiments will always take precedence over the rollouts, as experiments are constrained in time as opposed to rollouts which are more long-term in nature. This way, if a simple feature toggle is active (i.e. a variable has one or more rollout values associated with it) a user can start a temporary experiment which will kick in when started and be removed from the configuration when finished. This allows for a smooth transition between static feature toggles and experiments.

In the diagram above, you can see two parts:

  1. On the left, the metadata of a variable, an experiment, and two rollout blocks. This metadata will be created by end-users via a web portal (or APIs) provided by Galileo. They represent conceptual building blocks for experimentation where for a single variable one can have multiple rollouts to various target segments and experiments.
  2. On the right side, the pseudo-code represents the code that would be auto-generated for our “banner_color” variable. It’s essentially a series of “if” statements executed every time the SDK requests to retrieve a value of a variable.

Web Portal for Dynamic Configuration & Experimentation

In addition to making all of these SDKs and backend decisions, it was also extremely important to us to make sure that the user experience and tooling were as frictionless as possible. While most configuration systems cater primarily to engineers, we wanted to allow product managers, data scientists, or anyone else who wanted to re-configure the system in a certain way, to do so without 3-month onboarding or a degree in computer science. 

In order to get there, we have built a simple web UI where users can create and view variables and configure well-defined schemas and validation, as demonstrated in the screenshot below.

On top of this, we have added a relatively simple rollout interface, allowing users to configure various conditions and present it in a way that the complex boolean logic can be understood.

In order to make this article to the point, we’ll describe our UI in a separate article, keep watching this space!

Unifying Mobile, Web & Backend Experimentation

As Careem transitioned into a Super App, the demand for mobile experimentation gradually increased over time. Hence, we needed to provide best practices and standard tools for everyone to use.

Mobile experimentation can be a bit trickier since it is necessary to make sure the experiments are protected from unwanted eyes. Hence, it is unwise to ship the full configuration of an experiment or feature toggle to the edge. Most importantly, there are several considerations with the mobile devices that need to be addressed specifically:

  • When running experiments or rolling out features on the mobile device it is important to make sure that users do not perceive any major visual flickering which would potentially negatively affect the user experience and increase support tickets. For example, static text and colors should not suddenly change just because someone started an experiment.
  • Mobile networks are often unreliable and telemetry needs to be stored locally on the device when the network connection is not available.

To that end, we exposed essentially an “SDK-as-a-Service” with both HTTP and gRPC APIs and provided a Kotlin SDK which takes care of (a) calling the service and retrieving the values for all of the necessary variables when the application is started, (b) taking care of caching of variable values in the case where the said service is unavailable, and (c) sending and storing the telemetry so that the experimental results can be analyzed and rollouts can be tracked.

Conclusions and Future Work

Looking back, the decision of treating dynamic configurations as code instead of some custom domain-specific language paid off, using general-purpose LUA code we were able to add functionality such as geofencing, new experimentation strategies, and app version comparisons completely in LUA, without requiring users and services to update the SDK, or even restart for that matter.

Similarly, tracking allowed us to build deep analysis and provide real-time traffic information directly in our UI, as demonstrated in the (anonymized) screenshots below.

However, it is not all roses and there are a few downsides from specifically using LUA code as a dynamic configuration. While it is performant enough for most cases, we noticed that once the number of variables on our SDK-as-a-Service grows, our memory footprint increases as well, since each variable requires a separate script and a separate VM to run it. This can be mitigated by using LUAJit which is a significantly faster and leaner virtual machine or exploring a WebAssembly as an alternative.