The gap between the data we store and the data we need

Gilad Kleinman

December 11, 2023

Whenever we developers work on successful products, usually a couple of things happen:

The amount of data our products rely on increases. As more users use our projects, the amount of data we need to store increases.
The product becomes very complex. Over time, as management requested additional features and enhancements, the backend architecture becomes more complex, resulting in more complex database queries.

Although these two trends may seem obvious, they often have significant implications, especially on the performance of the databases we use. When a new feature is requested or a new major customer is onboarded, a very uncommon question is "Will the DB handle this?”

Why our queries became slow: The gap between the data we store and the data we seek

In a philosophical way, whenever one of those things occurs (more data is stored / queries become more complex), the gap between the data you store and the data you seek becomes larger.

The more data we have, and the more complex a query we run, the more things (disk reads, calculations, correlations, etc…) our database needed to do to calculate the result based on the data you stored. The more things our database needs to perform, the wider the gap becomes, and the longer it takes our database to provide a result.

If, for example, we are building a salary management system with a dashboard that shows the total sum of salaries in your company — and the number of salaries in the company were to suddenly double — our database would need to perform twice the amount of work each time someone opens the dashboard. This would result in our database either using twice the resources it did before or taking twice the time it used to. Both options are, obviously, not ideal.

Even in the scenario where the number of salaries does not grow, it is very likely that over time we would want to add additional or more complex queries to the dashboard (such as changes in salaries over time, distribution graphs, etc.), again — resulting in even more work for our database to perform to calculate what you want to show in the dashboard from the underlying data.

How do developers overcome these issues?

Usually, when facing these performance issues, developers try using all the traditional methods (indexing, optimizing the queries, etc.) that we were taught to "accelerate" queries. All of them may improve the performance in the short term, but they would usually just be "patches" and would not hold water as data volumes and query complexity continue to grow.

Methods such as adding indexes, increasing the computing power of the database, and rewriting queries usually do indeed shorten query times. However, since the performance is still correlated to the data volumes and complexity, queries would continue to slow down as the gap continued to widen. A query that was optimized to run 10 times faster would very quickly become slow again as the data increased by 10 times its size.

Alternative methods such as materialized views, caches, or denormalizations that pre-calculate the data we seek instead of just "accelerating" queries also seem very promising at the beginning but quickly prove difficult to implement and maintain when actually trying to use. Given that the results were pre-calculated, every time the underlying data changes, you need to recalculate the entire result from scratch to reflect the new data, which costs expensive compute power or leads to stale data if you don't constantly update the results.

Introducing Epsio

Epsio solves this problem by never trying to reprocess the entire dataset whenever a result of a query is needed or whenever the underlying data changes. Instead, Epsio pre-calculates results of defined queries and incrementally updates them whenever the underlying data changes, without ever recalculating the entire dataset.

By pre-calculating complex queries and incrementally maintaining their results, we can achieve the huge benefits of denormalizations and caches, where results are instant, without the need to invalidate or recreate results whenever the underlying data changes. For us, this method of executing complex queries was the only way to effectively bridge the gap between the data we stored and the data we sought, and to “prepare” our database for the specific queries we were about to execute.

How does this actually work?

To demonstrate how Epsio actually works, imagine the backend system to manage salaries we mentioned before. Imagine that the system contains a table with the salaries of all employees in a company, and a dashboard running a query that calculates the sum of salaries by department. If our imaginary system were to be used in a very large company, our salaries table might easily contain millions of records, meaning that our dashboard would probably take multiple seconds or minutes to load without us optimizing its queries extensively.

To utilize Epsio, we can simply define our query in Epsio. First, Epsio scans the entire base table and saves the initial result of the query to a results table:

Then, when new data arrives or existing data gets updated or deleted, Epsio performs the minimum calculation needed to update the previous result. In this case, just adding or subtracting the previous count per department:

As the above calculation is very efficient (compared to recalculating the entire query), we are able to take a complex query that originally took seconds / minutes to run and provide instant and always up-to-date results, all while using less compute power and without spending a second on “query optimizations”.

Although it’s very simple to imagine how the above query could be incrementally updated when new data arrives, apparently (as you can read in the previous blog post we published), the same concept and way of thinking could be applied to much more complex queries and SQL operators that contain JOINs, ORDER BYs, DISTINCT, and many more — meaning that almost any complex and slow SQL query that runs frequently in our backends could benefit from this huge performance boost if processed “incrementally.”

Why Epsio is (not) a new database

Although our initial hunch when thinking about the architecture of Epsio was to build a new database, we quickly understood that it is important for our new engine to be “Easy to integrate, Easy to trust, and Easy to use”. When thinking about these three “E”s, we understood that building a new database simply doesn’t address any of those, and probably isn’t the right architecture for us.

Specifically, we wished for an architecture where:

Zero migrations are required for use.
The engine could be tested on a small use case without effecting anything not related to that use case.
No additional single point of failure should be added. If the engine falls, our database should still be “queryable”.

To answer all of the above, we decided to implement an architecture where Epsio sits “behind” the existing database, receives the change stream (CDC/replication stream) for the relevant data, and writes back to “result tables” in the original database:

In this architecture, unlike a new database:

No migrations are required to use Epsio.
Developers can try Epsio on a small use case without it effecting the rest of the database. Epsio is not in the hot path.
In the (very bad) scenario where the Epsio instance fails, other queries in the database are not affected, and even Epsio’s result tables continue to be accessible, although their results might be stale.

All of the above means we can continue working with the databases we already love and trust, while still enjoying the significant benefits of stream processing — or as the great Hannah Montana used to say: “You get the best of both worlds.”

So, what’s the catch?

Although we truly believe that the “incremental” way of thinking can really change the way we process and query our data to the better, at the end of the day, similar to everything else in the world of databases (or software in generally) — everything is a tradeoff.

Some queries that change drastically between runs (ad-hoc queries) or that query data that changes drastically between runs, might be less efficient in Epsio. In the world of tradeoffs you can never solve all queries, but if we can still solve a huge portion of them, we should be happy with the big “incremental” (no pun intended) change it offers in the process of making developers focus more on building logic and less on optimizing stuff.

Concluding thoughts

After spending way too much time optimizing complex queries and trying to scale them as our products scale, we hope that Epsio will serve as a new paradigm to better bridge our applications and our databases — and to ensure that our databases perform only the minimum required operations to meet our complex application needs.

[.dots]...[.dots]

If you enjoyed this blog post and wish to try Epsio’s blazingly fast (but actually just cheating) SQL engine, check out our docs here.

Moreover, take a look at

pg_ivm: Limitations and Alternatives

Learn about the limitations of pg_ivm and about Epsio, a leading alternative.

Gilad Kleinman

Apr 1, 2024

The RAMifications of long-living processes

Omer Banay

Memory Management in a Streaming SQL Engine

Omer Banay

Ready to get started?

Start Free Read Docs