Back to Blog

Why We Built a Streaming SQL Engine

Gilad Kleinman
December 11, 2023

Following our previous blog post “How we built a Streaming SQL Engine,” we are delighted to share some of the reasoning and thoughts that led us to build Epsio (as we always say — start with how, only later ask why).

Why we built Epsio

For most of our adult lives, my co-founder Maor and I have been working on developing and designing software products, and scaling them from mere ideas to fully-fledged solutions that cater to the needs of thousands / tens of thousands of users.

Two interesting (though obvious) things almost always happened as time passed while working on these projects:

  • The amount of data our product relied on increased. As more users were using our projects, the more data we had to store.
  • Our products become more complex. Over time, management would request additional features and enhancements, resulting in a much more complex backend architecture and more complex database queries.

Although these two trends may seem obvious, they always had severe implications, particularly on the performance of the databases we were using. As a new feature was requested or a new big customer onboarded, we always remember asking ourselves “will the DB handle this?”

Why our queries became slow: The gap between the data we store and the data we seek

In a philosophical way, whenever one of those things occurred (more data was stored / queries became more complex), the gap between the data we stored and the data we sought became bigger.

The more data we had, and the more complex a query we ran, the more things (disk reads, calculations, correlations, etc…) our database needed to do to calculate the result based on the data we stored. The more things our database needed to perform, the wider the gap became, and the longer it took the database to provide a result.

If, for example, we were building a salary management system with a dashboard that shows the total sum of salaries in our company — and the number of salaries in the company were to suddenly double, our database would need to perform twice the amount of work each time someone opens the dashboard. This would result in our database either using twice the resources it did before or taking twice the time it used to. Both options are, obviously, not ideal.

Even in the scenario where the number of salaries doesn’t grow, it is very likely that over time we would want to add additional or more complex queries to our dashboard (such as changes in salaries over time, distribution graphs, etc.), again — resulting in even more work for our database to perform to calculate what we want to show in the dashboard from the underlaying data.

How did we overcome these performance issues until today?

Usually, when facing these performance issues, we would try using all the traditional methods (indexing, optimizing the queries, etc.) that we were taught to “accelerate” our queries. All of them may improve the performance in the short term, but usually these solutions would just be “patches” and would not hold water as data volumes and query complexity continued to grow.

Methods such as adding indexes, increasing the computing power of the database, and rewriting queries would usually indeed shorten query times. However, since the performance would still be correlated to the data volumes and complexity, our queries would continue to slow down as the gap continued to widen. A query that was optimized to run 10 times faster would very quickly become slow again as data increased by 10 times its size.

Alternative methods such as materialized views, caches, or denormalizations that pre-calculated the data we seek instead of just “accelerating” queries would also seem very promising at the beginning, but would quickly prove difficult to implement and maintain when we actually tried using them. Given that the results were pre-calculated, every time the underlying data changed, we would need to recalculate the entire result from scratch to reflect the new data, which would cost us expensive compute power or lead to stale data if we didn’t constantly update the results.

Introducing Epsio

Epsio solves this problem by never trying to reprocess the entire dataset whenever a result of a query is needed or whenever the underlying data changes. Instead, Epsio pre-calculates results of defined queries and incrementally updates them whenever the underlying data changes, without ever recalculating the entire dataset.

By pre-calculating complex queries and incrementally maintaining their results, we can achieve the huge benefits of denormalizations and caches, where results are instant, without the need to invalidate or recreate results whenever the underlying data changes. For us, this method of executing complex queries was the only way to effectively bridge the gap between the data we stored and the data we sought, and to “prepare” our database for the specific queries we were about to execute.

How does this actually work?

To demonstrate how Epsio actually works, imagine the backend system to manage salaries we mentioned before. Imagine that the system contains a table with the salaries of all employees in a company, and a dashboard running a query that calculates the sum of salaries by department. If our imaginary system were to be used in a very large company, our salaries table might easily contain millions of records, meaning that our dashboard would probably take multiple seconds or minutes to load without us optimizing its queries extensively.

To utilize Epsio, we can simply define our query in Epsio. First, Epsio scans the entire base table and saves the initial result of the query to a results table:

Then, when new data arrives or existing data gets updated or deleted, Epsio performs the minimum calculation needed to update the previous result. In this case, just adding or subtracting the previous count per department:

As the above calculation is very efficient (compared to recalculating the entire query), we are able to take a complex query that originally took seconds / minutes to run and provide instant and always up-to-date results, all while using less compute power and without spending a second on “query optimizations”.

Although it’s very simple to imagine how the above query could be incrementally updated when new data arrives, apparently (as you can read in the previous blog post we published), the same concept and way of thinking could be applied to much more complex queries and SQL operators that contain JOINs, ORDER BYs, DISTINCT, and many more — meaning that almost any complex and slow SQL query that runs frequently in our backends could benefit from this huge performance boost if processed “incrementally.”

Why Epsio is (not) a new database

Although our initial hunch when thinking about the architecture of Epsio was to build a new database, we quickly understood that it is important for our new engine to be “Easy to integrate, Easy to trust, and Easy to use”. When thinking about these three “E”s, we understood that building a new database simply doesn’t address any of those, and probably isn’t the right architecture for us.

Specifically, we wished for an architecture where:

  • Zero migrations are required for use.
  • The engine could be tested on a small use case without effecting anything not related to that use case.
  • No additional single point of failure should be added. If the engine falls, our database should still be “queryable”.

To answer all of the above, we decided to implement an architecture where Epsio sits “behind” the existing database, receives the change stream (CDC/replication stream) for the relevant data, and writes back to “result tables” in the original database:

In this architecture, unlike a new database:

  • No migrations are required to use Epsio.
  • Developers can try Epsio on a small use case without it effecting the rest of the database.  Epsio is not in the hot path.
  • In the (very bad) scenario where the Epsio instance fails, other queries in the database are not affected, and even Epsio’s result tables continue to be accessible, although their results might be stale.

All of the above means we can continue working with the databases we already love and trust, while still enjoying the significant benefits of stream processing — or as the great Hannah Montana used to say: “You get the best of both worlds.”

So, what’s the catch?

Although we truly believe that the “incremental” way of thinking can really change the way we process and query our data to the better, at the end of the day, similar to everything else in the world of databases (or software in generally) — everything is a tradeoff.

Some queries that change drastically between runs (ad-hoc queries) or that query data that changes drastically between runs, might be less efficient in Epsio. In the world of tradeoffs you can never solve all queries, but if we can still solve a huge portion of them, we should be happy with the big “incremental” (no pun intended) change it offers in the process of making developers focus more on building logic and less on optimizing stuff.

Concluding thoughts

After spending way too much time optimizing complex queries and trying to scale them as our products scale, we hope that Epsio will serve as a new paradigm to better bridge our applications and our databases — and to ensure that our databases perform only the minimum required operations to meet our complex application needs.

[.dots]...[.dots]

If you enjoyed this blog post and wish to try Epsio’s blazingly fast (but actually just cheating) SQL engine, check out our docs here.

This is some text inside of a div block.
...

Ready to get started?