We recently published an article that gained a bit of traction about how we built a streaming engine, and though the concepts therein received much adulation and praise, many members of the community expressed confusion about how Epsio is different than:
- Snowflake Dynamic Tables
- BigQuery Incremental Views
- Oracle Fast Refresh
- MSSQL Indexed Views
- Database Triggers
I won’t be going through every similar product, but I’d like to break down the so called magic bullet that is “incremental views”- what inherent tradeoffs exist, and where (and why) Epsio is positioning itself.
A briefer- Why is everyone so interested in incremental views?
Incremental views allow you to define a query once, and have the queries’ results stay up to date without ever having to recalculate them, even as the underlying data changes. Imagine those costly queries that make your users wait multiple seconds and absolutely eat up your databases resources (we’ve even been privy to dastardly situations where queries take upwards of 30 minutes). With incremental views, those same queries all of a sudden take milliseconds with almost no compute, all with one simple command- CREATE INCREMENTAL VIEW SELECT ... FROM ... .
Here’s a brief explanation from our other article about how this works:
A streaming SQL engine keeps queries results up to date without ever having to recalculate them, even as the underlying data changes. To explain this, imagine a simple query, such as SELECT count(*) FROM humans . A normal SQL engine (such as Postgres’s, MySQL’s) would need to constantly go over all the different humans every time you ran that query- which could be quite costly and lengthy given our ever changing population count. With a streaming SQL engine, you would define that query once, and the engine would constantly keep the resulting count up to date as new humans were born and the old ones died off, without ever performing a recalculation of counting all humans in the world. In comprehensive streaming engines this is possible with every sort of query, no matter the complexity.
This can be a huge improvement with very little overhead. Furthermore, in big data environments, this has huge cost implications- every programmer who has worked with Big Query or Snowflake shudders at the words “Full Table Scan”. Incremental views eliminate them; your costs are correlative to your rates of change, not the amount of overall data in your system.
So, given incremental views are interesting, let’s talk about a few categories of incremental views, and where Epsio sees itself in comparison with them.
First Category — The Databases/Warehouses themselves- Batch Processors in disguise
Oracle, MSSQL on the DBMS side, BigQuery and Snowflake on the warehouse side.
From a UX standpoint, it would have been excellent if you could just define incremental views using these engines instead of needing another solution for them. The above databases/warehouses each attempted in their own way to give you some type of ability to create incremental views without needing to use another service.
The problem is that under the hood, these databases have planning & query engines that were built from the ground up to be batch processors. Whenever queried, these databases look at a snapshot of their complete set of data, and calculate a result. They do not deal with a stream of changes (eg “Deal with this insert. Deal with this update. Deal with this delete. etc”), but with constant data (eg “Deal with this static dataset”). Their queries are heavily optimized to be “short lived” (a 30 minute query is considered incredibly long), and produce one single result set. A stream processor’s queries normally live forever, and continuously produce new results as changes come in. This affects every single part of how they’re implementing both their planning and execution of a query. The language between the query nodes is different; batch engines pass data between their nodes, and do not have any way to pass a deletions/updates. The optimization tradeoffs are different; batch processors usually need much more compute and the ability to go over large amounts of base data quickly, while stream processors need little compute and lots of storage for intermediary representations of data. Even the flow of data is often different- in batch processors, you’ll normally have the query nodes “pulling” data from the nodes above them, while streaming processors normally work in “push” mode (which means different communication channels altogether).
But the proof is in the pudding. Oracle, MSSQL, and BigQuery have endless constraints on what you can do in their incremental views- the easiest example being that all of them do not support Outer Joins (Left/Right etc). And it makes sense; when reading through the three page essay that is the list of constraints on these databases, you get the monumental feeling that they’ve created a mountain of patches on their existing execution engines to support this, and that it had the expected effects on their user facing capabilities. This is by no means a bashing of those databases; they’re absolutely incredible at what they do. Stream processing just isn’t what they do.
Snowflake, by the way, is a bit more interesting, with their dynamic tables being released as a preview in June. They’re a bit more vague on what they actually support, and purposefully did not call them incremental views. If they do not support your query, they’ll just do a full refresh behind the scenes (and charge you as such). Playing with them and talking with some companies that tried to use them tells us that they’re valuable as a slightly smarter query refresher, but by no means a streaming solution.
Second Category — Stream Processors
Materialize, ReadySet, Flink, Epsio, etc- all built from the ground up to be stream processors, all very efficient; Materialize and ReadySet built by pioneers in the academic space.
While on the surface they may look similar, there are a multitude of tradeoffs and questions that arise as you start looking at each one in depth.
1. Does this stream processor give “correct” answers?
This may seem like a bit of an obvious question, but there are actually a variety of consistency models in databases. A consistency model of a database is a sort of “contract” the database gives you which gives certain guarantees for how writes will be taken into account in reads. The strongest level of consistency guarantees that every write will be taken into account on sequential reads- all your average SQL databases (Oracle, MySQL, Postgres, etc) give you this. Most stream processors do not give you this- as a matter of fact, none of the above (Materialize, Readyset, Flink and Epsio) do. What they do guarantee is that at a certain point in time after writing (maybe a few milliseconds, maybe a few seconds) the inserted record will be taken into account.
For most use cases, eventual consistency of 50ms is something you can swallow. What can be a much bigger issue is if you’re not internally consistent. Internal consistency means that every answer given was true at some point in time. Some systems, like Flink, are not internally consistent, which is usually a no-go from the get-go for companies. Imagine an alert going out about an event that never happened, or a user going into your dashboard at just the wrong time and seeing his balance is below zero. This is true about other streaming engines as well, and something I highly advise on checking before picking a streaming engine.
2. How much cost/risk/time is inherent in trying to integrate them?
To answer this per stream processor, we’ll need to understand their mode of integration. Some stream processors are built to be more “standalone”, and some are meant to complement existing systems. If you already have a database/data-warehouse, standalone stream processors mean replacing your current database. Materialize, for example, is an incredible streaming engine, highly geared towards realtime latency in big data environments. They’re a good example of tending towards more of a standalone engine (and you can see over time how this has become more and more the case), looking to be a replacement for existing data warehouses (or ETL processes). A few others I can throw into the category of tending towards standalone engines are Arroyo (serverless stream processing) and Feldera (data warehouse, built upon the uber-cool-bleeding-edge DBSP). They’re excellent choice if you want to create new solutions (if you want to create a new warehouse), but can often have too heavy integration costs if you’re looking to optimize existing queries/materialized views, as you’ll probably need to move your data warehouse to only use them if you don’t want to end up managing two data warehouses (a quote I really like- “replacing your database/data warehouse is like trying to replace a car’s engine while driving”).
For stream processors that are complementing current systems, the question becomes- what risks does this new platform pose architecturally, and how do I integrate with it?
Here ReadySet is very interesting. They sit as a proxy between your server and your database in order to offer partial materialization. Sitting as a proxy has both upsides and downsides; ReadySet can intercept queries and automatically reroute them to their corresponding incremental views (that you define in their portal), thereby taking away the need to make any code changes. The downside is that adding a proxy to your database is adding another Single Point of Failure to your entire system. If ReadySet goes down, the connection from your backend to your database goes down with it. The irony is that although ReadySet is very easy to integrate from a code standpoint, it would take a courageous architect to add a proxy without extensive testing; it makes it more difficult to check a specific use case and grow from there.
So where is Epsio in all this?
Our aim is to give a seamless UX within your existing database, while minimizing risk & integration time. Our belief is that while the implementation requires a different solution than a batch processor, the user experience ought to be as though it were available within your very own database.
From an architectural standpoint, Epsio sits “behind” your database as another replica, and writes back to your existing database. This allows your backend to converse with your database without ever knowing about Epsio. If Epsio were to go down, your backend would be unaffected, and even Epsio’s own incremental views would return results, albeit “stale”. From a risk standpoint, this can be crucial for integrating a new technology; architects are often extremely risk-averse when it comes to production environments, and this architecture allows you to try out Epsio on one use case without risking others.
To give a “seamless” UX, you never interact with Epsio directly, instead calling functions that Epsio creates within your existing database (such as “epsio.create_view”, or “epsio.alter_view”). These functions are transaction safe, which can be critical for the workflow of most companies who want to create/alter views within migrations.
Lastly, there’s an inherent tradeoff between speed and cost that stream processors have to deal with. Epsio is built from the ground up to work well with storage to vastly reduce costs compared to in-memory engines. If you want sub-millisecond latency no matter the cost Epsio may not be the best engine. But if you’re looking for high throughputs at ~50ms latency, Epsio delivers.
I think at a bit of a more “meta” level, we believe that incremental view theory has finally reached the point where it’s not a concern of support for SQL operators, or making it that bit more speedy, but a question of how well it ties into your current architecture, both from speed of integration and the cost of the actual product. This is because of incredible work done by the folks who built Noria, Differential Dataflow, DBSP, and countless other streaming methodologies.
Epsio is built to bring the best of the world of theory while optimizing on seamless, low time/risk integration and using low cost resources. Can’t wait to see you!