On the Future of Stream Processing

June 17, 2025

Gilad Kleinman

Before starting Epsio a couple of years ago, I used to spend countless hours learning about creative and innovative ways companies were making data access and querying faster and more efficient. Looking at rising products in that space — whether it’s ClickHouse, DuckDB, or others — I was consistently amazed by how much magic and innovation these tools packed into their products while still abstracting away the complexity & while keeping the interface extremely simple.

The streaming space however, told a completely different story. While technical innovation was absolutely booming in this area, it somehow felt like the technology was becoming increasingly more complex and costly to use as the underlying theory & tech evolved.

Streaming is a high-maintenance, complex bitch that offers real-time data and scalability, but it’s costly and a pain to manage; batch processing is your reliable, easy-to-handle workhorse, less resource-intensive but slower and less sexy — choose based on whether you want a race car or a minivan.

(Ill-Valuable6211 in Reddit, “Tradeoffs between batch processing and streaming?”)

A Death By A Thousand Components & Configurations

While in a batch processing world, I could spin up a database and pretty quickly get started, in a stream processing architecture, for me to be able to build a simple feature that counted rows in a PostgreSQL database, I “only” needed:
- Debezium for CDC
- Avro registry for schema of events
- Kafka for change queue
- Flink to run the aggregation
- JDBC Sink to write back the results to PostgreSQL

And, of course — digging deep into many (many) Flink / Kafka / Debezium configurations to make all components work together and learn many new concepts like watermarks, checkpoints, etc…

(Although from a different topic, a great portrait of how it feels to build streaming pipelines)

Why Is Streaming Inherently Hard?

Fundamentally, I do believe stream processing is inherently harder and more complex than batch processing. This is for 2 main reasons:

1 — Stream processors are highly coupled to other data products

By definition, a stream processor is meant to input data, transform it, and push it to downstream data products (DBs/ data stores)— not to be the final “endpoint” for other systems (backend / BI / end-users) to pull data from it.

Unlike a database (/batch processor), that can usually be a full “end-to-end” data solution — ingesting, transforming and serving data to the backend — a stream processor can almost never be implemented without an additional data product (database / batch processor) that enables users to access, index, and search over its output. A stream processor without a database after it is like email without email inboxes — the data flows, but there’s no easy way to view or interact with anything after it’s sent.

This means that unlike the batch processor (that doesn’t need any other component to serve a full end-to-end data product)— the stream processor depends on and must couple and integrate itself with numerous other data product for users to be able to benefit from it’s value.

If your backend reads/ writes to a specific PostgreSQL instance (in a specific PostgreSQL version), even the best stream processor in the world probably won’t be helpful unless it can properly ingest from and sink data to your specific PostgreSQL instance/ version.

Similarly — if your database has a complex partitioned schema that involves many ENUMs and custom columns — no stream processor int the world would be usable unless it can properly handle the scenario of a new partition being added in your database, or a way to process and transform all of your custom-typed columns and ENUMs.

Jay Kreps' original vision for Kafka — https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

As a live testimony for this — some of the hardest bugs we ever encountered while building Epsio, were actually bugs in other data products we integrated with. Since we needed to rely so heavily on the specific format and way each data store does things — the smallest of glitches in a source / destination database could easily translate into a glitch in Epsio.

2 — Stream processors are batch processors, plus more

Fundamentally, batch processing is just a “subset” of stream processing. A batch processor only works on the CURRENT data you have. A stream must both process the CURRENT data you have, and also “prepare” (/build structures) for any FUTURE data that arrives.

A great example of this difference is how JOIN operations are handled (specifically, hash joins). In batch processing, since the system knows it will never receive additional data, it can optimize the JOIN by building a hash table only for the “smaller” side of the join — say, the left side — and then iterate over the current rows in the larger side (the right) in an un-orderly fashion, constantly looking up the corresponding rows in the other side’s hash table.

A stream processor in contrast, needs to build not only a hashtable for the small side, but ALSO a hashtable for the larger side of the JOIN — to make sure it can quickly lookup corresponding rows if any change in the future is made to the smaller side of the JOIN. It needs to build all the structures a batch processor needs for the same operation, PLUS MORE.

(Some stream processing systems — or similar systems like TimescaleDB’s continuous aggregates — mitigate this overhead by limiting support for processing updates to only one side of the join).

Even after building these additional structures, stream processors still must “take care” of additional edge cases batch processors don’t need to. Stream processors for example, many times must guarantee that the order they received data correlates with the order they output data. While a batch processor receives a single batch of data and outputs a single batch — a stream processor that constantly receives a stream of changes, needs to make sure the output stream of changes correlates to the order of input stream of changes (image how hard that could be when trying to process that stream in parallel!). A row that got INSERTed and then DELETEd must never be DELETEd in downstream systems before it got INSERTed.

As evidence of a stream processor being “a batch processor, plus more”, you can frequently find stream processing systems that also have support for batch queries (e.g., Flink, Materialize, and Epsio soon ;)), while no batch processor supports streaming queries. To build a batch processor — simply take a stream processor and remove some components!

How Were These Difficulties Overcome Until Today?

Historically, the creator of Kafka (Jay Kreps) talked about tackling the difficulties of stream processing by breaking down these huge complexities into many small “bite size” components, thereby making it easier to build each one of them: “By cutting back to a single query type or use case each system is able to bring its scope down into the set of things that are feasible to build”. Specifically, he portrayed a world where many “small” open source tools play together to form the complete “data infrastructure layer”:

Data infrastructure could be unbundled into a collection of services and application-facing system apis. You already see this happening to a certain extent in the Java stack:
‍
‍- Zookeeper handles much of the system co-ordination (perhaps with a bit of help from higher-level abstractions like Helix or Curator).
- Mesos and YARN do process virtualization and resource management
Embedded libraries like Lucene and LevelDB do indexing
- Netty, Jetty and higher-level wrappers like Finagle and rest.li handle remote communication
- Avro, Protocol Buffers, Thrift, and umpteen zillion other libraries handle serialization
- Kafka and Bookeeper provide a backing log.

If you stack these things in a pile and squint a bit, it starts to look a bit like a lego version of distributed data system engineering. You can piece these ingredients together to create a vast array of possible systems. This is clearly not a story relevant to end-users who presumably care primarily more about the API then how it is implemented, but it might be a path towards getting the simplicity of the single system in a more diverse and modular world that continues to evolve. If the implementation time for a distributed system goes from years to weeks because reliable, flexible building blocks emerge, then the pressure to coalesce into a single monolithic system disappears.

Even though this architecture definitely succeeded in improving the “specialization” of specific components and advanced the capabilities of the streaming ecosystem dramatically — it kind of feels like somewhere along the road, the end user that was supposed to be safeguarded from this implementation choice (“This is clearly not a story relevant to end-users who presumably care primarily more about the API then how it is implemented”.) ended up paying the price.

In today’s world — there is nearly no technical barrier for great stream processing. Whether it’s scale, complexity or robustness — nearly any application can be built with the tools the market has to offer. The interesting question though is — at what cost?

Even setting aside the sheer number of components users must manage — Debezium, Kafka Connect, Kafka, Flink, Avro Schema Registry, and more — the internal knowledge required to operate these tools effectively is staggering. Whether it’s understanding how Flink parallelizes jobs (does an average user need to understand how PostgreSQL parallelizes queries?), thinking about the subtleties of checkpointing, watermarks, or how to propagate schema changes to transformations — the end users is definitely not abstracted away from the internal implementation choice.

A New Generation Of Stream Processors

Innovation is dirty. And perhaps this historical unbundling and complexity must have taken place in order for us to reach where we are today and to overcome all these difficulties. But similar to the evolution of batch processors — where the “dirty” and “complex” innovation of Oracle must have taken place to allow the rise of the easy to use PostgreSQL, in the past couple of years — a new generation of stream processors is now rising that focus their innovation on the end user, not just technical capabilities.

Similar to the movement from Oracle to PostgreSQL — stream processors in this new generation are not necessarily less internally “complex” than the previous ones. They compound all the “internal” innovation the previous generation had, while somehow still abstracting away that innovation from the end user.

In this new generation — simplicity is the default and complexity a choice. An end user is able dive deep into the complexities and configurations of the stream processor if he wishes to — but can always have the option to use a default version/configurations that “just work” for 99% of the use cases.

For this new generation of stream processors to be built, a radical change in the underlying technology had to happen. The endless process of stitching and gluing together generic old components was broken — and a new paradigm, a much more holistic one, had to be built.

New stream processing algorithms, with much stronger end-to-end guarantees (differential dataflow & DBSP with their internal consistency promises) are used and top-tier replication practices inspired by great replication tools like peerdb, fivetran, etc… are incorporated.

In this new generation of stream processors, each component is designed specifically for it’s role in the larger system — and with strong “awareness” to all the external systems and integrations it must serve.

The concept of “database transactions,” for example, is enforced in all internal components in order to abstract away ordering and correctness issues from the user. While stream processors in the old generation (Debezium, Flink, etc.) by default lose the bundling of transactions when processing changes (meaning large operations performed atomically on source databases might not appear atomically on sinked databases) — our new stream processor must (and does!) automatically translate a single transaction at the source database to a single “transaction” in the transformation layer, and into a single “transaction” in the sink database.

The transformation layer is aware of the source database it transforms data from so it can build the most optimal query plans (based on schemas, table statistics, etc..), and the replication layer is aware of the transformation layer (so it can automatically start replicating new tables that new transformations require).

Parallelism, fault tolerance, and schema evolution are all abstracted away from the user — and a single simple interface is built to control all the stream processors mechanism (ingesting data, transforming data, sinking data).

Moving Forward

Streaming is hard. But as the barrier of adoption in the market shifted from a theoretical one (is reliable stream processing at scale even possible / a real solution with real use cases?) to a usability one (how hard is it for me to use?) — the underlying technologies that power streaming had to and need to change.

Although probably always harder than batch processing, if we’re able to truly make stream processing (almost) as easy as batch processing — the implications to the data world would be staggering. Heavy queries/pre-calculation would be easily pre-calculated (and kept up-to-date) using incremental materialized views, data movement between data stores would be a solved issue, dbt model/ETL processes could be made incremental using a streaming engine, and front ends would be much more reactive (e.g. https://skiplabs.io/). The cost of data stacks will drop dramatically, and engineers would be able to focus much more on what they are supposed to — building application logic!

Similar to how Snowflake revolutionized the data warehousing world with its ease of use and scalability — I truly believe as a software engineer that an “easy to implement” yet robust stream processor can break the existing trade-off between batch processing and stream processing (easy to use vs. real-time/performant) — and unlock a new and exciting world in the data ecosystem!

‍

; Shameless plug

If you are interested in learning more about one of these "newer generation" stream processors — welcome to check out Epsio's docs in this link!

Deliver instant & up-to-date results for complex queries

Start free Read docs