Stream-Processing DSLs Family Index
type: language-family-index family: stream-processing languages_catalogued: 22 tags: [language-reference, family-index, stream-processing, kafka, flink, spark, real-time, cdc, materialized-views]
Stream-Processing DSLs — Family Index
Family overview
Stream-processing languages are the class of DSLs (and constrained-pattern APIs) that apply SQL or relational/dataflow semantics to unbounded data. The defining problems they all confront are not present in batch SQL: the windowing problem (how do you GROUP BY when the input never ends? — answered by tumbling, sliding, hopping, and session windows, codified in the Apache Beam model), the watermark / late-data problem (how late is “too late” for an out-of-order event, and what do you emit when corrections arrive — accumulating, discarding, or accumulating-and-retracting modes), and the delivery-semantics problem (at-most-once vs at-least-once vs exactly-once, the last requiring transactional sinks, idempotent producers, or end-to-end checkpoint barriers à la Flink/Chandy-Lamport).
The 2015–2022 arc was a slow convergence on streaming SQL as the user-facing layer, with the underlying execution graph (Flink’s JobGraph, Spark’s DAG, Beam’s pipeline) hidden. ksqlDB (2017, Confluent), Flink SQL (~2017+), Beam SQL (2017+), and Spark Structured Streaming (2.0, 2016) all eventually grew SQL surfaces over what had previously been Java/Scala builder APIs (KStream, DataStream, PCollection, DStream). The newer incremental view maintenance (IVM) wave — Materialize (2019+), RisingWave (2022+), Feldera (DBSP, 2023+) — treats streaming as “SQL views that automatically update on every input row,” which is theoretically the cleanest formulation but took until ~2020 to become production-ready because the underlying differential dataflow / DBSP math had to mature first.
In parallel, a Python streaming subgenre emerged because the JVM-only Flink/Spark/Kafka-Streams world was too heavy for many use cases: Faust (originally Robinhood, now community-maintained), Pathway (2023+), Bytewax (Python frontend, Rust/Timely-Dataflow backend), and Quix Streams (“Pythonic Kafka Streams”). And upstream of all of it sits CDC — Debezium being the canonical answer to “where do streams come from in an OLTP world?” by tailing PostgreSQL/MySQL/Oracle WAL/binlog and turning row-level changes into Kafka topics, with Single Message Transforms (SMTs) acting as a small embedded DSL.
In our deep library
None catalogued. Stream-processing DSLs do not have standalone deep-library notes; they all sit on top of host languages (Java, Scala, Python, SQL) that are catalogued.
Cross-reference:
- sql — covers ANSI SQL + the 5 major dialects but not the streaming SQL extensions (windowing, watermarks, MATCH_RECOGNIZE-on-streams, EMIT CHANGES). Streaming SQL is effectively a sixth dialect-family.
- scientific — overlap with data-orchestration tools and dataframe APIs.
- visual-dataflow — Apache NiFi is dual-classified (visual flow editor and a stream-processing system).
- music-audio — NAME COLLISION with Faust: the audio-synthesis Faust (Functional Audio Stream, GRAME/INRIA, 2002) is unrelated to the Python Kafka library Faust (Robinhood, 2017). Always disambiguate by context.
- python, scala, java — host languages for the API-style frameworks (Kafka Streams DSL, Beam SDK, Flink DataStream, Bytewax, Faust, Pathway, Quix).
- query — InfluxQL/Flux and similar time-series query layers are adjacent.
Tier 3 family table
| Language | First appeared | Origin | Execution model | Status (2026) | URL |
|---|---|---|---|---|---|
| ksqlDB / Kafka KSQL | 2017 | Confluent (Jay Kreps, Hojjat Jafarpour) | SQL over Kafka topics, push + pull queries, runs on a Kafka Streams runtime | Active, but Confluent has shifted strategic focus to Flink SQL via Confluent Cloud after the 2023 acquisition of Immerok | https://docs.ksqldb.io/ |
| Apache Flink SQL | ~2017 (Table API), SQL surface matured 2018+ | Apache Flink (orig. Stratosphere, TU Berlin, 2010) | Streaming dataflow on Flink runtime; unified batch+stream via Table API | Very active, the de facto JVM streaming SQL standard; Confluent + Alibaba (Ververica) anchor it | https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/sql/overview/ |
| Apache Beam SQL | 2017 | Google (Dataflow → Apache, 2016) | Compiles to Beam pipeline → runs on Flink, Spark, Dataflow, Samza, etc. (portability layer) | Active but lower velocity; Beam is more widely used as a Python/Java SDK than as SQL | https://beam.apache.org/documentation/dsls/sql/overview/ |
| Apache Spark SQL / Structured Streaming | 2014 (Spark SQL) / 2016 (Structured Streaming) | UC Berkeley AMPLab → Databricks | Micro-batch (default) and continuous-processing modes; SQL + Dataset/DataFrame API | Very active; the dominant Databricks runtime path for streaming workloads | https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html |
| Kafka Streams DSL | 2016 | Confluent / Apache Kafka project | Java/Scala builder API (KStream, KTable, GlobalKTable); not SQL but a constrained, fluent DSL-style API | Active, library-style alternative to running a Flink/Spark cluster | https://kafka.apache.org/documentation/streams/ |
| Apache Pulsar Functions | 2018 | Yahoo / StreamNative (Apache, 2018) | Lightweight serverless functions over Pulsar topics; Java/Python/Go | Active but niche outside Pulsar shops | https://pulsar.apache.org/docs/functions-overview/ |
| Apache Storm Trident | 2012 | BackType / Twitter (Nathan Marz) | Higher-level micro-batch DSL on top of Storm’s tuple-at-a-time core | Legacy / mostly retired; Storm itself is in Apache Attic territory, superseded by Flink | https://storm.apache.org/releases/2.6.0/Trident-API-Overview.html |
| Apache Samza | 2013 | Stateful stream processing on Kafka + YARN/Standalone; Java/Scala SDK and a SQL surface | Legacy, low activity since ~2020; LinkedIn moved most workloads to Flink | https://samza.apache.org/ | |
| Apache Heron | 2015 | Twitter (post-Storm) | Topology model similar to Storm, intended as Storm’s replacement | Largely retired; minimal commits in recent years | https://heron.apache.org/ |
| Materialize SQL | 2019 (founded), 2020 GA | Materialize Inc. (Frank McSherry, Arjun Narayan) | Incremental view maintenance via differential dataflow; standard PostgreSQL wire protocol | Active (Materialize Cloud) | https://materialize.com/docs/ |
| RisingWave SQL | 2021 founded, 2022 OSS | RisingWave Labs | Cloud-native streaming database; PostgreSQL-compatible SQL with materialized views as the streaming primitive | Active, fast-moving | https://docs.risingwave.com/ |
| ClickHouse SQL streaming | 2016 (ClickHouse OSS) | Yandex → ClickHouse Inc. | Materialized views over Kafka/Kinesis/RabbitMQ table engines; not a true stream processor but functions as one for many ingest+aggregate use cases | Very active (ClickHouse Cloud) | https://clickhouse.com/docs/integrations/kafka |
| Faust (Python streaming) | 2017 | Robinhood (Ask Solem of Celery) | Python Kafka Streams analogue; asyncio-based, agent abstraction | Maintenance mode (faust-streaming community fork after Robinhood archived the original); name collides with the audio Faust (music-audio) | https://faust-streaming.github.io/faust/ |
| Pathway | 2023 | Pathway Inc. | Python framework, Rust core; declarative dataflow over streaming + batch with consistent semantics | Active, growing in LLM-RAG-over-streams territory | https://pathway.com/developers/ |
| Bytewax | 2022 | Bytewax Inc. | Python frontend on Rust Timely Dataflow backend; dataflow API over Kafka and other connectors | Active | https://bytewax.io/ |
| Apache NiFi | 2014 (NSA → Apache) | NSA (originally “Niagarafiles”) | Visual drag-and-drop flow editor; processors connected by directed edges over flowfiles | Active, dual-classified (visual-dataflow) | https://nifi.apache.org/ |
| Google Cloud Dataflow SQL | 2019 | Google Cloud | Streaming SQL surface over Beam → Dataflow; integrated with BigQuery and Pub/Sub | Active, niche to GCP customers | https://cloud.google.com/dataflow/docs/reference/sql/overview |
| Apache Druid SQL | 2011 (Druid), SQL surface 2018+ | Metamarkets → Apache → Imply | Real-time OLAP database with streaming ingest from Kafka/Kinesis; SQL via Apache Calcite | Active | https://druid.apache.org/docs/latest/querying/sql.html |
| Apache Pinot SQL | 2014 (LinkedIn) → Apache 2018 | LinkedIn → StarTree | Real-time OLAP, streaming ingest from Kafka/Kinesis; SQL via Calcite; tight cousin of Druid | Active (StarTree Cloud) | https://docs.pinot.apache.org/users/user-guide-query |
| Timely Dataflow / Differential Dataflow | 2013 (Naiad paper, MSR) / 2015+ (DD) | Frank McSherry et al., Microsoft Research → open source | Rust libraries; Naiad lineage; the theoretical foundation under Materialize and (in part) Bytewax | Active (research/library; powers Materialize) | https://github.com/TimelyDataflow/timely-dataflow |
| Quix Streams | 2023 | Quix Analytics | ”Pythonic Kafka Streams”; pure-Python library, no JVM, no separate cluster | Active | https://quix.io/docs/quix-streams/introduction.html |
| Debezium (CDC + SMTs) | 2016 | Red Hat | Kafka Connect source connectors that tail WAL/binlog/redo from Postgres/MySQL/Oracle/SQL Server/MongoDB; Single Message Transforms act as a small embedded DSL | Very active, the de facto OSS CDC layer | https://debezium.io/documentation/ |
Notable threads
-
Streaming SQL as the convergent end-state. Every major streaming engine that started with a fluent Java/Scala builder API (Flink DataStream, Spark DStream, Kafka Streams, Beam SDK) eventually grew a SQL surface, because SQL is what data engineers and analysts already speak. The interesting design space is no longer whether to expose SQL but which streaming-specific extensions (EMIT CHANGES, MATCH_RECOGNIZE, GROUP BY TUMBLE/HOP/SESSION, watermark declarations, retract/upsert semantics) to add and how to standardise them. Flink SQL is currently the most complete and is the de facto reference.
-
Incremental view maintenance is the cleanest theoretical model. Materialize, RisingWave, and Feldera reframe streaming as “SQL views that update on every input.” This is mathematically elegant — the user writes ordinary
CREATE MATERIALIZED VIEWand the engine derives an incremental update plan via differential dataflow (Naiad, McSherry) or DBSP (Feldera). It took until ~2020 for production-readiness because (a) the math (Z-sets, lattice timestamps, generalised retractions) only stabilised in the 2010s research literature, and (b) you need a transactional storage layer + consistent snapshots to make it correct. -
The exactly-once-delivery debate. “Exactly once” is the most-marketed and most-misunderstood property in streaming. Realisations include: Kafka transactions (KIP-98, idempotent producer + transactional consumer offsets) for end-to-end EOS within Kafka; Flink’s distributed-snapshot checkpoints (Chandy–Lamport variant) for engine-internal exactly-once; Spark Structured Streaming’s idempotent sink contract; and the practical reality that end-to-end EOS requires the sink to be either transactional or idempotent. Most production systems still settle for at-least-once + idempotency at the sink.
-
The Python streaming wave. Faust (2017), then Pathway, Bytewax, and Quix (2022–2023) emerged because the JVM-only world (Flink, Spark, Kafka Streams) was operationally heavy — you needed a cluster, a JVM, schema registries, and Java fluency. The Python frameworks target the “I just want a Python script that processes a Kafka topic” use case; Bytewax and Pathway also push into LLM-era stream-RAG and incremental-embedding pipelines.
-
The SQL-vs-API split inside the IVM crowd. Materialize and RisingWave are PostgreSQL-wire-compatible SQL-only; Feldera exposes SQL but also a Rust/Java API; Bytewax is Python-API-only despite running on the same Timely Dataflow runtime that powers Materialize. The split tracks audience: data-team-first products go SQL-only, engineering-team-first products go API-first.
-
CDC (Debezium) as the upstream “where do streams come from” answer. A streaming engine without an event source is decorative. Debezium turns the world’s installed base of OLTP databases (Postgres, MySQL, Oracle, SQL Server, MongoDB, Cassandra) into Kafka topics by tailing the write-ahead log / binlog. Debezium SMTs (Single Message Transforms) are themselves a small embedded DSL — tombstone handling, field renaming, masking, routing — and the broader pattern (log-based CDC → Kafka → Flink SQL → materialized view) is the dominant 2025-era reference architecture for real-time analytics.
Citations
- Apache Flink SQL: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/sql/overview/
- ksqlDB: https://docs.ksqldb.io/
- Confluent on Flink SQL strategy: https://www.confluent.io/product/flink/
- Apache Beam SQL: https://beam.apache.org/documentation/dsls/sql/overview/
- Apache Beam programming model (windows + watermarks): https://beam.apache.org/documentation/programming-guide/
- Apache Spark Structured Streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
- Apache Kafka Streams: https://kafka.apache.org/documentation/streams/
- Apache Pulsar Functions: https://pulsar.apache.org/docs/functions-overview/
- Apache Storm Trident: https://storm.apache.org/releases/current/Trident-API-Overview.html
- Apache Samza: https://samza.apache.org/
- Apache Heron: https://heron.apache.org/
- Materialize: https://materialize.com/docs/
- Differential Dataflow paper (McSherry et al.): https://github.com/TimelyDataflow/differential-dataflow
- Naiad paper (SOSP 2013): https://www.microsoft.com/en-us/research/publication/naiad-a-timely-dataflow-system/
- RisingWave: https://docs.risingwave.com/
- Feldera (DBSP): https://www.feldera.com/
- ClickHouse Kafka engine: https://clickhouse.com/docs/integrations/kafka
- Faust (community fork): https://faust-streaming.github.io/faust/
- Pathway: https://pathway.com/developers/
- Bytewax: https://bytewax.io/
- Apache NiFi: https://nifi.apache.org/
- Google Cloud Dataflow SQL: https://cloud.google.com/dataflow/docs/reference/sql/overview
- Apache Druid SQL: https://druid.apache.org/docs/latest/querying/sql.html
- Apache Pinot SQL: https://docs.pinot.apache.org/users/user-guide-query
- Timely Dataflow: https://github.com/TimelyDataflow/timely-dataflow
- Quix Streams: https://quix.io/docs/quix-streams/introduction.html
- Debezium: https://debezium.io/documentation/
- Kafka KIP-98 (exactly-once): https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
- Apache Beam “Streaming 101/102” (Akidau): https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/