Stream-Processing DSLs Family Index

type: language-family-index family: stream-processing languages_catalogued: 22 tags: [language-reference, family-index, stream-processing, kafka, flink, spark, real-time, cdc, materialized-views]

Stream-Processing DSLs — Family Index

Family overview

Stream-processing languages are the class of DSLs (and constrained-pattern APIs) that apply SQL or relational/dataflow semantics to unbounded data. The defining problems they all confront are not present in batch SQL: the windowing problem (how do you GROUP BY when the input never ends? — answered by tumbling, sliding, hopping, and session windows, codified in the Apache Beam model), the watermark / late-data problem (how late is “too late” for an out-of-order event, and what do you emit when corrections arrive — accumulating, discarding, or accumulating-and-retracting modes), and the delivery-semantics problem (at-most-once vs at-least-once vs exactly-once, the last requiring transactional sinks, idempotent producers, or end-to-end checkpoint barriers à la Flink/Chandy-Lamport).

The 2015–2022 arc was a slow convergence on streaming SQL as the user-facing layer, with the underlying execution graph (Flink’s JobGraph, Spark’s DAG, Beam’s pipeline) hidden. ksqlDB (2017, Confluent), Flink SQL (~2017+), Beam SQL (2017+), and Spark Structured Streaming (2.0, 2016) all eventually grew SQL surfaces over what had previously been Java/Scala builder APIs (KStream, DataStream, PCollection, DStream). The newer incremental view maintenance (IVM) wave — Materialize (2019+), RisingWave (2022+), Feldera (DBSP, 2023+) — treats streaming as “SQL views that automatically update on every input row,” which is theoretically the cleanest formulation but took until ~2020 to become production-ready because the underlying differential dataflow / DBSP math had to mature first.

In parallel, a Python streaming subgenre emerged because the JVM-only Flink/Spark/Kafka-Streams world was too heavy for many use cases: Faust (originally Robinhood, now community-maintained), Pathway (2023+), Bytewax (Python frontend, Rust/Timely-Dataflow backend), and Quix Streams (“Pythonic Kafka Streams”). And upstream of all of it sits CDC — Debezium being the canonical answer to “where do streams come from in an OLTP world?” by tailing PostgreSQL/MySQL/Oracle WAL/binlog and turning row-level changes into Kafka topics, with Single Message Transforms (SMTs) acting as a small embedded DSL.

In our deep library

None catalogued. Stream-processing DSLs do not have standalone deep-library notes; they all sit on top of host languages (Java, Scala, Python, SQL) that are catalogued.

Cross-reference:

sql — covers ANSI SQL + the 5 major dialects but not the streaming SQL extensions (windowing, watermarks, MATCH_RECOGNIZE-on-streams, EMIT CHANGES). Streaming SQL is effectively a sixth dialect-family.
scientific — overlap with data-orchestration tools and dataframe APIs.
visual-dataflow — Apache NiFi is dual-classified (visual flow editor and a stream-processing system).
music-audio — NAME COLLISION with Faust: the audio-synthesis Faust (Functional Audio Stream, GRAME/INRIA, 2002) is unrelated to the Python Kafka library Faust (Robinhood, 2017). Always disambiguate by context.
python, scala, java — host languages for the API-style frameworks (Kafka Streams DSL, Beam SDK, Flink DataStream, Bytewax, Faust, Pathway, Quix).
query — InfluxQL/Flux and similar time-series query layers are adjacent.

Tier 3 family table

Language	First appeared	Origin	Execution model	Status (2026)	URL
ksqlDB / Kafka KSQL	2017	Confluent (Jay Kreps, Hojjat Jafarpour)	SQL over Kafka topics, push + pull queries, runs on a Kafka Streams runtime	Active, but Confluent has shifted strategic focus to Flink SQL via Confluent Cloud after the 2023 acquisition of Immerok	https://docs.ksqldb.io/
Apache Flink SQL	~2017 (Table API), SQL surface matured 2018+	Apache Flink (orig. Stratosphere, TU Berlin, 2010)	Streaming dataflow on Flink runtime; unified batch+stream via Table API	Very active, the de facto JVM streaming SQL standard; Confluent + Alibaba (Ververica) anchor it	https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/sql/overview/
Apache Beam SQL	2017	Google (Dataflow → Apache, 2016)	Compiles to Beam pipeline → runs on Flink, Spark, Dataflow, Samza, etc. (portability layer)	Active but lower velocity; Beam is more widely used as a Python/Java SDK than as SQL	https://beam.apache.org/documentation/dsls/sql/overview/
Apache Spark SQL / Structured Streaming	2014 (Spark SQL) / 2016 (Structured Streaming)	UC Berkeley AMPLab → Databricks	Micro-batch (default) and continuous-processing modes; SQL + Dataset/DataFrame API	Very active; the dominant Databricks runtime path for streaming workloads	https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Kafka Streams DSL	2016	Confluent / Apache Kafka project	Java/Scala builder API (KStream, KTable, GlobalKTable); not SQL but a constrained, fluent DSL-style API	Active, library-style alternative to running a Flink/Spark cluster	https://kafka.apache.org/documentation/streams/
Apache Pulsar Functions	2018	Yahoo / StreamNative (Apache, 2018)	Lightweight serverless functions over Pulsar topics; Java/Python/Go	Active but niche outside Pulsar shops	https://pulsar.apache.org/docs/functions-overview/
Apache Storm Trident	2012	BackType / Twitter (Nathan Marz)	Higher-level micro-batch DSL on top of Storm’s tuple-at-a-time core	Legacy / mostly retired; Storm itself is in Apache Attic territory, superseded by Flink	https://storm.apache.org/releases/2.6.0/Trident-API-Overview.html
Apache Samza	2013	LinkedIn	Stateful stream processing on Kafka + YARN/Standalone; Java/Scala SDK and a SQL surface	Legacy, low activity since ~2020; LinkedIn moved most workloads to Flink	https://samza.apache.org/
Apache Heron	2015	Twitter (post-Storm)	Topology model similar to Storm, intended as Storm’s replacement	Largely retired; minimal commits in recent years	https://heron.apache.org/
Materialize SQL	2019 (founded), 2020 GA	Materialize Inc. (Frank McSherry, Arjun Narayan)	Incremental view maintenance via differential dataflow; standard PostgreSQL wire protocol	Active (Materialize Cloud)	https://materialize.com/docs/
RisingWave SQL	2021 founded, 2022 OSS	RisingWave Labs	Cloud-native streaming database; PostgreSQL-compatible SQL with materialized views as the streaming primitive	Active, fast-moving	https://docs.risingwave.com/
ClickHouse SQL streaming	2016 (ClickHouse OSS)	Yandex → ClickHouse Inc.	Materialized views over Kafka/Kinesis/RabbitMQ table engines; not a true stream processor but functions as one for many ingest+aggregate use cases	Very active (ClickHouse Cloud)	https://clickhouse.com/docs/integrations/kafka
Faust (Python streaming)	2017	Robinhood (Ask Solem of Celery)	Python Kafka Streams analogue; asyncio-based, agent abstraction	Maintenance mode (faust-streaming community fork after Robinhood archived the original); name collides with the audio Faust (music-audio)	https://faust-streaming.github.io/faust/
Pathway	2023	Pathway Inc.	Python framework, Rust core; declarative dataflow over streaming + batch with consistent semantics	Active, growing in LLM-RAG-over-streams territory	https://pathway.com/developers/
Bytewax	2022	Bytewax Inc.	Python frontend on Rust Timely Dataflow backend; dataflow API over Kafka and other connectors	Active	https://bytewax.io/
Apache NiFi	2014 (NSA → Apache)	NSA (originally “Niagarafiles”)	Visual drag-and-drop flow editor; processors connected by directed edges over flowfiles	Active, dual-classified (visual-dataflow)	https://nifi.apache.org/
Google Cloud Dataflow SQL	2019	Google Cloud	Streaming SQL surface over Beam → Dataflow; integrated with BigQuery and Pub/Sub	Active, niche to GCP customers	https://cloud.google.com/dataflow/docs/reference/sql/overview
Apache Druid SQL	2011 (Druid), SQL surface 2018+	Metamarkets → Apache → Imply	Real-time OLAP database with streaming ingest from Kafka/Kinesis; SQL via Apache Calcite	Active	https://druid.apache.org/docs/latest/querying/sql.html
Apache Pinot SQL	2014 (LinkedIn) → Apache 2018	LinkedIn → StarTree	Real-time OLAP, streaming ingest from Kafka/Kinesis; SQL via Calcite; tight cousin of Druid	Active (StarTree Cloud)	https://docs.pinot.apache.org/users/user-guide-query
Timely Dataflow / Differential Dataflow	2013 (Naiad paper, MSR) / 2015+ (DD)	Frank McSherry et al., Microsoft Research → open source	Rust libraries; Naiad lineage; the theoretical foundation under Materialize and (in part) Bytewax	Active (research/library; powers Materialize)	https://github.com/TimelyDataflow/timely-dataflow
Quix Streams	2023	Quix Analytics	”Pythonic Kafka Streams”; pure-Python library, no JVM, no separate cluster	Active	https://quix.io/docs/quix-streams/introduction.html
Debezium (CDC + SMTs)	2016	Red Hat	Kafka Connect source connectors that tail WAL/binlog/redo from Postgres/MySQL/Oracle/SQL Server/MongoDB; Single Message Transforms act as a small embedded DSL	Very active, the de facto OSS CDC layer	https://debezium.io/documentation/

Notable threads

Streaming SQL as the convergent end-state. Every major streaming engine that started with a fluent Java/Scala builder API (Flink DataStream, Spark DStream, Kafka Streams, Beam SDK) eventually grew a SQL surface, because SQL is what data engineers and analysts already speak. The interesting design space is no longer whether to expose SQL but which streaming-specific extensions (EMIT CHANGES, MATCH_RECOGNIZE, GROUP BY TUMBLE/HOP/SESSION, watermark declarations, retract/upsert semantics) to add and how to standardise them. Flink SQL is currently the most complete and is the de facto reference.
Incremental view maintenance is the cleanest theoretical model. Materialize, RisingWave, and Feldera reframe streaming as “SQL views that update on every input.” This is mathematically elegant — the user writes ordinary CREATE MATERIALIZED VIEW and the engine derives an incremental update plan via differential dataflow (Naiad, McSherry) or DBSP (Feldera). It took until ~2020 for production-readiness because (a) the math (Z-sets, lattice timestamps, generalised retractions) only stabilised in the 2010s research literature, and (b) you need a transactional storage layer + consistent snapshots to make it correct.
The exactly-once-delivery debate. “Exactly once” is the most-marketed and most-misunderstood property in streaming. Realisations include: Kafka transactions (KIP-98, idempotent producer + transactional consumer offsets) for end-to-end EOS within Kafka; Flink’s distributed-snapshot checkpoints (Chandy–Lamport variant) for engine-internal exactly-once; Spark Structured Streaming’s idempotent sink contract; and the practical reality that end-to-end EOS requires the sink to be either transactional or idempotent. Most production systems still settle for at-least-once + idempotency at the sink.
The Python streaming wave. Faust (2017), then Pathway, Bytewax, and Quix (2022–2023) emerged because the JVM-only world (Flink, Spark, Kafka Streams) was operationally heavy — you needed a cluster, a JVM, schema registries, and Java fluency. The Python frameworks target the “I just want a Python script that processes a Kafka topic” use case; Bytewax and Pathway also push into LLM-era stream-RAG and incremental-embedding pipelines.
The SQL-vs-API split inside the IVM crowd. Materialize and RisingWave are PostgreSQL-wire-compatible SQL-only; Feldera exposes SQL but also a Rust/Java API; Bytewax is Python-API-only despite running on the same Timely Dataflow runtime that powers Materialize. The split tracks audience: data-team-first products go SQL-only, engineering-team-first products go API-first.
CDC (Debezium) as the upstream “where do streams come from” answer. A streaming engine without an event source is decorative. Debezium turns the world’s installed base of OLTP databases (Postgres, MySQL, Oracle, SQL Server, MongoDB, Cassandra) into Kafka topics by tailing the write-ahead log / binlog. Debezium SMTs (Single Message Transforms) are themselves a small embedded DSL — tombstone handling, field renaming, masking, routing — and the broader pattern (log-based CDC → Kafka → Flink SQL → materialized view) is the dominant 2025-era reference architecture for real-time analytics.

Citations

Apache Flink SQL: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/sql/overview/
ksqlDB: https://docs.ksqldb.io/
Confluent on Flink SQL strategy: https://www.confluent.io/product/flink/
Apache Beam SQL: https://beam.apache.org/documentation/dsls/sql/overview/
Apache Beam programming model (windows + watermarks): https://beam.apache.org/documentation/programming-guide/
Apache Spark Structured Streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Apache Kafka Streams: https://kafka.apache.org/documentation/streams/
Apache Pulsar Functions: https://pulsar.apache.org/docs/functions-overview/
Apache Storm Trident: https://storm.apache.org/releases/current/Trident-API-Overview.html
Apache Samza: https://samza.apache.org/
Apache Heron: https://heron.apache.org/
Materialize: https://materialize.com/docs/
Differential Dataflow paper (McSherry et al.): https://github.com/TimelyDataflow/differential-dataflow
Naiad paper (SOSP 2013): https://www.microsoft.com/en-us/research/publication/naiad-a-timely-dataflow-system/
RisingWave: https://docs.risingwave.com/
Feldera (DBSP): https://www.feldera.com/
ClickHouse Kafka engine: https://clickhouse.com/docs/integrations/kafka
Faust (community fork): https://faust-streaming.github.io/faust/
Pathway: https://pathway.com/developers/
Bytewax: https://bytewax.io/
Apache NiFi: https://nifi.apache.org/
Google Cloud Dataflow SQL: https://cloud.google.com/dataflow/docs/reference/sql/overview
Apache Druid SQL: https://druid.apache.org/docs/latest/querying/sql.html
Apache Pinot SQL: https://docs.pinot.apache.org/users/user-guide-query
Timely Dataflow: https://github.com/TimelyDataflow/timely-dataflow
Quix Streams: https://quix.io/docs/quix-streams/introduction.html
Debezium: https://debezium.io/documentation/
Kafka KIP-98 (exactly-once): https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
Apache Beam “Streaming 101/102” (Akidau): https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/

Compendium

Explorer

Stream-Processing DSLs Family Index

Stream-Processing DSLs Family Index

type: language-family-index family: stream-processing languages_catalogued: 22 tags: [language-reference, family-index, stream-processing, kafka, flink, spark, real-time, cdc, materialized-views]

Stream-Processing DSLs — Family Index

Family overview

In our deep library

Tier 3 family table

Notable threads

Citations

Graph View

Table of Contents