Stream-Processing DSLs Family Index


Stream-Processing DSLs — Family Index

Family overview

Stream-processing languages are the class of DSLs (and constrained-pattern APIs) that apply SQL or relational/dataflow semantics to unbounded data. The defining problems they all confront are not present in batch SQL: the windowing problem (how do you GROUP BY when the input never ends? — answered by tumbling, sliding, hopping, and session windows, codified in the Apache Beam model), the watermark / late-data problem (how late is “too late” for an out-of-order event, and what do you emit when corrections arrive — accumulating, discarding, or accumulating-and-retracting modes), and the delivery-semantics problem (at-most-once vs at-least-once vs exactly-once, the last requiring transactional sinks, idempotent producers, or end-to-end checkpoint barriers à la Flink/Chandy-Lamport).

The 2015–2022 arc was a slow convergence on streaming SQL as the user-facing layer, with the underlying execution graph (Flink’s JobGraph, Spark’s DAG, Beam’s pipeline) hidden. ksqlDB (2017, Confluent), Flink SQL (~2017+), Beam SQL (2017+), and Spark Structured Streaming (2.0, 2016) all eventually grew SQL surfaces over what had previously been Java/Scala builder APIs (KStream, DataStream, PCollection, DStream). The newer incremental view maintenance (IVM) wave — Materialize (2019+), RisingWave (2022+), Feldera (DBSP, 2023+) — treats streaming as “SQL views that automatically update on every input row,” which is theoretically the cleanest formulation but took until ~2020 to become production-ready because the underlying differential dataflow / DBSP math had to mature first.

In parallel, a Python streaming subgenre emerged because the JVM-only Flink/Spark/Kafka-Streams world was too heavy for many use cases: Faust (originally Robinhood, now community-maintained), Pathway (2023+), Bytewax (Python frontend, Rust/Timely-Dataflow backend), and Quix Streams (“Pythonic Kafka Streams”). And upstream of all of it sits CDC — Debezium being the canonical answer to “where do streams come from in an OLTP world?” by tailing PostgreSQL/MySQL/Oracle WAL/binlog and turning row-level changes into Kafka topics, with Single Message Transforms (SMTs) acting as a small embedded DSL.

In our deep library

None catalogued. Stream-processing DSLs do not have standalone deep-library notes; they all sit on top of host languages (Java, Scala, Python, SQL) that are catalogued.

Cross-reference:

  • sql — covers ANSI SQL + the 5 major dialects but not the streaming SQL extensions (windowing, watermarks, MATCH_RECOGNIZE-on-streams, EMIT CHANGES). Streaming SQL is effectively a sixth dialect-family.
  • scientific — overlap with data-orchestration tools and dataframe APIs.
  • visual-dataflow — Apache NiFi is dual-classified (visual flow editor and a stream-processing system).
  • music-audioNAME COLLISION with Faust: the audio-synthesis Faust (Functional Audio Stream, GRAME/INRIA, 2002) is unrelated to the Python Kafka library Faust (Robinhood, 2017). Always disambiguate by context.
  • python, scala, java — host languages for the API-style frameworks (Kafka Streams DSL, Beam SDK, Flink DataStream, Bytewax, Faust, Pathway, Quix).
  • query — InfluxQL/Flux and similar time-series query layers are adjacent.

Tier 3 family table

LanguageFirst appearedOriginExecution modelStatus (2026)URL
ksqlDB / Kafka KSQL2017Confluent (Jay Kreps, Hojjat Jafarpour)SQL over Kafka topics, push + pull queries, runs on a Kafka Streams runtimeActive, but Confluent has shifted strategic focus to Flink SQL via Confluent Cloud after the 2023 acquisition of Immerokhttps://docs.ksqldb.io/
Apache Flink SQL~2017 (Table API), SQL surface matured 2018+Apache Flink (orig. Stratosphere, TU Berlin, 2010)Streaming dataflow on Flink runtime; unified batch+stream via Table APIVery active, the de facto JVM streaming SQL standard; Confluent + Alibaba (Ververica) anchor ithttps://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/sql/overview/
Apache Beam SQL2017Google (Dataflow → Apache, 2016)Compiles to Beam pipeline → runs on Flink, Spark, Dataflow, Samza, etc. (portability layer)Active but lower velocity; Beam is more widely used as a Python/Java SDK than as SQLhttps://beam.apache.org/documentation/dsls/sql/overview/
Apache Spark SQL / Structured Streaming2014 (Spark SQL) / 2016 (Structured Streaming)UC Berkeley AMPLab → DatabricksMicro-batch (default) and continuous-processing modes; SQL + Dataset/DataFrame APIVery active; the dominant Databricks runtime path for streaming workloadshttps://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Kafka Streams DSL2016Confluent / Apache Kafka projectJava/Scala builder API (KStream, KTable, GlobalKTable); not SQL but a constrained, fluent DSL-style APIActive, library-style alternative to running a Flink/Spark clusterhttps://kafka.apache.org/documentation/streams/
Apache Pulsar Functions2018Yahoo / StreamNative (Apache, 2018)Lightweight serverless functions over Pulsar topics; Java/Python/GoActive but niche outside Pulsar shopshttps://pulsar.apache.org/docs/functions-overview/
Apache Storm Trident2012BackType / Twitter (Nathan Marz)Higher-level micro-batch DSL on top of Storm’s tuple-at-a-time coreLegacy / mostly retired; Storm itself is in Apache Attic territory, superseded by Flinkhttps://storm.apache.org/releases/2.6.0/Trident-API-Overview.html
Apache Samza2013LinkedInStateful stream processing on Kafka + YARN/Standalone; Java/Scala SDK and a SQL surfaceLegacy, low activity since ~2020; LinkedIn moved most workloads to Flinkhttps://samza.apache.org/
Apache Heron2015Twitter (post-Storm)Topology model similar to Storm, intended as Storm’s replacementLargely retired; minimal commits in recent yearshttps://heron.apache.org/
Materialize SQL2019 (founded), 2020 GAMaterialize Inc. (Frank McSherry, Arjun Narayan)Incremental view maintenance via differential dataflow; standard PostgreSQL wire protocolActive (Materialize Cloud)https://materialize.com/docs/
RisingWave SQL2021 founded, 2022 OSSRisingWave LabsCloud-native streaming database; PostgreSQL-compatible SQL with materialized views as the streaming primitiveActive, fast-movinghttps://docs.risingwave.com/
ClickHouse SQL streaming2016 (ClickHouse OSS)Yandex → ClickHouse Inc.Materialized views over Kafka/Kinesis/RabbitMQ table engines; not a true stream processor but functions as one for many ingest+aggregate use casesVery active (ClickHouse Cloud)https://clickhouse.com/docs/integrations/kafka
Faust (Python streaming)2017Robinhood (Ask Solem of Celery)Python Kafka Streams analogue; asyncio-based, agent abstractionMaintenance mode (faust-streaming community fork after Robinhood archived the original); name collides with the audio Faust (music-audio)https://faust-streaming.github.io/faust/
Pathway2023Pathway Inc.Python framework, Rust core; declarative dataflow over streaming + batch with consistent semanticsActive, growing in LLM-RAG-over-streams territoryhttps://pathway.com/developers/
Bytewax2022Bytewax Inc.Python frontend on Rust Timely Dataflow backend; dataflow API over Kafka and other connectorsActivehttps://bytewax.io/
Apache NiFi2014 (NSA → Apache)NSA (originally “Niagarafiles”)Visual drag-and-drop flow editor; processors connected by directed edges over flowfilesActive, dual-classified (visual-dataflow)https://nifi.apache.org/
Google Cloud Dataflow SQL2019Google CloudStreaming SQL surface over Beam → Dataflow; integrated with BigQuery and Pub/SubActive, niche to GCP customershttps://cloud.google.com/dataflow/docs/reference/sql/overview
Apache Druid SQL2011 (Druid), SQL surface 2018+Metamarkets → Apache → ImplyReal-time OLAP database with streaming ingest from Kafka/Kinesis; SQL via Apache CalciteActivehttps://druid.apache.org/docs/latest/querying/sql.html
Apache Pinot SQL2014 (LinkedIn) → Apache 2018LinkedIn → StarTreeReal-time OLAP, streaming ingest from Kafka/Kinesis; SQL via Calcite; tight cousin of DruidActive (StarTree Cloud)https://docs.pinot.apache.org/users/user-guide-query
Timely Dataflow / Differential Dataflow2013 (Naiad paper, MSR) / 2015+ (DD)Frank McSherry et al., Microsoft Research → open sourceRust libraries; Naiad lineage; the theoretical foundation under Materialize and (in part) BytewaxActive (research/library; powers Materialize)https://github.com/TimelyDataflow/timely-dataflow
Quix Streams2023Quix Analytics”Pythonic Kafka Streams”; pure-Python library, no JVM, no separate clusterActivehttps://quix.io/docs/quix-streams/introduction.html
Debezium (CDC + SMTs)2016Red HatKafka Connect source connectors that tail WAL/binlog/redo from Postgres/MySQL/Oracle/SQL Server/MongoDB; Single Message Transforms act as a small embedded DSLVery active, the de facto OSS CDC layerhttps://debezium.io/documentation/

Notable threads

  • Streaming SQL as the convergent end-state. Every major streaming engine that started with a fluent Java/Scala builder API (Flink DataStream, Spark DStream, Kafka Streams, Beam SDK) eventually grew a SQL surface, because SQL is what data engineers and analysts already speak. The interesting design space is no longer whether to expose SQL but which streaming-specific extensions (EMIT CHANGES, MATCH_RECOGNIZE, GROUP BY TUMBLE/HOP/SESSION, watermark declarations, retract/upsert semantics) to add and how to standardise them. Flink SQL is currently the most complete and is the de facto reference.

  • Incremental view maintenance is the cleanest theoretical model. Materialize, RisingWave, and Feldera reframe streaming as “SQL views that update on every input.” This is mathematically elegant — the user writes ordinary CREATE MATERIALIZED VIEW and the engine derives an incremental update plan via differential dataflow (Naiad, McSherry) or DBSP (Feldera). It took until ~2020 for production-readiness because (a) the math (Z-sets, lattice timestamps, generalised retractions) only stabilised in the 2010s research literature, and (b) you need a transactional storage layer + consistent snapshots to make it correct.

  • The exactly-once-delivery debate. “Exactly once” is the most-marketed and most-misunderstood property in streaming. Realisations include: Kafka transactions (KIP-98, idempotent producer + transactional consumer offsets) for end-to-end EOS within Kafka; Flink’s distributed-snapshot checkpoints (Chandy–Lamport variant) for engine-internal exactly-once; Spark Structured Streaming’s idempotent sink contract; and the practical reality that end-to-end EOS requires the sink to be either transactional or idempotent. Most production systems still settle for at-least-once + idempotency at the sink.

  • The Python streaming wave. Faust (2017), then Pathway, Bytewax, and Quix (2022–2023) emerged because the JVM-only world (Flink, Spark, Kafka Streams) was operationally heavy — you needed a cluster, a JVM, schema registries, and Java fluency. The Python frameworks target the “I just want a Python script that processes a Kafka topic” use case; Bytewax and Pathway also push into LLM-era stream-RAG and incremental-embedding pipelines.

  • The SQL-vs-API split inside the IVM crowd. Materialize and RisingWave are PostgreSQL-wire-compatible SQL-only; Feldera exposes SQL but also a Rust/Java API; Bytewax is Python-API-only despite running on the same Timely Dataflow runtime that powers Materialize. The split tracks audience: data-team-first products go SQL-only, engineering-team-first products go API-first.

  • CDC (Debezium) as the upstream “where do streams come from” answer. A streaming engine without an event source is decorative. Debezium turns the world’s installed base of OLTP databases (Postgres, MySQL, Oracle, SQL Server, MongoDB, Cassandra) into Kafka topics by tailing the write-ahead log / binlog. Debezium SMTs (Single Message Transforms) are themselves a small embedded DSL — tombstone handling, field renaming, masking, routing — and the broader pattern (log-based CDC → Kafka → Flink SQL → materialized view) is the dominant 2025-era reference architecture for real-time analytics.

Citations