Query Engines and System Fit

How Triveni exposes one query surface while keeping the execution layer flexible for different serving models.

How users actually get data back, and when this architecture is worth building at all.

By this point in the series, the pipeline looks roughly like this:

  • ingest converts messages into typed batches
  • Parquet files land in object storage
  • a transaction layer defines consistent snapshots
  • a hybrid cache keeps the hot working set local

That still leaves the final user-facing question:

How does a person, dashboard, or API actually ask for data and get it back?

This is where the query engine enters the picture.

And this is also where many architectures become muddled.

Teams often spend a lot of time on serialization, storage, and caching, then treat the query engine as a late plug-in choice. In practice, the engine shapes:

  • latency
  • concurrency
  • deployment model
  • operational burden
  • extension points
  • what “SQL access” really means in production

For a query-first system, the engine is not just a reader. It is the user-facing execution layer.

The Engine Decision Is Really a System-Boundary Decision

At a high level, the engine choice is less about SQL syntax and more about where you want the query system to live.

Do you want:

  • a separate analytical service
  • an embedded library inside your own service
  • a local in-process database that you can ship with an application

That boundary matters because it changes:

  • who owns memory management
  • who owns concurrency control
  • who owns scaling
  • how much custom query behavior you can inject
  • how much infrastructure the team must operate

This is why ClickHouse, DataFusion, and DuckDB are useful to compare. They represent three meaningfully different ways to answer the same user need: “I want SQL over recent and historical data.”

A Quick Mental Model

User / API / Dashboard
  |
  v
SQL surface
  |
  +--> ClickHouse: external analytical service
  |
  +--> DataFusion: embedded Rust query engine
  |
  +--> DuckDB: embedded in-process analytical database

The SQL may look similar at the top. The operational model underneath is very different.

Option A: ClickHouse

ClickHouse is most naturally used as an external analytical system: a dedicated service that stores data itself or queries external data, and serves many concurrent analytical workloads.

That model is attractive because ClickHouse is built for:

  • large scans
  • high concurrency
  • analytical aggregation
  • operational dashboards
  • production serving workloads

In the context of this series, ClickHouse is the most natural fit when the query layer should behave like a standalone product subsystem rather than a library inside your application.

Where ClickHouse fits well

ClickHouse is strong when:

  • many users or tenants query the system concurrently
  • dashboards and APIs need predictable analytical performance
  • the team is comfortable operating a dedicated analytical service
  • data volumes are large enough that a specialized engine earns its keep
  • you want a mature external SQL endpoint

What you pay for

You are also choosing:

  • another service to operate
  • separate storage and execution planning concerns
  • deployment, scaling, and observability overhead
  • some distance between your application logic and query execution internals

This is not necessarily bad. It is often exactly the right trade for a serious multi-user analytics backend. But it is a trade.

How it interacts with a query-first pipeline

ClickHouse can sit in front of object storage, work with Parquet-oriented data flows, and serve as the query system that turns the lake-style storage layer into a user-facing analytics product.

Architecturally, that means:

  • object storage remains the durable history layer
  • ClickHouse becomes the execution and serving layer
  • caching may exist both inside the engine and in the surrounding system

The result is powerful, but it pushes the overall system toward “operated analytical platform” rather than “self-contained application component.”

Option B: DataFusion

DataFusion represents a very different model.

It is not primarily a standalone database server. It is a query engine library built around Arrow. That makes it attractive when you want to build the query system into your own Rust service rather than hand the job to an external database.

That gives DataFusion a distinctive strength:

  • your application owns the engine boundary directly

Where DataFusion fits well

DataFusion is strong when:

  • you want to embed SQL execution inside a Rust service
  • you want tight control over planning, execution, and extension points
  • you want to query Arrow and Parquet data with low additional infrastructure
  • the workload is real, but not necessarily large enough to justify a separate analytical cluster
  • custom functions, custom catalogs, or custom execution rules matter

What you pay for

DataFusion gives you flexibility, but it does not magically remove systems work.

You still need to decide:

  • how requests are served
  • how concurrency is managed
  • how memory is budgeted
  • how caches are integrated
  • how distributed execution works, if you need it

That is the central trade.

With ClickHouse, much of the analytical system already exists as a service.

With DataFusion, you get a powerful engine core, but more of the productization is your responsibility.

Why it matches this series well

For a query-first IoT backend, DataFusion is especially interesting when the design goal is:

  • one coherent application
  • one service boundary
  • one Rust-native control path from ingest to query

It fits naturally with:

  • Arrow in memory
  • Parquet on object storage
  • custom cache policy
  • a pluggable SQL layer

If the architecture wants “query engine as library,” DataFusion is the cleanest example in this comparison.

Option C: DuckDB

DuckDB sits in a third position.

It is an in-process analytical database. That makes it extremely attractive for:

  • local analytics
  • single-node data exploration
  • embedded applications
  • developer tooling
  • lightweight analytical services

DuckDB is excellent at making analytical SQL very accessible with minimal operational ceremony.

Where DuckDB fits well

DuckDB is strong when:

  • a single process can own the workload comfortably
  • local or near-local execution is acceptable
  • developer simplicity matters a lot
  • the system benefits from a very low-friction SQL layer
  • you want excellent support for file-based analytics, especially Parquet

What you pay for

DuckDB is less naturally positioned as the central multi-tenant analytical service for a large backend. You can absolutely build services around it, but that is a different posture from “ship a dedicated analytical database tier.”

The tradeoffs usually show up in:

  • concurrency expectations
  • service-shaping work around the engine
  • operational patterns for many simultaneous clients
  • how much distributed behavior you need outside the process

That does not make DuckDB weak. It makes it opinionated.

It is strongest when the problem can remain close to one process, one machine, or one well-bounded service instance.

ClickHouse vs DataFusion vs DuckDB

The useful comparison is not raw benchmark mythology. It is system fit.

| Engine | Natural shape | Best at | Main tradeoff | | --- | --- | --- | --- | | ClickHouse | external analytical service | high-concurrency analytical serving | more operational overhead | | DataFusion | embedded Rust query engine | custom application-owned query systems | more engine integration work | | DuckDB | in-process analytical database | local, embedded, and lightweight analytical execution | less natural as a central shared backend service |

That table is intentionally simple. Real deployments are always more nuanced. But as a design guide, it is the right level of abstraction.

External Service vs Embedded Engine

This is the architectural fork underneath the engine discussion.

External service model

clients
  |
  v
application / API
  |
  v
external query service
  |
  v
object storage / table snapshots / cache

Advantages:

  • clear operational separation
  • stronger multi-user serving posture
  • independent scaling of query infrastructure
  • easier to reason about as a shared analytical endpoint

Costs:

  • more moving parts
  • more infrastructure ownership
  • more network boundaries
  • less direct application-level control inside the engine

This is the natural ClickHouse posture.

Embedded engine model

clients
  |
  v
application / API
  |
  v
embedded query engine
  |
  v
object storage / table snapshots / cache

Advantages:

  • fewer moving parts
  • tighter integration with business logic
  • easier custom planning and domain-specific behavior
  • simpler small deployments

Costs:

  • your service now owns more execution complexity
  • memory and cache isolation are harder
  • concurrency and workload management become application concerns

This is the natural DataFusion posture, and often the most natural DuckDB posture as well.

The Same SQL Surface, Different Engine Beneath

One of the best design choices in this kind of system is to separate the SQL surface from the engine implementation.

That means users should ideally see:

  • one query API
  • one SQL dialect contract, or a constrained subset
  • one authentication and authorization layer
  • one result model

while the backend remains free to choose different engines underneath.

user-facing query API
  |
  +--> parser / validator / auth / limits
  |
  +--> logical query contract
          |
          +--> engine adapter: ClickHouse
          +--> engine adapter: DataFusion
          +--> engine adapter: DuckDB

This is valuable for two reasons.

1. It protects the product surface from engine churn

You do not want every engine decision to leak directly into the user contract.

If you later move:

  • from DataFusion to ClickHouse for scale
  • from DuckDB to DataFusion for custom integration
  • from one engine to a mixed model for different tenants

the application should not need a full product rewrite.

2. It lets the system choose engines by workload

In principle, the same logical SQL layer can route different workloads to different backends:

  • lightweight embedded execution for small tenants
  • heavier external serving for large tenants
  • local exploration with DuckDB-style execution
  • application-integrated query endpoints with DataFusion

You would not want to expose all of that complexity to end users. But it is a strong internal design pattern.

When Each Engine Is the Best Fit

The cleanest way to think about it is this.

Choose ClickHouse when:

  • query serving is a dedicated platform concern
  • concurrency is high
  • dashboards and APIs are central product surfaces
  • the team accepts running a specialized analytical service

Choose DataFusion when:

  • the application wants to own the query layer directly
  • Rust-native integration matters
  • custom planning, catalogs, or extensions matter
  • infrastructure simplicity matters more than maximum out-of-the-box serving scale

Choose DuckDB when:

  • the workload is local, embedded, or single-node enough
  • developer simplicity is a major goal
  • you want very fast adoption with minimal operational setup
  • the system does not need to behave like a large shared analytical service

When This Whole Architecture Is the Wrong Choice

This is the last and most important filter.

Even if the engine choice is sound, the entire query-first architecture can still be the wrong system.

It is usually the wrong choice when:

  • users rarely run analytical queries
  • the dominant problem is message delivery, not data retrieval
  • the workload needs ultra-low-latency stream processing more than SQL
  • data is mostly opaque blobs rather than queryable fields
  • the team really needs OLTP semantics, not analytical scans
  • the system cannot afford the complexity of object-store tables, cache policy, and query serving

In those cases, the better answer may be:

  • a broker-first pipeline
  • a simpler operational database
  • a direct stream processor
  • or even just raw durable files plus offline analysis

Architecture should follow the repeated user operation, not the elegance of the diagram.

The Strongest Practical Rule

If users need:

  • repeated analytical access
  • SQL over fresh data
  • cheap historical retention
  • acceptable but not extreme freshness latency

then the architecture in this series is worth serious consideration.

If users mostly need:

  • transport
  • replay
  • device state mutation
  • transactional updates
  • opaque payload handling

then this architecture is probably overbuilt.

That is the clearest system-fit rule in the whole series.

Decision Summary

The query engine is where the architecture becomes an actual product surface.

The main choices are:

  • external analytical service for scale and concurrency
  • embedded engine for tighter application ownership
  • in-process analytical database for simplicity and locality

Mapped onto the engines in this chapter:

  • ClickHouse is the strongest default when query serving is a platform in its own right
  • DataFusion is the strongest default when query execution should live inside your Rust application
  • DuckDB is the strongest default when local or lightweight embedded analytics matters most

The best long-term design is usually not to hard-code the product around one engine's identity. It is to expose one stable query surface and keep the engine boundary replaceable.

Closing the Series

Across these five parts, the argument has been consistent:

  • query-first is a different system design goal from ingest-first
  • Arrow and Parquet change where compute happens
  • object storage plus a transaction layer changes the durability boundary
  • caching makes the read path economically and operationally viable
  • the query engine determines how the system becomes usable

The common thread is simple:

if the product promise is immediate queryability, then representation, storage, cache, and execution all have to be chosen for the read path, not only the write path.

That is the real meaning of a query-first IoT data pipeline.