Serialization and Storage Layout
How JSON, Protobuf, Arrow, and Parquet affect CPU cost, data shape, and the efficiency of repeated scans.
Why Arrow and Parquet change the compute model, not just the file format.
In Part 1, the core argument was that query-first is a different system design goal from ingest-first. This part makes that argument concrete.
The first major design choice is not really "Which serializer should we use?" The deeper question is:
In what representation should data live if we expect to query it repeatedly?
That question spans both memory and storage. It affects:
- how much CPU ingest spends up front
- how much CPU every future query repeats
- whether data can be scanned efficiently
- whether storage behaves like a query substrate or just a durability sink
For a query-first system, serialization and layout are one decision expressed in two phases:
- Arrow shapes data in memory
- Parquet shapes data on disk
Together they change the path from "message arrived" to "query answered."
Start With the Actual Incoming Message
Assume each device publish looks roughly like this:
{
"device_id": "dev-42",
"ts": "2026-04-06T09:15:00Z",
"site_id": "plant-7",
"firmware_version": "1.8.3",
"temperature": 36.4,
"humidity": 48.2,
"battery_mv": 3890,
"status_code": 0
}
At the device edge, this is perfectly reasonable. It is readable, debuggable, and easy to publish.
The question is what happens next.
If the system keeps this as an independent row-shaped payload all the way through, then every analytical query will repeatedly reconstruct the same logical columns from separate messages. That is the hidden cost.
JSON, Protobuf, and Arrow Are Solving Different Problems
This comparison is slightly uneven, and that is important to say clearly.
- JSON is a text wire format
- Protobuf is a compact binary wire format with schema
- Arrow is a columnar in-memory representation
So this is not a pure apples-to-apples codec comparison. It is a pipeline-design comparison. These are the real choices teams make when deciding what form data should take after arrival.
Option A: JSON
JSON is attractive because it is simple and universal.
Advantages:
- easy for devices and services to produce
- human-readable
- flexible during early development
- no special tooling required to inspect payloads
Costs:
- field names are repeated in every record
- numbers and timestamps arrive as text that must be parsed
- schema mistakes are discovered late
- every query often reparses structure from scratch
In an ingest-first pipeline, these costs are often tolerated because JSON is just the transport envelope. But in a query-first pipeline, JSON is expensive because it keeps the data in a representation that analytical systems do not naturally want.
The CPU story looks like this:
JSON path
Write time:
cheap publish
cheap append
Read time:
parse text
validate fields
cast types
rebuild columns
then execute query
That is acceptable when queries are rare. It is a poor bargain when the same data is scanned again and again.
Option B: Protobuf
Protobuf fixes several JSON problems immediately.
Advantages:
- smaller payload size on the wire
- explicit schema
- faster decode than text formats
- better control over compatibility and evolution
That makes Protobuf an excellent transport choice.
But Protobuf does not automatically solve the query path.
A Protobuf message is still fundamentally a message. It is compact and typed, but analytics still requires the system to:
- decode each message
- materialize fields row by row
- reshape them into columns if the engine prefers columnar execution
So Protobuf usually reduces wire cost and decode cost, but it does not eliminate the row-to-column conversion step that analytical scans care about.
Protobuf path
Write time:
efficient binary publish
schema-aware append
Read time:
decode messages
materialize rows
regroup into columns
then execute query
This is a meaningful improvement over JSON. It is still not the same thing as storing data in a query-native shape.
Option C: Arrow
Arrow changes the problem because it is already organized around arrays of typed values.
Instead of storing each message as a self-contained object, Arrow stores the batch as columns:
Row-shaped messages
msg1 -> {device_id, ts, site_id, temperature, humidity, battery_mv, status_code}
msg2 -> {device_id, ts, site_id, temperature, humidity, battery_mv, status_code}
msg3 -> {device_id, ts, site_id, temperature, humidity, battery_mv, status_code}
Arrow-style columns
device_id -> [dev-42, dev-43, dev-44, ...]
ts -> [t1, t2, t3, ...]
site_id -> [plant-7, plant-7, plant-9, ...]
temperature -> [36.4, 35.9, 37.1, ...]
humidity -> [48.2, 49.1, 44.8, ...]
battery_mv -> [3890, 3875, 3910, ...]
status_code -> [0, 0, 2, ...]
That change matters because analytical engines naturally want to:
- scan only the columns mentioned in the query
- apply vectorized operations over arrays
- avoid reparsing field structure for every row
- move batches across process boundaries with minimal transformation
Arrow is not just "faster serialization." It is a different execution substrate.
Arrow in Memory, Parquet on Disk
Arrow and Parquet are closely related, but they solve different stages of the pipeline.
- Arrow is the in-memory batch format
- Parquet is the on-disk columnar file format
That division is useful.
At ingest time, the system can accumulate readings into an Arrow RecordBatch. Once the batch is large enough, it can flush that batch into Parquet for durable storage.
The path looks like this:
Incoming messages
|
v
Schema validation
|
v
Arrow builders / typed arrays
|
v
Arrow RecordBatch
|
v
Parquet row groups and pages
|
v
Object storage
This is the point where compute model and storage model stop being independent.
If ingest already produces typed columnar batches, persisting them into a columnar file is natural. If ingest preserves raw row messages, then later systems must do the reshaping work as a separate step.
Why Columnar Layout Helps
Columnar layout helps because many analytical queries do not need the whole row.
Consider this query:
SELECT site_id, avg(temperature)
FROM telemetry
WHERE ts >= now() - interval '1 hour'
GROUP BY site_id;
This query does not need:
firmware_versionhumiditybattery_mvstatus_code
With row-oriented storage, the engine often has to touch far more bytes than the query logically needs. With columnar storage, it can read only the relevant columns and skip the rest.
That helps in several ways.
1. Projection becomes cheap
Reading two columns from a seven-column dataset is much cheaper when the columns are physically separate.
2. Compression gets better
Values in the same column usually resemble each other more than values across a whole row. Timestamps compress well with timestamp encodings. Repeated IDs compress well with dictionary encoding. Numeric telemetry often compresses well because adjacent values are similar.
3. Vectorized execution becomes natural
Filtering 10,000 temperatures in a typed array is a better fit for modern analytical engines than stepping through 10,000 independently decoded objects.
4. File-level and row-group statistics become useful
Formats like Parquet can store min/max and other metadata per row group. That allows engines to skip chunks of data before reading them fully.
5. Query cost becomes more predictable
Instead of paying parsing cost per message, the engine works over typed batches. That typically makes latency more stable under scan-heavy workloads.
Where Columnar Layout Does Not Help
Columnar storage is not universally better.
It is weaker when the workload looks like this:
- single-row point lookups
- heavy row-level updates or deletes
- extremely low-latency operational reads
- tiny datasets where scan efficiency does not matter
- workloads dominated by opaque binary content
That last category matters more than many designs admit.
Columnar layout helps when fields are independently queryable. It helps far less when the important thing is a large blob that the query engine cannot meaningfully filter, aggregate, or compress structurally.
The Main Tradeoff: Ingest CPU vs Query CPU
This is the real decision.
With JSON or Protobuf, ingest can be lighter because the system mostly accepts messages in their native form and postpones expensive reshaping.
With Arrow and Parquet, ingest does more up front:
- schema validation happens earlier
- values are appended into typed builders
- batches are formed explicitly
- files are written with a query-oriented layout
That is more CPU work on write.
But it often saves much more CPU later:
- no repeated JSON parsing
- less row-by-row decoding
- less row-to-column conversion
- fewer bytes scanned per query
The trade can be summarized like this:
| Format choice | Ingest CPU | Query CPU | Best fit | | --- | --- | --- | --- | | JSON | Low | High | Simplicity, debugging, loose schemas | | Protobuf | Low to medium | Medium to high | Efficient transport, typed messaging | | Arrow + Parquet | Medium to high | Low to medium | Repeated analytical scans and fresh query access |
The key asymmetry is still the same:
- an event is ingested once
- the same event may be queried many times
If the query fan-out is high enough, spending extra CPU during ingest is often the cheaper total-system decision.
A Concrete Example of CPU Shift
Assume a dashboard refreshes every 15 seconds and serves many users. Each refresh needs the last hour of telemetry aggregated by site and firmware.
If the source data is JSON:
- every refresh re-parses text
- every refresh re-casts numeric values
- every refresh rebuilds analytical columns
If the source data is Protobuf:
- text parsing disappears
- but row-by-row decode and reshape still remain
If the source data is already available as Arrow batches or Parquet files:
- the engine reads typed columns directly
- it scans fewer bytes
- it avoids rebuilding the same structure on every refresh
That is why query-first systems tolerate a heavier ingest stage. They are trying to avoid multiplying read-path work by dashboard frequency, user concurrency, and historical scans.
Large Binary Payloads Change the Answer
The clean Arrow-to-Parquet story starts to break when messages contain large binary payloads such as:
- camera frames
- audio samples
- firmware bundles
- diagnostic dumps
- arbitrary attachments
These blobs behave differently from scalar telemetry fields.
Why blobs do not benefit much from columnar layout
A query usually does not ask:
- "compute the average of these JPEG bytes"
- "group these firmware binaries by content prefix"
The query usually asks about metadata around the blob:
- device ID
- timestamp
- camera ID
- content type
- object key
- inference label
That means the blob itself is often not the analytical surface. It is just a large payload attached to the event.
What goes wrong if you force blobs into the same columnar path
Several problems appear:
- files become large and awkward to compact
- writer memory pressure increases
- compression benefits are limited for high-entropy binary content
- queries that only want metadata still have to operate around blob-heavy datasets
- cache efficiency suffers because a small metadata query and a huge binary payload now belong to the same logical object set
Parquet can technically store binary columns, but "possible" is not the same thing as "good system design."
A Better Hybrid Layout
For blob-heavy workloads, a hybrid design is usually better:
Incoming event
|
+--> metadata fields ---------> Arrow / Parquet ---------> query engine
|
+--> large binary payload ----> object storage blob key -> fetched only when needed
In this model:
- searchable metadata stays in columnar form
- the large payload is stored separately in object storage
- the table stores a pointer, URI, checksum, or content key
- analytical queries stay fast because they scan only metadata
- blob retrieval happens lazily and only for the small subset of rows that need it
This keeps the analytical path clean without giving up access to the raw binary data.
When a Raw-Bytes Passthrough Mode Makes Sense
Some workloads should not be forced into an Arrow-first ingest path at all.
A raw-bytes passthrough mode makes sense when:
- payloads are mostly opaque binaries
- query demand is low or metadata-only
- ingest throughput matters more than immediate SQL over full payloads
- the product primarily stores and forwards content rather than analyzing it
In those cases, the system can still be query-first for metadata while being passthrough-first for the heavy payload itself.
That is an important distinction. A good architecture does not insist that every byte in the system must receive identical treatment.
Why Arrow and Parquet Change the Storage Model
At this point the key architectural shift should be clear.
JSON and Protobuf mostly answer:
- how do we represent one message?
Arrow and Parquet answer:
- how do we represent many messages together so scans are cheap?
That is a fundamentally different storage question.
Once data is stored in columnar batches on object storage, the storage system is no longer just keeping durable copies. It is actively shaping query behavior through:
- batch size
- row-group sizing
- partitioning
- sort order
- column statistics
- file count
This is why serialization and storage layout belong in the same discussion. They jointly determine how much work the system repeats on every read.
The Dominant Tradeoff
The dominant tradeoff is straightforward:
- pay more structure cost once at ingest
- avoid paying decode and scan cost repeatedly at query time
That trade is good when:
- the same data is queried many times
- freshness matters
- analytical scans are a core product feature
That trade is bad when:
- reads are rare
- messages are mostly opaque blobs
- the simplest possible ingest path matters more than query efficiency
Decision Summary
If the system promise is immediate queryability, then Arrow and Parquet are compelling because they align the ingest representation with the query engine's preferred shape.
That does not make JSON or Protobuf obsolete.
- JSON remains useful at the edge and for debugging
- Protobuf remains excellent for transport efficiency and schema discipline
- Arrow and Parquet become valuable when the system must repeatedly answer analytical questions over fresh data
The question is not "Which format is best in general?"
The right question is:
Where do you want the system to spend its CPU, and in what shape do you want data to exist when the first real query arrives?
Where the Series Goes Next
This part covered the serialization and layout decision:
- JSON is flexible but expensive to query repeatedly
- Protobuf is a better wire format, but still not a query-native layout
- Arrow and Parquet move work earlier so query execution does less reconstruction
- large binary payloads often need a hybrid design rather than a pure columnar path
The next part moves from format choice to durability and correctness: object storage, transaction layers, optimistic concurrency, fault tolerance, and why raw Parquet files alone are not enough for a real system of record.