Iceberg Locks Down Expressions and UDF Specific Names

The open lakehouse stack had a busy week, with Apache Iceberg leading the charge. The most significant move was a vote to adopt the new expressions spec, proposed by Ryan Blue. This spec defines the minimal structure and behavior of expressions—the "where date > Jan 1" parts of queries. Without a shared spec, each Iceberg implementation (Java, Python, Rust, Go, C++) had its own interpretation, blocking features that depend on precise filtering. The vote gathered 32 messages and strong support, including binding +1s from Steven Wu and Szehon Ho.

Right after, Szehon Ho started a vote to add a specific-name field to the UDF (user-defined function) spec. In SQL, a function name like f can have multiple overloads; the specific-name field provides a pointer to exactly one version. This matches the SQL standard and helps catalogs resolve function calls unambiguously.

Cross-Implementation Conformance Testing

The most forward-looking discussion came from Neelesh Salian, who opened a thread on cross-implementation conformance testing. Iceberg now has five codebases, each with its own tests, but no shared way to verify that a table written by one is read correctly by another. Matt Topol echoed the need, citing cases where implementations silently disagreed. Tanmay Rauth put it sharply: "The hardest problems are not the outright bugs, they are the cases where two implementations both look correct and still produce different results." Danny Jones mentioned his team had already built similar test sets. The thread converged on creating a physical reference artifact—a real table that all implementations test against.

Concurrently, Sung Yun proposed a shared cross-language test fixtures repository called iceberg-testing. Anurag Mantripragada connected the two threads, and the contributors synced offline to merge their efforts. The result will be a single shared test suite and reference tables, ensuring that a Rust reader and a Java writer agree on data.

Column Update Representation: Dense vs. Sparse

A meaty spec debate centered on how to store updates to individual columns. Steven Wu argued that supporting both dense and sparse layouts forces every engine to implement the more complex sparse read path. Andrei Tserakhau made the sharp point that dense is a special case of sparse, so allowing both means every reader carries heavier code. The thread leaned toward mandating a single dense representation now, with room for column families later.

Polaris: Modular Design and Semantic Layer

Polaris had the busiest mailing list by volume. Dmitri Bourlatchkov opened a discussion on modular design for new features, warning that bolting every proposal into the core makes the system harder to maintain. Russell Spitzer agreed that features should not be tightly coupled. The community settled on using judgment over blanket rules.

This set the backdrop for the semantic layer support discussion, which explored storing Open Semantic Interchange (OSI) data in Polaris. The proposal would allow Polaris to serve as a catalog for semantic models, bridging analytics and AI workloads.

Parquet and Arrow: Versioning and Benchmarking

Parquet dug into what a version number means when features ship faster than releases. The discussion aimed to clarify semantic versioning for the format, ensuring that forward-compatible changes don't break existing readers.

Arrow rebuilt its benchmarking service, partly using an AI agent to generate code. The new service provides more reliable performance numbers across languages.

DataFusion Python Bindings and Polaris Release Vote

DataFusion cut a clean release of its Python bindings, making it easier for Python developers to run SQL queries on Arrow data without leaving the Python ecosystem.

Polaris held a release vote that failed—but for the right reasons. The community identified issues and called for a fix before proceeding, demonstrating a commitment to quality over speed.

Why It Matters for Developers

If you run analytics on open formats, these threads directly impact your data's portability. The expressions spec means that filtering behavior will be consistent across engines like Spark, Flink, Trino, and DuckDB. The conformance testing effort means you can trust that a table written by one engine will be read correctly by another. The column update decision will shape future performance for partial updates.

Next Steps

  • If you use Iceberg, review the expressions spec vote and consider how it affects your queries.
  • Watch for the iceberg-testing repository and consider contributing test cases from your own workloads.
  • For Polaris users, the modular design discussion signals that the project is maturing—expect cleaner APIs and fewer breaking changes.

Check the mailing lists for the latest on these proposals and vote outcomes.