Data Storage Evolution: From Rows to Columnar Files for Scalability

Data Storage Evolution: From Rows to Columnar Files for Scalability

Engineering Kiosk Feb 17, 2026 german 5 min read

Explore the evolution of data storage from traditional row-oriented databases to modern columnar formats and data lakes, optimizing scalability and cloud costs.

Key Insights

  • Insight

    The phrase "it doesn't scale" often leads to costly database replacements without proper root cause analysis. This can result in exchanging known problems for new, unknown ones, highlighting a common pitfall in tech decision-making.

    Impact

    Avoiding premature optimization can save significant engineering effort and financial resources, directing investment towards actual bottlenecks rather than perceived ones.

  • Insight

    Row-oriented databases are optimized for OLTP (Online Transactional Processing) workloads with frequent small reads/writes, while column-oriented databases excel in OLAP (Online Analytical Processing) for large-scale aggregations due to better compression and vectorized processing.

    Impact

    Selecting the appropriate database architecture based on workload type drastically improves performance, reduces operational costs, and enhances query efficiency for specific business needs.

  • Insight

    Apache Parquet is a binary, columnar file format ideal for analytical systems, enabling efficient data compression and query performance through features like Row Groups and footer-based metadata for stream processing.

    Impact

    Adopting Parquet can lead to significant reductions in storage footprint and faster query execution times, making large-scale data analytics more feasible and cost-effective.

  • Insight

    Apache Iceberg serves as a crucial management layer for Parquet files in data lakes, providing schema evolution, transactional guarantees, parallel write support, and 'time travel' capabilities, transforming raw files into robust tables.

    Impact

    Iceberg enhances data reliability and governance in data lakes, enabling complex analytical operations and historical data analysis without the overhead of traditional data warehousing.

  • Insight

    The separation of storage and compute, facilitated by columnar file formats and management layers, enables highly cost-efficient data architectures in the cloud, utilizing inexpensive object storage for vast datasets.

    Impact

    This modular approach allows organizations to scale storage and compute independently, leading to substantial cost savings and greater flexibility in deploying diverse analytical workloads.

  • Insight

    Tiered storage strategies, leveraging Parquet and Iceberg, are being integrated into high-volume systems like Kafka and Prometheus to manage escalating storage costs for long-term data retention, especially in observability.

    Impact

    Implementing tiered storage can drastically cut down operational expenses for large-scale data platforms, allowing for longer data retention policies essential for historical analysis and compliance.

Key Quotes

"You often replace a known problem with a new, still unknown problem."
"If you want to calculate the sum of all prices, you really only read that price column."
"Iceberg is the blueprint, construction supervision, and the land registry office. The bricks store material. Iceberg ensures that it becomes a stable house."

Summary

The Peril of Premature Optimization: "It Doesn't Scale"

In the fast-evolving world of technology, the phrase "it doesn't scale" often triggers a rush to replace core database systems, frequently leading to replacing a familiar problem with an unknown, potentially more complex one. This podcast episode delves into the fundamental aspects of data storage and processing, highlighting how a deeper understanding of data structures and access patterns can inform more strategic and cost-effective decisions.

Row-Oriented vs. Columnar Storage: A Foundational Choice

Traditional relational databases like PostgreSQL and MySQL are primarily row-oriented, excelling in Online Transactional Processing (OLTP) workloads. These systems are optimized for frequent, small, point-in-time operations—think individual record lookups, updates, and inserts typical of banking or e-commerce transactions. Their efficiency stems from storing entire rows contiguously, allowing quick retrieval of all attributes for a given record.

Conversely, columnar data stores (e.g., ClickHouse, Google BigQuery, Snowflake) are built for Online Analytical Processing (OLAP). These are designed for massive data scans, complex aggregations, and business intelligence queries over vast datasets. By storing data column by column, they achieve superior data compression, enable "column pruning" (reading only necessary columns), and leverage CPU-friendly vectorized processing, drastically improving performance for analytical queries.

The Rise of File-Based Columnar Formats and Data Lake Management

Beyond database engines, Apache Parquet has emerged as a standard, open-source, binary columnar file format. Developed by Twitter and Cloudera, Parquet is optimized for analytical workloads, featuring "Row Groups" for internal partitioning and metadata stored in the footer, which facilitates efficient streaming writes and partial data reads. This design is crucial for handling petabytes of data in modern data lakes.

However, managing numerous Parquet files across distributed storage systems (like Amazon S3) introduces challenges such as schema evolution and transactional integrity. This is where Apache Iceberg steps in. Iceberg acts as a crucial management layer, providing a table format that sits atop Parquet files. It offers schema evolution, transactional guarantees, parallel write support, and a highly valuable "time travel" feature, allowing users to query data as it existed at past points in time. This transforms a collection of files into a robust, ACID-compliant table.

Strategic Implications: Cost Savings and Modularity

The adoption of columnar file formats like Parquet, orchestrated by Iceberg, enables a modular architecture where storage and compute are decoupled. This is a game-changer for cloud economics. By storing data on inexpensive object storage (e.g., S3) and processing it with various compute engines (e.g., Presto, DuckDB), organizations can significantly reduce costs compared to monolithic, tightly coupled database systems. This approach also allows for sophisticated tiered storage strategies, where hot data resides on fast storage while older, less frequently accessed data is moved to more cost-effective cold storage.

This tiered approach is finding increasing application in high-volume data systems like Kafka (for archiving older message segments) and Prometheus (for storing historical time-series blocks). By leveraging Parquet's efficiencies, these systems can extend data retention and manage escalating observability costs without compromising performance for hot data.

Conclusion: A Measured Approach to Modern Data Challenges

While the allure of cutting-edge technology is strong, the key lies in understanding your specific workload requirements. For traditional transactional systems, row-oriented databases remain highly effective. For analytical workloads demanding massive scale, performance, and cost efficiency, columnar formats like Parquet, managed by Iceberg, offer compelling advantages. Finance, investment, and leadership professionals should encourage strategic analysis of existing data architectures, challenge assumptions about scalability, and explore these modern data solutions to unlock significant operational efficiencies and cost savings.

Action Items

Implement a rigorous root cause analysis process for scalability issues, moving beyond superficial diagnoses to understand underlying data structures and access patterns before initiating major system overhauls.

Impact: Prevents costly and potentially counterproductive technology migrations, ensuring that resources are invested in solutions that address actual performance bottlenecks effectively.

Evaluate and adopt column-oriented storage solutions or file formats like Parquet for all new analytical workloads to maximize compression, optimize query performance, and reduce storage costs for large datasets.

Impact: Significantly improves the efficiency and speed of business intelligence and data analytics initiatives, directly impacting decision-making capabilities and operational insights.

Integrate data lake management frameworks like Apache Iceberg when working with large volumes of data stored in file formats, ensuring data integrity, schema evolution, and enabling advanced features like time-travel for robust analytics.

Impact: Establishes a more resilient and manageable data infrastructure, crucial for scaling data operations and supporting complex data governance and auditing requirements.

Develop and implement tiered storage strategies for high-volume data platforms such as Kafka and Prometheus, moving less-frequently accessed historical data to cost-effective object storage using columnar formats.

Impact: Achieves substantial cost reductions in cloud infrastructure, particularly for observability and streaming data, while maintaining necessary data accessibility for long-term analysis.

Adopt a pragmatic approach to technology adoption: start with simpler, proven database solutions for new projects and only introduce more complex, specialized architectures (e.g., columnar or data lakes) when dictated by genuine scale or analytical requirements.

Impact: Minimizes initial development complexity and operational overhead, allowing teams to focus on core business value and scale infrastructure incrementally as demand grows.

Mentioned Companies

Highlighted as a performant columnar database, capable of reading Parquet files natively and offering significant compression benefits for analytical workloads.

Presented as a leading columnar data warehouse, utilizing Dremel's principles and supporting Iceberg.

Praised for its hybrid row/columnar storage capabilities and pioneering work in in-memory databases.

Discussed positively for its ongoing development to integrate Parquet for tiered storage, addressing observability cost challenges.

Recommended as an easy-to-use in-memory database capable of parsing Parquet, CSV, and JSON, facilitating local data analysis.

Cited as a prominent example of a columnar data store optimized for OLAP workloads.

Mentioned as another key columnar data store in the cloud analytics space.

Credited as a co-developer of Apache Parquet, highlighting its role in open-source data formats.

Acknowledged as a co-developer of Apache Parquet and for simplifying Hadoop distributions for enterprises.

Mentioned in the context of leveraging Tiered Storage with Iceberg and Parquet to manage log files on cheaper object storage.

Mentioned with historical issues of data loss and lack of normalization, contrasting it with more stable solutions.

Used as an example of observability platforms that can incur high cloud costs, prompting the need for tiered storage solutions.

Tags

Keywords

Database scaling strategies Columnar vs row databases Data lake architecture Parquet file format benefits Iceberg data management Cloud cost optimization data Tiered storage Kafka Prometheus tiered storage Analytical processing performance Data storage trends