Parquet

Hands-On: Predicate Pushdown

Originally published on Efficiently (Substack) We’ve spoken a lot about on-disk and distributed storage, as well as blocks. All of this theory is great, let’s talk about this in practice. In this post, I’m going to: Read a CSV dataset into Spark Write the dataset into 5 Parquet files (treating each file as a block) Introspect metadata existing on the files Run queries demonstrating predicate pushdown power Hands-On: Setup The tutorial uses an airports dataset. Download it via: ...

On-Disk Storage Methods (w/ visualizations)

Originally published on Efficiently (Substack) A few years ago, I gave a talk at Spark Summit 2020 about File Formats covering Avro, ORC, and Parquet. I received numerous questions about that topic, responding point-to-point, leaving the knowledge confined to those forums alone. That isn’t helpful for most people. This post aims to fix that. In this series, I’ll outline the primitives of this topic and then explore the hands-on details. Problem In the efficiency space, minimizing “work” is key. Whether work requires compute, network, or storage, “the goal of efficient data usage is to get the most accurate answer in the fastest and cheapest way possible.” ...

Apache Parquet Website

The Parquet website was a bit dated - especially given its heavy usage. I took the opportunity to rebuild the website using Hugo. Check it out and please let me know if you have any feedback! Code https://github.com/apache/parquet-site/commit/3563721676b364b767058a953f2bcc3e2c0c4b09 Link https://parquet.apache.org

Migrating to Parquet

I work at a data-as-a-service (DaaS) company that delivers PBs of geospatial data to customers across a variety of industries. We build and manage a central data lake, housing years of data, and operationalize that data to solve our customers’ problems. I recently gave a talk about the specifics of file formats at Spark+AI Summit 2020 that generated a lot of questions about my company’s migration from CSV to Apache Parquet. As CTO of a DaaS company, I saw firsthand how this migration had a drastic effect for all of our customers. This session will drill into the operational burden of transforming the storage format in an ecosystem and its impact on the business. ...

The Apache Spark File Format Ecosystem

In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs. In reality, the choice of file format has drastic implications to everything from the ongoing stability to compute cost of compute jobs. These file formats also employ a number of optimization techniques to minimize data exchange, permit predicate pushdown, and prune unnecessary partitions. This session aims to introduce and concisely explain the key concepts behind some of the most widely used file formats in the Spark ecosystem – namely Parquet, ORC, and Avro. We’ll discuss the history of the advent of these file formats from their origins in the Hadoop / Hive ecosystems to their functionality and use today. We’ll then deep dive into the core data structures that back these formats, covering specifics around the row groups of Parquet (including the recently deprecated summary metadata files), stripes and footers of ORC, and the schema evolution capabilities of Avro. We’ll continue to describe the specific SparkConf / SQLConf settings that developers can use to tune the settings behind these file formats. We’ll conclude with specific industry examples of the impact of the file on the performance of the job or the stability of a job (with examples around incorrect partition pruning introduced by a Parquet bug), and look forward to emerging technologies (Apache Arrow). ...