Guaranteeing pipeline SLAs and data quality standards with Databand

We’ve all heard the phrase “data is the new oil.” But really imagine a world where this analogy is more real, where problems in the flow of data - delays, low quality, high volatility - could bring down whole economies? When data is the new oil with people and businesses similarly reliant on it, how do you avoid the fires, spills, and crises? As data products become central to companies’ bottom line, data engineering teams need to create higher standards for the availability, completeness, and fidelity of their data. ...

May 28, 2021 · 1 min · 193 words · Vinoo Ganesh

The Apache Spark File Format Ecosystem

In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs. In reality, the choice of file format has drastic implications to everything from the ongoing stability to compute cost of compute jobs. These file formats also employ a number of optimization techniques to minimize data exchange, permit predicate pushdown, and prune unnecessary partitions. This session aims to introduce and concisely explain the key concepts behind some of the most widely used file formats in the Spark ecosystem – namely Parquet, ORC, and Avro. We’ll discuss the history of the advent of these file formats from their origins in the Hadoop / Hive ecosystems to their functionality and use today. We’ll then deep dive into the core data structures that back these formats, covering specifics around the row groups of Parquet (including the recently deprecated summary metadata files), stripes and footers of ORC, and the schema evolution capabilities of Avro. We’ll continue to describe the specific SparkConf / SQLConf settings that developers can use to tune the settings behind these file formats. We’ll conclude with specific industry examples of the impact of the file on the performance of the job or the stability of a job (with examples around incorrect partition pruning introduced by a Parquet bug), and look forward to emerging technologies (Apache Arrow). ...

June 24, 2020 · 2 min · 299 words · Vinoo Ganesh