ETL | Vinoo Ganesh

Designing Data Pipelines — with Interactivity

The data pipeline has become a fundamental component of the data science, data analyst, and data engineering workflow. Pipelines serve as the glue that links together various components of the data cleansing, data validation, and data transformation process. However, despite its importance to the data ecosystem, constructing the optimal data pipeline is generally an afterthought - if it’s considered at all. This makes any changes to the central pipeline highly error-prone and cumbersome. With the ever-growing demand for new kinds of data, especially from external vendors, constructing pipelines that are scalable and that allow for monitoring is pivotal for the safe and continued use of data. ...

O'Reilly Radar: Data & AI

O’Reilly Radar: Data & AI will showcase what’s new, what’s important, and what’s coming in the field. It includes two keynotes and two concurrent three-hour tracks—designed to lay out for tech leaders the issues, tools, and best practices that are critical to an organization at any step of their data and AI journey. You’ll explore everything from prototyping and pipelines to deployment and DevOps to responsible and ethical AI. Link https://www.oreilly.com/videos/oreilly-radar-data/0636920654667/ https://www.businesswire.com/news/home/20210909005792/en/O%E2%80%99Reilly-Announces-O%E2%80%99Reilly-Radar-Data-AI-to-Help-Tech-Leaders-Drive-Innovation-and-Successful-Implementation

Guaranteeing pipeline SLAs and data quality standards with Databand

We’ve all heard the phrase “data is the new oil.” But really imagine a world where this analogy is more real, where problems in the flow of data - delays, low quality, high volatility - could bring down whole economies? When data is the new oil with people and businesses similarly reliant on it, how do you avoid the fires, spills, and crises? As data products become central to companies’ bottom line, data engineering teams need to create higher standards for the availability, completeness, and fidelity of their data. ...

Migrating to Parquet

I work at a data-as-a-service (DaaS) company that delivers PBs of geospatial data to customers across a variety of industries. We build and manage a central data lake, housing years of data, and operationalize that data to solve our customers’ problems. I recently gave a talk about the specifics of file formats at Spark+AI Summit 2020 that generated a lot of questions about my company’s migration from CSV to Apache Parquet. As CTO of a DaaS company, I saw firsthand how this migration had a drastic effect for all of our customers. This session will drill into the operational burden of transforming the storage format in an ecosystem and its impact on the business. ...

The Apache Spark File Format Ecosystem

In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs. In reality, the choice of file format has drastic implications to everything from the ongoing stability to compute cost of compute jobs. These file formats also employ a number of optimization techniques to minimize data exchange, permit predicate pushdown, and prune unnecessary partitions. This session aims to introduce and concisely explain the key concepts behind some of the most widely used file formats in the Spark ecosystem – namely Parquet, ORC, and Avro. We’ll discuss the history of the advent of these file formats from their origins in the Hadoop / Hive ecosystems to their functionality and use today. We’ll then deep dive into the core data structures that back these formats, covering specifics around the row groups of Parquet (including the recently deprecated summary metadata files), stripes and footers of ORC, and the schema evolution capabilities of Avro. We’ll continue to describe the specific SparkConf / SQLConf settings that developers can use to tune the settings behind these file formats. We’ll conclude with specific industry examples of the impact of the file on the performance of the job or the stability of a job (with examples around incorrect partition pruning introduced by a Parquet bug), and look forward to emerging technologies (Apache Arrow). ...