On-Disk Storage Methods (w/ visualizations)

Originally published on Efficiently (Substack) A few years ago, I gave a talk at Spark Summit 2020 about File Formats covering Avro, ORC, and Parquet. I received numerous questions about that topic, responding point-to-point, leaving the knowledge confined to those forums alone. That isn’t helpful for most people. This post aims to fix that. In this series, I’ll outline the primitives of this topic and then explore the hands-on details. Problem In the efficiency space, minimizing “work” is key. Whether work requires compute, network, or storage, “the goal of efficient data usage is to get the most accurate answer in the fastest and cheapest way possible.” ...

January 14, 2023 · 6 min · Vinoo Ganesh

Introducing: Efficiently

Originally published on Efficiently (Substack) My name is Vinoo and I’ll be your guide, writer, and (likely) ranter throughout this series. 0. What this is Recently, I had a friend recommend Austin Kleon’s Show Your Work. I read it over the course of a short plane ride and realized… it’s time to start showing my work. Publicly. This blog series, at its core, is a collection of things learned while building analytical tools, creating data products, and consuming data products. Mostly, it’s intended as a technical series for technical audiences, but the direction remains uncertain. ...

December 28, 2022 · 2 min · Vinoo Ganesh

O'Reilly Superstream Series: Data Pipelines

Data pipelines are the foundation for success in data analytics, so understanding how they work is of the utmost importance. Join us for four hours of expert-led sessions that will give you insight into how data is moved, processed, and transformed to support analytics and reporting needs. You’ll also learn how to address common challenges like monitoring and managing broken pipelines, explore considerations for choosing and connecting open source frameworks, commercial products, and homegrown solutions, and more. ...

August 10, 2022 · 2 min · Vinoo Ganesh

Designing Data Pipelines — with Interactivity

The data pipeline has become a fundamental component of the data science, data analyst, and data engineering workflow. Pipelines serve as the glue that links together various components of the data cleansing, data validation, and data transformation process. However, despite its importance to the data ecosystem, constructing the optimal data pipeline is generally an afterthought - if it’s considered at all. This makes any changes to the central pipeline highly error-prone and cumbersome. With the ever-growing demand for new kinds of data, especially from external vendors, constructing pipelines that are scalable and that allow for monitoring is pivotal for the safe and continued use of data. ...

March 10, 2022 · 1 min · Vinoo Ganesh

O'Reilly Radar: Data & AI

O’Reilly Radar: Data & AI will showcase what’s new, what’s important, and what’s coming in the field. It includes two keynotes and two concurrent three-hour tracks—designed to lay out for tech leaders the issues, tools, and best practices that are critical to an organization at any step of their data and AI journey. You’ll explore everything from prototyping and pipelines to deployment and DevOps to responsible and ethical AI. Link https://www.oreilly.com/videos/oreilly-radar-data/0636920654667/ https://www.businesswire.com/news/home/20210909005792/en/O%E2%80%99Reilly-Announces-O%E2%80%99Reilly-Radar-Data-AI-to-Help-Tech-Leaders-Drive-Innovation-and-Successful-Implementation

October 14, 2021 · 1 min · Vinoo Ganesh