Hands-On Introduction: Data Engineering

In this course, instructor Vinoo Ganesh gives you an overview of the fundamental skills you need to become a data engineer. Learn how to solve complex data problems in a scalable, concrete way. Explore the core principles of the data engineer toolkit—including ELT, OLTP/OLAP, orchestration, DAGs, and more—as well as how to set up a local Apache Airflow deployment and full-scale data engineering ETL pipeline. Along the way, Vinoo helps you boost your technical skill set using real-world, hands-on scenarios. ...

April 28, 2023 · 1 min · Vinoo Ganesh

The Efficiently Guide to Snowflake (Top Down)

Originally published on Efficiently (Substack) The majority of my career has been focused on making data systems more efficient — whether that means performance, scalability, or cost. This series aims to democratize knowledge about how to Efficiently operationalize data. TLDR 4 changes you can make right now to run Snowflake more Efficiently: File a Snowflake support ticket and request access to the GET_QUERY_STATS function ALTER WAREHOUSE <warehouseName> SET AUTO_SUSPEND = 60; For multi-cluster warehouses: ALTER WAREHOUSE <warehouseName> SET MIN_CLUSTER_COUNT = 1; ALTER WAREHOUSE <warehouseName> SET SCALING_POLICY = ECONOMY; ALTER WAREHOUSE <warehouseName> SET STATEMENT_TIMEOUT_IN_SECONDS=36000 Snowflake + Driving Snowflake optimization resembles efficient driving. There are four parallel constraints: ...

February 2, 2023 · 3 min · Vinoo Ganesh

Hands-On: Predicate Pushdown

Originally published on Efficiently (Substack) We’ve spoken a lot about on-disk and distributed storage, as well as blocks. All of this theory is great, let’s talk about this in practice. In this post, I’m going to: Read a CSV dataset into Spark Write the dataset into 5 Parquet files (treating each file as a block) Introspect metadata existing on the files Run queries demonstrating predicate pushdown power Hands-On: Setup The tutorial uses an airports dataset. Download it via: ...

January 28, 2023 · 3 min · Vinoo Ganesh

Distributed Data and Blocks

Originally published on Efficiently (Substack) This is a continuation of a previous blog post about efficient data partitioning. In the previous post, I discussed how data layout on disk impacts analytics performance. This post focuses on tactical implementation using open source technologies. Topics I’ll cover: HDFS Blocks + Block Size Block sizes + tradeoffs Background Data organization on disk dramatically affects analytics performance. I previously explored row-oriented, columnar, and hybrid storage models — now let’s connect these concepts to modern data infrastructure. ...

January 24, 2023 · 3 min · Vinoo Ganesh

On-Disk Storage Methods (w/ visualizations)

Originally published on Efficiently (Substack) A few years ago, I gave a talk at Spark Summit 2020 about File Formats covering Avro, ORC, and Parquet. I received numerous questions about that topic, responding point-to-point, leaving the knowledge confined to those forums alone. That isn’t helpful for most people. This post aims to fix that. In this series, I’ll outline the primitives of this topic and then explore the hands-on details. Problem In the efficiency space, minimizing “work” is key. Whether work requires compute, network, or storage, “the goal of efficient data usage is to get the most accurate answer in the fastest and cheapest way possible.” ...

January 14, 2023 · 6 min · Vinoo Ganesh