Advance Your SQL Skills with dbt for Data Engineering

Managing SQL code at scale is one of the biggest challenges in data engineering. As data teams grow and pipelines become more complex, traditional approaches to SQL development quickly become unwieldy. This LinkedIn Learning course explores how dbt (data build tool) transforms the way we think about SQL development, bringing software engineering best practices to analytics engineering. Course Approach Real-World Problem Solving: Each chapter presents actual situations and challenges that data engineers face, with focused code examples showing practical solutions. ...

September 26, 2023 · 2 min · Vinoo Ganesh

The Efficiently Guide to Snowflake (Top Down)

Originally published on Efficiently (Substack) The majority of my career has been focused on making data systems more efficient — whether that means performance, scalability, or cost. This series aims to democratize knowledge about how to Efficiently operationalize data. TLDR 4 changes you can make right now to run Snowflake more Efficiently: File a Snowflake support ticket and request access to the GET_QUERY_STATS function ALTER WAREHOUSE <warehouseName> SET AUTO_SUSPEND = 60; For multi-cluster warehouses: ALTER WAREHOUSE <warehouseName> SET MIN_CLUSTER_COUNT = 1; ALTER WAREHOUSE <warehouseName> SET SCALING_POLICY = ECONOMY; ALTER WAREHOUSE <warehouseName> SET STATEMENT_TIMEOUT_IN_SECONDS=36000 Snowflake + Driving Snowflake optimization resembles efficient driving. There are four parallel constraints: ...

February 2, 2023 · 3 min · Vinoo Ganesh

Hands-On: Predicate Pushdown

Originally published on Efficiently (Substack) We’ve spoken a lot about on-disk and distributed storage, as well as blocks. All of this theory is great, let’s talk about this in practice. In this post, I’m going to: Read a CSV dataset into Spark Write the dataset into 5 Parquet files (treating each file as a block) Introspect metadata existing on the files Run queries demonstrating predicate pushdown power Hands-On: Setup The tutorial uses an airports dataset. Download it via: ...

January 28, 2023 · 3 min · Vinoo Ganesh

The Apache Spark File Format Ecosystem

In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs. In reality, the choice of file format has drastic implications to everything from the ongoing stability to compute cost of compute jobs. These file formats also employ a number of optimization techniques to minimize data exchange, permit predicate pushdown, and prune unnecessary partitions. This session aims to introduce and concisely explain the key concepts behind some of the most widely used file formats in the Spark ecosystem – namely Parquet, ORC, and Avro. We’ll discuss the history of the advent of these file formats from their origins in the Hadoop / Hive ecosystems to their functionality and use today. We’ll then deep dive into the core data structures that back these formats, covering specifics around the row groups of Parquet (including the recently deprecated summary metadata files), stripes and footers of ORC, and the schema evolution capabilities of Avro. We’ll continue to describe the specific SparkConf / SQLConf settings that developers can use to tune the settings behind these file formats. We’ll conclude with specific industry examples of the impact of the file on the performance of the job or the stability of a job (with examples around incorrect partition pruning introduced by a Parquet bug), and look forward to emerging technologies (Apache Arrow). ...

June 24, 2020 · 2 min · Vinoo Ganesh