Teaching

Advance Your SQL Skills with dbt for Data Engineering

Managing SQL code at scale is one of the biggest challenges in data engineering. As data teams grow and pipelines become more complex, traditional approaches to SQL development quickly become unwieldy. This LinkedIn Learning course explores how dbt (data build tool) transforms the way we think about SQL development, bringing software engineering best practices to analytics engineering. Course Approach Real-World Problem Solving: Each chapter presents actual situations and challenges that data engineers face, with focused code examples showing practical solutions. ...

Hands-On Introduction: Data Engineering

In this course, instructor Vinoo Ganesh gives you an overview of the fundamental skills you need to become a data engineer. Learn how to solve complex data problems in a scalable, concrete way. Explore the core principles of the data engineer toolkit—including ELT, OLTP/OLAP, orchestration, DAGs, and more—as well as how to set up a local Apache Airflow deployment and full-scale data engineering ETL pipeline. Along the way, Vinoo helps you boost your technical skill set using real-world, hands-on scenarios. ...

O'Reilly Superstream Series: Data Pipelines

Data pipelines are the foundation for success in data analytics, so understanding how they work is of the utmost importance. Join us for four hours of expert-led sessions that will give you insight into how data is moved, processed, and transformed to support analytics and reporting needs. You’ll also learn how to address common challenges like monitoring and managing broken pipelines, explore considerations for choosing and connecting open source frameworks, commercial products, and homegrown solutions, and more. ...

Designing Data Pipelines — with Interactivity

The data pipeline has become a fundamental component of the data science, data analyst, and data engineering workflow. Pipelines serve as the glue that links together various components of the data cleansing, data validation, and data transformation process. However, despite its importance to the data ecosystem, constructing the optimal data pipeline is generally an afterthought - if it’s considered at all. This makes any changes to the central pipeline highly error-prone and cumbersome. With the ever-growing demand for new kinds of data, especially from external vendors, constructing pipelines that are scalable and that allow for monitoring is pivotal for the safe and continued use of data. ...

The Apache Spark File Format Ecosystem

In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs. In reality, the choice of file format has drastic implications to everything from the ongoing stability to compute cost of compute jobs. These file formats also employ a number of optimization techniques to minimize data exchange, permit predicate pushdown, and prune unnecessary partitions. This session aims to introduce and concisely explain the key concepts behind some of the most widely used file formats in the Spark ecosystem – namely Parquet, ORC, and Avro. We’ll discuss the history of the advent of these file formats from their origins in the Hadoop / Hive ecosystems to their functionality and use today. We’ll then deep dive into the core data structures that back these formats, covering specifics around the row groups of Parquet (including the recently deprecated summary metadata files), stripes and footers of ORC, and the schema evolution capabilities of Avro. We’ll continue to describe the specific SparkConf / SQLConf settings that developers can use to tune the settings behind these file formats. We’ll conclude with specific industry examples of the impact of the file on the performance of the job or the stability of a job (with examples around incorrect partition pruning introduced by a Parquet bug), and look forward to emerging technologies (Apache Arrow). ...