Databricks Delta Live Tables 101

Fri, 08 Mar 2024 00:00:00 +0000

Originally published on Sync Computing

Databricks’ DLT offering showcases a substantial improvement in the data engineer lifecycle and workflow. By offering a pre-baked, and opinionated pipeline construction ecosystem, Databricks has finally started offering a holistic end-to-end data engineering experience from inside of its own product, which provides superior solutions for raw data workflow, live batching and a host of other benefits detailed below.

Since its release in 2022, Databricks’ Delta Live Tables have quickly become a go-to end-to-end resource for data engineers looking to build opinionated ETL pipelines for streaming data and big data. The pipeline management framework is considered one of most valuable offerings on the databricks platform, and is used by over 1,000 companies including Shell and H&R block.

What Are Delta Live Tables?

Delta Live Tables, or DLT, is a declarative ETL framework that dramatically simplifies the development of both batch and streaming pipelines. Concretely though, DLT is just another way of authoring and managing pipelines in databricks. Tables are created using the @dlt.table() annotation on top of functions (which return queries defining the table) in notebooks.

Delta Live Tables are built using Databricks foundational technology such as the Delta Lake and Delta File format. As such, they operate in conjunction with these two. However, whereas these two focus on the more “stagnant” portions of the data process, DLT focuses on the transformation piece. Specifically, the DLT framework allows data engineers to describe how data should be transformed between tables in the DAG.

The magic of DLT is most apparent when it comes to datasets that both involve streaming data and batch processing data. Whereas, in the past, users had to be keenly aware of and design pipelines for the type of the “velocity” (batch vs. streaming) of data transformed, DLT allows users to push this problem to the system itself. Meaning, users can write declarative transformations and let the system figure out how to handle the streaming or batch components.

The word “Delta” appears a lot in the Databricks ecosystem, and to understand why, it’s important to look back at history. In 2019, Databricks publicly announced the Delta Lake, a foundational element for storing data (tables) into the Databricks Lakehouse. Delta Lake popularized the idea of a Table Format on top of files, with the goal of bringing reliability to data lakes.

Tables that live inside of this Delta Lake are written using the Delta Table format and, as such, are called Delta Tables. Delta Live Tables focus on the “live” part of data flow between Delta tables – usually called the “transformation” step in the ETL paradigm. Delta Live Tables (DLTs) offer declarative pipeline development and visualization.

Breaking Down The Components of Delta Live Tables

There are two main ways to create tables within a Delta Live Tables pipeline:

Tables

Tables in DLT are materialized views that are stored in the lakehouse. They represent the physical datasets that will be persisted and can be queried directly. These tables are created using the @dlt.table() decorator and contain the actual transformed data.

Views

Views in DLT are temporary datasets that exist only during the pipeline execution. They’re useful for intermediate transformations and don’t consume storage since they’re computed on-demand. Views are created using the @dlt.view() decorator.

You can declare your datasets in DLT using either SQL or Python. These declarations can then trigger an update to calculate results for each dataset in the pipeline.

When to Use Views or Materialized Views in Delta Live Tables

The choice of View or Materialized View primarily depends on your use case. The biggest difference between the two is that Views are computed at query time, whereas Materialized Views are precomputed. Views also have the added benefit that they don’t actually require any additional storage, as they are computed on the fly.

The general rule of thumb when choosing between the two has to do with the performance requirements and downstream access patterns of the table in question. When performance is critical, having to compute a view on the fly may be an unnecessary slowdown, in which case, Materialized Views may be preferred. The same is true when there are multiple downstream consumers of a particular View.

However, there are multiple situations where users just need a quick view, computed in memory, to reference a particular state of a transferred table. Rather than materializing this table, creating a View is more straightforward and efficient.

What Are the Advantages of Delta Live Tables?

There are many benefits to using Delta Live Tables:

Unified Streaming/Batch Experience

By removing the need for data engineers to build distinct streaming/batch data pipelines, DLT simplifies one of the most difficult pain points of working with data, thereby offering a truly unified experience.

Opinionated Pipeline Management

The modern data stack is filled with orchestration players, observability players, data quality players, and many others. DLT offers an opinionated way to orchestrate and assert data quality.

Performance Optimization

DLTs offer the full advantages of Delta Tables, which are designed to handle large volumes of data and support fast querying. Their vectorized query execution allows them to process data in batches rather than one row at a time.

Built-in Quality Assertions

Delta Live Tables provide data quality features, such as data cleansing and data deduplication, out of the box. Users can specify rules to remove duplicates or cleanse data as data is ingested, ensuring data accuracy.

ACID Transactions

Because DLTs use Delta format they support ACID transactions (Atomicity, Consistency, Isolation and Durability) which has become the standard for data quality and exactness.

Pipeline Visibility

DLT provides a Directed Acyclic Graph of your data pipeline workloads, giving you a clear, visually compelling way to both see and introspect your pipeline at various points.

Change Data Capture (CDC) in Delta Live Tables

One of the large benefits of Delta Live Tables is the ability to use Change Data Capture while streaming data. Change Data Capture refers to the tracking of all changes in a data source so they can be captured across all destination systems.

With Delta Live Tables, data engineers can easily implement CDC with the Apply Changes API (either with Python or SQL). The capability lets ETL pipelines easily detect source data changes and apply them to data sets throughout the lakehouse.

Delta Live Tables support Slowly Changing Dimensions (SCD) both type 1 and type 2. This is important because SCD type 2 retains a full history of values, which means you can retain a history of records in your data lakehouse.

What is the Cost of Delta Live Tables?

The cost of Delta Live Tables depends on the compute function itself. On AWS, DLT compute can range from $0.20/DBU for DLT Core Compute Photon up to $0.36/DBU for DLT Advanced Compute. However, these prices can be up to twice as high when applying expectations and CDC.

From an efficiency perspective, DLT results in a reduction in total cost of ownership. Automatic orchestration tests by Databricks have shown total compute time to be reduced by as much as half with Delta Live Tables – ingesting up to 1 billion records for under $1.

Conclusion

Delta Live Tables represent a significant advancement in data engineering workflows, offering a unified approach to batch and streaming data processing. By providing built-in data quality checks, automatic orchestration, and comprehensive pipeline visibility, DLT simplifies many of the traditional pain points in data pipeline development.

While there are cost considerations to keep in mind, the efficiency gains and reduced operational overhead often justify the investment, especially for organizations dealing with complex data transformation workflows.

This post was originally published on Sync Computing’s blog on March 8, 2024.

The Apache Spark File Format Ecosystem

Wed, 24 Jun 2020 00:00:00 +0000

In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs. In reality, the choice of file format has drastic implications to everything from the ongoing stability to compute cost of compute jobs. These file formats also employ a number of optimization techniques to minimize data exchange, permit predicate pushdown, and prune unnecessary partitions. This session aims to introduce and concisely explain the key concepts behind some of the most widely used file formats in the Spark ecosystem – namely Parquet, ORC, and Avro. We’ll discuss the history of the advent of these file formats from their origins in the Hadoop / Hive ecosystems to their functionality and use today. We’ll then deep dive into the core data structures that back these formats, covering specifics around the row groups of Parquet (including the recently deprecated summary metadata files), stripes and footers of ORC, and the schema evolution capabilities of Avro. We’ll continue to describe the specific SparkConf / SQLConf settings that developers can use to tune the settings behind these file formats. We’ll conclude with specific industry examples of the impact of the file on the performance of the job or the stability of a job (with examples around incorrect partition pruning introduced by a Parquet bug), and look forward to emerging technologies (Apache Arrow).

After this presentation, attendees should understand the core concepts behind the prevalent file formats, the relevant file-format specific settings, and finally how to select the correct file format for their jobs. This presentation is relevant to Spark+AI summit because as more AI/ML workflows move into the Spark ecosystem (especially IO intensive deep learning) leveraging the correct file format is paramount in performant model training.

Link

https://databricks.com/session_na20/the-apache-spark-file-format-ecosystem

Apache Spark on Vinoo Ganesh

Databricks Delta Live Tables 101

What Are Delta Live Tables?

Breaking Down The Components of Delta Live Tables

Tables

Views

When to Use Views or Materialized Views in Delta Live Tables

What Are the Advantages of Delta Live Tables?

Unified Streaming/Batch Experience

Opinionated Pipeline Management

Performance Optimization

Built-in Quality Assertions

ACID Transactions

Pipeline Visibility

Change Data Capture (CDC) in Delta Live Tables

What is the Cost of Delta Live Tables?

Conclusion

The Apache Spark File Format Ecosystem

Link

Video

Apache Spark on Vinoo Ganesh

Databricks Delta Live Tables 101

What Are Delta Live Tables?

How are Delta Live Tables, Delta Tables, and Delta Lake related?

Breaking Down The Components of Delta Live Tables

Tables

Views

When to Use Views or Materialized Views in Delta Live Tables

What Are the Advantages of Delta Live Tables?

Unified Streaming/Batch Experience

Opinionated Pipeline Management

Performance Optimization

Built-in Quality Assertions

ACID Transactions

Pipeline Visibility

Change Data Capture (CDC) in Delta Live Tables

What is the Cost of Delta Live Tables?

Conclusion

The Apache Spark File Format Ecosystem

Link

Video