Technology

Databricks Delta Live Tables 101

Originally published on Sync Computing Databricks’ DLT offering showcases a substantial improvement in the data engineer lifecycle and workflow. By offering a pre-baked, and opinionated pipeline construction ecosystem, Databricks has finally started offering a holistic end-to-end data engineering experience from inside of its own product, which provides superior solutions for raw data workflow, live batching and a host of other benefits detailed below. Since its release in 2022, Databricks’ Delta Live Tables have quickly become a go-to end-to-end resource for data engineers looking to build opinionated ETL pipelines for streaming data and big data. The pipeline management framework is considered one of the most valuable offerings on the databricks platform, and is used by over 1,000 companies including Shell and H&R block. ...

Rethinking Serverless: The Price of Convenience

Originally published on Sync Computing Serverless functions have had their 15 minutes of fame (and runtime). As is the case with many concepts in technology, the term Serverless is abusively vague. As such, discussing the idea of “serverless” usually invokes one of two feelings in developers. Either, it’s thought of as the catalyst for this potential incredible future, finally freeing developers from having to worry about resources or scaling concerns, or it’s thought of as the harbinger of yet another “we don’t need DevOps anymore” trend. ...

The Future in Tech: Data Engineering Powers AI Revolution

Originally streamed live on August 3, 2023 - LinkedIn Learning’s “The Future in Tech” series Data engineering is the unsung hero fueling the rapid growth and consumption of artificial intelligence. It transforms AI’s potential into reality, driving digital innovation and reshaping the world. In this comprehensive discussion, we explore how data engineering unlocks and enables democratized use of Artificial Intelligence. Video: The Future in Tech - Data Engineering and AI Discussion (1,668 views) ...

Hands-On: Predicate Pushdown

Originally published on Efficiently (Substack) We’ve spoken a lot about on-disk and distributed storage, as well as blocks. All of this theory is great, let’s talk about this in practice. In this post, I’m going to: Read a CSV dataset into Spark Write the dataset into 5 Parquet files (treating each file as a block) Introspect metadata existing on the files Run queries demonstrating predicate pushdown power Hands-On: Setup The tutorial uses an airports dataset. Download it via: ...

Distributed Data and Blocks

Originally published on Efficiently (Substack) This is a continuation of a previous blog post about efficient data partitioning. In the previous post, I discussed how data layout on disk impacts analytics performance. This post focuses on tactical implementation using open source technologies. Topics I’ll cover: HDFS Blocks + Block Size Block sizes + tradeoffs Background Data organization on disk dramatically affects analytics performance. I previously explored row-oriented, columnar, and hybrid storage models — now let’s connect these concepts to modern data infrastructure. ...