Databricks

Databricks Delta Live Tables 101

Originally published on Sync Computing Databricks’ DLT offering showcases a substantial improvement in the data engineer lifecycle and workflow. By offering a pre-baked, and opinionated pipeline construction ecosystem, Databricks has finally started offering a holistic end-to-end data engineering experience from inside of its own product, which provides superior solutions for raw data workflow, live batching and a host of other benefits detailed below. Since its release in 2022, Databricks’ Delta Live Tables have quickly become a go-to end-to-end resource for data engineers looking to build opinionated ETL pipelines for streaming data and big data. The pipeline management framework is considered one of most valuable offerings on the databricks platform, and is used by over 1,000 companies including Shell and H&R block. ...

Building Chatbots with Rasa

Tell us about your background Throughout my career, I have dedicated myself to creating tools, products, and technologies that help people effectively utilize their data. My passion lies in developing products that enable users to efficiently and scalably gain maximum value from their data. My journey in understanding the intricacies of data and its potential began at Palantir Technologies, where I began working on search and indexing products. As data volumes grew, I focused my efforts on solving some of Palantir customer’s core problems across the financial and defense verticals before leading customer focused compute teams. After Palantir, I served as CTO at Veraset, a cloud-based data-as-a-service company. Veraset delivered high-quality, large scale data to a number of enterprises and grew to 15M ARR before being acquired. Following this, I joined Citadel Investment Group as the Head of Business Engineering of Ashler. In that role, I managed crucial data operations, including overseeing data pipelines, investment platforms, data lakes, and the software and data engineering teams responsible for them. ...

Data SLA Nightmares & Lessons Learned

Databricks Sr. Staff Developer Advocate, Denny Lee, Citadel Head of Business Engineering, Vinoo Ganesh, and Databand.ai Co-Founder & CEO, Josh Benamram, discuss the complexities and business necessity of setting clear data service-level agreements (SLAs). They share their experiences around the importance of contractual expectations and why data delivery success criteria are prone to disguise failures as success in spite of our best intentions. Denny, Vinoo, and Josh challenge businesses of all industries to see themselves as data companies by driving home a costly reality – what do businesses have to lose when their data is wrong? A lot more than they’d like to believe. ...

Accelerating Data Evaluation

As the data-as-a-service ecosystem continues to evolve, data brokers are faced with an unprecedented challenge – demonstrating the value of their data. Successfully crafting and selling a compelling data product relies on a broker’s ability to differentiate their product from the rest of the market. In smaller or static datasets, measures like row count and cardinality can speak volumes. However, when datasets are in the terabytes or petabytes though – differentiation becomes much difficult. On top of that “data quality” is a somewhat ill-defined term and the definition of a “high quality dataset” can change daily or even hourly. ...

The Apache Spark File Format Ecosystem

In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs. In reality, the choice of file format has drastic implications to everything from the ongoing stability to compute cost of compute jobs. These file formats also employ a number of optimization techniques to minimize data exchange, permit predicate pushdown, and prune unnecessary partitions. This session aims to introduce and concisely explain the key concepts behind some of the most widely used file formats in the Spark ecosystem – namely Parquet, ORC, and Avro. We’ll discuss the history of the advent of these file formats from their origins in the Hadoop / Hive ecosystems to their functionality and use today. We’ll then deep dive into the core data structures that back these formats, covering specifics around the row groups of Parquet (including the recently deprecated summary metadata files), stripes and footers of ORC, and the schema evolution capabilities of Avro. We’ll continue to describe the specific SparkConf / SQLConf settings that developers can use to tune the settings behind these file formats. We’ll conclude with specific industry examples of the impact of the file on the performance of the job or the stability of a job (with examples around incorrect partition pruning introduced by a Parquet bug), and look forward to emerging technologies (Apache Arrow). ...