Data Pipelines

Context Is The Easy Part

Originally published on Kepler Everyone’s talking about context engineering right now, but most of the conversation is focused on the wrong thing. Read the blog posts, the guides, the thought leadership. They’re all asking the same questions: What should I include in the context window? How do I manage tokens efficiently? How do I curate what the model sees? These are valid questions. They’re also the easy part. The hard part isn’t deciding what context to include. It’s building systems that deliver that context reliably, with provenance, at scale, every single time. That’s not a context problem. That’s an engineering problem. And engineering means something specific. ...

Databricks Delta Live Tables 101

Originally published on Sync Computing Databricks’ DLT offering showcases a substantial improvement in the data engineer lifecycle and workflow. By offering a pre-baked, and opinionated pipeline construction ecosystem, Databricks has finally started offering a holistic end-to-end data engineering experience from inside of its own product, which provides superior solutions for raw data workflow, live batching and a host of other benefits detailed below. Since its release in 2022, Databricks’ Delta Live Tables have quickly become a go-to end-to-end resource for data engineers looking to build opinionated ETL pipelines for streaming data and big data. The pipeline management framework is considered one of the most valuable offerings on the databricks platform, and is used by over 1,000 companies including Shell and H&R block. ...

Building Chatbots with Rasa

Tell us about your background Throughout my career, I have dedicated myself to creating tools, products, and technologies that help people effectively utilize their data. My passion lies in developing products that enable users to efficiently and scalably gain maximum value from their data. My journey in understanding the intricacies of data and its potential began at Palantir Technologies, where I began working on search and indexing products. As data volumes grew, I focused my efforts on solving some of Palantir customers’ core problems across the financial and defense verticals before leading customer focused compute teams. After Palantir, I served as CTO at Veraset, a cloud-based data-as-a-service company. Veraset delivered high-quality, large scale data to a number of enterprises and grew to 15M ARR before being acquired. Following this, I joined Citadel Investment Group as the Head of Business Engineering of Ashler. In that role, I managed crucial data operations, including overseeing data pipelines, investment platforms, data lakes, and the software and data engineering teams responsible for them. ...

Hands-On Introduction: Data Engineering

In this course, instructor Vinoo Ganesh gives you an overview of the fundamental skills you need to become a data engineer. Learn how to solve complex data problems in a scalable, concrete way. Explore the core principles of the data engineer toolkit—including ELT, OLTP/OLAP, orchestration, DAGs, and more—as well as how to set up a local Apache Airflow deployment and full-scale data engineering ETL pipeline. Along the way, Vinoo helps you boost your technical skill set using real-world, hands-on scenarios. ...

Hands-On: Predicate Pushdown

Originally published on Efficiently (Substack) We’ve spoken a lot about on-disk and distributed storage, as well as blocks. All of this theory is great, let’s talk about this in practice. In this post, I’m going to: Read a CSV dataset into Spark Write the dataset into 5 Parquet files (treating each file as a block) Introspect metadata existing on the files Run queries demonstrating predicate pushdown power Hands-On: Setup The tutorial uses an airports dataset. Download it via: ...