The data pipeline has become a fundamental component of the data science, data analyst, and data engineering workflow. Pipelines serve as the glue that links together various components of the data cleansing, data validation, and data transformation process. However, despite its importance to the data ecosystem, constructing the optimal data pipeline is generally an afterthought - if it’s considered at all. This makes any changes to the central pipeline highly error-prone and cumbersome. With the ever-growing demand for new kinds of data, especially from external vendors, constructing pipelines that are scalable and that allow for monitoring is pivotal for the safe and continued use of data.
This session will cover the core components that each data pipeline needs from an operational and functional perspective. We’ll discuss a framework that will allow practitioners to set their pipelines up for success. We’ll also discuss how to leverage data pipelines for metrics gathering and how pipelines can be architected to alert on potential data problems before the fact.