<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Apache Spark on Vinoo Ganesh</title>
    <link>https://vinoo.io/tags/apache-spark/</link>
    <description>Recent content in Apache Spark on Vinoo Ganesh</description>
    <image>
      <title>Vinoo Ganesh</title>
      <url>https://vinoo.io/img/vinoo.jpg</url>
      <link>https://vinoo.io/img/vinoo.jpg</link>
    </image>
    <generator>Hugo</generator>
    <language>en-us</language>
    <lastBuildDate>Tue, 14 Apr 2026 22:49:03 -0400</lastBuildDate>
    <atom:link href="https://vinoo.io/tags/apache-spark/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Databricks Delta Live Tables 101</title>
      <link>https://vinoo.io/writing/2024-03-08-databricks-delta-live-tables/</link>
      <pubDate>Fri, 08 Mar 2024 00:00:00 +0000</pubDate>
      <guid>https://vinoo.io/writing/2024-03-08-databricks-delta-live-tables/</guid>
      <description>A comprehensive guide to understanding Databricks Delta Live Tables and their role in modern data engineering workflows</description>
      <content:encoded><![CDATA[<p><em>Originally published on <a href="https://synccomputing.com/databricks-delta-live-tables-101/">Sync Computing</a></em></p>
<p>Databricks&rsquo; DLT offering showcases a substantial improvement in the data engineer lifecycle and workflow. By offering a pre-baked, and opinionated pipeline construction ecosystem, Databricks has finally started offering a holistic end-to-end data engineering experience from inside of its own product, which provides superior solutions for raw data workflow, live batching and a host of other benefits detailed below.</p>
<p>Since its release in 2022, Databricks&rsquo; Delta Live Tables have quickly become a go-to end-to-end resource for data engineers looking to build opinionated ETL pipelines for streaming data and big data. The pipeline management framework is considered one of the most valuable offerings on the databricks platform, and is used by over 1,000 companies including Shell and H&amp;R block.</p>
<h2 id="what-are-delta-live-tables">What Are Delta Live Tables?</h2>
<p>Delta Live Tables, or DLT, is a declarative ETL framework that dramatically simplifies the development of both batch and streaming pipelines. Concretely though, DLT is just another way of authoring and managing pipelines in databricks. Tables are created using the <code>@dlt.table()</code> annotation on top of functions (which return queries defining the table) in notebooks.</p>
<p>Delta Live Tables are built using Databricks foundational technology such as the Delta Lake and Delta File format. As such, they operate in conjunction with these two. However, whereas these two focus on the more &ldquo;stagnant&rdquo; portions of the data process, DLT focuses on the <em>transformation</em> piece. Specifically, the DLT framework allows data engineers to describe <em>how</em> data should be transformed between tables in the DAG.</p>
<p>The magic of DLT is most apparent when it comes to datasets that both involve streaming data and batch processing data. Whereas, in the past, users had to be keenly aware of and design pipelines for the type of the &ldquo;velocity&rdquo; (batch vs. streaming) of data transformed, DLT allows users to push this problem to the system itself. Meaning, users can write declarative transformations and let the system figure out how to handle the streaming or batch components.</p>
<h2 id="how-are-delta-live-tables-delta-tables-and-delta-lake-related">How are Delta Live Tables, Delta Tables, and Delta Lake related?</h2>
<p>The word &ldquo;Delta&rdquo; appears a lot in the Databricks ecosystem, and to understand why, it&rsquo;s important to look back at history. In 2019, Databricks publicly announced the Delta Lake, a foundational element for storing data (tables) into the Databricks Lakehouse. Delta Lake popularized the idea of a <em>Table Format</em> on top of files, with the goal of bringing reliability to data lakes.</p>
<p>Tables that live inside of this Delta Lake are written using the Delta Table format and, as such, are called Delta Tables. Delta Live Tables focus on the &ldquo;live&rdquo; part of data flow between Delta tables – usually called the &ldquo;transformation&rdquo; step in the ETL paradigm. Delta Live Tables (DLTs) offer declarative pipeline development and visualization.</p>
<h2 id="breaking-down-the-components-of-delta-live-tables">Breaking Down The Components of Delta Live Tables</h2>
<p>There are two main ways to create tables within a Delta Live Tables pipeline:</p>
<h3 id="tables">Tables</h3>
<p>Tables in DLT are materialized views that are stored in the lakehouse. They represent the physical datasets that will be persisted and can be queried directly. These tables are created using the <code>@dlt.table()</code> decorator and contain the actual transformed data.</p>
<h3 id="views">Views</h3>
<p>Views in DLT are temporary datasets that exist only during the pipeline execution. They&rsquo;re useful for intermediate transformations and don&rsquo;t consume storage since they&rsquo;re computed on-demand. Views are created using the <code>@dlt.view()</code> decorator.</p>
<p>You can declare your datasets in DLT using either SQL or Python. These declarations can then trigger an update to calculate results for each dataset in the pipeline.</p>
<h2 id="when-to-use-views-or-materialized-views-in-delta-live-tables">When to Use Views or Materialized Views in Delta Live Tables</h2>
<p>The choice of View or Materialized View primarily depends on your use case. The biggest difference between the two is that Views are <strong>computed at query time</strong>, whereas Materialized Views are <strong>precomputed.</strong> Views also have the added benefit that they don&rsquo;t actually require any additional storage, as they are computed on the fly.</p>
<p>The general rule of thumb when choosing between the two has to do with the performance requirements and downstream access patterns of the table in question. When performance is critical, having to compute a view on the fly may be an unnecessary slowdown, in which case, Materialized Views may be preferred. The same is true when there are multiple downstream consumers of a particular View.</p>
<p>However, there are multiple situations where users just need a quick view, computed in memory, to reference a particular state of a transferred table. Rather than materializing this table, creating a View is more straightforward and efficient.</p>
<h2 id="what-are-the-advantages-of-delta-live-tables">What Are the Advantages of Delta Live Tables?</h2>
<p>There are many benefits to using Delta Live Tables:</p>
<h3 id="unified-streamingbatch-experience">Unified Streaming/Batch Experience</h3>
<p>By removing the need for data engineers to build distinct streaming/batch data pipelines, DLT simplifies one of the most difficult pain points of working with data, thereby offering a truly unified experience.</p>
<h3 id="opinionated-pipeline-management">Opinionated Pipeline Management</h3>
<p>The modern data stack is filled with orchestration players, observability players, data quality players, and many others. DLT offers an opinionated way to orchestrate and assert data quality.</p>
<h3 id="performance-optimization">Performance Optimization</h3>
<p>DLTs offer the full advantages of Delta Tables, which are designed to handle large volumes of data and support fast querying. Their vectorized query execution allows them to process data in batches rather than one row at a time.</p>
<h3 id="built-in-quality-assertions">Built-in Quality Assertions</h3>
<p>Delta Live Tables provide data quality features, such as data cleansing and data deduplication, out of the box. Users can specify rules to remove duplicates or cleanse data as data is ingested, ensuring data accuracy.</p>
<h3 id="acid-transactions">ACID Transactions</h3>
<p>Because DLTs use Delta format they support ACID transactions (Atomicity, Consistency, Isolation and Durability) which has become the standard for data quality and exactness.</p>
<h3 id="pipeline-visibility">Pipeline Visibility</h3>
<p>DLT provides a Directed Acyclic Graph of your data pipeline workloads, giving you a clear, visually compelling way to both see and introspect your pipeline at various points.</p>
<h2 id="change-data-capture-cdc-in-delta-live-tables">Change Data Capture (CDC) in Delta Live Tables</h2>
<p>One of the large benefits of Delta Live Tables is the ability to use Change Data Capture while streaming data. Change Data Capture refers to the tracking of all changes in a data source so they can be captured across all destination systems.</p>
<p>With Delta Live Tables, data engineers can easily implement CDC with the Apply Changes API (either with Python or SQL). The capability lets ETL pipelines easily detect source data changes and apply them to data sets throughout the lakehouse.</p>
<p>Delta Live Tables support Slowly Changing Dimensions (SCD) both type 1 and type 2. This is important because SCD type 2 retains a full history of values, which means you can retain a history of records in your data lakehouse.</p>
<h2 id="what-is-the-cost-of-delta-live-tables">What is the Cost of Delta Live Tables?</h2>
<p>The cost of Delta Live Tables depends on the compute function itself. On AWS, DLT compute can range from $0.20/DBU for DLT Core Compute Photon up to $0.36/DBU for DLT Advanced Compute. However, these prices can be up to twice as high when applying expectations and CDC.</p>
<p>From an efficiency perspective, DLT results in a reduction in total cost of ownership. Automatic orchestration tests by Databricks have shown total compute time to be reduced by as much as half with Delta Live Tables – ingesting up to 1 billion records for under $1.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Delta Live Tables represent a significant advancement in data engineering workflows, offering a unified approach to batch and streaming data processing. By providing built-in data quality checks, automatic orchestration, and comprehensive pipeline visibility, DLT simplifies many of the traditional pain points in data pipeline development.</p>
<p>While there are cost considerations to keep in mind, the efficiency gains and reduced operational overhead often justify the investment, especially for organizations dealing with complex data transformation workflows.</p>
<hr>
<p><em>This post was originally published on <a href="https://synccomputing.com/databricks-delta-live-tables-101/">Sync Computing&rsquo;s blog</a> on March 8, 2024.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Hands-On: Predicate Pushdown</title>
      <link>https://vinoo.io/writing/2023-01-28-hands-on-predicate-pushdown/</link>
      <pubDate>Sat, 28 Jan 2023 00:00:00 +0000</pubDate>
      <guid>https://vinoo.io/writing/2023-01-28-hands-on-predicate-pushdown/</guid>
      <description>A practical demonstration of how query optimizers leverage Parquet metadata to skip unnecessary data reads.</description>
      <content:encoded><![CDATA[<p><em>Originally published on <a href="https://vinooganesh.substack.com/p/hands-on-predicate-pushdown">Efficiently (Substack)</a></em></p>
<p>We&rsquo;ve spoken a lot about on-disk and distributed storage, as well as blocks. All of this theory is great, let&rsquo;s talk about this in practice.</p>
<p>In this post, I&rsquo;m going to:</p>
<ol>
<li>Read a CSV dataset into Spark</li>
<li>Write the dataset into 5 Parquet files (treating each file as a block)</li>
<li>Introspect metadata existing on the files</li>
<li>Run queries demonstrating predicate pushdown power</li>
</ol>
<p><img alt="Predicate Pushdown" height='ǻ' loading="lazy" src="/writing/2023-01-28-hands-on-predicate-pushdown/predicate-pushdown-cover_hu_159dd2e821d6e856.webp" width='̠'></p>
<h2 id="hands-on-setup">Hands-On: Setup</h2>
<p>The tutorial uses an airports dataset. Download it via:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f" id="hl-0-1"><a style="outline:none;text-decoration:none;color:inherit" href="#hl-0-1">1</a>
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>wget https://raw.githubusercontent.com/curran/data/gh-pages/vegaExamples/airports.csv -O dataset.csv
</span></span></code></pre></td></tr></table>
</div>
</div><p>The CSV contains columns: <code>iata</code>, <code>name</code>, <code>city</code>, <code>state</code>, <code>country</code>, <code>latitude</code>, <code>longitude</code>.</p>
<h3 id="loading-data-into-spark">Loading Data into Spark</h3>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f" id="hl-1-1"><a style="outline:none;text-decoration:none;color:inherit" href="#hl-1-1">1</a>
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-scala" data-lang="scala"><span style="display:flex;"><span><span style="color:#ff79c6">val</span> dataset <span style="color:#ff79c6">=</span> spark<span style="color:#ff79c6">.</span>read<span style="color:#ff79c6">.</span>option<span style="color:#ff79c6">(</span><span style="color:#f1fa8c">&#34;header&#34;</span><span style="color:#ff79c6">,</span><span style="color:#f1fa8c">&#34;true&#34;</span><span style="color:#ff79c6">).</span>option<span style="color:#ff79c6">(</span><span style="color:#f1fa8c">&#34;inferSchema&#34;</span><span style="color:#ff79c6">,</span><span style="color:#f1fa8c">&#34;true&#34;</span><span style="color:#ff79c6">).</span>csv<span style="color:#ff79c6">(</span><span style="color:#f1fa8c">&#34;dataset.csv&#34;</span><span style="color:#ff79c6">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>The inferred schema shows latitude and longitude as double types.</p>
<h3 id="writing-parquet-files">Writing Parquet Files</h3>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f" id="hl-2-1"><a style="outline:none;text-decoration:none;color:inherit" href="#hl-2-1">1</a>
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-scala" data-lang="scala"><span style="display:flex;"><span>dataset<span style="color:#ff79c6">.</span>repartition<span style="color:#ff79c6">(</span><span style="color:#bd93f9">5</span><span style="color:#ff79c6">).</span>write<span style="color:#ff79c6">.</span>parquet<span style="color:#ff79c6">(</span><span style="color:#f1fa8c">&#34;/root/parquet_dataset&#34;</span><span style="color:#ff79c6">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>This creates 5 Parquet files plus a <code>_SUCCESS</code> flag file.</p>
<h3 id="inspecting-parquet-metadata">Inspecting Parquet Metadata</h3>
<p>Install the inspection tools:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f" id="hl-3-1"><a style="outline:none;text-decoration:none;color:inherit" href="#hl-3-1">1</a>
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f" id="hl-3-2"><a style="outline:none;text-decoration:none;color:inherit" href="#hl-3-2">2</a>
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>pip3 install parquet-tools
</span></span><span style="display:flex;"><span>pip3 install parquet-metadata
</span></span></code></pre></td></tr></table>
</div>
</div><p>View file contents:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f" id="hl-4-1"><a style="outline:none;text-decoration:none;color:inherit" href="#hl-4-1">1</a>
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>parquet-tools show part-00000-53b27d15-b049-41db-a8aa-fa3033763836-c000.snappy.parquet
</span></span></code></pre></td></tr></table>
</div>
</div><h2 id="hands-on-query-plans">Hands-On: Query Plans</h2>
<h3 id="simple-filter-query">Simple Filter Query</h3>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f" id="hl-5-1"><a style="outline:none;text-decoration:none;color:inherit" href="#hl-5-1">1</a>
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f" id="hl-5-2"><a style="outline:none;text-decoration:none;color:inherit" href="#hl-5-2">2</a>
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-scala" data-lang="scala"><span style="display:flex;"><span><span style="color:#ff79c6">val</span> simpleFilter <span style="color:#ff79c6">=</span> dataset<span style="color:#ff79c6">.</span>filter<span style="color:#ff79c6">(</span>$<span style="color:#f1fa8c">&#34;latitude&#34;</span> <span style="color:#ff79c6">&gt;</span> <span style="color:#bd93f9">30</span><span style="color:#ff79c6">)</span>
</span></span><span style="display:flex;"><span>simpleFilter<span style="color:#ff79c6">.</span>show<span style="color:#ff79c6">()</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>The result shows all rows where latitude exceeds 30.</p>
<p>The query plan analysis reveals three optimization stages: parsed logical plan, analyzed logical plan, and optimized logical plan. The optimized logical plan has added some null checking — which also matches our predicate.</p>
<h3 id="complex-filter-query">Complex Filter Query</h3>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f" id="hl-6-1"><a style="outline:none;text-decoration:none;color:inherit" href="#hl-6-1">1</a>
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-scala" data-lang="scala"><span style="display:flex;"><span><span style="color:#ff79c6">val</span> complexFilter <span style="color:#ff79c6">=</span> dataset<span style="color:#ff79c6">.</span>filter<span style="color:#ff79c6">(</span>$<span style="color:#f1fa8c">&#34;latitude&#34;</span> <span style="color:#ff79c6">&gt;</span> <span style="color:#bd93f9">30</span><span style="color:#ff79c6">).</span>filter<span style="color:#ff79c6">(</span>$<span style="color:#f1fa8c">&#34;latitude&#34;</span> <span style="color:#ff79c6">&lt;</span> <span style="color:#bd93f9">40</span><span style="color:#ff79c6">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>As you can see, the plan has combined both of our predicates into one step as part of the query process (meaning that what would previously take two passes over the data now only requires one).</p>
<p>The optimized plan consolidates the filters: <code>Filter ((isnotnull(latitude#21) AND (latitude#21 &gt; 30.0)) AND (latitude#21 &lt; 40.0))</code></p>
<h2 id="hands-on-querying-with-parquet">Hands-On: Querying with Parquet</h2>
<h3 id="row-group-metadata-analysis">Row Group Metadata Analysis</h3>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f" id="hl-7-1"><a style="outline:none;text-decoration:none;color:inherit" href="#hl-7-1">1</a>
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>parquet-metadata /root/parquet_dataset/part-00000-...
</span></span></code></pre></td></tr></table>
</div>
</div><p>Critical metadata fields include:</p>
<ul>
<li><code>stats:min</code> — smallest value in the column</li>
<li><code>stats:max</code> — largest value in the column</li>
</ul>
<p>Example statistics from one file:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f" id="hl-8-1"><a style="outline:none;text-decoration:none;color:inherit" href="#hl-8-1">1</a>
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f" id="hl-8-2"><a style="outline:none;text-decoration:none;color:inherit" href="#hl-8-2">2</a>
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#282a36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>row_group 0 latitude stats:min 14.1743075
</span></span><span style="display:flex;"><span>row_group 0 latitude stats:max 70.46727611
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="predicate-pushdown-mechanism">Predicate Pushdown Mechanism</h3>
<p>For a second file with <code>stats:min=44.4430157</code> and <code>stats:max=74.46727611</code>, a query filtering for latitude between 30 and 40 would exclude this entire file — because we know from the metadata that no values in this file fall within our filter range.</p>
<p>In practice, this is called <strong>predicate pushdown</strong>. The requirements of the predicate (the query) have been pushed down, allowing the optimizers to look at the metadata on the row groups themselves to decide which row groups to read, and when they can be ignored.</p>
<h2 id="conclusion">Conclusion</h2>
<p>There is a lot of magic that goes into our ability to query data quickly and <em>Efficiently</em>. Query optimizers do a lot for us — and understanding how they work under the hood helps us write better queries and design better data layouts.</p>
]]></content:encoded>
    </item>
    <item>
      <title>On-Disk Storage Methods (w/ visualizations)</title>
      <link>https://vinoo.io/writing/2023-01-14-on-disk-storage-methods/</link>
      <pubDate>Sat, 14 Jan 2023 00:00:00 +0000</pubDate>
      <guid>https://vinoo.io/writing/2023-01-14-on-disk-storage-methods/</guid>
      <description>The way you write data can affect your performance. Exploring row-wise, columnar, and hybrid storage methods with visualizations.</description>
      <content:encoded><![CDATA[<p><em>Originally published on <a href="https://vinooganesh.substack.com/p/on-disk-storage-methods">Efficiently (Substack)</a></em></p>
<p>A few years ago, I gave a talk at <a href="https://www.databricks.com/session_na20/the-apache-spark-file-format-ecosystem">Spark Summit 2020</a> about File Formats covering Avro, ORC, and Parquet. I received numerous questions about that topic, responding point-to-point, leaving the knowledge confined to those forums alone.</p>
<p>That isn&rsquo;t helpful for most people. This post aims to fix that.</p>
<p>In this series, I&rsquo;ll outline the primitives of this topic and then explore the hands-on details.</p>
<h2 id="problem">Problem</h2>
<p>In the efficiency space, minimizing &ldquo;work&rdquo; is key. Whether work requires compute, network, or storage, &ldquo;the goal of efficient data usage is to get the most accurate answer in the fastest and cheapest way possible.&rdquo;</p>
<p>File Formats help data practitioners store their data in ways that minimize work. When you think of a file format, you may think of extensions like .xlsx, .pdf, .pptx. Similarly, technologies like Parquet, Avro, and ORC serve this purpose.</p>
<h2 id="background--example-data">Background / Example Data</h2>
<p>A partition is a logical segment of data. In the big data world, this usually means a piece of a larger dataset. For our purposes, I&rsquo;m going to use an example dataset below.</p>
<p>This dataset has 3 columns (Column A, Column B, and Column C) and 4 rows (Row 0, Row 1, Row 2, and Row 3).</p>
<p><img alt="Example data table" height='Ĺ' loading="lazy" src="/writing/2023-01-14-on-disk-storage-methods/storage-example-table_hu_8baefa3a812c6cc7.webp" width='̠'></p>
<p>This table should look familiar — something you&rsquo;ve seen in Excel, Pandas, etc. Let&rsquo;s take this example further and split the individual elements into their own logical &ldquo;pieces.&rdquo;</p>
<p><img alt="Cell reference notation" height='ō' loading="lazy" src="/writing/2023-01-14-on-disk-storage-methods/storage-cell-reference_hu_2231e9ea91e14a12.webp" width='̠'></p>
<p>We can refer to each &ldquo;cell&rdquo; by its &ldquo;&lt;column&gt;&lt;row&gt;.&rdquo; For example, the second row in Column B is called B1.</p>
<h2 id="storage">Storage</h2>
<h3 id="background">Background</h3>
<p>Data is stored on hard disks in what is called a <strong>block.</strong> A block is the minimum amount of data read during any read operation.</p>
<p>Blocks function like a suitcase. When checking a bag on a trip, you pay the same price regardless of how full or empty your suitcase is. It&rsquo;s optimal to fill your suitcase with as many relevant objects as possible, in as easy a way to find as possible.</p>
<p>Extending this analogy: packing unnecessary stuff isn&rsquo;t great. Bringing too many suitcases (unless strictly necessary) also isn&rsquo;t great. Inside the suitcase, you want to &ldquo;group&rdquo; similar things together — each pair of socks should be next to each other in the same suitcase, rather than split across different ones.</p>
<p>In hard drives, these insights apply. Reading unnecessary data is expensive. Reading fragmented data is expensive. Random seeks are expensive as well.</p>
<p>Our goal is to lay data out in a manner optimized for our workflows.</p>
<h3 id="row-wise-storage">Row-wise Storage</h3>
<p>In database land, the common way to store data used to be row-wise. It&rsquo;s pretty easy to understand why. Most people think about datasets as a list of rows.</p>
<p>Taking our dataset above, let&rsquo;s store this in a row-wise method.</p>
<p><img alt="Row-wise storage diagram" height='ś' loading="lazy" src="/writing/2023-01-14-on-disk-storage-methods/row-wise-storage_hu_9b6d37bdbdf0da2.webp" width='̠'></p>
<p>I have taken each row in order and packed as much of the rows as I can into a block, before moving to the next block.</p>
<p>This method works great when the goal is to read the data sequentially. All that&rsquo;s required is a simple linear scan of the block in order. It doesn&rsquo;t work as well if, for example, you want to only look at Column C. In that case, you&rsquo;re required to read all of the block (i.e., read all of the data) and filter down to Column C.</p>
<p>This is <strong>row-wise</strong> storage methodology.</p>
<h3 id="columnar-column-wise-storage">Columnar (Column-wise) Storage</h3>
<p>Column-wise storage takes the opposite approach and orients around columns.</p>
<p><img alt="Columnar storage diagram" height='ŗ' loading="lazy" src="/writing/2023-01-14-on-disk-storage-methods/columnar-storage_hu_1bce7707ed5e2a6b.webp" width='̠'></p>
<p>As you can see, we first take the entire column, pack it into a block, and then move onto the next column.</p>
<p>This method works great when the data is read in a columnar way (i.e., one column at a time). It doesn&rsquo;t work well if, for example, you want to reconstruct Row 0. In that situation, you&rsquo;d need to read all of the data and filter down to the elements that make up Row 0.</p>
<p>Now, we&rsquo;re in a dilemma — one approach seems to favor a row-oriented workflow, one approach seems to favor a column-oriented workflow. Luckily for us (and Goldilocks), there&rsquo;s a middle ground.</p>
<p><img alt="Goldilocks principle" height='ȕ' loading="lazy" src="/writing/2023-01-14-on-disk-storage-methods/goldilocks_hu_70cd99c3e3eb2552.webp" width='̠'></p>
<h3 id="hybrid-storage">Hybrid Storage</h3>
<p>A hybrid storage model gives us the best of both worlds. First, we group a fixed number of Rows together and then further group that by columns. We segment these and call these &ldquo;Row Groups&rdquo; (at least in the Parquet terminology).</p>
<p>In this example, we first selected two rows — Row 0 and Row 1. We then grouped those rows by column, and inserted them into our first Row Group.</p>
<p><img alt="Logical row groups" height='Ŕ' loading="lazy" src="/writing/2023-01-14-on-disk-storage-methods/storage-row-groups-logical_hu_28c4af6518c6d5c9.webp" width='̠'></p>
<p>I called these logical Row Groups because this is more of how we should be thinking about them, rather than how they may necessarily end up on disk.</p>
<p><img alt="Row groups on disk" height='Ī' loading="lazy" src="/writing/2023-01-14-on-disk-storage-methods/storage-row-groups-physical_hu_b50ebd16ae2331a6.webp" width='̠'></p>
<p>This representation of data is actually immensely powerful. It allows us to optimize our workflows for both row-oriented and column-oriented operations.</p>
<p>Let&rsquo;s talk about how this works.</p>
<p>In the case of a row-oriented workflow, let&rsquo;s say you want to recreate Row 2. To do this, you would simply need to look at Block 1 and Block 2. If you were operating in a Columnar storage model, you would need to look at Block 1, Block 2, and Block 3. You&rsquo;ve saved a whole Block!</p>
<p>In the case of a column-oriented workflow, let&rsquo;s say you want to recreate Column B. In this case, you would simply need to look at Block 1 and Block 2. If you were operating in a Row-wise storage model, you would need to look at Block 1, Block 2, and Block 3. You&rsquo;ve once again saved a whole Block!</p>
<p>Our examples used very small data, you can imagine how this extrapolates further with larger datasets.</p>
<h2 id="data-workflows">Data Workflows</h2>
<p>Throughout this post, I&rsquo;ve referred to my data workflows as &ldquo;row oriented&rdquo; or &ldquo;column oriented.&rdquo; Luckily for us, the big data community has come up with some terminology that should help bring these two workflows to life.</p>
<h3 id="oltp">OLTP</h3>
<p>Online Transaction Processing (OLTP) workloads generally involve larger amounts of short queries/transactions. These tend to be more focused on processing than analytics and as such have more data updates and deletions. Roughly — we can consider OLTP workflows as &ldquo;row oriented&rdquo; workflows.</p>
<h3 id="olap">OLAP</h3>
<p>Online Analytical Processing (OLAP) workloads are more analysis than processing focused. As such, there tends to be more analytical complexity per query and fewer CRUD transactions. Roughly — we can consider OLAP workflows as &ldquo;column oriented&rdquo; workflows.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Using data efficiently relies on using all levels of the &ldquo;data stack&rdquo; (storage, network, compute) efficiently. Reducing the amount of unnecessary data read during a query process can have compounding effects on the speed and efficiency of your analytics process.</p>
<p>In subsequent parts of this series, I&rsquo;ll be digging more into the details of how everything we have covered thus far can be applied in analytics workloads.</p>
]]></content:encoded>
    </item>
    <item>
      <title>The Apache Spark File Format Ecosystem</title>
      <link>https://vinoo.io/talks/2020-06-24-spark-file-format-ecosystem/</link>
      <pubDate>Wed, 24 Jun 2020 00:00:00 +0000</pubDate>
      <guid>https://vinoo.io/talks/2020-06-24-spark-file-format-ecosystem/</guid>
      <description>Spark Summit 2020</description>
      <content:encoded><![CDATA[<p>In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs. In reality, the choice of file format has drastic implications to everything from the ongoing stability to compute cost of compute jobs. These file formats also employ a number of optimization techniques to minimize data exchange, permit predicate pushdown, and prune unnecessary partitions. This session aims to introduce and concisely explain the key concepts behind some of the most widely used file formats in the Spark ecosystem – namely Parquet, ORC, and Avro. We’ll discuss the history of the advent of these file formats from their origins in the Hadoop / Hive ecosystems to their functionality and use today. We’ll then deep dive into the core data structures that back these formats, covering specifics around the row groups of Parquet (including the recently deprecated summary metadata files), stripes and footers of ORC, and the schema evolution capabilities of Avro. We’ll continue to describe the specific SparkConf / SQLConf settings that developers can use to tune the settings behind these file formats. We’ll conclude with specific industry examples of the impact of the file on the performance of the job or the stability of a job (with examples around incorrect partition pruning introduced by a Parquet bug), and look forward to emerging technologies (Apache Arrow).</p>
<p>After this presentation, attendees should understand the core concepts behind the prevalent file formats, the relevant file-format specific settings, and finally how to select the correct file format for their jobs. This presentation is relevant to Spark+AI summit because as more AI/ML workflows move into the Spark ecosystem (especially IO intensive deep learning) leveraging the correct file format is paramount in performant model training.</p>
<h1 id="link">Link</h1>
<p><a href="https://databricks.com/session_na20/the-apache-spark-file-format-ecosystem">https://databricks.com/session_na20/the-apache-spark-file-format-ecosystem</a></p>
<h1 id="video">Video</h1>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/auNAzC3AU18?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

]]></content:encoded>
    </item>
  </channel>
</rss>
