Distributed Data and Blocks
Originally published on Efficiently (Substack) This is a continuation of a previous blog post about efficient data partitioning. In the previous post, I discussed how data layout on disk impacts analytics performance. This post focuses on tactical implementation using open source technologies. Topics I’ll cover: HDFS Blocks + Block Size Block sizes + tradeoffs Background Data organization on disk dramatically affects analytics performance. I previously explored row-oriented, columnar, and hybrid storage models — now let’s connect these concepts to modern data infrastructure. ...