Migrating to Parquet

I work at a data-as-a-service (DaaS) company that delivers PBs of geospatial data to customers across a variety of industries. We build and manage a central data lake, housing years of data, and operationalize that data to solve our customers’ problems. I recently gave a talk about the specifics of file formats at Spark+AI Summit 2020 that generated a lot of questions about my company’s migration from CSV to Apache Parquet. As CTO of a DaaS company, I saw firsthand how this migration had a drastic effect for all of our customers. This session will drill into the operational burden of transforming the storage format in an ecosystem and its impact on the business. ...

July 13, 2021 · 1 min · 116 words · Vinoo Ganesh

Accelerating Data Evaluation

As the data-as-a-service ecosystem continues to evolve, data brokers are faced with an unprecedented challenge – demonstrating the value of their data. Successfully crafting and selling a compelling data product relies on a broker’s ability to differentiate their product from the rest of the market. In smaller or static datasets, measures like row count and cardinality can speak volumes. However, when datasets are in the terabytes or petabytes though – differentiation becomes much difficult. On top of that “data quality” is a somewhat ill-defined term and the definition of a “high quality dataset” can change daily or even hourly. ...

May 28, 2021 · 1 min · 184 words · Vinoo Ganesh

Guaranteeing pipeline SLAs and data quality standards with Databand

We’ve all heard the phrase “data is the new oil.” But really imagine a world where this analogy is more real, where problems in the flow of data - delays, low quality, high volatility - could bring down whole economies? When data is the new oil with people and businesses similarly reliant on it, how do you avoid the fires, spills, and crises? As data products become central to companies’ bottom line, data engineering teams need to create higher standards for the availability, completeness, and fidelity of their data. ...

May 28, 2021 · 1 min · 193 words · Vinoo Ganesh

Strata Data Superstream Series: Creating Data-Intensive Applications

As the scale of data continues to grow (alongside an ever expanding ecosystem of tools to work with it), developing successful applications is an increasingly challenging proposition—and a necessity. At each stage of the process, from architecting to processing and storing data to deployment, there are a range of aspects to consider. Things like scalability, consistency, reliability, efficiency, and maintainability. It can be hard to figure out the right way forward. ...

May 4, 2021 · 1 min · 138 words · Vinoo Ganesh

Large Scale Data Analytics with Vinoo Ganesh

In this episode of The Data Standard, Catherine Tao and Vinoo Ganash talk about large-scale data and data processing challenges. Vinoo starts the conversation by explaining his current obligations and how his company uses data to find working solutions for a wide range of problems. Then he talks about OLTP and OLAP models and how large-scale data can help improve workflows and offer better results. Optimization is needed for every specific application, and Vinoo talks about the methods he uses to enhance existing platforms. Even when the newly developed systems show positive results, the work is never done, as optimization is a constant, dynamic process. ...

February 5, 2021 · 1 min · 204 words · Vinoo Ganesh