Large Scale Data Analytics with Vinoo Ganesh

In this episode of The Data Standard, Catherine Tao and Vinoo Ganash talk about large-scale data and data processing challenges. Vinoo starts the conversation by explaining his current obligations and how his company uses data to find working solutions for a wide range of problems. Then he talks about OLTP and OLAP models and how large-scale data can help improve workflows and offer better results. Optimization is needed for every specific application, and Vinoo talks about the methods he uses to enhance existing platforms. Even when the newly developed systems show positive results, the work is never done, as optimization is a constant, dynamic process. ...

February 5, 2021 · 1 min · 204 words · Vinoo Ganesh

The Apache Spark File Format Ecosystem

In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs. In reality, the choice of file format has drastic implications to everything from the ongoing stability to compute cost of compute jobs. These file formats also employ a number of optimization techniques to minimize data exchange, permit predicate pushdown, and prune unnecessary partitions. This session aims to introduce and concisely explain the key concepts behind some of the most widely used file formats in the Spark ecosystem – namely Parquet, ORC, and Avro. We’ll discuss the history of the advent of these file formats from their origins in the Hadoop / Hive ecosystems to their functionality and use today. We’ll then deep dive into the core data structures that back these formats, covering specifics around the row groups of Parquet (including the recently deprecated summary metadata files), stripes and footers of ORC, and the schema evolution capabilities of Avro. We’ll continue to describe the specific SparkConf / SQLConf settings that developers can use to tune the settings behind these file formats. We’ll conclude with specific industry examples of the impact of the file on the performance of the job or the stability of a job (with examples around incorrect partition pruning introduced by a Parquet bug), and look forward to emerging technologies (Apache Arrow). ...

June 24, 2020 · 2 min · 299 words · Vinoo Ganesh

Campus Circulator Application

I built an app to help Wash U Students get around campus! Check out the story / link below. An engineering undergraduate student at Washington University in St. Louis helped create and launch a mobile app that helps students track the campus circulator shuttle. Vinoo Ganesh, a senior majoring in computer science in the School of Engineering & Applied Science, developed the real-time tracking app, titled “WUSTL Circulator.” The app also shows the circulator’s route and stopping points along with a full schedule that students can browse. ...

February 18, 2013 · 3 min · 515 words · Vinoo Ganesh