Optimization on Vinoo Ganesh

Optimizing Query Workloads

Wed, 28 Sep 2022 00:00:00 +0000

This week on The Data Stack Show, Eric and Kostas chat with Vinoo Ganesh. During the episode, Vinoo discusses how to benchmark cost, optimize your workloads, and Bluesky’s role in addressing your Snowflake bills.

Video

Link

https://datastackshow.com/podcast/optimizing-query-workloads-and-your-snowflake-bill-with-vinoo-ganesh-of-bluesky-data/

Data SLA Nightmares & Lessons Learned

Wed, 11 Aug 2021 00:00:00 +0000

Databricks Sr. Staff Developer Advocate, Denny Lee, Citadel Head of Business Engineering, Vinoo Ganesh, and Databand.ai Co-Founder & CEO, Josh Benamram, discuss the complexities and business necessity of setting clear data service-level agreements (SLAs). They share their experiences around the importance of contractual expectations and why data delivery success criteria are prone to disguise failures as success in spite of our best intentions. Denny, Vinoo, and Josh challenge businesses of all industries to see themselves as data companies by driving home a costly reality – what do businesses have to lose when their data is wrong? A lot more than they’d like to believe.

Link

https://databand.ai/mad-data-podcast/defining-data-quality-data-sla-nightmares-lessons-learned/

Large Scale Data Analytics with Vinoo Ganesh

Fri, 05 Feb 2021 00:00:00 +0000

In this episode of The Data Standard, Catherine Tao and Vinoo Ganash talk about large-scale data and data processing challenges. Vinoo starts the conversation by explaining his current obligations and how his company uses data to find working solutions for a wide range of problems. Then he talks about OLTP and OLAP models and how large-scale data can help improve workflows and offer better results. Optimization is needed for every specific application, and Vinoo talks about the methods he uses to enhance existing platforms. Even when the newly developed systems show positive results, the work is never done, as optimization is a constant, dynamic process.

He then goes over the techniques used to extract useful data. The distribution of data and data types have the most significant impact on data quality. Vinoo talks about the challenges of working with data, where a simple data movement can present a massive problem. Constant profiling is needed to help scale the data and make sure that the computing power can cope.

Finally, the guest talks about handling messy data that doesnt have the required quality. He talks about the multiple problems data scientists have to consider to sort messy data to make it more useful.

Link

https://datastandard.io/podcast/large-scale-data-analytics-with-vinoo-ganesh-at-veraset/

The Apache Spark File Format Ecosystem

Wed, 24 Jun 2020 00:00:00 +0000

In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs. In reality, the choice of file format has drastic implications to everything from the ongoing stability to compute cost of compute jobs. These file formats also employ a number of optimization techniques to minimize data exchange, permit predicate pushdown, and prune unnecessary partitions. This session aims to introduce and concisely explain the key concepts behind some of the most widely used file formats in the Spark ecosystem – namely Parquet, ORC, and Avro. We’ll discuss the history of the advent of these file formats from their origins in the Hadoop / Hive ecosystems to their functionality and use today. We’ll then deep dive into the core data structures that back these formats, covering specifics around the row groups of Parquet (including the recently deprecated summary metadata files), stripes and footers of ORC, and the schema evolution capabilities of Avro. We’ll continue to describe the specific SparkConf / SQLConf settings that developers can use to tune the settings behind these file formats. We’ll conclude with specific industry examples of the impact of the file on the performance of the job or the stability of a job (with examples around incorrect partition pruning introduced by a Parquet bug), and look forward to emerging technologies (Apache Arrow).

After this presentation, attendees should understand the core concepts behind the prevalent file formats, the relevant file-format specific settings, and finally how to select the correct file format for their jobs. This presentation is relevant to Spark+AI summit because as more AI/ML workflows move into the Spark ecosystem (especially IO intensive deep learning) leveraging the correct file format is paramount in performant model training.

Link

https://databricks.com/session_na20/the-apache-spark-file-format-ecosystem