Veraset on Vinoo Ganesh

Building Chatbots with Rasa

Fri, 09 Feb 2024 00:00:00 +0000

Tell us about your background

Throughout my career, I have dedicated myself to creating tools, products, and technologies that help people effectively utilize their data. My passion lies in developing products that enable users to efficiently and scalably gain maximum value from their data.

My journey in understanding the intricacies of data and its potential began at Palantir Technologies, where I began working on search and indexing products. As data volumes grew, I focused my efforts on solving some of Palantir customer’s core problems across the financial and defense verticals before leading customer focused compute teams. After Palantir, I served as CTO at Veraset, a cloud-based data-as-a-service company. Veraset delivered high-quality, large scale data to a number of enterprises and grew to 15M ARR before being acquired. Following this, I joined Citadel Investment Group as the Head of Business Engineering of Ashler. In that role, I managed crucial data operations, including overseeing data pipelines, investment platforms, data lakes, and the software and data engineering teams responsible for them.

Currently, I’m working on building a new company in the data platform + AI space.

What are you building?

I am developing a prototype to showcase how Large Language Models (LLMs) can revolutionize a major public institution in New York City. This institution’s data is currently scattered across various systems (not all of which are organized). The solution I’m creating will integrate external APIs, fundamental coding elements, user-generated commands, and the intuitive interaction with a large language model. This will allow individuals to pose complex inquiries in everyday language. My objective is straightforward: by highlighting the recent progress and sophisticated tools available in our field, I aim to initiate this enterprise’s digital transformation.

What technology have you been using so far?

Before adopting CALM, I utilized a blend of LangChain/LlamaIndex for “routing,” Chroma for my Vector Database, and OpenAI’s models for my cloud-based LLM tasks. For on-prem LLMs, I used Databricks Dolly/MPT models. Each of these technologies excelled in their specific functions, but there was a notable lack of cohesive integration among them. As such, my experience largely involved playing a guessing game with temperatures and prompts to steer a conversation into the fixed pattern that I wanted. Additionally, the unpredictable nature of the models meant that maintaining a natural, fluid conversation often had to be sacrificed in favor of rigid rules. This constraint resulted in a user experience that was less enchanting, spontaneous, and as early customers described…less magical.

Describe CALM in 3 words:

“Transformative, Extensible, Simple.”

What are your biggest challenges in building and maintaining an AI assistant?

The majority of tools in the industry are developed with a primary emphasis on the tool itself, rather than creating a holistic user experience. This focus can make the development process challenging and unwieldy for developers who have to navigate and integrate various backend elements, models, and systems in an effort to achieve a smooth user experience.

Furthermore, there is a prevalent ‘all-or-nothing’ mentality in the utilization of Large Language Models (LLMs). This approach suggests that LLMs should either be responsible for the entire workflow, from data analysis and code generation to producing readable output, or not be used at all. Finding products that leverage the strengths of LLMs while also providing developers with flexibility and choice, especially in scenarios where LLMs may not be the most suitable option, remains a challenge.

Finally, AI assistants are fundamentally task oriented. These tasks generally have multiple steps and workflows associated with them. Actually visualizing these steps and ensuring they work under a variety of conditions is a challenging part of the AI assistant development process.

Which functionality of CALM helps the most to resolve the challenges of building AI assistants?

I don’t think I have ever seen a product that has increased my developer velocity more than Rasa Pro with CALM. CALM had 2 features that, in particular, stood out to me.

Declarative YAML flows - CALM’s declarative user based flows puts developers right in the seat of the user. By describing the exact user flow, coupled with validation and data gathered, developers not only can clearly and deliberately articulate the ideal conversation flow, but they can also visualize and extend it. It allowed me to start my development work with an articulation of the happy paths, so I was able to build quickly and with guardrails.

Conversation Repair - In the ideal world, users would follow our predefined conversation flows to the letter. However, real-world interactions are rarely so linear. Users often diverge from the set path, whether to ask questions, explore tangents, or jump into different flows entirely. This is where the magic of conversation repair comes into play. It’s a foundational element that gracefully handles these deviations, allowing users the freedom to stray from the main path without getting lost. They can ask something off-topic, follow a different thread, and then seamlessly return to the original flow. I’ve found that integrating conversation repair into chatbots is crucial for mimicking human interaction. It’s not just about guiding users through a set journey; it’s about creating an experience that acknowledges and adapts to their natural conversational behavior. This flexibility is key to making chatbots more relatable and engaging, transforming them from mere tools into conversational partners.

What prompted you to explore CALM?

I quickly realized the power of a structured approach to chatbot creation while playing with Rasa’s open source tool. This platform enabled me to set specific rules for extracting entities and establish policies for guiding the flow of conversations. Working with Rasa’s open source solution was a game-changer for me as a developer. However, during this process, I began to recognize a limitation: while the Natural Language Understanding (NLU) models were powerful, they didn’t quite capture the more dynamic, almost magical interaction quality offered by Large Language Models (LLMs). This realization led me to explore how LLMs could be integrated with Rasa. This is when I discovered CALM, which combines the power of LLMs while maintaining the controls of an NLU based-approach.

The Rasa Pro Developer Edition from Rasa gave me the chance to explore the full scope of the product functionality and build a prototype.

Link

https://rasa.com/blog/navigating-calm-s-benefits-technical-insights-from-vinoo-ganesh/

Migrating to Parquet

Tue, 13 Jul 2021 00:00:00 +0000

I work at a data-as-a-service (DaaS) company that delivers PBs of geospatial data to customers across a variety of industries. We build and manage a central data lake, housing years of data, and operationalize that data to solve our customers’ problems. I recently gave a talk about the specifics of file formats at Spark+AI Summit 2020 that generated a lot of questions about my company’s migration from CSV to Apache Parquet. As CTO of a DaaS company, I saw firsthand how this migration had a drastic effect for all of our customers. This session will drill into the operational burden of transforming the storage format in an ecosystem and its impact on the business.

Link

https://www.dremio.com/subsurface/migrating-to-parquet-the-veraset-story/

Accelerating Data Evaluation

Fri, 28 May 2021 00:00:00 +0000

As the data-as-a-service ecosystem continues to evolve, data brokers are faced with an unprecedented challenge – demonstrating the value of their data. Successfully crafting and selling a compelling data product relies on a broker’s ability to differentiate their product from the rest of the market. In smaller or static datasets, measures like row count and cardinality can speak volumes. However, when datasets are in the terabytes or petabytes though – differentiation becomes much difficult. On top of that “data quality” is a somewhat ill-defined term and the definition of a “high quality dataset” can change daily or even hourly.

This breakout session will describe Veraset’s partnership with Databricks, and how we have white labeled Databricks to showcase and accelerate the value of our data. We’ll discuss the challenges that data brokers have faced to date and some of the primitives of our businesses that have guided our direction thus far. We will also actively demo our white label instance and notebook to show how we’ve been able to provide key insights to our customers and reduce the TTFB of data onboarding.

Link

https://databricks.com/session_na21/brokering-data-accelerating-data-evaluation-with-databricks-white-label

Video

Guaranteeing pipeline SLAs and data quality standards with Databand

Fri, 28 May 2021 00:00:00 +0000

We’ve all heard the phrase “data is the new oil.” But really imagine a world where this analogy is more real, where problems in the flow of data - delays, low quality, high volatility - could bring down whole economies? When data is the new oil with people and businesses similarly reliant on it, how do you avoid the fires, spills, and crises?

As data products become central to companies’ bottom line, data engineering teams need to create higher standards for the availability, completeness, and fidelity of their data.

In this session we’ll demonstrate how Databand helps organizations guarantee the health of their Airflow pipelines. Databand is a data pipeline observability system that monitors SLAs and data quality issues, and proactively alerts users on problems to avoid data downtime.

The session will be led by Josh Benamram, CEO and Cofounder of Databand.ai. Josh will be joined by Vinoo Ganesh, an experienced software engineer, system architect, and current CTO of Veraset, a data-as-a-service startup focused on understanding the world from a geospatial perspective.

Join to see how Databand.ai can help you create stable, reliable pipelines that your business can depend on!

Link

https://airflowsummit.org/sessions/2021/data-quality-standards-databand/

Video

Large Scale Data Analytics with Vinoo Ganesh

Fri, 05 Feb 2021 00:00:00 +0000

In this episode of The Data Standard, Catherine Tao and Vinoo Ganash talk about large-scale data and data processing challenges. Vinoo starts the conversation by explaining his current obligations and how his company uses data to find working solutions for a wide range of problems. Then he talks about OLTP and OLAP models and how large-scale data can help improve workflows and offer better results. Optimization is needed for every specific application, and Vinoo talks about the methods he uses to enhance existing platforms. Even when the newly developed systems show positive results, the work is never done, as optimization is a constant, dynamic process.

He then goes over the techniques used to extract useful data. The distribution of data and data types have the most significant impact on data quality. Vinoo talks about the challenges of working with data, where a simple data movement can present a massive problem. Constant profiling is needed to help scale the data and make sure that the computing power can cope.

Finally, the guest talks about handling messy data that doesnt have the required quality. He talks about the multiple problems data scientists have to consider to sort messy data to make it more useful.

Link

https://datastandard.io/podcast/large-scale-data-analytics-with-vinoo-ganesh-at-veraset/