177: AI-Based Data Cleaning, Data Labelling, and Data Enrichment with LLMs Featuring Rishabh Bhargava of refuel
Feb 14, 2024
auto_awesome
Rishabh Bhargava, an expert in AI-based data cleaning, data labelling, and data enrichment with LLMs, discusses topics like the evolution of AI and LLMs, implementing use cases and cost considerations, categorizing search queries, benchmarking and evaluation, utilizing customer support ticket data, understanding confidence scores, and training models with human feedback.
Refule is a platform for data cleaning, labeling, and enrichment using large language models (LLMs).
LLMs have rapidly evolved in recent years and are being implemented for tasks such as internal efficiency gains and improving data workflows.
Users should create personalized benchmarks when building LLM applications to select the appropriate model size and refine performance with iteration and feedback.
Deep dives
Overview of Refule and the Founder's Background
Refule is a platform for data cleaning, labeling, and enrichment using large language models (LLMs). The CEO and co-founder, Rish Pragava, has a background in data, machine learning, and AI, with experience at Stanford and working as an ML engineer. Refule aims to make data work more efficient by allowing users to write instructions for LLMs to perform tasks instead of manually working with data.
The Evolution of LLMs and their Use in Companies
LLMs have rapidly evolved in recent years due to factors such as improved machine learning model architectures, increased data volumes, and improved hardware capabilities. Companies are starting to implement LLM-based technology for tasks such as internal efficiency gains, offering suggestions to users, and improving data workflows. While there are challenges in deploying LLMs at scale, the cost curve is promising, with costs expected to decrease over time.
Using Refule for Documentation Improvement
One specific use case for Refule discussed in the podcast is using it to improve documentation in a software company. By leveraging customer support ticket data, Refule helps identify problem areas in the documentation and suggests improvements. The challenge of defining success with documentation is addressed through metrics like time on site and triangulation with customer support ticket data. Refule's platform allows users to easily plug in their data, select relevant templates, and write instructions for LLMs to categorize and analyze the data, resulting in more accurate and efficient documentation improvement.
Using LLMs for data management and problem solving
LLMs are valuable tools for data management and problem solving. Instead of writing complex rules and managing them, the focus is on providing good instructions and receiving feedback. The key is to write clear instructions and use binary feedback (thumbs up, thumbs down) to improve the LLM's performance. The availability of different LLM options and techniques offers more accuracy and flexibility. The future may involve using multiple specialized models rather than a single all-purpose model. This approach allows for scalability, cost-effectiveness, and quicker implementation.
Choosing models and benchmarking in LLM applications
When building LLM applications using open-source models, users need to make several key decisions. First, they must select the appropriate base model to train their application. Second, they should consider the size of the models, as different sizes have different capabilities. However, existing benchmarks in academic literature may not be sufficient for selecting the best model for specific use cases. To navigate this, users should create their own benchmarks aligned with their unique problems. They can start by testing small representative data sets with different model sizes to find the smallest model that achieves the desired accuracy, latency, and cost goals. Iteration and feedback are important in refining the model's performance and confidence in the results. Establishing a personalized benchmarking framework tailored to their use case is crucial.
Implementing LLM use cases and cost considerations (00:15:52)
User experience and fine-tuning LLM models (21:49)
Categorizing search queries (22:44)
Creating internal benchmark framework (29:50)
Benchmarking and evaluation (35:35)
Using refuel for documentation (44:18)
The challenges of analytics (46:45)
Using customer support ticket data (48:17)
The tagging process (50:18)
Understanding confidence scores (59:22)
Training the model with human feedback (1:02:37)
Final thoughts and takeaways (1:05:48)
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Get the Snipd podcast app
Unlock the knowledge in podcasts with the podcast player of the future.
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode
Save any moment
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Share & Export
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode