Markus Stoll, Co-Founder of Renumics and developer of the interactive ML dataset exploration tool Spotlight, shares fascinating insights on structuring unstructured data like text and images. He discusses advanced techniques such as U-MAP for data visualization, enhancing anomaly detection and user experience. Markus emphasizes the importance of personalized models in industrial AI and the iterative approach for managing complex automotive datasets. His innovative methods bridge the gap between machine learning and practical applications, making data analysis more accessible.
Read more
AI Summary
Highlights
AI Chapters
Episode notes
auto_awesome
Podcast summary created with Snipd AI
Quick takeaways
Effective data visualization techniques, like UMAP, simplify the analysis of complex machine learning datasets by identifying clusters and anomalies.
Integrating multimodal data is essential for improving machine learning models, allowing specialized training for individual data types to enhance performance.
Deep dives
Data Visualization Techniques
Effective data visualization techniques are crucial for understanding complex datasets, especially in the context of machine learning and simulations. One innovative approach discussed involves using U-MAP to reduce high-dimensional embeddings into two-dimensional representations. This method allows for the identification of clusters and anomalies within the data, enabling users to spot unwanted entries like incorrect Wikipedia headers or footers. By presenting embeddings in a simpler format, team members can more easily interact with the data during presentations and discussions.
Embedding Analysis for Model Optimization
Embedding space analysis is vital for optimizing machine learning models by evaluating the relevance of documents and their relationships to user queries. By embedding both questions and documents, it becomes possible to visualize which documents are most frequently referenced in response to specific queries. This analysis also reveals gaps in reference questions, prompting discussions with clients about data relevance and coverage. The goal is to ensure that the machine learning model effectively addresses user needs and expectations through continuous refinement based on feedback.
Challenges in Data Curation and Annotation
The process of curating and annotating data for machine learning applications poses significant challenges, especially when dealing with diverse data types generated during tests. In automotive testing, vehicles equipped with extensive sensor data create large datasets that must be cleaned and standardized before analysis. Anomalies or misconfigurations in sensor setups can lead to discrepancies, making it essential to have robust labeling and data normalization processes. Often, historical data helps refine current datasets, but users must remain pragmatic about achieving a workable solution rather than aiming for perfection from the start.
Multimodal Data Integration
Integrating multimodal data—such as audio, visual, and sensor data—is becoming increasingly important in enhancing machine learning models. Currently, different models are usually trained for each data type, with a focus on collecting relevant data tailored to specific use cases. Although training a single model for all modalities is a future goal, the current approach maximizes the performance of individual models by specializing them. This strategy allows for the effective handling of unique datasets, even when quality varies significantly from publicly available datasets.
Markus Stoll is the Co-Founder of Renumics and the developer behind the open-source interactive ML dataset exploration tool, Spotlight. He shares insights on:
AI in Engineering and Manufacturing Interactive ML Data Visualization ML Data Exploration
Follow Markus for hands-on articles about leveraging ML while keeping a strong focus on data.
Visualize - Bringing Structure to Unstructured Data // MLOps Podcast #258 with Markus Stoll, CTO of Renumics.
A huge thank you to SAS for their generous support!
// Abstract
This talk is about how data visualization and embeddings can support you in understanding your machine-learning data. We explore methods to structure and visualize unstructured data like text, images, and audio for applications ranging from classification and detection to Retrieval-Augmented Generation. By using tools and techniques like UMAP to reduce data dimensions and visualization tools like Renumics Spotlight, we aim to make data analysis for ML easier. Whether you're dealing with interpretable features, metadata, or embeddings, we'll show you how to use them all together to uncover hidden patterns in multimodal data, evaluate the model performance for data subgroups, and find failure modes of your ML models.
// Bio
Markus Stoll began his career in the industry at Siemens Healthineers, developing software for the Heavy Ion Therapy Center in Heidelberg. He learned about software quality while developing a treatment machine weighing over 600 tons. He earned a Ph.D., focusing on combining biomechanical models with statistical models, through which he learned how challenging it is to bridge the gap between research and practical application in the healthcare domain. Since co-founding Renumics, he has been active in the field of AI for Engineering, e.g., AI for Computer Aided Engineering (CAE), implementing projects, contributing to their open-source library for data exploration for ML datasets (Renumics Spotlight) and writing articles about data visualization.
// MLOps Jobs board
https://mlops.pallet.xyz/jobs
// MLOps Swag/Merch
https://mlops-community.myshopify.com/
// Related Links
Website: https://renumics.com/
MLSecOps Community: https://community.mlsecops.com/
Blogs: https://towardsdatascience.com/visualize-your-rag-data-evaluate-your-retrieval-augmented-generation-system-with-ragas-fc2486308557
: https://medium.com/itnext/how-to-explore-and-visualize-ml-data-for-object-detection-in-images-88e074f46361
--------------- ✌️Connect With Us ✌️ -------------
Join our slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Catch all episodes, blogs, newsletters, and more: https://mlops.community/
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with Markus on LinkedIn: https://www.linkedin.com/in/markus-stoll-b39a42138/
Get the Snipd podcast app
Unlock the knowledge in podcasts with the podcast player of the future.
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode
Save any moment
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Share & Export
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode