
DataTopics Unplugged: All Things Data, AI & Tech
#83 Who’s Minding the Metadata? Why Data Quality Matters in GenAI (Quality Time With Paolo)
Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society.
Dive into conversations that should flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style!
In this episode, host Murilo is joined by returning guest Paolo, Data Management Team Lead at dataroots, for a deep dive into the often-overlooked but rapidly evolving domain of unstructured data quality. Tune in for a field guide to navigating documents, images, and embeddings without losing your sanity.
What we unpack:
- Data management basics: Metadata, ownership, and why Excel isn’t everything.
- Structured vs unstructured data: How the wild west of PDFs, images, and audio is redefining quality.
- Data quality challenges for LLMs: From apples and pears to rogue chatbots with “legally binding” hallucinations.
- Practical checks for document hygiene: Versioning, ownership, embedding similarity, and tagging strategies.
- Retrieval-Augmented Generation (RAG): When ChatGPT meets your HR policies and things get weird.
- Monitoring and governance: Building systems that flag rot before your chatbot gives out 2017 vacation rules.
- Tooling and gaps: Where open source is doing well—and where we’re still duct-taping workflows.
- Real-world inspirations: A look at how QuantumBlack (McKinsey) is tackling similar issues with their AI for DQ framework.