#289 Building the Right Foundations for Generative AI - Interview w/ May Xu

Jan 22, 2024

Ask episode

Chapters

Transcript

Episode notes

Please Rate and Review us on your podcast app of choice!

Get involved with Data Mesh Understanding's free community roundtables and introductions: https://landing.datameshunderstanding.com/

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

Episode list and links to all available episode transcripts here.

Provided as a free resource by Data Mesh Understanding. Get in touch with Scott on LinkedIn.

Transcript for this episode (link) provided by Starburst. You can download their Data Products for Dummies e-book (info-gated) here and their Data Mesh for Dummies e-book (info gated) here.

May's LinkedIn: https://www.linkedin.com/in/may-xu-sydney/

In this episode, Scott interviewed May Xu, Head of Technology, APAC Digital Engineering at Thoughtworks. To be clear, she was only representing her own views on the episode.

We will use the terms GenAI and LLMs to mean Generative AI and Large-Language Models in this write-up rather than use the entire phrase each time :)

Some key takeaways/thoughts from May's point of view:

Garbage-in, garbage-out: if you don't have good quality data - across many dimensions - and "solid data architecture", you won't get good results from trying to leverage LLMs on your data. Or really on most of your data initiatives 😅
There are 3 approaches to LLMs: train your own, start from pre-trained and tune them, or use existing pre-trained models. Many organizations should focus on the second.
Relatedly, per a survey, most organizations understand they aren't capable of training their own LLMs from scratch at this point.
It will likely take any organization around three months at least to train their own LLM from scratch. Parallel training and throwing money at the problem can only take you so far. And you need a LOT of high-quality data to train an LLM from scratch.
There's a trend towards more people exploring and leveraging models that aren't so 'large', that have fewer parameters. They can often perform specific tasks better than general large parameter models.
Similarly, there is a trend towards organizations exploring more domain-specific models instead of general purpose models like ChatGPT.
?Controversial?: Machines have given humanity scalability through predictability and reliability. But GenAI inherently lacks predictability. You have to treat GenAI like working with a person and that means less inherent trust in their responses.
Generative AI is definitely not the right approach to all problems. As always, you have to understand your tradeoffs. If you don’t feed your GenAI the right information, it will give you bad answers. It only knows what it has been told.
Always start from the problem you are trying to solve rather than the approach you are trying to use. Then evaluate if GenAI is the right approach for that problem. Simple, fundamental stuff but it's crucial to remember: start with the problem before the proposed solution.
Many people are leaping to use GenAI because their past approaches to certain problems haven't worked. Dig into those pains. GenAI may or may not be the right approach but either way it can be great for surfacing persistent challenges.
Leverage people's enthusiasm for GenAI to have deeper conversations about general business challenges. It can really start to highlight friction points across organizational boundaries and who is responsible for what. Scott note: But as the data team, be careful not to try to fix the entire organization, that's not what you are responsible for 😅
Right now, despite all the hype, most organizations are still at most in small-scale PoCs around GenAI. There is less of an initial focus on return on investment versus what capabilities GenAI might unlock but there is also a focus on what risks GenAI may introduce. Despite the hype, many to most organizations are doing their diligence.

May started with three general approaches organizations are taking to generative AI (GenAI): 1) building their own LLMs from scratch, 2) fine tune specific, pre-trained existing LLMs, or 3) leverage pre-trained LLMs as is. Many organizations may want to do the first but it is prohibitively expensive to train your own LLMs from scratch just for the compute and you also need (very expensive) people with very specific expertise to do so. Tuning pre-trained models will likely become the standard approach for many organizations. However, being able to leverage LLMs on internal data in general requires "existing good quality data and solid data architecture."

When considering training a model from scratch, May also pointed to time as an issue. Typically, it takes at least three months to properly train an LLM from scratch. Parallel training is helpful but you need to fine-tune results and retrain so you can't just throw compute at it and make the process that much faster. So again, you need high quality data - and you need a LOT of it - plus a fair amount of time plus a ton of money. Once you are in production, it also takes a lot of money and effort to keep them running and tuned properly 😅 Luckily, according to some surveys Thoughtworks did, most organizations recognize training LLMs from scratch isn't the right call for them just yet.

May is seeing a trend of people moving away from the 'bigger is better' mentality. More people are starting to explore more targeted and specialized models that have fewer parameters. And often, for specific tasks, they perform better than the first L in LLMs. So we may see a trend towards more and more targeted LLMs/models. Scott note: Madhav Srinath really leaned into this in his episode, #264.

Humanity in general has benefited greatly from machines through predictability and reliability according to May. Essentially, if they are made well, you essentially know what you should/will get from machines. But GenAI is designed specifically to act like humans and humans are not predictable and often not that reliable. So people have to get used to interacting with machines that may give wrong answers and are designed - in a way - to do so 😅 We can't expect predictability and reliability from GenAI.

Relatedly, when thinking about where is GenAI the right choice versus like traditional machine learning/AI, May believes you really have to dig into the tradeoffs. If you really understand the problem set and what you are trying to accomplish, traditional ML/AI is probably the better approach for you. You need to really understand where the strengths of GenAI will play and feed it the data/information it needs to succeed, otherwise you'll be asking an uniformed and unpredictable entity to solve your most pressing business problems. That's probably not going to go well…

May talked about going back to the basics of problem solving when it comes to Generative AI: what problem are you trying to solve instead of what way are you trying to solve a problem and then finding your way back to the problem. It can sound obvious but really, many are in such a rush to leverage these tools, it's crucial to stop and consider. Start with the problem before the solution 😅

GenAI may also surface a number of internal business challenges that aren't spoken about or people have essentially given up on tackling previously according to May. We have a new tool in the toolbox so people want to see if it will be useful to tackle something they haven't been able to address well previously. Lean into GenAI as a conversational lubricant. GenAI may not be the right tool for every one of these challenges but it means there is more internal conversation and sharing :)

From what May is seeing, many to most organizations are still in the early experimenting and PoC phase with Generative AI. They are trying to figure out what opportunities GenAI brings and also what risks. Despite the hype, people are taking their time but they aren't as focused on initial return on investment, more to validate if they can actually leverage GenAI to create value. Also, there is strong trend towards domain-specific LLMs rather than general purpose ones, e.g. financial sector or media specific models.

May finished on the idea that data mesh and other data management paradigms are crucial to doing something like GenAI right. There is still a strong need for quality data that is accessible, interoperable, privacy-aware, secured, etc. to be able to leverage GenAI well.

Learn more about Data Mesh Understanding: https://datameshunderstanding.com/about

Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him on LinkedIn: https://www.linkedin.com/in/scotthirleman/

If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf