MLOps.community

Evaluation Panel // Large Language Models in Production Conference Part II

12 snips
Aug 25, 2023
Language model interpretability experts and AI researchers discuss challenges of evaluating large language models, the impact of chat GPT in the industry, evaluating model performance and data set quality, the use of large language models in machine learning, and tool sets, guardrails, and challenges in language models.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Evaluation Starts With Desired Behavior

  • Generative LLM projects rarely start by building a formal dataset; they begin with desired behavior and prompt design.
  • This shifts evaluation challenges to choosing the right test data and metrics rather than held-out training sets.
ADVICE

Measure Instruction Following And Safety

  • Measure both whether the model follows instructions and whether its answers are correct and coherent.
  • Actively evaluate hallucinations, context retention, safety, and latency as core metrics.
INSIGHT

Good Answers Are User-Dependent

  • 'Good' answers vary by user expectations across tone, length, and conciseness.
  • Evaluation can and should be personalized to reflect those attribute preferences.
Get the Snipd Podcast app to discover more snips from this episode
Get the app