Measuring Bias, Toxicity, and Truthfulness in LLMs With Python
Jan 19, 2024
auto_awesome
Jodie Burchell, developer advocate for data science at JetBrains, discusses techniques and tools for evaluating large language models (LLMs) using Python. They explore measuring bias, toxicity, and truthfulness in LLMs, the challenges and limitations of AI language models, the role of Python packages like Hugging Face, and the concept of grouping and acronyms. Jodie also shares benchmarking datasets and resources available on Hugging Face for evaluating LLMs.
Toxicity in large language models can be measured using the Evaluate package from Hugging Face, which utilizes a smaller ML model as a hate speech classifier.
Bias in large language models can be assessed by prompting the models with gender, race, and profession-related sentences and using sentiment analysis measures to determine the emotional sentiment towards different groups.
Hallucination rates in large language models can be measured by comparing the models' completed prompts against a set of correct answers using the Truthful QA dataset.
Deep dives
Measuring Toxicity in Large Language Models
To measure the toxicity of large language models, researchers use the Evaluate package from Hugging Face. They pass the completed prompts through a smaller machine learning model that acts as a hate speech classifier, giving each prompt a probability score indicating the likelihood of it being hate speech.
Assessing Bias in Large Language Models
To assess bias in large language models, researchers utilize datasets like Window Bias and Bold. They prompt the model to complete sentences related to gender, race, and profession, and then use sentiment analysis measures to determine the emotional sentiment expressed towards different groups, providing ratings for the degree of positive or negative sentiment.
Measuring Hallucination Rates
Hallucination rates in large language models can be measured using the Truthful QA dataset. By comparing the models' completed prompts against a set of correct answers, researchers can determine how often the model provides incorrect information or internalizes lies, misconceptions, or conspiracy theories.
Introduction to Hugging Face and its Open Source Arm
Hugging Face is an organization that provides access to generative AI models, large language models, and associated datasets. Their open-source branch aims to make these resources easily accessible and provide tools for model deployment and inference. They have built a range of Python packages to simplify the use of these models, such as transformers package for working with large language models. The organization also focuses on making these models and datasets more transparent, including benchmarking for bias assessment.
Using Transformers and Langchain to Test Models and Evaluate Outputs
To test models, one can use the transformers package, which allows the prompting and generating of text from open-source causal language models. However, for proprietary models like Chat GPT, Langchain comes into play. Langchain offers the capability to chain tools and make sequential decisions using large language models. It is part of the broader concept of retrieval-augmented generation (RAG). To evaluate the quality of model outputs, the evaluate package from Hugging Face can be used. It assesses aspects like bias, toxicity, and hallucinations. The use of reinforcement learning from human feedback has improved model quality, particularly in reducing bias rates.
How can you measure the quality of a large language model? What tools can measure bias, toxicity, and truthfulness levels in a model using Python? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, returns to discuss techniques and tools for evaluating LLMs With Python.
Jodie provides some background on large language models and how they can absorb vast amounts of information about the relationship between words using a type of neural network called a transformer. We discuss training datasets and the potential quality issues with crawling uncurated sources.
We dig into ways to measure levels of bias, toxicity, and hallucinations using Python. Jodie shares three benchmarking datasets and links to resources to get you started. We also discuss ways to augment models using agents or plugins, which can access search engine results or other authoritative sources.
In this course, you’ll learn about Python text classification with Keras, working your way from a bag-of-words model with logistic regression to more advanced methods, such as convolutional neural networks. You’ll see how you can use pretrained word embeddings, and you’ll squeeze more performance out of your model through hyperparameter optimization.
Topics:
00:00:00 – Introduction
00:02:19 – Testing characteristics of LLMs with Python
00:04:18 – Background on LLMs
00:08:35 – Training of models
00:14:23 – Uncurated sources of training
00:16:12 – Safeguards and prompt engineering
00:21:19 – TruthfulQA and creating a more strict prompt
00:23:20 – Information that is out of date
00:26:07 – WinoBias for evaluating gender stereotypes
00:28:30 – BOLD dataset for evaluating bias
00:30:28 – Sponsor: Intel
00:31:18 – Using Hugging Face to start testing with Python
00:35:25 – Using the transformers package
00:37:34 – Using langchain for proprietary models
00:43:04 – Putting the tools together and evaluating
00:47:19 – Video Course Spotlight
00:48:29 – Assessing toxicity
00:50:21 – Measuring bias
00:54:40 – Checking the hallucination rate
00:56:22 – LLM leaderboards
00:58:17 – What helped ChatGPT leap forward?
01:06:01 – Improvements of what is being crawled
01:07:32 – Revisiting agents and RAG
01:11:03 – ChatGPT plugins and Wolfram-Alpha
01:13:06 – How can people follow your work online?