Using Binary Vectorization to Train Machine Learning Models

This is called binary vectorization, and it's basically the stupidest, most like naive approach you can take. But it does actually get your results if you've got pretty good separation between your documents. So one thing you can do to clean up is just include your inmost common words in your collection of documents. Ok? And i want to say it's a dumb approach, but sometimes dum is fine, tru.

Play episode from 15:54

chevron_right

Transcript

chevron_right

Transcript

Episode notes

How do you process and classify text documents in Python? What are the fundamental techniques and building blocks for Natural Language Processing (NLP)? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, talks about how machine learning (ML) models understand text.

Jodie explains how ML models require data in a structured format, which involves transforming text documents into columns and rows. She covers the most straightforward approach, called binary vectorization. We discuss the bag-of-words method and the tools of stemming, lemmatization, and count vectorization.

We jump into word embedding models next. Jodie talks about WordNet, Natural Language Toolkit (NLTK), word2vec, and Gensim. Our conversation lays a foundation for starting with text classification, implementing sentiment analysis, and building projects using these tools. Jodie also shares multiple resources to help you continue exploring NLP and modeling.

Course Spotlight: Learn Text Classification With Python and Keras

In this course, you’ll learn about Python text classification with Keras, working your way from a bag-of-words model with logistic regression to more advanced methods, such as convolutional neural networks. You’ll see how you can use pretrained word embeddings, and you’ll squeeze more performance out of your model through hyperparameter optimization.

Topics:

00:00:00 – Introduction
00:02:47 – Exploring the topic
00:06:00 – Perceived sentience of LaMDA
00:10:24 – How do we get started?
00:11:16 – What are classification and sentiment analysis?
00:13:03 – Transforming text in rows and columns
00:14:47 – Sponsor: Snyk
00:15:27 – Bag-of-words approach
00:19:12 – Stemming and lemmatization
00:22:05 – Capturing N-grams
00:25:34 – Count vectorization
00:27:14 – Stop words
00:28:46 – Text Frequency / Inverse Document Frequency (TFIDF) vectorization
00:32:28 – Potential projects for bag-of-words techniques
00:34:07 – Video Course Spotlight
00:35:20 – WordNet and NLTK package
00:37:27 – Word embeddings and word2vec
00:45:30 – Previous training and too many dimensions
00:50:07 – How to use word2vec and Gensim?
00:51:26 – What types of projects for word2vec and Gensim?
00:54:41 – Getting into GPT and BERT in another episode
00:56:11 – How to follow Jodie’s work?
00:57:36 – Thanks and goodbye

Show Links:

Why Google’s “sentient” AI LaMDA is nothing like a person.
On NYT Magazine on AI: Resist the Urge to be Impressed | Emily M. Bender | Medium
ELIZA - Wikipedia
eliza.py - Python 2 version by Daniel Connelly
dabraude/Pyliza: Python3 Implementation of Eliza
magneticpoetry.com
Natural Language Processing With Python’s NLTK Package – Real Python
Practical Text Classification With Python and Keras – Real Python
Sentiment Analysis: First Steps With Python’s NLTK Library – Real Python
NLTK: Natural Language Toolkit
spaCy · Industrial-strength Natural Language Processing in Python
Natural Language Processing With spaCy in Python - Real Python
Stemming - Wikipedia
Lemmatization - Wikipedia
Binary/Count Vectorization: sklearn.feature_extraction.text.CountVectorizer— scikit-learn
TFIDF: sklearn.feature_extraction.text.TfidfVectorizer — scikit-learn
Porter Stemmer: nltk.stem.porter module — NLTK
Snowball Stemmer: nltk.stem.snowball module — NLTK
WordNet Lemmatizer: nltk.stem.wordnet module — NLTK
Lemmatizer · spaCy API Documentation
Applying Bag of Words and Word2Vec models on Reuters-21578 Dataset Elvin Ouyang’s Blog
UCI Machine Learning Repository: Reuters-21578 Text Categorization Collection Data Set
The Illustrated Word2vec – Jay Alammar
A Complete Guide to Using WordNET in NLP Applications
Gensim: Topic modeling for humans
Core Tutorials — gensim
Find Open Datasets and Machine Learning Projects | Kaggle
Engineering All Hands: Vectorise all the things! - YouTube
PyCon Portugal 2022
NDC Oslo 2022 | Conference for Software Developers
Jodie Burchell’s Blog - Standard error
Jodie Burchell 🇦🇺🇩🇪 (@t_redactyl) / Twitter
JetBrains: Essential tools for software developers and teams

Level up your Python skills with our expert-led courses:

Data Cleaning With pandas and NumPy
Reading and Writing Files With pandas
Learn Text Classification With Python and Keras

Support the podcast & join our community of Pythonistas

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books