Machine Learning Guide

OCDevel

Machine learning audio course, teaching the fundamentals of machine learning and artificial intelligence. It covers intuition, models (shallow and deep), math, languages, frameworks, etc. Where your other ML resources provide the trees, I provide the forest. Consider MLG your syllabus, with highly-curated resources for each episode's details at ocdevel.com. Audio is a great supplement during exercise, commute, chores, etc.

Episodes

Mentioned books

Nov 8, 2020 • 35min

MLA 011 Practical Clustering Tools

Primary clustering tools for practical applications include K-means using scikit-learn or Faiss, agglomerative clustering leveraging cosine similarity with scikit-learn, and density-based methods like DBSCAN or HDBSCAN. For determining the optimal number of clusters, silhouette score is generally preferred over inertia-based visual heuristics, and it natively supports pre-computed distance matrices. Links Notes and resources at ocdevel.com/mlg/mla-11 Try a walking desk stay healthy & sharp while you learn & code K-means Clustering K-means is the most widely used clustering algorithm and is typically the first method to try for general clustering tasks. The scikit-learn KMeans implementation is suitable for small to medium-sized datasets, while Faiss's kmeans is more efficient and accurate for very large datasets. K-means requires the number of clusters to be specified in advance and relies on the Euclidean distance metric, which performs poorly in high-dimensional spaces. When document embeddings have high dimensionality (e.g., 768 dimensions from sentence transformers), K-means becomes less effective due to the limitations of Euclidean distance in such spaces. Alternatives to K-means for High Dimensions For text embeddings with high dimensionality, agglomerative (hierarchical) clustering methods are preferable, particularly because they allow the use of different similarity metrics. Agglomerative clustering in scikit-learn accepts a pre-computed cosine similarity matrix, which is more appropriate for natural language processing. Constructing the pre-computed distance (or similarity) matrix involves normalizing vectors and computing dot products, which can be efficiently achieved with linear algebra libraries like PyTorch. Hierarchical algorithms do not use inertia in the same way as K-means and instead rely on external metrics, such as silhouette score. Other clustering algorithms exist, including spectral, mean shift, and affinity propagation, which are not covered in this episode. Semantic Search and Vector Indexing Libraries such as Faiss, Annoy, and HNSWlib provide approximate nearest neighbor search for efficient semantic search on large-scale vector data. These systems create an index of your embeddings to enable rapid similarity search, often with the ability to specify cosine similarity as the metric. Sample code using these libraries with sentence transformers can be found in the UKP Lab sentence-transformers examples directory. Determining the Optimal Number of Clusters Both K-means and agglomerative clustering require a predefined number of clusters, but this is often unknown beforehand. The "elbow" method involves running the clustering algorithm with varying cluster counts and plotting the inertia (sum of squared distances within clusters) to visually identify the point of diminishing returns; see kmeans.inertia_. The kneed package can automatically detect the "elbow" or "knee" in the inertia plot, eliminating subjective human judgment; sample code available here. The silhouette score, calculated via silhouette_score, considers both inter- and intra-cluster distances and allows for direct selection of the number of clusters with the maximum score. The silhouette score can be computed using a pre-computed distance matrix (such as from cosine similarities), making it well-suited for applications involving non-Euclidean metrics and hierarchical clustering. Density-Based Clustering: DBSCAN and HDBSCAN DBSCAN is a hierarchical clustering method that does not require specifying the number of clusters, instead discovering clusters based on data density. HDBSCAN is a more popular and versatile implementation of density-based clustering, capable of handling various types of data without significant parameter tuning. DBSCAN and HDBSCAN can be preferable to K-means or agglomerative clustering when automatic determination of cluster count or robustness to noise is important. However, these algorithms may not perform well with all types of high-dimensional embedding data, as illustrated by the challenges faced when clustering 768-dimensional text embeddings. Summary Recommendations and Links For low- to medium-sized, low-dimensional data, use K-means with silhouette score to choose the optimal number of clusters: scikit-learn KMeans, silhouette_score. For very large data or vector search, use Faiss.kmeans. For high-dimensional data using cosine similarity, use Agglomerative Clustering with a pre-computed square matrix of cosine similarities; sample code. For density-based clustering, consider DBSCAN or HDBSCAN. Exploratory code and further examples can be found in the UKP Lab sentence-transformers examples.

Oct 28, 2020 • 26min

MLA 010 NLP packages: transformers, spaCy, Gensim, NLTK

The landscape of Python natural language processing tools has evolved from broad libraries like NLTK toward more specialized packages such as Gensim for topic modeling, SpaCy for linguistic analysis, and Hugging Face Transformers for advanced tasks, with Sentence Transformers extending transformer models to enable efficient semantic search and clustering. Each library occupies a distinct place in the NLP workflow, from fundamental text preprocessing to semantic document comparison and large-scale language understanding. Links Notes and resources at ocdevel.com/mlg/mla-10 Try a walking desk stay healthy & sharp while you learn & code Historical Foundation: NLTK NLTK ("Natural Language Toolkit") was one of the earliest and most popular Python libraries for natural language processing, covering tasks from tokenization and stemming to document classification and syntax parsing. NLTK remains a catch-all "Swiss Army knife" for NLP, but many of its functions have been supplemented or superseded by newer tools tailored to specific tasks. Specialized Topic Modeling and Phrase Analysis: Gensim Gensim emerged as the leading library for topic modeling in Python, most notably via its LDA Topic Modeling implementation, which groups documents according to topic distributions. Topic modeling workflows often use NLTK for initial preprocessing (tokenization, stop word removal, lemmatization), then vectorize with scikit-learn’s TF-IDF, and finally model topics with Gensim’s LDA. Gensim also provides effective Bigrams/Trigrams, allowing the detection and combination of commonly-used word pairs or triplets (n-grams) to enhance analysis accuracy. Linguistic Structure and Manipulation: SpaCy and Related Tools spaCy is a deep-learning-based library for high-performance linguistic analysis, focusing on tasks such as part-of-speech tagging, named entity recognition, and syntactic parsing. SpaCy supports integrated sentence and word tokenization, stop word removal, and lemmatization, but for advanced lemmatization and inflection, LemmInflect can be used to derive proper inflections for part-of-speech tags. For even more accurate (but slower) linguistic tasks, consider Stanford CoreNLP via SpaCy integration as spacy-stanza. SpaCy can examine parse trees to identify sentence components, enabling sophisticated NLP applications like grammatical corrections and intent detection in conversation agents. High-Level NLP Tasks: Hugging Face Transformers huggingface/transformers provides interfaces to transformer-based models (like BERT and its successors) capable of advanced NLP tasks including question answering, summarization, translation, and sentiment analysis. Its Pipelines allow users to accomplish over ten major NLP applications with minimal code. The library’s model repository hosts a vast collection of pre-trained models that can be used for both research and production. Semantic Search and Clustering: Sentence Transformers UKPLab/sentence-transformers extends the transformer approach to create dense document embeddings, enabling semantic search, clustering, and similarity comparison via cosine distance or similar metrics. Example applications include finding the most similar documents, clustering user entries, or summarizing clusters of text. The repository offers application examples for tasks such as semantic search and clustering, often using cosine similarity. For very large-scale semantic search (such as across Wikipedia), approximate nearest neighbor (ANN) libraries like Annoy, FAISS, and hnswlib enable rapid similarity search with embeddings; practical examples are provided in the Sentence Transformers documentation. Additional Resources and Library Landscape For a comparative overview and discovery of further libraries, see Analytics Steps Top 10 NLP Libraries in Python, which reviews several packages beyond those discussed here. Summary of Library Roles and Use Cases NLTK: Foundational and comprehensive for most classic NLP needs; still covers a broad range of preprocessing and basic analytic tasks. Gensim: Best for topic modeling and phrase extraction (bigrams/trigrams); especially useful in workflows relying on document grouping and label generation. SpaCy: Leading tool for syntactic, linguistic, and grammatical analysis; supports integration with advanced lemmatizers and external tools like Stanford CoreNLP. Hugging Face Transformers: The standard for modern, high-level NLP tasks and quick prototyping, featuring simple pipelines and an extensive model hub. Sentence Transformers: The main approach for embedding text for semantic search, clustering, and large-scale document comparison, supporting ANN methodologies via companion libraries.

Nov 6, 2018 • 25min

MLA 009 Charting and Visualization Tools for Data Science

Python charting libraries - Matplotlib, Seaborn, and Bokeh - explaining, their strengths from quick EDA to interactive, HTML-exported visualizations, and clarifies where D3.js fits as a JavaScript alternative for end-user applications. It also evaluates major software solutions like Tableau, Power BI, QlikView, and Excel, detailing how modern BI tools now integrate drag-and-drop analytics with embedded machine learning, potentially allowing business users to automate entire workflows without coding. Links Notes and resources at ocdevel.com/mlg/mla-9 Try a walking desk stay healthy & sharp while you learn & code Core Phases in Data Science Visualization Exploratory Data Analysis (EDA): EDA occupies an early stage in the Business Intelligence (BI) pipeline, positioned just before or sometimes merged with the data cleaning (“munging”) phase. The outputs of EDA (e.g., correlation matrices, histograms) often serve as inputs to subsequent machine learning steps. Python Visualization Libraries 1. Matplotlib The foundational plotting library in Python, supporting static, basic chart types. Requires substantial boilerplate code for custom visualizations. Serves as the core engine for many higher-level visualization tools. Common EDA tasks (like plotting via .corr(), .hist(), and .scatter() methods on pandas DataFrames) depend on Matplotlib under the hood. 2. Pandas Plotting Pandas integrates tightly with Matplotlib and exposes simple, one-line commands for common plots (e.g., df.corr(), df.hist()). Designed to make quick EDA accessible without requiring detailed knowledge of Matplotlib’s verbose syntax. 3. Seaborn A high-level wrapper around Matplotlib, analogous to how Keras wraps TensorFlow. Sets sensible defaults for chart styles, fonts, colors, and sizes, improving aesthetics with minimal effort. Importing Seaborn can globally enhance the appearance of all Matplotlib plots, even without direct usage of Seaborn’s plotting functions. 4. Bokeh A powerful library for creating interactive, web-ready plots from Python. Enables user interactions such as hovering, zooming, and panning within rendered plots. Exports visualizations as standalone HTML files or can operate as a server-linked app for live data exploration. Supports advanced features like cross-filtering, allowing dynamic slicing and dicing of data across multiple axes or columns. More suited for creating reusable, interactive dashboards rather than quick, one-off EDA visuals. 5. D3.js Unlike previous libraries, D3.js is a JavaScript framework for creating complex, highly customized data visualizations for web and mobile apps. Used predominantly on the client-side to build interactive front-end graphics for end users, not as an EDA tool for analysts. Common in production-grade web apps, but not typically part of a Python-based data science workflow. Dedicated Visualization and BI Software Tableau Leading commercial drag-and-drop BI tool for data visualization and dashboarding. Connects to diverse data sources (CSV, Excel, databases), auto-detects column types, and suggests default chart types. Users can interactively build visualizations, cross-filter data, and switch chart types without coding. Power BI Microsoft’s BI suite, similar to Tableau, supporting end-to-end data analysis and visualization. Integrates data preparation, visualization, and increasingly, built-in machine learning workflows. Focused on empowering business users or analysts to run the BI pipeline without programming. QlikView Another major BI offering is QlikView, emphasizing interactive dashboards and data exploration. Excel Still widely used for basic EDA and visualizations directly on spreadsheets. Offers limited but accessible charting tools for histograms, scatter plots, and simple summary statistics. Data often originates from Excel/CSV files before being ingested for further analysis in Python/pandas. Trends & Insights Workflow Integration: Modern BI tools are converging, adding both classic EDA capabilities and basic machine learning modeling, often through a code-free interface. Automation Risks and Opportunities: As drag-and-drop BI tools increase in capabilities (including model training and selection), some data science coding work traditionally required for BI pipelines may become accessible to non-programmers. Distinctions in Use: Python libraries (Matplotlib, Seaborn, Bokeh) excel in automating and scripting EDA, report generation, and static analysis as part of data pipelines. BI software (Tableau, Power BI, QlikView) shines for interactive exploration and democratized analytics, integrated from ingestion to reporting. D3.js stands out for tailored, production-level, end-user app visualizations, rarely leveraged by data scientists for EDA. Key Takeaways For quick, code-based EDA: Use Pandas’ built-in plotters (wrapping Matplotlib). For pre-styled, pretty plots: Use Seaborn (with or without direct API calls). For interactive, shareable dashboards: Use Bokeh for Python or BI tools for no-code operation. For enterprise, end-user-facing dashboards: Choose BI software like Tableau or build custom apps using D3.js for total control.

Oct 26, 2018 • 25min

MLA 008 Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) sits at the critical pre-modeling stage of the data science pipeline, focusing on uncovering missing values, detecting outliers, and understanding feature distributions through both statistical summaries and visualizations, such as Pandas' info(), describe(), histograms, and box plots. Visualization tools like Matplotlib, along with processes including imputation and feature correlation analysis, allow practitioners to decide how best to prepare, clean, or transform data before it enters a machine learning model. Links Notes and resources at ocdevel.com/mlg/mla-8 Try a walking desk stay healthy & sharp while you learn & code EDA in the Data Science Pipeline Position in Pipeline: EDA is an essential pre-processing step in the business intelligence (BI) or data science pipeline, occurring after data acquisition but before model training. Purpose: The goal of EDA is to understand the data by identifying: Missing values (nulls) Outliers Feature distributions Relationships or correlations between variables Data Acquisition and Initial Inspection Data Sources: Data may arrive from various streams (e.g., Twitter, sensors) and is typically stored in structured formats such as databases or spreadsheets. Loading Data: In Python, data is often loaded into a Pandas DataFrame using commands like pd.read_csv('filename.csv'). Initial Review: df.info(): Displays data types and counts of non-null entries by column, quickly highlighting missing values. df.describe(): Provides summary statistics for each column, including count, mean, standard deviation, min/max, and quartiles. Handling Missing Data and Outliers Imputation: Missing values must often be filled (imputed), as most machine learning algorithms cannot handle nulls. Common strategies: impute with mean, median, or another context-appropriate value. For example, missing ages can be filled with the column's average rather than zero, to avoid introducing skew. Outlier Strategy: Outliers can be removed, replaced (e.g., by nulls and subsequently imputed), or left as-is if legitimate. Treatment depends on whether outliers represent true data points or data errors. Visualization Techniques Purpose: Visualizations help reveal data distributions, outliers, and relationships that may not be apparent from raw statistics. Common Visualization Tools: Matplotlib: The primary Python library for static data visualizations. Visualization Methods: Histogram: Ideal for visualizing the distribution of a single variable (e.g., age), making outliers visible as isolated bars. Box Plot: Summarizes quartiles, median, and range, with 'whiskers' showing min/max; useful for spotting outliers and understanding data spread. Line Chart: Used for time-series data, highlighting trends and anomalies (e.g., sudden spikes in stock price). Correlation Matrix: Visual grid (often of scatterplots) comparing each feature against every other, helping to detect strong or weak linear relationships between features. Feature Correlation and Dimensionality Correlation Plot: Generated with df.corr() in Pandas to assess linear relationships between features. High correlation between features may suggest redundancy (e.g., number of bedrooms and square footage) and inform feature selection or removal. Limitations: While correlation plots provide intuition, automated approaches like Principal Component Analysis (PCA) or autoencoders are typically superior for feature reduction and target prediction tasks. Data Transformation Prior to Modeling Scaling: Machine learning models, especially neural networks, often require input features to be scaled (normalized or standardized). StandardScaler (from scikit-learn): Standardizes features, but is sensitive to outliers. RobustScaler: A variant that compresses the influence of outliers, keeping data within interquartile ranges, simplifying preprocessing steps. Summary of EDA Workflow Initial Steps: Load data into a DataFrame. Examine data types and missing values with df.info(). Review summary statistics with df.describe(). Visualization: Use histograms and box plots to explore feature distributions and detect anomalies. Leverage correlation matrices to identify related features. Data Preparation: Impute missing values thoughtfully (e.g., with means or medians). Decide on treatment for outliers: removal, imputation, or scaling with tools like RobustScaler. Outcome: Proper EDA ensures that data is cleaned, features are well-understood, and inputs are suitable for effective machine learning model training.

Oct 16, 2018 • 17min

MLA 007 Jupyter Notebooks

Jupyter Notebooks, originally conceived as IPython Notebooks, enable data scientists to combine code, documentation, and visual outputs in an interactive, browser-based environment supporting multiple languages like Python, Julia, and R. This episode details how Jupyter Notebooks structure workflows into executable cells - mixing markdown explanations and inline charts - which is essential for documenting, demonstrating, and sharing data analysis and machine learning pipelines step by step. Links Notes and resources at ocdevel.com/mlg/mla-7 Try a walking desk stay healthy & sharp while you learn & code Overview of Jupyter Notebooks Historical Context and Scope Jupyter Notebooks began as IPython Notebooks focused solely on Python. The project was renamed Jupyter to support additional languages - namely Julia ("JU"), Python ("PY"), and R ("R") - broadening its applicability for data science and machine learning across multiple languages. Interactive, Narrative-Driven Coding Jupyter Notebooks allow for the mixing of executable code, markdown documentation, and rich media outputs within a browser-based interface. The coding environment is structured as a sequence of cells where each cell can independently run code and display its output directly underneath. Unlike traditional Python scripts, which output results linearly and impermanently, Jupyter Notebooks preserve the stepwise development process and its outputs for later review or publication. Typical Workflow Example Stepwise Data Science Pipeline Construction Import necessary libraries: Each new notebook usually starts with a cell for imports (e.g., matplotlib, scikit-learn, keras, pandas). Data ingestion phase: Read data into a pandas DataFrame via read_csv for CSVs or read_sql for databases. Exploratory analysis steps: Use DataFrame methods like .info() and .describe() to inspect the dataset; results are rendered below the respective cell. Model development: Train a machine learning model - for example using Keras - and output performance metrics such as loss, mean squared error, or classification accuracy directly beneath the executed cell. Data visualization: Leverage charting libraries like matplotlib to produce inline plots (e.g., histograms, correlation matrices), which remain visible as part of the notebook for later reference. Publishing and Documentation Features Markdown Support and Storytelling Markdown cells enable the inclusion of formatted explanations, section headings, bullet points, and even inline images and videos, allowing for clear documentation and instructional content interleaved with code. This format makes it simple to delineate different phases of a pipeline (e.g., "Data Ingestion", "Data Cleaning", "Model Evaluation") with descriptive context. Inline Visual Outputs Outputs from code cells, such as tables, charts, and model training logs, are preserved within the notebook interface, making it easy to communicate findings and reasoning steps alongside the code. Visualization libraries (like matplotlib) can render charts directly in the notebook without the need to generate separate files. Reproducibility and Sharing Notebooks can be published to platforms like GitHub, where the full code, markdown, and most recent cell outputs are viewable in-browser. This enables transparent workflow documentation and facilitates tutorials, blog posts, and collaborative analysis. Practical Considerations and Limitations Cell-based Execution Flexibility Each cell can be run independently, so developers can repeatedly rerun specific steps (e.g., re-trying a modeling cell after code fixes) without needing to rerun the entire notebook. This is especially useful for iterative experimentation with large or slow-to-load datasets. Primary Use Cases Jupyter Notebooks excel at "storytelling" - presenting an analytical or modeling process along with its rationale and findings, primarily for publication or demonstration. For regular development, many practitioners prefer traditional editors or IDEs (like PyCharm or Vim) due to advanced features such as debugging, code navigation, and project organization. Summary Jupyter Notebooks serve as a central tool for documenting, presenting, and sharing the entirety of a machine learning or data analysis pipeline - combining code, output, narrative, and visualizations into a single, comprehensible document ideally suited for tutorials, reports, and reproducible workflows.

Jul 19, 2018 • 20min

MLA 006 Salaries for Data Science & Machine Learning

O'Reilly's 2017 Data Science Salary Survey finds that location is the most significant salary determinant for data professionals, with median salaries ranging from $134,000 in California to under $30,000 in Eastern Europe, and highlights that negotiation skills can lead to salary differences as high as $45,000. Other key factors impacting earnings include company age and size, job title, industry, and education, while popular tools and languages—such as Python, SQL, and Spark—do not strongly influence salary despite widespread use. Links Notes and resources at ocdevel.com/mlg/mla-6 Try a walking desk stay healthy & sharp while you learn & code Global and Regional Salary Differences Median Global Salary: $90,000 USD, up from $85,000 the previous year. Regional Breakdown: United States: $112,000 median; California leads at $134,000. Western Europe: $57,000—about half the US median. Australia & New Zealand: Second after the US. Eastern Europe: Below $30,000. Asia: Wide interquartile salary range, indicating high variability. Demographic and Personal Factors Gender: Women's median salaries are $8,000 lower than men's. Women make up 20% of respondents but are increasing in number. Age & Experience: Higher age/experience correlates with higher salaries, but the proportion of older professionals declines. Education: Nearly all respondents have at least a master's; PhD holders earn only about $5,000 more than those with a master’s. Negotiation Skills: Self-reported strong salary negotiation skills are linked to $45,000 higher median salaries (from $70,000 for lowest to $115,000 for highest bargaining skill). Industry, Company, and Role Industry Impact: Highest salaries found in search/social networking and media/entertainment. Education and non-profit offer the lowest pay. Company Age & Size: Companies aged 2–5 years offer higher than average pay; less than 2 years old offer much lower salaries (~$40,000). Large organizations generally pay more. Job Title: "Data scientist" and "data analyst" titles carry higher medians than "engineer" titles by around $7,000. Executive titles (CTO, VP, Director) see the highest pay, with CTOs at $150,000 median. Tools, Languages, and Technologies Operating Systems: Windows: 67% usage, but declining. Linux: 55%; Unix: 18%; macOS: 46%; Unix-based systems are rising in use. Programming Languages: SQL: 64% (most used for database querying). Python: 63% (most popular procedural language). R: 54%. Others (Java, Scala, C/C++, C#): Each less than 20%. Salary difference across languages is minor; C/C++ users earn more but not enough to outweigh the difficulty. Databases: MySQL (37%), MS SQL Server (30%), PostgreSQL (28%). Popularity of the database has little impact on pay. Big Data and Search Tools: Spark: Most popular big data platform, especially for large-scale data processing. Elasticsearch: Most common search engine, but Solr pays more. Machine Learning Libraries: Scikit-learn (37%) and Spark MLlib (16%) are most used. Visualization Tools: R’s ggplot2 and Python’s matplotlib are leading choices. Key Salary Differentiators (per Machine Learning Analysis) Top Predictors (explaining ~60% of salary variance): World/US region Experience Gender Company size Education (but amounting to only ~$5,000 difference) Job title Industry Lesser Impact: Specific tools, languages, and databases do not meaningfully affect salary. Summary Takeaways The greatest leverage for a higher salary comes from geography and individual negotiation capability, with up to $45,000 differences possible. Role/title selection, industry, company age, and size are also significant, while mastering the most commonly used tools is essential but does not strongly differentiate pay. For aspiring data professionals: focus on developing negotiation skills and, where possible, optimize for location and title to maximize earning potential.

Jun 9, 2018 • 27min

MLA 005 Shapes and Sizes: Tensors and NDArrays

Explains the fundamental differences between tensor dimensions, size, and shape, clarifying frequent misconceptions—such as the distinction between the number of features (“columns”) and true data dimensions—while also demystifying reshaping operations like expand_dims, squeeze, and transpose in NumPy. Through practical examples from images and natural language processing, listeners learn how to manipulate tensors to match model requirements, including scenarios like adding dummy dimensions for grayscale images or reordering axes for sequence data. Links Notes and resources at ocdevel.com/mlg/mla-5 Try a walking desk stay healthy & sharp while you learn & code Definitions Tensor: A general term for an array of any number of dimensions. 0D Tensor (Scalar): A single number (e.g., 5). 1D Tensor (Vector): A simple list of numbers. 2D Tensor (Matrix): A grid of numbers (rows and columns). 3D+ Tensors: Higher-dimensional arrays, such as images or batches of images. NDArray (NumPy): Stands for "N-dimensional array," the foundational array type in NumPy, synonymous with "tensor." Tensor Properties Dimensions Number of nested levels in the array (e.g., a matrix has two dimensions: rows and columns). Access in NumPy: Via .ndim property (e.g., array.ndim). Size Total number of elements in the tensor. Examples: Scalar: size = 1 Vector: size equals number of elements (e.g., 5 for [1, 2, 3, 4, 5]) Matrix: size = rows × columns (e.g., 10×10 = 100) Access in NumPy: Via .size property. Shape Tuple listing the number of elements per dimension. Example: An image with 256×256 pixels and 3 color channels has shape = (256, 256, 3). Common Scenarios & Examples Data Structures in Practice CSV/Spreadsheet Example: Dataset with 1 million housing examples and 50 features: Shape: (1_000_000, 50) Size: 50,000,000 Image Example (RGB): 256×256 pixel image: Shape: (256, 256, 3) Dimensions: 3 (width, height, channels) Batching for Models: For a convolutional neural network, shape might become (batch_size, width, height, channels), e.g., (32, 256, 256, 3). Conceptual Clarifications The term "dimensions" in data science often refers to features (columns), but technically in tensors it means the number of structural axes. The "curse of dimensionality" often uses "dimensions" to refer to features, not tensor axes. Reshaping and Manipulation in NumPy Reshaping Tensors Adding Dimensions: Useful when a model expects higher-dimensional input than currently available (e.g., converting grayscale image from shape (256, 256) to (256, 256, 1)). Use np.expand_dims or array.reshape. Removing Singleton Dimensions: Occurs when, for example, model output is (N, 1) and single dimension should be removed to yield (N,). Use np.squeeze or array.reshape. Wildcard with -1: In reshaping, -1 is a placeholder for NumPy to infer the correct size, useful when batch size or another dimension is variable. Flattening: Use np.ravel to turn a multi-dimensional tensor into a contiguous 1D array. Axis Reordering Transposing Axes: Needed when model input or output expects axes in a different order (e.g., sequence length and embedding dimensions in NLP). Use np.transpose for general axis permutations. Use np.swapaxes to swap two specific axes but prefer transpose for clarity and flexibility. Practical Example In NLP sequence models: 3D tensor with (batch_size, sequence_length, embedding_dim) might need to be reordered to (batch_size, embedding_dim, sequence_length) for certain models. Achieved using: array.transpose(0, 2, 1) Core NumPy Functions for Manipulation reshape: General function for changing the shape of a tensor, including adding or removing dimensions. expand_dims: Adds a new axis with size 1. squeeze: Removes axes with size 1. ravel: Flattens to 1D. transpose: Changes the order of axes. swapaxes: Swaps specified axes (less general than transpose). Summary Table of Operations Operation NumPy Function Purpose Add dimension np.expand_dims Convert (256,256) to (256,256,1) Remove dimension np.squeeze Convert (N,1) to (N,) General reshape np.reshape Any change matching total size Flatten np.ravel Convert (a,b) to (a*b,) Swap axes np.swapaxes Exchange positions of two axes Permute axes np.transpose Reorder any sequence of axes Closing Notes A deep understanding of tensor structure - dimensions, size, and shape - is vital for preparing data for machine learning models. Reshaping, expanding, squeezing, and transposing tensors are everyday tasks in model development, especially for adapting standard datasets and models to each other.

May 24, 2018 • 18min

MLA 003 Storage: HDF, Pickle, Postgres

Practical workflow of loading, cleaning, and storing large datasets for machine learning, moving from ingesting raw CSVs or JSON files with pandas to saving processed datasets and neural network weights using HDF5 for efficient numerical storage. It clearly distinguishes among storage options—explaining when to use HDF5, pickle files, or SQL databases—while highlighting how libraries like pandas, TensorFlow, and Keras interact with these formats and why these choices matter for production pipelines. Links Notes and resources at ocdevel.com/mlg/mla-3 Try a walking desk stay healthy & sharp while you learn & code Data Ingestion and Preprocessing Data Sources and Formats: Datasets commonly originate as CSV (comma-separated values), TSV (tab-separated values), fixed-width files (FWF), JSON from APIs, or directly from databases. Typical applications include structured data (e.g., real estate features) or unstructured data (e.g., natural language corpora for sentiment analysis). Pandas as the Core Ingestion Tool: Pandas provides versatile functions such as read_csv, read_json, and others to load various file formats with robust options for handling edge cases (e.g., file encodings, missing values). After loading, data cleaning is performed using pandas: dropping or imputing missing values, converting booleans and categorical columns to numeric form. Data Encoding for Machine Learning: All features must be numerical before being supplied to machine learning models like TensorFlow or Keras. Categorical data is one-hot encoded using pandas.get_dummies, converting strings to binary indicator columns. The underlying NumPy array of a DataFrame is accessed via df.values for direct integration with modeling libraries. Numerical Data Storage Options HDF5 for Storing Processed Arrays: HDF5 (Hierarchical Data Format version 5) enables efficient storage of large multidimensional NumPy arrays. Libraries like h5py and built-in pandas functions (to_hdf) allow seamless saving and retrieval of arrays or DataFrames. TensorFlow and Keras use HDF5 by default to store neural network weights as multi-dimensional arrays for model checkpointing and early stopping, accommodating robust recovery and rollback. Pickle for Python Objects: Python's pickle protocol serializes arbitrary objects, including machine learning models and arrays, into files for later retrieval. While convenient for quick iterations or heterogeneous data, pickle is less efficient with NDarrays compared to HDF5, lacks significant compression, and poses security risks if not properly safeguarded. SQL Databases and Spreadsheets: For mixed or heterogeneous data, or when producing results for sharing and collaboration, relational databases like PostgreSQL or spreadsheets such as CSVs are used. Databases serve as the endpoint for production systems, where model outputs—such as generated recommendations or reports—are published for downstream use. Storage Workflow in Machine Learning Pipelines Typical Process: Data is initially loaded and processed with pandas, then converted to numerical arrays suitable for model training. Intermediate states and model weights are saved using HDF5 during model development and training, ensuring recovery from interruptions and facilitating early stopping. Final outputs, especially those requiring sharing or production use, are published to SQL databases or shared as spreadsheet files. Best Practices and Progression: Quick project starts may involve pickle for accessible storage during early experimentation. For large-scale, high-performance applications, migration to HDF5 for numerical data and SQL for production-grade results is recommended. Alternative options like Feather and PyTables (an interface on top of HDF5) exist for specialized needs. Summary HDF5 is optimal for numerical array storage due to its efficiency, built-in compression, and integration with major machine learning frameworks. Pickle accommodates arbitrary Python objects but is suboptimal for numerical data persistence or security. SQL databases and spreadsheets are used for disseminating results, especially when human consumption or application integration is required. The selection of a storage format is determined by data type, pipeline stage, and end-use requirements within machine learning workflows.

May 24, 2018 • 18min

MLA 002 Numpy & Pandas

NumPy enables efficient storage and vectorized computation on large numerical datasets in RAM by leveraging contiguous memory allocation and low-level C/Fortran libraries, drastically reducing memory footprint compared to native Python lists. Pandas, built on top of NumPy, introduces labelled, flexible tabular data manipulation—facilitating intuitive row and column operations, powerful indexing, and seamless handling of missing data through tools like alignment, reindexing, and imputation. Links Notes and resources at ocdevel.com/mlg/mla-2 Try a walking desk stay healthy & sharp while you learn & code NumPy: Efficient Numerical Arrays and Vectorized Computation Purpose and Design NumPy ("Numerical Python") is the foundational library for handling large numerical datasets in RAM. It introduces the ndarray (n-dimensional array), which is synonymous with a tensor—enabling storage of vectors, matrices, or higher-dimensional data. Memory Efficiency NumPy arrays are homogeneous: all elements share a consistent data type (e.g., float64, int32, bool). This data type awareness enables allocation of tightly-packed, contiguous memory blocks, optimizing both RAM usage and data access speed. Memory footprint can be orders of magnitude lower than equivalent native Python lists; for example, tasks that exhausted 32GB of RAM using Python lists could drop to just 6GB with NumPy structures. Vectorized Operations NumPy supports vectorized calculations: operations (such as squaring all elements) are applied across entire arrays in a single step, without explicit Python loops. These operations are operator-overloaded and are executed by delegating instructions to low-level, highly optimized C or Fortran routines, delivering significant computational speed gains. Conditional operations and masking, such as zeroing out negative numbers (akin to a ReLU activation), can be done efficiently with Boolean masks. Pandas: Advanced Tabular Data Manipulation Relationship to NumPy Pandas builds upon NumPy, leveraging its underlying optimized array storage and computation for numerical columns in its data structures. Supports additional types like strings for non-numeric data, which are common in real-world datasets. 2D Data Handling and Directional Operations The core Pandas structure is the DataFrame, which handles labelled rows and columns, analogous to a spreadsheet or SQL table. Operations are equally intuitive row-wise and column-wise, facilitating both SQL-like ("row-oriented") and "columnar" manipulations. This dual-orientation enables many complex data transformations to be succinct one-liners instead of lengthy Python code. Indexing and Alignment Pandas uses flexible and powerful indexing, enabling functions such as joining disparate datasets via a shared index (e.g., timestamp alignment in financial time series). When merging DataFrames (e.g., two stocks with differing trading days), Pandas automatically aligns data on the index, introducing NaN (null) values for unmatched dates. Handling Missing Data (Imputation) Pandas includes robust features for detecting and filling missing values, known as imputation. Options include forward filling, backfilling, or interpolating missing values based on surrounding data. Datasets can be reindexed against standardized sequences, such as all valid trading days, to enforce consistent time frames and further identify or fill data gaps. Use Cases and Integration Pandas simplifies ETL (extract, transform, load) for CSV and database-derived data, merging NumPy’s computation power with tools for advanced data cleaning and integration. When preparing data for machine learning frameworks (e.g., TensorFlow or Keras), Pandas DataFrames can be converted back into NumPy arrays for computation, maintaining tight integration across the data science stack. Summary: NumPy underpins high-speed numerical operations and memory efficiency, while Pandas extends these capabilities to powerful, flexible, and intuitive manipulation of labelled multi-dimensional data -together forming the backbone of data analysis and preparation in Python machine learning workflows.

May 24, 2018 • 11min

MLA 001 Degrees, Certificates, and Machine Learning Careers

While industry-respected credentials like Udacity Nanodegrees help build a practical portfolio for machine learning job interviews, they remain insufficient stand-alone qualifications—most roles require a Master’s degree as a near-hard requirement, especially compared to more flexible web development fields. A Master’s, such as Georgia Tech’s OMSCS, not only greatly increases employability but is strongly recommended for those aiming for entry into machine learning careers, while a PhD is more appropriate for advanced, research-focused roles with significant time investment. Links Notes and resources at ocdevel.com/mlg/mla-1 Online Certificates: Usefulness and Limitations Udacity Nanodegree Provides valuable hands-on experience and a practical portfolio of machine learning projects. Demonstrates self-motivation and the ability to self-teach. Not industry-recognized as a formal qualification—does not by itself suffice for job placement in most companies. Best used as a supplement to demonstrate applied skills, especially in interviews where coding portfolios (e.g., on GitHub) are essential. Coursera Specializations Another MOOC resource similar to Udacity, but Udacity's Nanodegree is cited as closer to real-world relevance among certificates. Neither is accredited or currently accepted as a substitute for formal university degrees by most employers. The Role of a Portfolio Possessing a portfolio with multiple sophisticated projects is critical, regardless of educational background. Interviewers expect examples showcasing data processing (e.g., with Pandas and NumPy), analysis, and end-to-end modeling using libraries like scikit-learn or TensorFlow. Degree Requirements in Machine Learning Bachelor’s Degree Often sufficient for software engineering and web development roles but generally inadequate for machine learning positions. In web development, non-CS backgrounds and bootcamp graduates are commonplace; the requirement is flexible. Machine learning employers treat “Master’s preferred” as a near-required credential, sharply contrasting with the lax standards in web and mobile development. Master’s Degree Significantly improves employability and is typically expected for most machine learning roles. The Georgia Tech Online Master of Science in Computer Science (OMSCS) is highlighted as a cost-effective, flexible, and industry-recognized path. Industry recruiters often filter out candidates without a master's, making advancement with only a bachelor’s degree an uphill struggle. A master's degree reduces obstacles and levels the playing field with other candidates. PhD Necessary mainly for highly research-centric positions at elite companies (e.g., Google, OpenAI). Opens doors to advanced research and high salaries (often $300,000+ per year in leading tech sectors). Involves years of extensive commitment; suitable mainly for those with a passion for research. Recommendations For Aspiring Machine Learning Professionals: Start with a bachelor’s if you don’t already have one. Strongly consider a master’s degree (such as OMSCS) for solid industry entry. Only pursue a PhD if intent on working in cutting-edge research roles. Always build and maintain a robust portfolio to supplement academic achievements. Summary Insight: A master’s degree is becoming the de facto entry ticket to machine learning careers, with MOOCs and portfolios providing crucial, but secondary, support.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner