Machine Learning Guide

OCDevel

Machine learning audio course, teaching the fundamentals of machine learning and artificial intelligence. It covers intuition, models (shallow and deep), math, languages, frameworks, etc. Where your other ML resources provide the trees, I provide the forest. Consider MLG your syllabus, with highly-curated resources for each episode's details at ocdevel.com. Audio is a great supplement during exercise, commute, chores, etc.

Episodes

Mentioned books

Feb 5, 2018 • 43min

MLG 029 Reinforcement Learning Intro

Notes and resources: ocdevel.com/mlg/29 Try a walking desk to stay healthy while you study or work! Reinforcement Learning (RL) is a fundamental component of artificial intelligence, different from purely being AI itself. It is considered a key aspect of AI due to its ability to learn through interactions with the environment using a system of rewards and punishments. Links: openai/baselines reinforceio/tensorforce NervanaSystems/coach rll/rllab Differential Computers Concepts and Definitions Reinforcement Learning (RL): RL is a framework where an "agent" learns by interacting with its environment and receiving feedback in the form of rewards or punishments. It is part of the broader machine learning category, which includes supervised and unsupervised learning. Unlike supervised learning, where a model learns from labeled data, RL focuses on decision-making and goal achievement. Comparison with Other Learning Types Supervised Learning: Involves a teacher-student paradigm where models are trained on labeled data. Common in applications like image recognition and language processing. Unsupervised Learning: Not commonly used in practical applications according to the experience shared in the episode. Reinforcement Learning vs. Supervised Learning: RL allows agents to learn independently through interaction, unlike supervised learning where training occurs with labeled data. Applications of Reinforcement Learning Games and Simulations: Deep reinforcement learning is used in games like Go (AlphaGo) and video games, where the environment and possible rewards or penalties are predefined. Robotics and Autonomous Systems: Examples include robotics (e.g., Boston Dynamics mules) and autonomous vehicles that learn to navigate and make decisions in real-world environments. Finance and Trading: Utilized for modeling trading strategies that aim to optimize financial returns over time, although breakthrough performance in trading isn’t yet evidenced. RL Frameworks and Environments Framework Examples: OpenAI Baselines, TensorForce, and Intel's Coach, each with different capabilities and company backing for development. Environments: OpenAI's Gym is a suite of environments used for training RL agents. Future Aspects and Developments Model-based vs. Model-free RL: Model-based RL involves planning and knowledge of the world dynamics, while model-free is about reaction and immediate responses. Remaining Challenges: Current hurdles in AI include reasoning, knowledge representation, and memory, where efforts are ongoing in institutions like Google DeepMind for further advancement.

Feb 4, 2018 • 51min

MLG 028 Hyperparameters 2

Notes and resources: ocdevel.com/mlg/28 Try a walking desk to stay healthy while you study or work! More hyperparameters for optimizing neural networks. A focus on regularization, optimizers, feature scaling, and hyperparameter search methods. Hyperparameter Search Techniques Grid Search involves testing all possible permutations of hyperparameters, but is computationally exhaustive and suited for simpler, less time-consuming models. Random Search selects random combinations of hyperparameters, potentially saving time while potentially missing the optimal solution. Bayesian Optimization employs machine learning to continuously update and hone in on efficient hyperparameter combinations, avoiding the exhaustive or random nature of grid and random searches. Regularization in Neural Networks L1 and L2 Regularization penalize certain parameter configurations to prevent model overfitting; often smoothing overfitted parameters. Dropout randomly deactivates neurons during training to ensure the model doesn’t over-rely on specific neurons, fostering better generalization. Optimizers Optimizers like Adam, which combines elements of momentum and adaptive learning rates, are explained as vital tools for refining the learning process of neural networks. Adam, being the most sophisticated and commonly used optimizer, improves upon simpler techniques like momentum by incorporating more advanced adaptative features. Initializers The importance of weight initialization is underscored with methods like uniform random initialization and the more advanced Xavier initialization to prevent neural networks from starting in 'stuck' states. Feature Scaling Different scaling methods such as standardization and normalization are used to scale feature inputs to small, standardized ranges. Batch Normalization is highlighted, integrating scaling directly into the network to prevent issues like exploding and vanishing gradients through the normalization of layer outputs. Links Bayesian Optimization Optimizers (SGD): Momentum -> Adagrad -> RMSProp -> Adam -> Nadam

Jan 28, 2018 • 47min

MLG 027 Hyperparameters 1

Full notes and resources at ocdevel.com/mlg/27 Try a walking desk to stay healthy while you study or work! Hyperparameters are crucial elements in the configuration of machine learning models. Unlike parameters, which are learned by the model during training, hyperparameters are set by humans before the learning process begins. They are the knobs and dials that humans can control to influence the training and performance of machine learning models. Definition and Importance Hyperparameters differ from parameters like theta in linear and logistic regression, which are learned weights. They are choices made by humans, such as the type of model, number of neurons in a layer, or the model architecture. These choices can have significant effects on the model's performance, making them vital to conscious and informed tuning. Types of Hyperparameters Model Selection: Choosing what model to use is itself a hyperparameter. For example, deciding between linear regression, logistic regression, naive Bayes, or neural networks. Architecture of Neural Networks: Number of Layers and Neurons: Deciding the width (number of neurons) and depth (number of layers). Types of Layers: Whether to use LSTMs, convolutional layers, or dense layers. Activation Functions: They transform linear outputs into non-linear outputs. Popular choices include ReLU, tanh, and sigmoid, with ReLU being the default for most neural network layers. Regularization and Optimization: These influence the learning process. The use of L1/L2 regularization or dropout, as well as the type of optimizer (e.g., Adam, Adagrad), are hyperparameters. Optimization Techniques Techniques like grid search, random search, and Bayesian optimization are used to systematically explore combinations of hyperparameters to find the best configuration for a given task. While these methods can be computationally expensive, they are necessary for achieving optimal model performance. Challenges and Future Directions The field strives towards simplifying the choice of hyperparameters, ideally automating them to become parameters of the model itself. Efforts like Google's AutoML aim to handle hyperparameter tuning automatically. Understanding and optimizing hyperparameters is a cornerstone in machine learning, directly impacting the effectiveness and efficiency of a model. Progress continues to integrate these choices into model training, reducing the dependency on human intervention and trial-and-error experimentation. Decision Tree Model selection Unsupervised? K-means Clustering => DL Linear? Linear regression, logistic regression Simple? Naive Bayes, Decision Tree (Random Forest, Gradient Boosting) Little data? Boosting Lots of data, complex situation? Deep learning Network Layer arch Vision? CNN Time? LSTM Other? MLP Trading LSTM => CNN decision Layer size design (funnel, etc) Face pics From BTC episode Don't know? Layers=1, Neurons=mean(inputs, output) link Activations / nonlinearity Output Sigmoid = predict probability of output, usually at output Softmax = multi-class Nothing = regression Relu family (Leaky Relu, Elu, Selu, ...) = vanishing gradient (gradient is constant), performance, usually better Tanh = classification between two classes, mean 0 important

Jan 27, 2018 • 39min

MLG 026 Project Bitcoin Trader

Try a walking desk to stay healthy while you study or work! Ful notes and resources at ocdevel.com/mlg/26 NOTE. This episode is no longer relevant, and tforce_btc_trader no longer maintained. The current podcast project is Gnothi. Episode Overview TForce BTC Trader Project: Trading Crypto Special: Intuitively highlights decisions: hypers, supervised v reinforcement, LSTM v CNN Crypto (v stock) Bitcoin, Ethereum, Litecoin, Ripple Many benefits (immutable permenant distributed ledger; security; low fees; international; etc) For our purposes: popular, volatile, singular Singular like Forex vs Stock (instruments) Trading basics Day, swing, investing Patterns (technical analysis, vs fundamentals) OHLCV / Candles Indicators Exchanges & Arbitrage (GDAX, Krakken) Good because highlights lots LSTM v CNN Supervised v Reinforcement Obvious net architectures (indicators, time-series, tanh v relu) Episode Summary The project "Bitcoin Trader" involves developing a Bitcoin trading bot using machine learning to capitalize on the hot topic of cryptocurrency and its potential profitability. The project will serve as a medium to delve into complex machine learning engineering topics, such as hyperparameter selection and reinforcement learning, over subsequent episodes. Cryptocurrency, specifically Bitcoin, is used for its universal and decentralized nature, akin to a digital, secure, and democratic financial instrument like the US dollar. Bitcoin mining involves running complex calculations to manage the currency's existence, similar to a distributed Federal Reserve system, with transactions recorded on a secure and permanent ledger known as the blockchain. The flexibility of cryptocurrency trading allows for machine learning applications across unsupervised, supervised, and reinforcement learning paradigms. This project will focus on using models such as LSTM recurrent neural networks and convolutional neural networks, highlighting Bitcoin’s unique capacity to illustrate machine learning concept decisions like network architecture. Trading differs from investing by focusing on profit from price fluctuations rather than a belief in long-term value increase. It involves understanding patterns in price actions to buy low and sell high. Different types of trading include day trading, which involves daily buying and selling, and swing trading, which spans longer periods. Trading decisions rely on patterns identified in price graphs, using time series data. Data representation through candlesticks (OHLCV: open-high-low-close-volume), coupled with indicators like moving averages and RSI, provide multiple input features for machine learning models, enhancing prediction accuracy. Exchanges like GDAX and Kraken serve as platforms for converting traditional currencies into cryptocurrencies. The efficient market hypothesis suggests that the value of an instrument is fairly priced based on the collective analysis of market participants. Differences in exchange prices can provide opportunities for arbitrage, further fueling trading strategies. The project code, currently using deep reinforcement learning via tensor force, employs convolutional neural networks over LSTM to adapt to Bitcoin trading's intricacies. The project will be available at ocdevel.com for community engagement, with future episodes tackling hyperparameter selection and deep reinforcement learning techniques.

Oct 30, 2017 • 45min

MLG 025 Convolutional Neural Networks

Try a walking desk to stay healthy while you study or work! Notes and resources at ocdevel.com/mlg/25 Filters and Feature Maps: Filters are small matrices used to detect visual features from an input image by applying them to local pixel patches, creating a 3D output called a feature map. Each filter is tasked with recognizing a specific pattern (e.g., edges, textures) in the input images. Convolutional Layers: The filter is applied across the image to produce an output which is the feature map. A convolutional layer is composed of several feature maps, with depth corresponding to the number of filters applied. Image Compression Techniques: Window and Stride: The window is the size of the pixel patch examined by the filter, and stride determines how much the window moves over the image. Together, they allow compression of images by reducing the number of windows examined, effectively downsampling the image. Padding: Padding allows the filter to account for border pixels that do not fit perfectly within the window size. 'Same' padding adds zero-padding to ensure all pixels are included, while 'valid' padding ignores excess pixels around the borders. Max Pooling: Max pooling is a downsampling technique used to reduce the spatial dimensions of feature maps by taking the maximum value over a defined window, further compressing and reducing computational load. Predefined Architectures: There are well-established predefined architectures like LeNet, AlexNet, and ResNet, which have been fine-tuned through competitions such as the ImageNet Challenge, and can be used directly or adapted for specific tasks in computer vision.

Oct 7, 2017 • 1h 2min

MLG 024 Tech Stack

Try a walking desk to stay healthy while you study or work! Notes and resources at ocdevel.com/mlg/24 Hardware Desktop if you're stationary, as you'll get the best performance bang-for-buck and improved longevity; laptop if you're mobile. Desktops. Build your own PC, better value than pre-built. See PC Part Picker, make sure to use an Nvidia graphics card. Generally shoot for 2nd-best of CPUs/GPUs. Eg, RTX 4070 currently (2024-01); better value-to-price than 4080+. For laptops, see this post (updated). OS / Software Use Linux (I prefer Ubuntu), or Windows, WSL2, and Docker. See mla/12 for details. Programming Tech Stack Deep-learning frameworks. You'll use both TF & PT eventually, so don't get hung up. mlg/9 for details. Tensorflow (and/or Keras) PyTorch (and/or Lightning) Shallow-learning / utilities: ScikitLearn, Pandas, Numpy Cloud-hosting: AWS / GCP / Azure. mla/13 for details. Episode Summary The episode discusses setting up a tech stack tailored for machine learning, emphasizing the necessity of choosing a primary programming language and framework, which, in this case, are Python and TensorFlow. The decision is supported by the ongoing popularity and community support for these tools. This preference is further influenced by the necessity for GPU optimization, which TensorFlow provides, allowing for enhanced performance through utilizing Nvidia's CUDA technology. A notable change in the landscape is the decline of certain deep learning frameworks such as Theano, and the rise of competitors like PyTorch, which is gaining traction due to its ease of use in comparison to TensorFlow. The author emphasizes the importance of selecting frameworks with robust community support and resources, highlighting TensorFlow's lead in the market in this respect. For hardware, the suggestion is a custom-built PC with a powerful Nvidia GPU, such as the 1080 TI, running Ubuntu Linux for best compatibility. However, for those who favor cloud services, Amazon Web Services (AWS) and Google Cloud Platform (GCP) are viable options, with a preference for GCP due to cost and performance benefits, particularly with the upcoming Tensor Processing Units (TPUs). On the software side, the use of Pandas for data manipulation, NumPy for mathematical operations, and Scikit-Learn for shallow learning tasks provides a comprehensive toolkit for machine learning development. Additionally, the use of abstraction libraries such as Keras for simplifying TensorFlow syntax and TensorForce for reinforcement learning are recommended. The episode further explores system architectures, suggesting a separation of concerns between a web app server and a machine learning (job) server. Communication between these components can be efficiently managed using a message queuing system like RabbitMQ, with Celery as a potential abstraction layer. To support developers in implementing their machine learning pipelines, the recommendation extends to leveraging existing datasets, using Scikit-Learn for convenient access, and standardizing data for effective training results. The author points to several books and resources to assist in understanding and applying these technologies effectively, ending with your own workstation recommendations and building TensorFlow from source for performance gains as a potential advanced optimization step.

Aug 20, 2017 • 43min

MLG 023 Deep NLP 2

Try a walking desk to stay healthy while you study or work! Notes and resources at ocdevel.com/mlg/23 Neural Network Types in NLP Vanilla Neural Networks (Feedforward Networks): Used for general classification or regression tasks. Examples include predicting housing costs or classifying images as cat, dog, or tree. Convolutional Neural Networks (CNNs): Primarily used for image-related tasks. Recurrent Neural Networks (RNNs): Used for sequence-based tasks such as weather predictions, stock market predictions, and natural language processing. Differ from feedforward networks as they loop back onto previous steps to handle sequences over time. Key Concepts and Applications Supervised vs Reinforcement Learning: Supervised learning involves training models using labeled data to learn patterns and create labels autonomously. Reinforcement learning focuses on learning actions to maximize a reward function over time, suitable for tasks like gaming AI but less so for tasks like NLP. Encoder-Decoder Models: These models process entire input sequences before producing output, crucial for tasks like machine translation, where full context is needed before output generation. Transforms sequences to a vector space (encoding) and reconstructs it to another sequence (decoding). Gradient Problems & Solutions: Vanishing and Exploding Gradient Problems occur during training due to backpropagation over time steps, causing information loss or overflow, notably in longer sequences. Long Short-Term Memory (LSTM) Cells solve these by allowing RNNs to retain important information over longer time sequences, effectively mitigating gradient issues. LSTM Functionality An LSTM cell replaces traditional neurons in an RNN with complex machinery that regulates information flow. Components within an LSTM cell: Forget Gate: Decides which information to discard from the cell state. Input Gate: Determines which information to update. Output Gate: Controls the output from the cell.

Jul 29, 2017 • 50min

MLG 022 Deep NLP 1

Try a walking desk to stay healthy while you study or work! Notes and resources at ocdevel.com/mlg/22 Deep NLP Fundamentals Deep learning has had a profound impact on natural language processing by introducing models like recurrent neural networks (RNNs) that are specifically adept at handling sequential data. Unlike traditional linear models like linear regression, RNNs can address the complexities of language which appear from its inherent non-linearity and hierarchy. These models are able to learn complex features by combining data in multiple layers, which has revolutionized areas like sentiment analysis, machine translation, and more. Neural Networks and Their Use in NLP Neural networks can be categorized into regular feedforward neural networks and recurrent neural networks (RNNs). Feedforward networks are used for non-sequential tasks, while RNNs are useful for sequential data processing such as language, where the network’s hidden layers are connected to enable learning over time steps. This loopy architecture allows RNNs to maintain a form of state or memory, making them effective for tasks where context is crucial. The challenge of mapping these sequences into meaningful output has led to architectures like the encoder-decoder model, which reads entire sequences to produce responses or translations, enhancing the network's ability to learn and remember context across long sequences. Word Embeddings and Contextual Representations A key challenge in processing natural language using machine learning models is representing words as numbers, as machine learning relies on mathematical operations. Initial representations like one-hot vectors were simple but lacked semantic meaning. To address this, word embeddings such as those generated by the Word2Vec model have been developed. These embeddings place words in a vector space where distance and direction between vectors are meaningful, allowing models to interpret semantic similarities and differences between words. Word2Vec, using neural networks, learns these embeddings by predicting word contexts or vice versa. Advanced Architectures and Practical Implications RNNs and their more sophisticated versions like LSTM and GRU cells address specific challenges such as the vanishing gradient problem, which can occur during backpropagation through time. These architectures allow for more effective and longer-range dependencies to be learned, vital for handling the nuances of human language. As a result, these models have become dominant in modern NLP, replacing older methods for tasks ranging from part-of-speech tagging to machine translation. Further Learning and Resources For in-depth learning, resources such as the "Unreasonable Effectiveness of RNNs", Stanford courses on deep NLP by Christopher Manning, and continued education in deep learning can enhance one's understanding of these models. Emphasis on both theoretical understanding and practical application will be crucial for mastering the deep learning techniques that are transforming NLP.

Jul 23, 2017 • 41min

MLG 020 Natural Language Processing 3

Try a walking desk to stay healthy while you study or work! Notes and resources at ocdevel.com/mlg/20 NLP progresses through three main layers: text preprocessing, syntax tools, and high-level goals, each building upon the last to achieve complex linguistic tasks. Text Preprocessing Text preprocessing involves essential steps such as tokenization, stemming, and stop word removal. These foundational tasks clean and prepare text for further analysis, ensuring that subsequent processes can be applied more effectively. Syntax Tools Syntax tools are crucial for understanding grammatical structures within text. Part of Speech Tagging identifies the role of words within sentences, such as noun, verb, or adjective. Named Entity Recognition (NER) distinguishes entities such as people, organizations, and dates, leveraging models like maximum entropy, support vector machines, or hidden Markov models. Achieving High-Level Goals High-level NLP goals include text classification, sentiment analysis, and optimizing search engines. Techniques such as the Naive Bayes algorithm enable effective text classification by simplifying documents into word occurrence models. Search engines benefit from the TF-IDF method in tandem with cosine similarity, allowing for efficient document retrieval and relevance ranking. In-depth Look at Syntax Parsing Syntax parsing delves into sentence structure through two primary approaches: context-free grammars (CFG) and dependency parsing. CFGs use production rules to break down sentences into components like noun phrases and verb phrases. Probabilistic enhancements to CFGs learn from datasets like the Penn Treebank to determine the likelihood of various grammatical structures. Dependency parsing, on the other hand, maps out word relationships through directional arcs, providing a visual dependency tree that highlights connections between components such as subjects and verbs. Applications of NLP Tools Syntax parsing plays a vital role in tasks like relationship extraction, providing insights into how entities relate within text. Question answering integrates various tools, using TF-IDF and syntax parsing to locate and extract precise answers from relevant documents, evidenced in systems like Google’s snippet answers. Text summarization seeks to distill large texts into concise summaries. By employing TF-IDF, the process identifies sentences rich in informational content due to their less frequent vocabulary, removing redundancies for a coherent summary. TextRank, a graph-based methodology, evaluates sentence importance based on their connectedness within a document. Machine Translation Evolution Machine translation demonstrates the transformative impact of deep learning. Traditional methods, characterized by their complexity and multiple models, have been surpassed by neural machine translation systems. These employ recurrent neural networks (RNNs) to achieve end-to-end translation, accommodating tasks traditionally dependent on separate linguistic models into a unified approach, thus simplifying development and improving accuracy. The episode underscores the transition from shallow NLP approaches to deep learning methods, highlighting how advanced models, particularly those involving RNNs, are redefining speech processing tasks with efficiency and sophistication.

Jul 11, 2017 • 1h 6min

MLG 019 Natural Language Processing 2

Try a walking desk to stay healthy while you study or work! Notes and resources at ocdevel.com/mlg/19 Classical NLP Techniques: Origins and Phases in NLP History: Initially reliant on hardcoded linguistic rules, NLP's evolution significantly pivoted with the introduction of machine learning, particularly shallow learning algorithms, leading eventually to deep learning, which is the current standard. Importance of Classical Methods: Knowing traditional methods is still valuable, providing a historical context and foundation for understanding NLP tasks. Traditional methods can be advantageous with small datasets or limited compute power. Edit Distance and Stemming: Levenshtein Distance: Used for spelling corrections by measuring the minimal edits needed to transform one string into another. Stemming: Simplifying a word to its base form. The Porter Stemmer is a common algorithm used. Language Models: Understand language legitimacy by calculating the joint probability of word sequences. Use n-grams for constructing language models to increase accuracy at the expense of computational power. Naive Bayes for Classification: Ideal for tasks like spam detection, document classification, and sentiment analysis. Relies on a 'bag of words' model, simplifying documents down to word frequency counts and disregarding sequence dependence. Part of Speech Tagging and Named Entity Recognition: Methods: Maximum entropy models, hidden Markov models. Challenges: Feature engineering for parts of speech, complexity in named entity recognition. Generative vs. Discriminative Models: Generative Models: Estimate the joint probability distribution; useful with less data. Discriminative Models: Focus on decision boundaries between classes. Topic Modeling with LDA: Latent Dirichlet Allocation (LDA) helps identify topics within large sets of documents by clustering words into topics, allowing for mixed membership of topics across documents. Search and Similarity Measures: Utilize TF-IDF for transforming documents into vectors reflecting term importance inversely correlated with document frequency in the corpus. Employ cosine similarity for measuring semantic similarity between document vectors.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner