Yannic Kilcher Videos (Audio Only)

Yannic Kilcher
undefined
May 26, 2021 • 42min

Expire-Span: Not All Memories are Created Equal: Learning to Forget by Expiring (Paper Explained)

#expirespan #nlp #facebookai Facebook AI (FAIR) researchers present Expire-Span, a variant of Transformer XL that dynamically assigns expiration dates to previously encountered signals. Because of this, Expire-Span can handle sequences of many thousand tokens, while keeping the memory and compute requirements at a manageable level. It severely matches or outperforms baseline systems, while consuming much less resources. We discuss its architecture, advantages, and shortcomings. OUTLINE: 0:00 - Intro & Overview 2:30 - Remembering the past in sequence models 5:45 - Learning to expire past memories 8:30 - Difference to local attention 10:00 - Architecture overview 13:45 - Comparison to Transformer XL 18:50 - Predicting expiration masks 32:30 - Experimental Results 40:00 - Conclusion & Comments Paper: https://arxiv.org/abs/2105.06548 Code: https://github.com/facebookresearch/t... ADDENDUM: I mention several times that the gradient signal of the e quantity only occurs inside the R ramp. By that, I mean the gradient stemming from the model loss. The regularization loss acts also outside the R ramp. Abstract: Attention mechanisms have shown promising results in sequence modeling tasks that require long-term memory. Recent work investigated mechanisms to reduce the computational cost of preserving and storing memories. However, not all content in the past is equally important to remember. We propose Expire-Span, a method that learns to retain the most important information and expire the irrelevant information. This forgetting of memories enables Transformers to scale to attend over tens of thousands of previous timesteps efficiently, as not all states from previous timesteps are preserved. We demonstrate that Expire-Span can help models identify and retain critical information and show it can achieve strong performance on reinforcement learning tasks specifically designed to challenge this functionality. Next, we show that Expire-Span can scale to memories that are tens of thousands in size, setting a new state of the art on incredibly long context tasks such as character-level language modeling and a frame-by-frame moving objects task. Finally, we analyze the efficiency of Expire-Span compared to existing approaches and demonstrate that it trains faster and uses less memory. Authors: Sainbayar Sukhbaatar, Da Ju, Spencer Poff, Stephen Roller, Arthur Szlam, Jason Weston, Angela Fan Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/yannic-ki... BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
undefined
May 24, 2021 • 34min

FNet: Mixing Tokens with Fourier Transforms (Machine Learning Research Paper Explained)

#fnet #attention #fourier Do we even need Attention? FNets completely drop the Attention mechanism in favor of a simple Fourier transform. They perform almost as well as Transformers, while drastically reducing parameter count, as well as compute and memory requirements. This highlights that a good token mixing heuristic could be as valuable as a learned attention matrix. OUTLINE: 0:00 - Intro & Overview 0:45 - Giving up on Attention 5:00 - FNet Architecture 9:00 - Going deeper into the Fourier Transform 11:20 - The Importance of Mixing 22:20 - Experimental Results 33:00 - Conclusions & Comments Paper: https://arxiv.org/abs/2105.03824 ADDENDUM: Of course, I completely forgot to discuss the connection between Fourier transforms and Convolutions, and that this might be interpreted as convolutions with very large kernels. Abstract: We show that Transformer encoder architectures can be massively sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear transformations, along with simple nonlinearities in feed-forward layers, are sufficient to model semantic relationships in several text classification tasks. Perhaps most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92% of the accuracy of BERT on the GLUE benchmark, but pre-trains and runs up to seven times faster on GPUs and twice as fast on TPUs. The resulting model, which we name FNet, scales very efficiently to long inputs, matching the accuracy of the most accurate "efficient" Transformers on the Long Range Arena benchmark, but training and running faster across all sequence lengths on GPUs and relatively shorter sequence lengths on TPUs. Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes: for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts. Authors: James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/yannic-ki... BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
undefined
May 21, 2021 • 14min

AI made this music video | What happens when OpenAI's CLIP meets BigGAN?

#artificialintelligence #musicvideo #clip I used OpenAI's CLIP model and BigGAN to create a music video that goes along with the lyrics of a song that I wrote. The song lyrics are made from ImageNet class labels, and the song itself is performed by me on a looper. OUTLINE: 0:00 - Intro 1:00 - AI-generated music video for "be my weasel" 3:50 - How it was made 7:30 - My looping gear 9:35 - AI-generated music video #2 12:45 - Outro & Credits Code and references: https://github.com/yk/clip_music_video Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/yannic-ki... BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
undefined
May 15, 2021 • 55min

DDPM - Diffusion Models Beat GANs on Image Synthesis (Machine Learning Research Paper Explained)

#ddpm #diffusionmodels #openai GANs have dominated the image generation space for the majority of the last decade. This paper shows for the first time, how a non-GAN model, a DDPM, can be improved to overtake GANs at standard evaluation metrics for image generation. The produced samples look amazing and other than GANs, the new model has a formal probabilistic foundation. Is there a future for GANs or are Diffusion Models going to overtake them for good? OUTLINE: 0:00 - Intro & Overview 4:10 - Denoising Diffusion Probabilistic Models 11:30 - Formal derivation of the training loss 23:00 - Training in practice 27:55 - Learning the covariance 31:25 - Improving the noise schedule 33:35 - Reducing the loss gradient noise 40:35 - Classifier guidance 52:50 - Experimental Results Paper (this): https://arxiv.org/abs/2105.05233 Paper (previous): https://arxiv.org/abs/2102.09672 Code: https://github.com/openai/guided-diff... Abstract: We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, we further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for sample quality using gradients from a classifier. We achieve an FID of 2.97 on ImageNet 128×128, 4.59 on ImageNet 256×256, and 7.72 on ImageNet 512×512, and we match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution. Finally, we find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.85 on ImageNet 512×512. We release our code at this https URL Authors: Alex Nichol, Prafulla Dhariwal Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/yannic-ki... BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
undefined
May 10, 2021 • 31min

Involution: Inverting the Inherence of Convolution for Visual Recognition (Research Paper Explained)

#involution​ #computervision​ #attention​ Convolutional Neural Networks (CNNs) have dominated computer vision for almost a decade by applying two fundamental principles: Spatial agnosticism and channel-specific computations. Involution aims to invert these principles and presents a spatial-specific computation, which is also channel-agnostic. The resulting Involution Operator and RedNet architecture are a compromise between classic Convolutions and the newer Local Self-Attention architectures and perform favorably in terms of computation accuracy tradeoff when compared to either. OUTLINE: 0:00​ - Intro & Overview 3:00​ - Principles of Convolution 10:50​ - Towards spatial-specific computations 17:00​ - The Involution Operator 20:00​ - Comparison to Self-Attention 25:15​ - Experimental Results 30:30​ - Comments & Conclusion Paper: https://arxiv.org/abs/2103.06255​ Code: https://github.com/d-li14/involution​ Abstract: Convolution has been the core ingredient of modern neural networks, triggering the surge of deep learning in vision. In this work, we rethink the inherent principles of standard convolution for vision tasks, specifically spatial-agnostic and channel-specific. Instead, we present a novel atomic operation for deep neural networks by inverting the aforementioned design principles of convolution, coined as involution. We additionally demystify the recent popular self-attention operator and subsume it into our involution family as an over-complicated instantiation. The proposed involution operator could be leveraged as fundamental bricks to build the new generation of neural networks for visual recognition, powering different deep learning models on several prevalent benchmarks, including ImageNet classification, COCO detection and segmentation, together with Cityscapes segmentation. Our involution-based models improve the performance of convolutional baselines using ResNet-50 by up to 1.6% top-1 accuracy, 2.5% and 2.4% bounding box AP, and 4.7% mean IoU absolutely while compressing the computational cost to 66%, 65%, 72%, and 57% on the above benchmarks, respectively. Code and pre-trained models for all the tasks are available at this https URL. Authors: Duo Li, Jie Hu, Changhu Wang, Xiangtai Li, Qi She, Lei Zhu, Tong Zhang, Qifeng Chen Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick​ YouTube: https://www.youtube.com/c/yannickilcher​ Twitter: https://twitter.com/ykilcher​ Discord: https://discord.gg/4H8xxDF​ BitChute: https://www.bitchute.com/channel/yann...​ Minds: https://www.minds.com/ykilcher​ Parler: https://parler.com/profile/YannicKilcher​ LinkedIn: https://www.linkedin.com/in/yannic-ki...​ BiliBili: https://space.bilibili.com/1824646584​ If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick...​ Patreon: https://www.patreon.com/yannickilcher​ Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
undefined
May 10, 2021 • 28min

MLP-Mixer: An all-MLP Architecture for Vision (Machine Learning Research Paper Explained)

#mixer​ #google​ #imagenet​ Convolutional Neural Networks have dominated computer vision for nearly 10 years, and that might finally come to an end. First, Vision Transformers (ViT) have shown remarkable performance, and now even simple MLP-based models reach competitive accuracy, as long as sufficient data is used for pre-training. This paper presents MLP-Mixer, using MLPs in a particular weight-sharing arrangement to achieve a competitive, high-throughput model and it raises some interesting questions about the nature of learning and inductive biases and their interaction with scale for future research. OUTLINE: 0:00​ - Intro & Overview 2:20​ - MLP-Mixer Architecture 13:20​ - Experimental Results 17:30​ - Effects of Scale 24:30​ - Learned Weights Visualization 27:25​ - Comments & Conclusion Paper: https://arxiv.org/abs/2105.01601​ Abstract: Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers. Authors: Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy ERRATA: Here is their definition of what the 5-shot classifier is: "we report the few-shot accuracies obtained by solving the L2-regularized linear regression problem between the frozen learned representations of images and the labels" Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick​ YouTube: https://www.youtube.com/c/yannickilcher​ Twitter: https://twitter.com/ykilcher​ Discord: https://discord.gg/4H8xxDF​ BitChute: https://www.bitchute.com/channel/yann...​ Minds: https://www.minds.com/ykilcher​ Parler: https://parler.com/profile/YannicKilcher​ LinkedIn: https://www.linkedin.com/in/yannic-ki...​ BiliBili: https://space.bilibili.com/1824646584​ If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick...​ Patreon: https://www.patreon.com/yannickilcher​ Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
undefined
May 3, 2021 • 12min

Is Google Translate Sexist? Gender Stereotypes in Statistical Machine Translation

#genderbias​ #algorithmicfairness​ #debiasing​ A brief look into gender stereotypes in Google Translate. The origin is a Tweet containing a Hungarian text. Hungarian is a gender-neutral language, so translating gender pronouns is ambiguous. Turns out that Google Translate assigns very stereotypical pronouns. In this video, we'll have a look at the origins and possible solutions to this problem. OUTLINE: 0:00​ - Intro 1:10​ - Digging Deeper 2:30​ - How does Machine Translation work? 3:50​ - Training Data Problems 4:40​ - Learning Algorithm Problems 5:45​ - Argmax Output Problems 6:45​ - Pragmatics 7:50​ - More on Google Translate 9:40​ - Social Engineering 11:15​ - Conclusion Songs: Like That - Anno Domini Beats Submarine - Dyalla Dude - Patrick Patrikios Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick​ YouTube: https://www.youtube.com/c/yannickilcher​ Twitter: https://twitter.com/ykilcher​ Discord: https://discord.gg/4H8xxDF​ BitChute: https://www.bitchute.com/channel/yann...​ Minds: https://www.minds.com/ykilcher​ Parler: https://parler.com/profile/YannicKilcher​ LinkedIn: https://www.linkedin.com/in/yannic-ki...​ BiliBili: https://space.bilibili.com/1824646584​ If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick...​ Patreon: https://www.patreon.com/yannickilcher​ Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
undefined
May 3, 2021 • 30min

Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)

#perceiver​ #deepmind​ #transformer​ Inspired by the fact that biological creatures attend to multiple modalities at the same time, DeepMind releases its new Perceiver model. Based on the Transformer architecture, the Perceiver makes no assumptions on the modality of the input data and also solves the long-standing quadratic bottleneck problem. This is achieved by having a latent low-dimensional Transformer, where the input data is fed multiple times via cross-attention. The Perceiver's weights can also be shared across layers, making it very similar to an RNN. Perceivers achieve competitive performance on ImageNet and state-of-the-art on other modalities, all while making no architectural adjustments to input data. OUTLINE: 0:00​ - Intro & Overview 2:20​ - Built-In assumptions of Computer Vision Models 5:10​ -  The Quadratic Bottleneck of Transformers 8:00​ - Cross-Attention in Transformers 10:45​ - The Perceiver Model Architecture & Learned Queries 20:05​ - Positional Encodings via Fourier Features 23:25​ - Experimental Results & Attention Maps 29:05​ - Comments & Conclusion Paper: https://arxiv.org/abs/2103.03206​ My Video on Transformers (Attention is All You Need): https://youtu.be/iDulhoQ2pro​ Abstract: Biological systems understand the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture performs competitively or beyond strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video and video+audio. The Perceiver obtains performance comparable to ResNet-50 on ImageNet without convolutions and by directly attending to 50,000 pixels. It also surpasses state-of-the-art results for all modalities in AudioSet. Authors: Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick​ YouTube: https://www.youtube.com/c/yannickilcher​ Twitter: https://twitter.com/ykilcher​ Discord: https://discord.gg/4H8xxDF​ BitChute: https://www.bitchute.com/channel/yann...​ Minds: https://www.minds.com/ykilcher​ Parler: https://parler.com/profile/YannicKilcher​ LinkedIn: https://www.linkedin.com/in/yannic-ki...​ BiliBili: https://space.bilibili.com/1824646584​ If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick...​ Patreon: https://www.patreon.com/yannickilcher​ Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
undefined
May 3, 2021 • 34min

Pretrained Transformers as Universal Computation Engines (Machine Learning Research Paper Explained)

#universalcomputation​ #pretrainedtransformers​ #finetuning​ Large-scale pre-training and subsequent fine-tuning is a common recipe for success with transformer models in machine learning. However, most such transfer learning is done when a model is pre-trained on the same or a very similar modality to the final task to be solved. This paper demonstrates that transformers can be fine-tuned to completely different modalities, such as from language to vision. Moreover, they demonstrate that this can be done by freezing all attention layers, tuning less than .1% of all parameters. The paper further claims that language modeling is a superior pre-training task for such cross-domain transfer. The paper goes through various ablation studies to make its point. OUTLINE: 0:00​ - Intro & Overview 2:00​ - Frozen Pretrained Transformers 4:50​ - Evaluated Tasks 10:05​ - The Importance of Training LayerNorm 17:10​ - Modality Transfer 25:10​ - Network Architecture Ablation 26:10​ - Evaluation of the Attention Mask 27:20​ - Are FPTs Overfitting or Underfitting? 28:20​ - Model Size Ablation 28:50​ - Is Initialization All You Need? 31:40​ - Full Model Training Overfits 32:15​ - Again the Importance of Training LayerNorm 33:10​ - Conclusions & Comments Paper: https://arxiv.org/abs/2103.05247​ Code: https://github.com/kzl/universal-comp...​ Abstract: We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning -- in particular, without finetuning of the self-attention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. In contrast to prior works which investigate finetuning on the same modality as the pretraining dataset, we show that pretraining on natural language improves performance and compute efficiency on non-language downstream tasks. In particular, we find that such pretraining enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks. Authors: Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick​ YouTube: https://www.youtube.com/c/yannickilcher​ Twitter: https://twitter.com/ykilcher​ Discord: https://discord.gg/4H8xxDF​ BitChute: https://www.bitchute.com/channel/yann...​ Minds: https://www.minds.com/ykilcher​ Parler: https://parler.com/profile/YannicKilcher​ LinkedIn: https://www.linkedin.com/in/yannic-ki...​ BiliBili: https://space.bilibili.com/1824646584​ If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick...​ Patreon: https://www.patreon.com/yannickilcher​ Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
undefined
May 3, 2021 • 59min

Yann LeCun - Self-Supervised Learning: The Dark Matter of Intelligence (FAIR Blog Post Explained)

#selfsupervisedlearning​ #yannlecun​ #facebookai​ Deep Learning systems can achieve remarkable, even super-human performance through supervised learning on large, labeled datasets. However, there are two problems: First, collecting ever more labeled data is expensive in both time and money. Second, these deep neural networks will be high performers on their task, but cannot easily generalize to other, related tasks, or they need large amounts of data to do so. In this blog post, Yann LeCun and Ishan Misra of Facebook AI Research (FAIR) describe the current state of Self-Supervised Learning (SSL) and argue that it is the next step in the development of AI that uses fewer labels and can transfer knowledge faster than current systems. They suggest as a promising direction to build non-contrastive latent-variable predictive models, like VAEs, but ones that also provide high-quality latent representations for downstream tasks. OUTLINE: 0:00​ - Intro & Overview 1:15​ - Supervised Learning, Self-Supervised Learning, and Common Sense 7:35​ - Predicting Hidden Parts from Observed Parts 17:50​ - Self-Supervised Learning for Language vs Vision 26:50​ - Energy-Based Models 30:15​ - Joint-Embedding Models 35:45​ - Contrastive Methods 43:45​ - Latent-Variable Predictive Models and GANs 55:00​ - Summary & Conclusion Paper (Blog Post): https://ai.facebook.com/blog/self-sup...​ My Video on BYOL: https://www.youtube.com/watch?v=YPfUi...​ ERRATA: - The difference between loss and energy: Energy is for inference, loss is for training. - The R(z) term is a regularizer that restricts the capacity of the latent variable. I think I said both of those things, but never together. - The way I explain why BERT is contrastive is wrong. I haven't figured out why just yet, though :) Video approved by Antonio. Abstract: We believe that self-supervised learning (SSL) is one of the most promising ways to build such background knowledge and approximate a form of common sense in AI systems. Authors: Yann LeCun, Ishan Misra Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick​ YouTube: https://www.youtube.com/c/yannickilcher​ Twitter: https://twitter.com/ykilcher​ Discord: https://discord.gg/4H8xxDF​ BitChute: https://www.bitchute.com/channel/yann...​ Minds: https://www.minds.com/ykilcher​ Parler: https://parler.com/profile/YannicKilcher​ LinkedIn: https://www.linkedin.com/in/yannic-ki...​ BiliBili: https://space.bilibili.com/1824646584​ If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick...​ Patreon: https://www.patreon.com/yannickilcher​ Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app