Yannic Kilcher Videos (Audio Only)

Yannic Kilcher

I make videos about machine learning research papers, programming, and issues of the AI community, and the broader impact of AI in society.

Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar (preferred to Patreon): https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq

Episodes

Mentioned books

Jan 5, 2022 • 54min

Resolution-robust Large Mask Inpainting with Fourier Convolutions (w/ Author Interview)

#lama #inpainting #deeplearning At the end of the video is an interview with the paper authors! LaMa is a system that is amazing at removing foreground objects from images, especially when those objects cover a large part of the image itself. LaMa is specifically trained to reconstruct large masked areas and includes global information throughout its forward propagation by using Fourier Convolutions in its layers. This makes it incredibly effective at reconstructing periodic structures with long-range consistency, compared to regular convolutions. OUTLINE: 0:00 - Intro 0:45 - Sponsor: ClearML 3:30 - Inpainting Examples 5:05 - Live Demo 6:40 - Locality as a weakness of convolutions 10:30 - Using Fourier Transforms for global information 12:55 - Model architecture overview 14:35 - Fourier convolution layer 21:15 - Loss function 24:25 - Mask generation algorithm 25:40 - Experimental results 28:25 - Interview with the authors Paper: https://arxiv.org/abs/2109.07161 Code: https://github.com/saic-mdal/lama Online Demo: https://cleanup.pictures/ Sponsor: ClearML https://clear.ml Abstract: Modern image inpainting systems, despite the significant progress, often struggle with large missing areas, complex geometric structures, and high-resolution images. We find that one of the main reasons for that is the lack of an effective receptive field in both the inpainting network and the loss function. To alleviate this issue, we propose a new method called large mask inpainting (LaMa). LaMa is based on i) a new inpainting network architecture that uses fast Fourier convolutions (FFCs), which have the image-wide receptive field; ii) a high receptive field perceptual loss; iii) large training masks, which unlocks the potential of the first two components. Our inpainting network improves the state-of-the-art across a range of datasets and achieves excellent performance even in challenging scenarios, e.g. completion of periodic structures. Our model generalizes surprisingly well to resolutions that are higher than those seen at train time, and achieves this at lower parameter&time costs than the competitive baselines. The code is available at \url{this https URL}. Authors: Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, Victor Lempitsky Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/2017636191 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Dec 14, 2021 • 26min

[ML News] DeepMind tackles Math | Microsoft does more with less | Timnit Gebru launches DAIR

#mlnews #deepmind #ai The most trusted model in News! Get started with Weights & Biases here: https://wandb.me/yannic (it's free forever for personal use) OUTLINE: 0:00 - Intro 0:15 - Sponsor: Weights & Biases 3:10 - DeepMind tackles fundamental math 6:45 - Microsoft focuses on scaling effectively and efficiently 10:15 - NeurIPS Anthology Visualization 13:30 - Timnit Gebru launches research institute independent from big tech 16:50 - SageMaker Canvas for no-code ML 17:50 - Help, Help! 21:40 - Cornelius Emde wins the 3090 21:55 - A retrospective on the NeurIPS 2021 ethics review process References: DeepMind tackles fundamental math https://deepmind.com/blog/article/exp... https://www.nature.com/articles/s4158... Microsoft focuses on scaling effectively and efficiently https://www.microsoft.com/en-us/resea... NeurIPS Anthology Visualization https://neuripsav.vizhub.ai/blog/ https://neuripsav.vizhub.ai/ Timnit Gebru launches research institute independent from big tech https://www.washingtonpost.com/techno... https://www.dair-institute.org/about https://www.theguardian.com/commentis... SageMaker Canvas for no-code ML https://aws.amazon.com/blogs/aws/anno... Help, Help! https://macberth.netlify.app/ https://huggingface.co/emanjavacas/Ma... https://developer.nvidia.com/blog/nvi... https://opacus.ai/ https://twitter.com/naotokui_en/statu... https://colab.research.google.com/dri... https://twitter.com/ThomasSimonini/st... https://github.com/karpathy/arxiv-san... https://arxiv-sanity-lite.com/ https://www.youtube.com/watch?v=01ENz... https://github.com/Felix-Petersen/alg... https://github.com/rentruewang/koila?... https://github.com/YeWR/EfficientZero Cornelius Emde wins the 3090 https://twitter.com/CorEmde/status/14... A retrospective on the NeurIPS 2021 ethics review process https://blog.neurips.cc/2021/12/03/a-... Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/2017636191 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Dec 10, 2021 • 53min

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion (ML Research Paper Explained)

#nuwa #microsoft #generative NÜWA is a unifying architecture that can ingest text, images, and videos and brings all of them into a quantized latent representation to support a multitude of visual generation tasks, such as text-to-image, text-guided video manipulation, or sketch-to-video. This paper details how the encoders for the different modalities are constructed, and how the latent representation is transformed using their novel 3D nearby self-attention layers. Experiments are shown on 8 different visual generation tasks that the model supports. OUTLINE: 0:00 - Intro & Outline 1:20 - Sponsor: ClearML 3:35 - Tasks & Naming 5:10 - The problem with recurrent image generation 7:35 - Creating a shared latent space w/ Vector Quantization 23:20 - Transforming the latent representation 26:25 - Recap: Self- and Cross-Attention 28:50 - 3D Nearby Self-Attention 41:20 - Pre-Training Objective 46:05 - Experimental Results 50:40 - Conclusion & Comments Paper: https://arxiv.org/abs/2111.12417 Github: https://github.com/microsoft/NUWA Sponsor: ClearML https://clear.ml Abstract: This paper presents a unified multimodal pre-trained model called NÜWA that can generate new or manipulate existing visual data (i.e., images and videos) for various visual synthesis tasks. To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and reduce the computational complexity. We evaluate NÜWA on 8 downstream tasks. Compared to several strong baselines, NÜWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is this https URL. Authors: Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, Nan Duan Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/2017636191 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Dec 3, 2021 • 29min

[ML News] OpenAI removes GPT-3 waitlist | GauGAN2 is amazing | NYC regulates AI hiring tools

#mlnews #gaugan #gpt-3 Your weekly dose of ML News! More GauGAN images here: https://drive.google.com/drive/folder... OUTLINE: 0:00 - Intro 0:20 - Sponsor: Weights & Biases 2:20 - OpenAI's removes GPT-3 Waitlist 4:55 - NVIDIA releases GauGAN2 Webapp 9:45 - Everyday Robots tackles real-life tasks 12:15 - MetNet-2: 12-hour Rain Forecasting 14:45 - TinyML Dog Bark Stopper 15:55 - AI learns to drive Mario Kart 64 on real hardware 17:40 - NYC regulates bias in AI hiring tools 21:05 - Beverage companies big into AI 21:50 - How does AlphaZero play Chess? 23:35 - Helpful Things 28:00 - ArXiv founder awarded Einstein Foundation Award References: OpenAI's removes GPT-3 Waitlist https://openai.com/blog/api-no-waitlist/ https://beta.openai.com/playground?mo... NVIDIA releases GauGAN2 Webapp https://www.reddit.com/r/MachineLearn... http://gaugan.org/gaugan2/ https://blogs.nvidia.com/blog/2021/11... https://blogs.nvidia.com/blog/2019/03... https://arxiv.org/abs/1903.07291 Everyday Robots tackles real-life tasks https://everydayrobots.com/ https://www.wired.com/story/plaintext... https://archive.ph/YC4XG#selection-92... MetNet-2: 12-hour Rain Forecasting https://ai.googleblog.com/2021/11/met... TinyML Dog Bark Stopper https://www.hackster.io/NathanielF/ti... AI learns to drive Mario Kart 64 on real hardwware https://www.youtube.com/watch?v=z9E38... NYC regulates bias in AI hiring tools https://www.nbcnewyork.com/news/local... Beverage companies big into AI https://www.just-drinks.com/features/... How does AlphaZero play Chess? https://arxiv.org/pdf/2111.09259.pdf https://storage.googleapis.com/uncert... Helpful Things https://huggingface.co/sberbank-ai/ru... https://github.com/MathisFederico/Ope... https://blog.tensorflow.org/2021/11/i... https://github.com/tensorflow/gnn https://github.com/jurgisp/pydreamer?... https://danijar.com/project/dreamerv2/ https://github.com/danijar/dreamerv2 https://deepgenx.com/ https://github.com/DeepGenX/CodeGenX https://devpost.com/software/heyoh-ca... https://heyoh-app.github.io/heyoh-pro... https://github.com/heyoh-app/heyoh-pr... ArXiv founder awarded Einstein Foundation Award https://idw-online.de/en/news781515?u... Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/2017636191 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq

Dec 2, 2021 • 57min

Sparse is Enough in Scaling Transformers (aka Terraformer) | ML Research Paper Explained

#scalingtransformers #terraformer #sparsity Transformers keep pushing the state of the art in language and other domains, mainly due to their ability to scale to ever more parameters. However, this scaling has made it prohibitively expensive to run a lot of inference requests against a Transformer, both in terms of compute and memory requirements. Scaling Transformers are a new kind of architecture that leverage sparsity in the Transformer blocks to massively speed up inference, and by including additional ideas from other architectures, they create the Terraformer, which is both fast, accurate, and consumes very little memory. OUTLINE: 0:00 - Intro & Overview 4:10 - Recap: Transformer stack 6:55 - Sparse Feedforward layer 19:20 - Sparse QKV Layer 43:55 - Terraformer architecture 55:05 - Experimental Results & Conclusion Paper: https://arxiv.org/abs/2111.12763 Code: https://github.com/google/trax/blob/m... Abstract: Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study sparse variants for all layers in the Transformer and propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer as we scale up the model size. Surprisingly, the sparse layers are enough to obtain the same perplexity as the standard Transformer with the same number of parameters. We also integrate with prior sparsity approaches to attention and enable fast inference on long sequences even with limited memory. This results in performance competitive to the state-of-the-art on long text summarization. Authors: Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Łukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/2017636191 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Dec 1, 2021 • 41min

ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning (Paper Explained)

#ext5 #transferlearning #exmix The T5 model has been a staple for NLP research for the last years. Both its size and its approach to formulate all NLP tasks as prompt-based language modeling make it a convenient choice to tackle new challenges and provides a strong baseline for most current datasets. ExT5 pushes T5 to its limits by pre-training not only on self-supervised mask filling, but also at the same time on 107 different supervised NLP tasks, which is their new ExMix dataset. The resulting model compares very favorably to T5 when fine-tuned to downstream tasks. OUTLINE: 0:00 - Intro & Overview 2:15 - Recap: The T5 model 3:55 - The ExT5 model and task formulations 8:10 - ExMix dataset 9:35 - Do different tasks help each other? 16:50 - Which tasks should we include? 20:30 - Pre-Training vs Pre-Finetuning 23:00 - A few hypotheses about what's going on 27:20 - How much self-supervised data to use? 34:15 - More experimental results 38:40 - Conclusion & Summary Paper: https://arxiv.org/abs/2111.10952 Abstract: Despite the recent success of multi-task learning and transfer learning for natural language processing (NLP), few works have systematically studied the effect of scaling up the number of tasks during pre-training. Towards this goal, this paper introduces ExMix (Extreme Mixture): a massive collection of 107 supervised NLP tasks across diverse domains and task-families. Using ExMix, we study the effect of multi-task pre-training at the largest scale to date, and analyze co-training transfer amongst common families of tasks. Through this analysis, we show that manually curating an ideal set of tasks for multi-task pre-training is not straightforward, and that multi-task scaling can vastly improve models on its own. Finally, we propose ExT5: a model pre-trained using a multi-task objective of self-supervised span denoising and supervised ExMix. Via extensive experiments, we show that ExT5 outperforms strong T5 baselines on SuperGLUE, GEM, Rainbow, Closed-Book QA tasks, and several tasks outside of ExMix. ExT5 also significantly improves sample efficiency while pre-training. Authors: Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, Donald Metzler Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/2017636191 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m

Dec 1, 2021 • 59min

Implicit MLE: Backpropagating Through Discrete Exponential Family Distributions (Paper Explained)

#imle #backpropagation #discrete Backpropagation is the workhorse of deep learning, but unfortunately, it only works for continuous functions that are amenable to the chain rule of differentiation. Since discrete algorithms have no continuous derivative, deep networks with such algorithms as part of them cannot be effectively trained using backpropagation. This paper presents a method to incorporate a large class of algorithms, formulated as discrete exponential family distributions, into deep networks and derives gradient estimates that can easily be used in end-to-end backpropagation. This enables things like combinatorial optimizers to be part of a network's forward propagation natively. OUTLINE: 0:00 - Intro & Overview 4:25 - Sponsor: Weights & Biases 6:15 - Problem Setup & Contributions 8:50 - Recap: Straight-Through Estimator 13:25 - Encoding the discrete problem as an inner product 19:45 - From algorithm to distribution 23:15 - Substituting the gradient 26:50 - Defining a target distribution 38:30 - Approximating marginals via perturb-and-MAP 45:10 - Entire algorithm recap 56:45 - Github Page & Example Paper: https://arxiv.org/abs/2106.01798 Code (TF): https://github.com/nec-research/tf-imle Code (Torch): https://github.com/uclnlp/torch-imle Our Discord: https://discord.gg/4H8xxDF Sponsor: Weights & Biases https://wandb.com Abstract: Combining discrete probability distributions and combinatorial optimization problems with neural network components has numerous applications but poses several challenges. We propose Implicit Maximum Likelihood Estimation (I-MLE), a framework for end-to-end learning of models combining discrete exponential family distributions and differentiable neural components. I-MLE is widely applicable as it only requires the ability to compute the most probable states and does not rely on smooth relaxations. The framework encompasses several approaches such as perturbation-based implicit differentiation and recent methods to differentiate through black-box combinatorial solvers. We introduce a novel class of noise distributions for approximating marginals via perturb-and-MAP. Moreover, we show that I-MLE simplifies to maximum likelihood estimation when used in some recently studied learning settings that involve combinatorial solvers. Experiments on several datasets suggest that I-MLE is competitive with and often outperforms existing approaches which rely on problem-specific relaxations. Authors: Mathias Niepert, Pasquale Minervini, Luca Franceschi Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/2017636191 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Nov 26, 2021 • 11min

Peer Review is still BROKEN! The NeurIPS 2021 Review Experiment (results are in)

#neurips #peerreview #machinelearning A look at the results of the 2021 NeurIPS peer review experiment. https://arxiv.org/abs/2109.09774 https://www.reddit.com/r/MachineLearn... Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/2017636191 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Nov 25, 2021 • 48min

Parameter Prediction for Unseen Deep Architectures (w/ First Author Boris Knyazev)

#deeplearning #neuralarchitecturesearch #metalearning Deep Neural Networks are usually trained from a given parameter initialization using SGD until convergence at a local optimum. This paper goes a different route: Given a novel network architecture for a known dataset, can we predict the final network parameters without ever training them? The authors build a Graph-Hypernetwork and train on a novel dataset of various DNN-architectures to predict high-performing weights. The results show that not only can the GHN predict weights with non-trivial performance, but it can also generalize beyond the distribution of training architectures to predict weights for networks that are much larger, deeper, or wider than ever seen in training. OUTLINE: 0:00 - Intro & Overview 6:20 - DeepNets-1M Dataset 13:25 - How to train the Hypernetwork 17:30 - Recap on Graph Neural Networks 23:40 - Message Passing mirrors forward and backward propagation 25:20 - How to deal with different output shapes 28:45 - Differentiable Normalization 30:20 - Virtual Residual Edges 34:40 - Meta-Batching 37:00 - Experimental Results 42:00 - Fine-Tuning experiments 45:25 - Public reception of the paper ERRATA: - Boris' name is obviously Boris, not Bori - At 36:05, Boris mentions that they train the first variant, yet on closer examination, we decided it's more like the second Paper: https://arxiv.org/abs/2110.13100 Code: https://github.com/facebookresearch/p... Abstract: Deep learning has been successful in automating the design of features in machine learning pipelines. However, the algorithms optimizing neural network parameters remain largely hand-designed and computationally inefficient. We study if we can use deep learning to directly predict these parameters by exploiting the past knowledge of training other networks. We introduce a large-scale dataset of diverse computational graphs of neural architectures - DeepNets-1M - and use it to explore parameter prediction on CIFAR-10 and ImageNet. By leveraging advances in graph neural networks, we propose a hypernetwork that can predict performant parameters in a single forward pass taking a fraction of a second, even on a CPU. The proposed model achieves surprisingly good performance on unseen and diverse networks. For example, it is able to predict all 24 million parameters of a ResNet-50 achieving a 60% accuracy on CIFAR-10. On ImageNet, top-5 accuracy of some of our networks approaches 50%. Our task along with the model and results can potentially lead to a new, more computationally efficient paradigm of training networks. Our model also learns a strong representation of neural architectures enabling their analysis. Authors: Boris Knyazev, Michal Drozdzal, Graham W. Taylor, Adriana Romero-Soriano Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/2017636191 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m

Nov 22, 2021 • 39min

Learning Rate Grafting: Transferability of Optimizer Tuning (Machine Learning Research Paper Reivew)

#grafting #adam #sgd The last years in deep learning research have given rise to a plethora of different optimization algorithms, such as SGD, AdaGrad, Adam, LARS, LAMB, etc. which all claim to have their special peculiarities and advantages. In general, all algorithms modify two major things: The (implicit) learning rate schedule, and a correction to the gradient direction. This paper introduces grafting, which allows to transfer the induced learning rate schedule of one optimizer to another one. In that, the paper shows that much of the benefits of adaptive methods (e.g. Adam) are actually due to this schedule, and not necessarily to the gradient direction correction. Grafting allows for more fundamental research into differences and commonalities between optimizers, and a derived version of it makes it possible to computes static learning rate corrections for SGD, which potentially allows for large savings of GPU memory. OUTLINE 0:00 - Rant about Reviewer #2 6:25 - Intro & Overview 12:25 - Adaptive Optimization Methods 20:15 - Grafting Algorithm 26:45 - Experimental Results 31:35 - Static Transfer of Learning Rate Ratios 35:25 - Conclusion & Discussion Paper (OpenReview): https://openreview.net/forum?id=FpKgG... Old Paper (Arxiv): https://arxiv.org/abs/2002.11803 Our Discord: https://discord.gg/4H8xxDF Abstract: In the empirical science of training large neural networks, the learning rate schedule is a notoriously challenging-to-tune hyperparameter, which can depend on all other properties (architecture, optimizer, batch size, dataset, regularization, ...) of the problem. In this work, we probe the entanglements between the optimizer and the learning rate schedule. We propose the technique of optimizer grafting, which allows for the transfer of the overall implicit step size schedule from a tuned optimizer to a new optimizer, preserving empirical performance. This provides a robust plug-and-play baseline for optimizer comparisons, leading to reductions to the computational cost of optimizer hyperparameter search. Using grafting, we discover a non-adaptive learning rate correction to SGD which allows it to train a BERT model to state-of-the-art performance. Besides providing a resource-saving tool for practitioners, the invariances discovered via grafting shed light on the successes and failure modes of optimizers in deep learning. Authors: Anonymous (Under Review) Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/2017636191 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app