
Long Context Language Models and their Biological Applications with Eric Nguyen - #690
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
Training Hyena DNA on the Human Genome and Evo on wider batch
The model is initially trained on the human genome consisting of 3 billion nucleotides using a simple vocabulary of just four bases. The training process involves taking arbitrarily long sequences and performing next token or nucleotide prediction. The challenge lies in scaling this process due to the sheer amount of data involved. However, the focus is shifting towards training a larger DNA foundation model by incorporating millions of genomes and species through Evo. DNA serves as a valuable source of biological data with trillions of tokens available for training across different species.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.