Train Your Toganizer for Different Languages?

A token is a sort ofthin enow rnt that we have about 50 thousand of these token a. We map them on to sequences of characters, so that it ends up being a common word like hi or the ends not being one token. That just makes it easier and more efficient for these language models to consume text. Inin principle, you can actually do it character level as well. Is jusd gets very inefficient. But i would think that might make foreign languages really hard. Like, for example, would asian languages be impossible then, if they have far more m tokens? Or i guess maybe you cold argue they've sort of done the tokenisation for you by

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app