The Science of Tokenization

There's a whole area of like natural language processing around that called tokenization. It's just a much harder problem in Japanese right and Chinese too because Chinese doesn't have spaces either. You know the route I should have gone was using one of the existing things like me tab is what's the popular one at the time. The existing open source libraries for this but I ended up the trick is that I really wanted it to be really tightly integrated with the dictionary for example.

Play episode from 39:41

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app