AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
How Many Tokens Do You Tokenize Text?
tokenization is the entry point to actually make all the data look like a sequence because tokens then are just kind of these little puzzle pieces we break down anything into these puzzle pieces. In gato the text there's a lot of work you tokenize text usually by looking at common commonly used substrings so that becomes a token. The current level or granularity of tokenization generally means is maybe two to five i mean i don't know the statistics exactly but i'm not to give you an idea.