Efficient Model with Spass Mixture of Experts

Yeah. So mixed studies are a new model that wasn't released in open source before. It's a technology called spas mixture of experts, which is quite simple. You take all of the dense layers of your transformer. And you duplicate them. You call these layers expert layers. And then what you do is that for each token that you have in your sequence, you have a router mechanism, just a very simple network that decides which experts should be looking at which token. And so you send all the tokens to their experts and then you apply the experts and you get back the output and you combine them and then you go forward in the network. And you have eight experts per layer and you execute only two of them. So what it means at the end of the day is that you have a lot of parameters on your model, you have 46 billion parameters. But the thing is that the number of parameters that you execute is much lower than that because you only execute two branches out of eight. And so at the end of the day, you only execute 12 billion parameters per token. And this is what counts for latency and throughput and for performance. So you have a model which has the performance of a 12 billion parameter network that have performance that have much higher than what you could get even by compressing that a lot on a 12 billion dense transformer. Spassmixed to our experts allows to be much more efficient at inference time and also much more efficient at training time. So that's the reason why we choose to develop it very quickly.

ARCHIVE: Open Models (with Arthur Mensch) and Video Models (with Stefano Ermon)

AI + a16z

Remember Everything You Learn from Podcasts