Lex Fridman Podcast cover image

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Lex Fridman Podcast

00:00

DeepSeek's Complex MOE Implementation

  • DeepSeek's mixture of experts (MOE) model implementation is very complex, featuring a high sparsity factor (8 out of 256 experts activated).
  • This high sparsity requires splitting the model across different GPUs with various types of parallelism, leading to load balancing and communication scheduling challenges.
  • DeepSeek's innovation lies in changing the routing mechanism by adding an extra parameter to balance expert usage across batches, moving away from the auxiliary loss approach.
  • This approach, potentially a world first, addresses complexities like expert idling and resource optimization in sparse MOE models.
  • Their model requires complex parallelism to split the model and route traffic efficiently across different GPUs and avoid bottlenecks.
Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app