AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Exploring Quantization in Transformers
This chapter investigates quantization methods for transformers, focusing on enhancing quantizability by tackling outlier issues. It discusses the implications of attention mechanisms and the challenges of updating token representations, comparing traditional softmax approaches with innovative gated attention methods. The findings reveal that while quantization generally yields better performance than pruning, specific scenarios may influence the choice between these techniques.