AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
How to Scale a Deep Transformer
The maximum we train was forty eight layers on thi small data set with very small batch size. And over all data that is is just a five santlesson for trainings, less tin five solven queries. Once you are able to train this in a stable fashion, the transformer plus relitional wer transformer, can already perform a lot of te the reasoning it. The improvement on this hard cases i is ihuge.