Think of it as a transformer is a series of blocks. And these blocks have attention and a little multi-layer perception. So you go off into a block and you come back to this residual pathway. And then you have a number of layers arranged sequentially. Because of the residual pathway, in the backward pass, the gradients sort of flow along it uninterrupted.