Generalization to Longer Sequence Lents?

In our teoratic construction, it's a kind of automatic how you generalyze from smaller imput to longer input. Because that is the fundamental strains of distributed way of siquence processing. Whereas if you want to generalize from like, a smaller depth to a longer depth by fixing about the hard attention construction, it requires stacking up more layers. And for the soft attention in network, it acturally requires some details about how you represent the dest information. It really depends on what is the maximum depth tat you count on.

Play episode from 32:12

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app