
129 - Transformers and Hierarchical Structure, with Shunyu Yao
NLP Highlights
00:00
Generalization to Longer Sequence Lents?
In our teoratic construction, it's a kind of automatic how you generalyze from smaller imput to longer input. Because that is the fundamental strains of distributed way of siquence processing. Whereas if you want to generalize from like, a smaller depth to a longer depth by fixing about the hard attention construction, it requires stacking up more layers. And for the soft attention in network, it acturally requires some details about how you represent the dest information. It really depends on what is the maximum depth tat you count on.
Transcript
Play full episode