
129 - Transformers and Hierarchical Structure, with Shunyu Yao
NLP Highlights
Generalization to Longer Sequence Lents?
In our teoratic construction, it's a kind of automatic how you generalyze from smaller imput to longer input. Because that is the fundamental strains of distributed way of siquence processing. Whereas if you want to generalize from like, a smaller depth to a longer depth by fixing about the hard attention construction, it requires stacking up more layers. And for the soft attention in network, it acturally requires some details about how you represent the dest information. It really depends on what is the maximum depth tat you count on.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.