NLP Highlights

Allen Institute for Artificial Intelligence
undefined
May 26, 2020 • 44min

114 - Behavioral Testing of NLP Models, with Marco Tulio Ribeiro

We invited Marco Tulio Ribeiro, a Senior Researcher at Microsoft, to talk about evaluating NLP models using behavioral testing, a framework borrowed from Software Engineering. Marco describes three kinds of black-box tests the check whether NLP models satisfy certain necessary conditions. While breaking the standard IID assumption, this framework presents a way to evaluate whether NLP systems are ready for real-world use. We also discuss what capabilities can be tested using this framework, how one can come up with good tests, and the need for an evolving set of behavioral tests for NLP systems. Marco’s homepage: https://homes.cs.washington.edu/~marcotcr/
undefined
May 22, 2020 • 42min

113 - Managing Industry Research Teams, with Fernando Pereira

We invited Fernando Pereira, a VP and Distinguished Engineer at Google, where he leads NLU and ML research, to talk about managing NLP research teams in industry. Topics we discussed include prioritizing research against product development and effective collaboration with product teams, dealing with potential research interest mismatch between individuals and the company, managing publications, hiring new researchers, and diversity and inclusion.
undefined
May 13, 2020 • 33min

112 - Alignment of Multilingual Contextual Representations, with Steven Cao

We invited Steven Cao to talk about his paper on multilingual alignment of contextual word embeddings. We started by discussing how multilingual transformers work in general, and then focus on Steven’s work on aligning word representations. The core idea is to start from a list of words automatically aligned from parallel corpora and to ensure the representations of the aligned words are similar to each other while not moving too far away from their original representations. We discussed the experiments on the XNLI dataset in the paper, analysis, and the decision to do the alignment at word level and compare it to other possibilities such as aligning word pieces or higher level encoded representations in transformers. Paper: https://openreview.net/forum?id=r1xCMyBtPS Steven Cao’s webpage: https://stevenxcao.github.io/
undefined
Apr 27, 2020 • 38min

111 - Typologically diverse, multi-lingual, information-seeking questions, with Jon Clark

We invited Jon Clark from Google to talk about TyDi QA, a new question answering dataset, for this episode. The dataset contains information seeking questions in 11 languages that are typologically diverse, i.e., they differ from each other in terms of key structural and functional features. The questions in TyDiQA are information-seeking, like those in Natural Questions, which we discussed in the previous episode. In addition, TyDiQA also has questions collected in multiple languages using independent crowdsourcing pipelines, as opposed to some other multilingual QA datasets like XQuAD and MLQA where English data is translated into other languages. The dataset and the leaderboard can be accessed at https://ai.google.com/research/tydiqa.
undefined
Apr 6, 2020 • 44min

110 - Natural Questions, with Tom Kwiatkowski and Michael Collins

In this episode, Tom Kwiatkowski and Michael Collins talk about Natural Questions, a benchmark for question answering research. We discuss how the dataset was collected to reflect naturally-occurring questions, the criteria used for identifying short and long answers, how this dataset differs from other QA datasets, and how easy it might be to game the benchmark with superficial processing of the text. We also contrast the holistic design in Natural Questions to deliberately targeting specific linguistic phenomena of interest when building a QA dataset. Dataset: https://ai.google.com/research/NaturalQuestions Paper: https://www.mitpressjournals.org/doi/full/10.1162/tacl_a_00276
undefined
Mar 30, 2020 • 47min

109 - What Does Your Model Know About Language, with Ellie Pavlick

How do we know, in a concrete quantitative sense, what a deep learning model knows about language? In this episode, Ellie Pavlick talks about two broad directions to address this question: structural and behavioral analysis of models. In structural analysis, we often train a linear classifier for some linguistic phenomenon we'd like to probe (e.g., syntactic dependencies) while using the (frozen) weights of a model pre-trained on some tasks (e.g., masked language models). What can we conclude from the results of probing experiments? What does probing tell us about the linguistic abstractions encoded in each layer of an end-to-end pre-trained model? How well does it match classical NLP pipelines? How important is it to freeze the pre-trained weights in probing experiments? In contrast, behavioral analysis evaluates a model's ability to distinguish between inputs which respect vs. violate a linguistic phenomenon using acceptability or entailment tasks, e.g., can the model predict which is more likely: "dog bites man" vs. "man bites dog"? We discuss the significance of which format to use for behavioral tasks, and how easy it is for humans to perform such tasks. Ellie Pavlick's homepage: https://cs.brown.edu/people/epavlick/ BERT rediscovers the classical nlp pipeline , by Ian Tenney, Dipanjan Das, Ellie Pavlick https://arxiv.org/pdf/1905.05950.pdf?fbclid=IwAR3gzFibSBoDGdjqVu9Gq0mh1lDdRZa7dm42JuXXUfjG6rKZ44iHIOdV6jg Inherent Disagreements in Human Textual Inferences by Ellie Pavlick and Tom Kwiatkowski https://www.mitpressjournals.org/doi/full/10.1162/tacl_a_00293
undefined
Mar 23, 2020 • 50min

108 - Data-To-Text Generation, with Verena Rieser and Ondřej Dušek

In this episode we invite Verena Rieser and Ondřej Dušek on to talk to us about the complexities of generating natural language when you have some kind of structured meaning representation as input. We talk about when you might want to do this, which is often is some kind of a dialog system, but also generating game summaries, and even some language modeling work. We then talk about why this is hard, which in large part is due to the difficulty of collecting data, and how to evaluate the output of these systems. We then move on to discussing the details of a major challenge that Verena and Ondřej put on, called the end-to-end natural language generation challenge (E2E NLG). This was a dataset of task-based dialog generation focused on the restaurant domain, with some very innovative data collection techniques. They held a shared task with 16 participating teams in 2017, and the data has been further used since. We talk about the methods that people used for the task, and what we can learn today from what methods have been used on this data. Verena's website: https://sites.google.com/site/verenateresarieser/ Ondřej's website: https://tuetschek.github.io/ The E2E NLG Challenge that we talked about quite a bit: http://www.macs.hw.ac.uk/InteractionLab/E2E/
undefined
Feb 24, 2020 • 38min

107 - Multi-Modal Transformers, with Hao Tan and Mohit Bansal

In this episode, we invite Hao Tan and Mohit Bansal to talk about multi-modal training of transformers, focusing in particular on their EMNLP 2019 paper that introduced LXMERT, a vision+language transformer. We spend the first third of the episode talking about why you might want to have multi-modal representations. We then move to the specifics of LXMERT, including the model structure, the losses that are used to encourage cross-modal representations, and the data that is used. Along the way, we mention latent alignments between images and captions, the granularity of captions, and machine translation even comes up a few times. We conclude with some speculation on the future of multi-modal representations. Hao's website: http://www.cs.unc.edu/~airsplay/ Mohit's website: http://www.cs.unc.edu/~mbansal/ LXMERT paper: https://www.aclweb.org/anthology/D19-1514/
undefined
Feb 17, 2020 • 39min

106 - Ethical Considerations In NLP Research, with Emily Bender

In this episode, we talked to Emily Bender about the ethical considerations in developing NLP models and putting them in production. Emily cited specific examples of ethical issues, and talked about the kinds of potential concerns to keep in mind, both when releasing NLP models that will be used by real people, and also while conducting NLP research. We concluded by discussing a set of open-ended questions about designing tasks, collecting data, and publishing results, that Emily has put together towards addressing these concerns. Emily M. Bender is a Professor in the Department of Linguistics and an Adjunct Professor in the Department of Computer Science and Engineering at the University of Washington. She’s active on Twitter at @emilymbender.
undefined
5 snips
Feb 10, 2020 • 43min

105 - Question Generation, with Sudha Rao

In this episode we invite Sudha Rao to talk about question generation. We talk about different settings where you might want to generate questions: for human testing scenarios (rare), for data augmentation (has been done a bunch for SQuAD-like tasks), for detecting missing information / asking clarification questions, for dialog uses, and others. After giving an overview of the general area, we talk about the specifics of some of Sudha's work, including her ACL 2018 best paper on ranking clarification questions using EVPI. We conclude with a discussion of evaluating question generation, which is a hard problem, and what the exciting open questions there are in this research area. Sudha's website: https://raosudha.weebly.com/

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app