

NLP Highlights
Allen Institute for Artificial Intelligence
**The podcast is currently on hiatus. For more active NLP content, check out the Holistic Intelligence Podcast linked below.**
Welcome to the NLP highlights podcast, where we invite researchers to talk about their work in various areas in natural language processing. All views expressed belong to the hosts/guests, and do not represent their employers.
Welcome to the NLP highlights podcast, where we invite researchers to talk about their work in various areas in natural language processing. All views expressed belong to the hosts/guests, and do not represent their employers.
Episodes
Mentioned books

Nov 16, 2018 • 41min
74 - Deep Reinforcement Learning Doesn't Work Yet, with Alex Irpan
Blog post by Alex Irpan titled "Deep Reinforcement Learning Doesn't Work Yet"
https://www.alexirpan.com/2018/02/14/rl-hard.html
In this episode, Alex Irpan talks about limitations of current deep reinforcement learning methods and why we have a long way to go before they go mainstream. We discuss sample inefficiency, instability, the difficulty to design reward functions and overfitting to the environment. Alex concludes with a list of recommendations he found useful when training models with deep reinforcement learning.

Nov 13, 2018 • 53min
73 - Supersense Disambiguation of English Prepositions and Possessives, with Nathan Schneider
ACL 2018 paper by Nathan Schneider, Jena D. Hwang, Vivek Srikumar, Jakob Prange, Austin Blodgett, Sarah R. Moeller, Aviram Stern, Adi Bitan, Omri Abend.
In this episode, Nathan discusses how the meaning of prepositions varies, proposes a hierarchy for classifying the semantics of function words (e.g., comparison, temporal, purpose), and describes empirical results using the provided dataset for disambiguating preposition semantics. Along the way, we talk about lexicon-based semantics, multilinguality and pragmatics.
https://www.semanticscholar.org/paper/Comprehensive-Supersense-Disambiguation-of-English-Schneider-Hwang/8310213af102913b9e74e7dfe6864f3aa62a5a5e

Oct 16, 2018 • 43min
72 - The Anatomy Question Answering Task, with Jordan Boyd-Graber
Our first episode in a new format: broader surveys of areas, instead of specific discussions on individual papers. In this episode, we talk with Jordan Boyd-Graber about question answering.
Matt starts the discussion by giving five different axes on which question answering tasks vary: (1)how complex is the language in the question, (2)what is the genre of the question / nature of the question semantics, (3)what is the context or knowledge source used to answer the question, (4)how much "reasoning" is required to answer the question, and (5) what's the format of the answer?
We talk about each of these in detail, giving examples from Jordan's and others' work. In the end, we conclude that "question answering" is a format to study a particular phenomenon, it is not a "phenomenon" in itself. Sometimes it's useful to pose a phenomenon you want to study as a question answering task, and sometimes it's not.
During the conversation, Jordan mentioned the QANTA competition; you can find that here: http://qanta.org. We also talked about an adversarial question creation task for Quiz Bowl questions; the paper on that can be found here: https://www.semanticscholar.org/paper/Trick-Me-If-You-Can%3A-Adversarial-Writing-of-Trivia-Wallace-Boyd-Graber/11caf090fef96605d6d67c7505572b1a26796971.

Oct 12, 2018 • 34min
71 - DuoRC: Complex Language Understanding with Paraphrased Reading Comprehension, with Amrita Saha
ACL 2018 paper by Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, Karthik Sankaranarayanan
Amrita and colleagues at IBM Research introduced a harder dataset for "reading comprehension", where you have to answer questions about a given passage of text. Amrita joins us on the podcast to talk about why a new dataset is necessary, what makes this one unique and interesting, and how well initial baseline systems perform on it. Along the way, we talk about the problems with using BLEU or ROUGE as evaluation metrics for question answering systems.
https://www.semanticscholar.org/paper/DuoRC%3A-Towards-Complex-Language-Understanding-with-Saha-Aralikatte/1e70a4830840d48486ecfbc6c89b774cdd0b6399

Sep 18, 2018 • 41min
70 - Measuring the Evolution of a Scientific Field through Citation Frames, with David Jurgens
TACL 2018 paper (presented at ACL 2018) by David Jurgens, Srijan Kumar, Raine Hoover, Daniel A. McFarland, and Daniel Jurafsky
David comes on the podcast to talk to us about citation frames. We discuss the dataset they created by painstakingly annotating the "citation type" for all of the citations in a large collection of papers (around 2000 citations in total), then training a classifier on that data to annotate the rest of the ACL anthology. This process itself is interesting, including how exactly the citations are classified, and we talk about this for a bit. The second half of the podcast talks about the analysis that David and colleagues did using the (automatically) annotated ACL anthology, trying to gauge how the field has changed over time.
https://www.semanticscholar.org/paper/Measuring-the-Evolution-of-a-Scientific-Field-Jurgens-Kumar/65118f3a7463f54bdf9b9e5cdd655953a2488c2f

Sep 10, 2018 • 35min
69 - Second language acquisition modeling, with Burr Settles
A shared task held in conjunction with a NAACL 2018 workshop, organized by Burr Settles and collaborators at Duolingo.
Burr tells us about the shared task. The goal of the task was to predict errors that a language learner would make when doing exercises on Duolingo. We talk about the details of the data, why this particular data is interesting to study for second language acquisition, what could be better about it, and what systems people used to approach this task. We also talk a bit about what you could do with a system that can predict these kinds of errors to build better language learning systems.
https://www.semanticscholar.org/paper/Second-Language-Acquisition-Modeling-Settles-Brust/10685728fab1dfe9d1cf0cd4240ed687dd601ac6

Sep 4, 2018 • 37min
68 - Neural models of factuality, with Rachel Rudinger
NAACL 2018 paper, by Rachel Rudinger, Aaron Steven White, and Benjamin Van Durme
Rachel comes on to the podcast, telling us about what factuality is (did an event happen?), what datasets exist for doing this task (a few; they made a new, bigger one), and how to build models to predict factuality (turns out a vanilla biLSTM does quite well). Along the way, we have interesting discussions about how you decide what an "event" is, how you label factuality (whether something happened) on inherently uncertain text (like "I probably failed the test"), and how you might use a system that predicts factuality in some end task.
https://www.semanticscholar.org/paper/Neural-models-of-factuality-Rudinger-White/4d62a1e7819f9e3f8c837832c66659db5a6d9b37

Aug 27, 2018 • 39min
67 - GLUE: A Multi-Task Benchmark and Analysis Platform, with Sam Bowman
Paper by Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman.
Sam comes on to tell us about GLUE. We talk about the motivation behind setting up a benchmark framework for natural language understanding, how the authors defined "NLU" and chose the tasks for this benchmark, a very nice diagnostic dataset that was constructed for GLUE, and what insight they gained from the experiments they've run so far. We also have some musings about the utility of general-purpose sentence vectors, and about leaderboards.
https://www.semanticscholar.org/paper/GLUE%3A-A-Multi-Task-Benchmark-and-Analysis-Platform-Wang-Singh/a2054eff8b4efe0f1f53d88c08446f9492ae07c1

Aug 20, 2018 • 26min
66 - Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods, with Jieyu Zhao
NACL 2018 paper, by Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang.
Jieyu comes on the podcast to talk about bias in coreference resolution models. This bias makes models rely disproportionately on gender when making decisions for whether "she" refers to a noun like "secretary" or "physician". Jieyu and her co-authors show that coreference systems do not actually exhibit much bias in standard evaluation settings (OntoNotes), perhaps because there is a broad document context to aid in making coreference decisions. But they then construct a really nice diagnostic dataset that isolates simple coreference decisions, and evaluates whether the model is using common sense, grammar, or gender bias to make those decisions. This dataset shows that current models are quite biased, particularly when it comes to common sense, using gender to make incorrect coreference decisions. Jieyu then tells us about some simple methods to correct the bias without much of a drop in overall accuracy.
https://www.semanticscholar.org/paper/Gender-Bias-in-Coreference-Resolution%3A-Evaluation-Zhao-Wang/e4a31322ed60479a6ae05d1f2580dd0fa2d77e50
Also, there was a very similar paper also published at NAACL 2018 that used similar methodology and constructed a similar dataset: https://www.semanticscholar.org/paper/Gender-Bias-in-Coreference-Resolution-Rudinger-Naradowsky/be2c8b5ec0eee2f32da950db1b6cf8cc4a621f8f.

Aug 13, 2018 • 39min
65 - Event Representations with Tensor-based Compositions, with Niranjan Balasubramanian
AAAI 2018 paper by Noah Weber, Niranjan Balasubramanian, and Nathanael Chambers
Niranjan joins us on the podcast to tell us about his latest contribution in a line of work going back to Shank's scripts. This work tries to model sequences of events to get coherent narrative schemas, mined from large collections of text. For example, given an event like "She threw a football", you might expect future events involving catching, running, scoring, and so on. But if the event is instead "She threw a bomb", you would expect future events to involve things like explosions, damage, arrests, or other related things. We spend much of our conversation talking about why these scripts are interesting to study, and the general outline for how one might learn these scripts from text, and spend a little bit of time talking about the particular contribution of this paper, which is a better model that captures interactions among all of the arguments to an event.
https://www.semanticscholar.org/paper/Event-Representations-With-Tensor-Based-Weber-Balasubramanian/418f405a60b8d9009099777f7ae37f4496542f90