Facebook Research - Unsupervised Translation of Programming Languages

Jun 24, 2020

Guest

Guillaume Lample

Marie-Anne Lachaux, Baptiste Roziere, and Guillaume Lample are talented researchers at Facebook AI Research in Paris, specializing in the unsupervised translation of programming languages. They discuss their groundbreaking method that leverages shared embeddings and tokenization to improve programming language interoperability. The conversation highlights the balance between human insight and machine learning in coding, the challenges of structural differences in languages, and the collaborative culture that fuels innovation at FAIR.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Unsupervised Translation

Unsupervised machine translation models learn a shared embedding space for different languages.
Similar concepts are mapped to similar locations in this space, regardless of the language.

INSIGHT

Shared Vocabulary

Shared vocabularies and word piece tokenization help align different languages in unsupervised translation.
Special language tokens guide the decoder to generate the correct target language.

ANECDOTE

Unsupervised Translator

The researchers trained an unsupervised translator for programming languages like Java, Python, and C++.
Previous methods were mostly rule-based, requiring extensive expertise and lacking generalizability.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

In this episode of Machine Learning Street Talk Dr. Tim Scarfe, Yannic Kilcher and Connor Shorten spoke with Marie-Anne Lachaux, Baptiste Roziere and Dr. Guillaume Lample from Facebook Research (FAIR) in Paris. They recently released the paper "Unsupervised Translation of Programming Languages" which was an exciting new approach to learned translation of programming languages (learned transcoder) using an unsupervised encoder trained on individual monolingual corpora i.e. no parallel language data needed. The trick they used what that there is significant token overlap when using word-piece embeddings. It was incredible to talk with this talented group of researchers and I hope you enjoy the conversation too.

Yannic's video on this got watched over 120K times! Check it out too https://www.youtube.com/watch?v=xTzFJIknh7E

Paper https://arxiv.org/abs/2006.03511;

Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, Guillaume Lample

Abstract;

"A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is timeconsuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin."