Multimodal Perception and Interaction with Transformers

Tutorial Presented at ACAI 2021

Berlin, the 11 October 2021

Transformers and self-attention, have become the dominant approach for natural language processing with systems such as Bert and GPT-3 rapidly displacing more established RNN and CNN structures. Recent results have shown that transformers are well suited for multi-modal perception and multi-modal interaction.
In this advanced tutorial we review the emergence of attention for bilingual language models, and show how this led to the transformer architecture composed of stacked encoder and decoder layers using multi-headed attention. We discuss techniques for token and position embeddings for natural language, and show how these can be trained using a masked language model. We describe how this approach can be extended to multiple modalities by concatenating encodings of modalities and discuss problems and approaches for adapting transforms for use with computer vision and spoken language interaction. We conclude with a review of current research challenges, performance evaluation metrics and benchmark data sets, followed by a discussion of potential applications such as multimodal sentiment analysis, affective interaction, and narrative description of activities.

Course Notes:

Introduction (James Crowley)
Transformers for Natural Language Processing (Francois Yvon) (Recording of Lecture)
Transformers For Spoken Language Processing (Marc Evrard)
Transformers for Computer Vision (Camille Guinaudeau)
Benchmarks Data and Research Challenges (James Crowley)
Tools for Research on Multimodal Perception and Interaction with Transformers (James Crowley)

Information about the Instructors

Francois Yvon is a senior researcher in the Spoken Language Processing Group of the CNRS LISN Laboratory (Laboratoire Interdisciplinaire des Sciences du Numérique) at the Univ. Paris Saclay. He currently focuses mainly machine translation using statistical and neural methods and more generally on machine learning applied to both written and vocal multilingual language data.

Marc Evrard is a junior professor (Maitre de Conference) at Univ. Paris Saclay and works as a researcher at the Spoken Language Processing Group of the CNRS LISN Laboratory (Laboratoire Interdisciplinaire des Sciences du Numérique). He received his doctorate in Computer Science from the Univ. Paris-Sud (now Univ. Paris Saclay) in 2015, with a thesis on Expressive Text-to-Speech Synthesis. His research focuses on speech processing and natural language processing in the context of digital humanities.

Camille Guinaudeau is a junior professor (Maitre de Conference) at Université Paris Saclay. She works as a researcher at the CNRS LISN Laboratory (Laboratoire Interdisciplinaire des Sciences du Numérique) at the Univ. Paris Saclay in the Spoken Language Processing group. She received her doctorate in 2011 with a thesis on the automatic structuring of TV streams. Her current research concerns spoken language processing, information retrieval and the structure of multimedia documents.

James L. Crowley is a Professor Emeritus of the Institut Polytechnique de Grenoble (Grenoble INP). Prior to October 2021, he taught courses in Computer Vision, Machine Learning and Intelligent Systems, and directed the Pervasive Interaction research team at INRIA Grenoble Research Center. In 2019 he was appointed to the chair of Collaborative Intelligent Systems, at the Univ. Grenoble Alpes Multidisciplinary AI Institute (Miai) where he remains active in research on Artificial Intelligence and Multimodal Human-Computer Interaction. His current research combines multi-modal perception with cognitive modeling to explore new forms of interaction with intelligent systems using multimodal Transformers trained with self-supervised learning.