Transformers and self-attention, have
become the dominant approach
for natural language processing with systems such as Bert and GPT-3
rapidly displacing more established RNN and CNN structures. Recent
results have shown that transformers are well suited for multi-modal
perception and multi-modal interaction.
In this advanced tutorial we review the emergence of attention
for bilingual language models, and show how this led to the transformer
architecture composed of stacked encoder and decoder layers using
multi-headed attention. We discuss techniques for token and position
embeddings for natural language, and show how these can be trained
using a masked language model. We describe how this approach can
be extended to multiple modalities by concatenating encodings of
modalities and discuss problems and approaches for adapting transforms
for use with computer vision and spoken language interaction. We
conclude with a review of current research challenges, performance
evaluation metrics and benchmark data sets, followed by a discussion of
potential applications such as multimodal sentiment analysis, affective
interaction, and narrative description of activities.
Francois Yvon is a senior researcher in the Spoken Language Processing Group of the CNRS LISN Laboratory (Laboratoire Interdisciplinaire des Sciences du Numérique) at the Univ. Paris Saclay. He currently focuses mainly machine translation using statistical and neural methods and more generally on machine learning applied to both written and vocal multilingual language data.
Marc Evrard is a junior professor (Maitre de Conference) at Univ. Paris Saclay and works as a researcher at the Spoken Language Processing Group of the CNRS LISN Laboratory (Laboratoire Interdisciplinaire des Sciences du Numérique). He received his doctorate in Computer Science from the Univ. Paris-Sud (now Univ. Paris Saclay) in 2015, with a thesis on Expressive Text-to-Speech Synthesis. His research focuses on speech processing and natural language processing in the context of digital humanities.
Camille Guinaudeau is a junior professor (Maitre de Conference) at Université Paris Saclay. She works as a researcher at the CNRS LISN Laboratory (Laboratoire Interdisciplinaire des Sciences du Numérique) at the Univ. Paris Saclay in the Spoken Language Processing group. She received her doctorate in 2011 with a thesis on the automatic structuring of TV streams. Her current research concerns spoken language processing, information retrieval and the structure of multimedia documents.
James L. Crowley
is a Professor Emeritus of the
Institut Polytechnique de Grenoble (Grenoble INP). Prior to
October 2021, he taught courses in Computer Vision, Machine
Learning and Intelligent Systems, and directed the Pervasive
Interaction research team at INRIA Grenoble Research Center. In 2019 he
was appointed to the chair of Collaborative Intelligent Systems, at the
Univ. Grenoble Alpes Multidisciplinary AI Institute (Miai) where he
remains active in research on Artificial Intelligence and Multimodal
Human-Computer Interaction. His current research combines multi-modal
perception with cognitive modeling to explore new forms of interaction
with intelligent systems using multimodal Transformers trained with
self-supervised learning.