Visually Grounded Interaction and Language (ViGIL)

June 10, 2021. NAACL Workshop, Mexico City, Mexico.

Underline Link (NAACL Registration Required)

VIGIL is a one-day interdisciplinary workshop that will push the boundaries of language grounding systems. We aim to bridge the fields of human cognition and machine learning through discussions on combining language, perception and other modalities via interaction.

Schedule

08:50 AM	Opening Remarks
09:00 AM	Talk 1: Roger Levy Semantics, Pragmatics, and Context in Human Grounded Language Understanding [Abstract] [Slides] Abstract: Computational systems for grounded language understanding have seen impressive advances over the last decade, due largely to advances in multimodal datasets, neural and symbolic modeling techniques, and computational power. But human meaning interpretation in grounded contexts remains far deeper and more sophisticated. In this talk I describe several recent studies in our research group that illustrate the subtlety and richness of human meaning interpretation using very simple, experimentally controlled utterances and visual grounding contexts. These studies shed light on the compositional structure of the semantic representations underlying human language comprehension, their relationship with the pragmatic inference mechanisms that support contextually conditioned interpretation, and the likely requirements for truly human-like language understanding in artificial systems.
09:45 AM	Talk 2: Stefanie Tellex Towards Complex Language in Partially Observed Environments [Abstract] Abstract: Robots can act as a force multiplier for people, whether a robot assisting an astronaut with a repair on the International Space station, a UAV taking flight over our cities, or an autonomous vehicle driving through our streets. Existing approaches use action-based representations that do not capture the goal-based meaning of a language expression and do not generalize to partially observed environments. The aim of my research program is to create autonomous robots that can understand complex goal-based commands and execute those commands in partially observed, dynamic environments. I will describe demonstrations of object-search in a POMDP setting with information about object locations provided by language, and mapping between English and Linear Temporal Logic, enabling a robot to understand complex natural language commands in city-scale environments. These advances represent steps towards robots that interpret complex natural language commands in partially observed environments using a decision theoretic framework.
10:30 AM	Break 1
11:00 AM	Talk 3: Katerina Fragkiadaki Linking Language with World Common Sense Using 3D Visual Feature Representations [Abstract] Abstract: To link language processing with spatial reasoning, we propose associating natural language utterances to a mental workspace of their meaning, encoded as 3-dimensional visual feature representations of the world scenes they describe. We learn such 3-dimensional visual representations---we call them visual imaginations--- by predicting images a mobile agent sees while moving around in the 3D world. The input image streams the agent collects are unprojected into egomotion-stable 3D scene feature maps of the scene, and projected from novel viewpoints to match the observed RGB image views in an end-to-end differentiable manner. We then train modular neural models to generate such 3D feature representations given language utterances, to localize the objects an utterance mentions in the 3D feature representation inferred from an image, and to predict the desired 3D object locations given a manipulation instruction. We empirically show the proposed models outperform by a large margin existing 2D models in spatial reasoning, referential object detection and instruction following, and generalize better across camera viewpoints and object arrangements.
11:45 AM	Talk 4: Max Garagnani Action-Perception Circuits for Word Learning and Semantic Grounding [Abstract] Abstract: Embodied semantic theories posit that word meaning is grounded in the perception and action systems of the human brain. Such theories are supported by a growing body of experimental results, indicating that processing of words belonging to specific semantic categories (e.g., visual object or motor action related, e.g. “sun” or “run”) leads to selective activation of corresponding modality-preferential areas. I highlight a deep, spiking neurocomputational architecture of the left-hemispheric fronto-temporal areas that has been used to simulate and explain putative brain processes underlying word learning and semantic grounding in action and perception. The model closely replicates neuroanatomical and neurobiological features of the relevant brain regions and implements exclusively mechanisms mimicking known cellular- and synaptic-level features of the mammalian cortex. Lastly I discuss some recent experimental evidence confirming the model’s main predictions and conclude by suggesting elements of a unifying theory for the emergence of cognition based on the spontaneous formation of cortically distributed action-perception circuits (APCs) in the brain.
12:30 PM	Break 2
13:00 PM	Panel Discussion
14:00 PM	Break 3
14:30 PM	Talk 5: Yejin Choi Grounded Causal Commonsense Reasoning [Abstract] [Slides] Abstract: In this talk, we will consider Harnad’s symbol grounding problem from three different angles: learning the functional meaning of objects and actions through interactions in a 3D environment, learning the grounded meaning of more complex language by watching YouTube videos at extreme scale, and learning causal commonsense inferences of the visual scenes through a large-scale symbolic knowledge graph.
15:15 PM	Talk 6: Justin Johnson Learning Visual Representations from Language [Abstract] Abstract: Standard practice in vision+language is to treat multimodal vision+language tasks as downstream from vision: generic unimodal representations are combined for multimodal end tasks. In this talk I'll argue that this should be flipped: multimodal vision+language tasks should be used to learn powerful representations that can be transferred to downstream visual representation tasks. Our approach, termed VirTex, uses image captioning as a pretext task for learning visual features. When trained on COCO captions, VirTex learns representations that match or exceed supervised ImageNet pretraining on many downstream visual recognition tasks. I will also discuss our efforts to scale up this algorithm, for which we've created a new dataset of 11.7M high-quality images and natural-language captions.
16:00 PM	Spotlight Presentations
16:10 PM	Poster
17:30 PM	Break
18:00 PM	Talk 7: Trevor Darrell (presented by Daniel Fried and Rudy Corona) Modularity in Grounded Interaction [Abstract] [Slides] Abstract: Neural networks have made great strides in language grounding, but still leave room for improvement in robustness, ease of design, and interpretability. Modularity, a staple of complex system design, has the potential to help on all of these. We find that modular neural nets outperform their non-modular counterparts on a grounded collaborative dialogue task and in compositional generalization settings for embodied instruction following.
18:45 PM	Talk 8: Sandra Waxman How (and how early) do infants link language and cognition? [Abstract] Abstract: Language is a signature of our species. It is the pathway through which we share the contents of our minds, imagine new ideas and ignite them in others. But how, and how early, do infants link language and thought? How do they identify which signals are part of their language and discover how these are linked to fundamental representations of objects and events? Infants begin to forge this language-cognition interface in the first months of life. Even before they say their first words, listening to human language promotes core cognitive capacities, including object categorization and rule-learning. Moreover, this precocious link emerges from a broader template that initially includes vocalizations of non-human primates, but is rapidly tuned specifically to human vocalizations. I’ll describe an exquisitely timed developmental cascade, fueled by both ‘nature’ and ‘nurture’, leading infants to discover increasingly precise links between language and cognition, and use this link to learn about their world.
19:30 PM	Closing Remark

Speakers

Yejin Choi

University of Washington

Trevor Darrell

Berkeley

Katerina Fragkiadaki

CMU

Max Garagnani

Goldsmiths, University of London

Justin Johnson

University of Michigan

Roger Levy

MIT

Stephanie Tellex

Brown University

Sandra Waxman

Northwestern University

Accepted Papers

Note: 3 additional papers were accepted but are not listed here because of an anonymity period.

[Spotlight] Emergent Communication of Generalizations
Jesse Mu (Stanford University); Noah Goodman (Stanford University)
[PDF] [Supp] [Spotlight Video]
[Spotlight] SocialAI 0.1: Towards a Benchmark to Stimulate Research on Socio-Cognitive Abilities in Deep Reinforcement Learning Agents
Rémy Portelas (Inria Bordeaux); Grgur Kovač (INRIA, Flowers team); Katja Hofmann (Microsoft Research); Pierre-Yves Oudeyer (Inria)
[PDF] [Spotlight Video]
[Spotlight] Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task Feasibility in Interactive Visual Environments
Andrea Burns (Boston University); Deniz Arsan (University of Illinois at Urbana Champaign); Sanjna Agrawal (Boston University); Ranjitha Kumar (UIUC: CS); Kate Saenko (Boston University); Bryan Plummer (Boston University)
[PDF] [Spotlight Video]
Curriculum Learning for Vision-Grounded Instruction Following
Guan-Lin Chao (Carnegie Mellon University); Ian Lane (Carnegie Mellon University)
[PDF]
RefineCap: Concept-Aware Refinement for Image Captioning
Yekun Chai (Institute of Automation, Chinese Academy of Sciences); Shuo Jin (University of Pittsburgh); Junliang Xing (Institute of Automation, Chinese Academy of Sciences)
[PDF]
Image Translation Model
Puneet Jain (Google); Orhan Firat (Google ); Qi Ge (Google); Sihang Liang (Princeton University)
[PDF]
EVOQUER: Enhancing Temporal Grounding with Video-Pivoted Back Query Generation
Yanjun Gao (Penn State University NLP Lab); Lulu Liu (Penn State University); Jason Wang (Penn State University); Xin Chen (Kuaishou Technology); Huayan Wang (Kuaishou Technology); Rui Zhang (Penn State University)
[PDF]
“Yes” and “No”: Visually Grounded Polar Answers
Claudio Greco (University of Trento); Alberto Testoni (University of Trento); Raffaella Bernardi (University of Trento)
[PDF]
VISITRON: Visual Semantics-Aligned Interactively Trained Object-Navigator
Ayush Shrivastava (Georgia Institute of Technology); Karthik Gopalakrishnan (Amazon Alexa AI); Yang Liu (Amazon Alexa AI); Robinson Piramuthu (Amazon Alexa AI); Gokhan Tur; Devi Parikh (Georgia Tech & Facebook AI Research); Dilek Hakkani-Tur (Amazon Alexa AI)
[PDF]
Interactive Learning from Activity Description
Khanh X Nguyen (University of Maryland); Dipendra Misra (Microsoft); Robert Schapire (Microsoft); Miroslav Dudik (Microsoft); Patrick Shafto (Rutgers University-Newark)
[PDF]
Language-Conditional Imitation Learning
Julian Skirzynski (University of California, San Diego), Bobak Baghi (McGill University), David Meger (McGill University)
[PDF]
Learning a natural-language to LTL executable semantic parser for grounded robotics
Christopher Wang (MIT); Candace Ross (Massachusetts Institute of Technology); Yen-Ling Kuo (MIT); Boris Katz (MIT); Andrei Barbu (MIT)
[PDF]
Zero-Shot Generalization using Intrinsically Motivated Compositional Emergent Protocols
Rishi Hazra (IISc Bangalore); Sonu Dixit (Indian Institue of Science); Sayambhu Sen (IISc, Bangalore)
[PDF]
Pose-Guided Sign Language Video GAN with Dynamic Lambda
Christopher Kissel (Beuth University of Applied Sciences Berlin); Christopher Kümmel (Beuth University of Applied Sciences Berlin); Dennis Ritter (Beuth University of Applied Sciences Berlin); Kristian Hildebrand (Beuth University of Applied Sciences Berlin)
[PDF]
Extracting Phone Numbers from Adversarial & Visually Corrupted Text
Timothy Forman (United States Naval Academy); Nathanael Chambers (United States Naval Academy)
[PDF]
gComm: An environment for investigating generalization in grounded language acquisition
Rishi Hazra (IISc Bangalore); Sonu Dixit (Indian Institue of Science)
[PDF]
Locate then Segment: A Strong Pipeline for Referring Image Segmentation
Ya Jing (Institute of Automation, Chinese Academy of Sciences); Tao Kong (Bytedance); Wei Wang (Institute of Automation Chinese Academy of Sciences); Liang Wang (NLPR, China); Lei Li (ByteDance AI Lab); Tieniu Tan (NLPR, China)
[PDF]
Towards Multi-Modal Text-Image Retrieval to improve Human Reading
Florian Schneider (University of Hamburg); Ozge Alacam (University of Hamburg); Xintong Wang (University of Hamburg); Chris Biemann (University of Hamburg)
[PDF]
Attend, tell and ground: Weakly-supervised Object Grounding with Attention-based Conditional Variational Autoencoders
Effrosyni Mavroudi (Johns Hopkins University); Rene Vidal (Johns Hopkins University, USA)
[PDF]
Referring to the recently seen: reference and perceptual memory in situated dialog
John Kelleher (Technological University Dublin); Simon Dobnik (University of Gothenburg); John D Kelleher (Technological University Dublin)
[PDF]
How Important are Visual Features for Visual Grounding? It Depends.
Fan Yang (Amazon); Prashan Wanigasekara (Amazon); Mingda Li (Amazon); Chengwei Su (Amazon); Emre Barut (Amazon)
[PDF]
Leveraging Language for Abstraction and Program Search
Catherine Wong (Massachusetts Institute of Technology); Kevin M Ellis (MIT); Jacob Andreas (MIT); Joshua Tenenbaum (MIT)
[PDF]
SynthRef: Generation of Synthetic Referring Expressions for Object Segmentation
Ioannis Kazakos (National Technical University of Athens); Carles Ventura (Universitat Oberta de Catalunya); Miriam Bellver (Barcelona Supercomputing Center); Carina Silberer (Institute for Natural Language Processing, University of Stuttgart); Xavier Giro-i-Nieto (Universitat Politecnica de Catalunya)
[PDF] [Supp]
Do Videos Guide Translations? Evaluation of a Video-Guided Machine Translation dataset
Zhishen Yang (Tokyo Institute of Technology); Tosho Hirasawa (Tokyo Metropolitan University); Naoaki Okazaki (Tokyo Institute of Technology); Mamoru Komachi (Tokyo Metropolitan University)
[PDF]

Organizers

Cătălina Cangea

University of Cambridge

Abhishek Das

Facebook AI Research

Drew Hudson

Stanford

Jacob Krantz

Oregon State University

Stefan Lee

Oregon State University

Jiayuan Mao

MIT

Florian Strub

DeepMind

Alane Suhr

Cornell

Erik Wijmans

Georgia Tech

Scientific Committee

Aaron Courville

University of Montreal

Mateusz Malinowski

DeepMind

Olivier Pietquin

Google Brain

Harm de Vries

University of Montreal | Element AI

Program Committee

Abhishek Das (Facebook AI Research)
Adria Recasens (DeepMind)
Alane Suhr (Cornell)
Anna Potapenko (DeepMind)
Arjun Majumdar (Georgia Tech)
Cătălina Cangea (University of Cambridge)
Catherine Wong (Massachusetts Institute of Technology)
Christopher Davis (University of Cambridge)
Daniel Fried (UC Berkeley)
Drew Hudson (Stanford University)
Erik Wijmans (Georgia Tech)
Florian Strub (DeepMind)
Gabriel Ilharco (University of Washington)
Geoffrey Cideron (InstaDeep)
Hammad Ayyubi (Columbia University)
Hao Tan (University of North Carolina Chapel Hill)
Hao Wu (Fudan University)
Haoyue Shi (Toyota Technological Institute at Chicago)
Hedi Ben-younes (Sorbonne université)
Jack Hessel (Allen Institute for AI)
Jacob Krantz (Oregon State University)
Jean-Baptiste Alayrac (DeepMind)

Jiayuan Mao (MIT)
Joel Ye (Georgia Tech)
Johan Ferret (Google Research, Brain Team)
Karan Desai (University of Michigan)
Lisa Anne Hendricks (DeepMind)
Luca Celotti (Université de Sherbrooke)
Mateusz Malinowski (DeepMind)
Mathieu Rita (École polytechnique)
Mathieu Seurin (University of Lille)
Meera Hahn (Georgia Institute of Technology)
Nicholas Tomlin (UC Berkeley)
Olivier Pietquin (2)
Peter Anderson (Google)
Rodolfo Corona (UC Berkeley)
Rowan Zellers (University of Washington)
Ryan Benmalek (Cornell University)
Sanjay Subramanian (Allen Institute for Artificial Intelligence)
Sidd Karamcheti (Stanford University)
Stefan Lee (Oregon State University)
Valts Blukis (Cornell University)
Volkan Cirik (Carnegie Mellon University)

Call for Papers

The authors are welcome to submit a 4-page paper based on in-progress work, or relevant paper being presented at the main conference, on any of the following topics:

grounded and interactive language acquisition;
reasoning and planning in language, vision, and interactive domains;
machine translation with visual cues;
transfer learning in language and vision tasks;
visual captioning, dialog, storytelling, and question-answering;
visual synthesis from language;
embodied agents: language instructions, agent co-ordination through language, interaction;
language-grounded robotic learning with multimodal inputs;
human-machine interaction with language through robots or within virtual world;
audio-visual scene understanding and dialog systems;
novel tasks that combine language, vision, interactions, and other modalities;
understanding and modeling the relationship between language and vision in humans;
semantic systems and modeling of natural language and visual stimuli representations in the human brain;
epistemology and research reflexions about language grounding, human embodiment and other related topics
visual and linguistic cognition in infancy and/or adults

We welcome review and positional papers that may foster discussions. We also encourage published papers from *non-ML* conferences, e.g. epistemology, cognitive science, psychology, neuroscience, that are within the scope of the workshop. Accepted papers will be presented during joint poster sessions, with exceptional submissions selected for spotlight oral presentations. Accepted papers will be made publicly available as *non-archival* reports, allowing future submissions to archival conferences or journals.

Submission Guidelines

Please upload submissions at: cmt3.research.microsoft.com/VIGIL2021.

Previously published work: We welcome previously published papers from non-ML conferences, will also accept cross-submissions from ML conferences (including NAACL 2021) which are within the scope of the workshop without re-formatting. These specific papers do not have to be anonymous. They are eligible for poster sessions and will only have a very light review process.
Unpublished work: All submissions must be in PDF format. The submissions must be formated using the NAACL 2021 LaTeX style file. Submissions are limited to 4 content pages, including all figures and tables; additional pages containing statements of acknowledgements and funding disclosures, and references are allowed. The maximum file size for submissions is 50MB. The CMT-based review process will be double-blind to avoid potential conflicts of interests.

In case of any issues, feel free to email the workshop organizers at: vigilworkshop@gmail.com.

Introduction

Language is neither learned nor used in a vacuum, but rather grounded within a rich, embodied experience rife with physical groundings (vision, audition, touch) and social influences (pragmatic reasoning about interlocutors, commonsense reasoning, learning from interaction) [1]. For example, studies of language acquisition in children show a strong interdependence between perception, motor control, and language understanding [2]. Yet, AI research has traditionally carved out individual components of this multimodal puzzle—perception (computer vision, audio processing, haptics), interaction with the world or other agents (robotics, reinforcement learning), and natural language processing—rather than adopting an interdisciplinary approach.

This fractured lens makes it difficult to address key language understanding problems that future agents will face in the wild. For example, describing "a bird perched on the lowest branch singing in a high pitch trill" requires grounding to perception. Likewise, providing the instruction to "move the jack to the left so it pushes on the frame of the car" requires not only perceptual grounding, but also physical understanding. For these reasons, language, perception, and interaction should be learned and bootstrapped together. In the last several years, efforts to merge subsets of these areas have gained popularity through tasks like instruction-guided navigation in 3D environments [3–5], audio-visual navigation [6], video descriptions [7], question-answering [8–11], and language-conditioned robotic control [12, 13], though these primarily study disembodied problems via static datasets. As such, there remains considerable scientific uncertainty around how to bridge the gap from current monolithic systems to holistic agents. What are the tasks? The environments? How to design and train such models? To transfer knowledge between modalities? To perform multimodal reasoning? To deploy language agents in the wild?

As in past incarnations, the goal of this 4th ViGIL workshop is to support and promote this research direction by bringing together scientists from diverse backgrounds—natural language processing, machine learning, computer vision, robotics, neuroscience, cognitive science, psychology, and philosophy—to share their perspectives on language grounding, embodiment, and interaction. ViGIL provides a unique opportunity for interdisciplinary discussion. We intend to utilize this variety of perspectives to foster new ideas about how to define, evaluate, learn, and leverage language grounding. This one-day session would enable in-depth conversations on understanding the boundaries of current work and establishing promising avenues for future work, with the overall aim to bridge the scientific fields of human cognition and machine learning.

Previous Sessions

References

Y. Bisk, A. Holtzman, J. Thomason, J. Andreas, Y. Bengio, J. Chai, M. Lapata, A. Lazaridou, J. May, A. Nisnevich, N. Pinto, and J. Turian. Experience grounds language. In EMNLP, 2020.
L. Smith and M. Gasser. The development of embodied cognition: Six lessons from babies. Artificial life, 11(1-2):13–29, 2005.
P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel. Vision-andlanguage Navigation: Interpreting Visually-grounded Navigation Instructions in Real Environments. In CVPR, 2018.
A. Suhr, C. Yan, J. Schluger, S. Yu, H. Khader, M. Mouallem, I. Zhang, and Y. Artzi. Executing instructions in situated collaborative interactions. In EMNLP-IJCNLP, 2019.
D. Misra, A. Bennett, V. Blukis, E. Niklasson, M. Shatkhin, and Y. Artzi. Mapping instructions to actions in 3D environments with visual goal prediction. In EMNLP, 2018.
C. Chen, U. Jain, C. Schissler, S. V. A. Gari, Z. Al-Halah, V. K. Ithapu, P. Robinson, and K. Grauman. SoundSpaces: Audio-Visual Navigation in 3D Environments. ECCV, 2019.
S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, and K. Saenko. Sequence to sequence-video to text. In ICCV, 2015.
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, L. C. Zitnick, and D. Parikh. VQA: Visual question answering. In CVPR, 2015.
A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Embodied Question Answering. In CVPR, 2018.
R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi. From recognition to cognition: Visual commonsense reasoning. In CVPR, 2019.
J. Lei, L. Yu, M. Bansal, and T. L. Berg. TVQA: Localized, compositional video question answering. In EMNLP, 2018.
S. Tellex, N. Gopalan, H. Kress-Gazit, and C. Matuszek. Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems, 3:25–55, 2020.
V. Blukis, Y. Terme, E. Niklasson, R. A. Knepper, and Y. Artzi. Learning to map natural language instructions to physical quadcopter control using simulated flight. In CoRL, 2019.