Visually Grounded Interaction and Language (ViGIL)
June 10, 2021. NAACL Workshop, Mexico City, Mexico.
VIGIL is a one-day interdisciplinary workshop that will push the boundaries of language grounding systems. We aim to bridge the fields of human cognition and machine learning through discussions on combining language, perception and other modalities via interaction.
Schedule
08:50 AM | Opening Remarks |
---|---|
09:00 AM |
Talk 1: Roger Levy Semantics, Pragmatics, and Context in Human Grounded Language Understanding [Abstract] [Slides]
Abstract: Computational systems for grounded language understanding have seen impressive advances over the last decade, due largely to advances in multimodal datasets, neural and symbolic modeling techniques, and computational power. But human meaning interpretation in grounded contexts remains far deeper and more sophisticated. In this talk I describe several recent studies in our research group that illustrate the subtlety and richness of human meaning interpretation using very simple, experimentally controlled utterances and visual grounding contexts. These studies shed light on the compositional structure of the semantic representations underlying human language comprehension, their relationship with the pragmatic inference mechanisms that support contextually conditioned interpretation, and the likely requirements for truly human-like language understanding in artificial systems.
|
09:45 AM |
Talk 2: Stefanie Tellex Towards Complex Language in Partially Observed Environments [Abstract]
Abstract: Robots can act as a force multiplier for people, whether a robot assisting an astronaut with a repair on the International Space station, a UAV taking flight over our cities, or an autonomous vehicle driving through our streets. Existing approaches use action-based representations that do not capture the goal-based meaning of a language expression and do not generalize to partially observed environments. The aim of my research program is to create autonomous robots that can understand complex goal-based commands and execute those commands in partially observed, dynamic environments. I will describe demonstrations of object-search in a POMDP setting with information about object locations provided by language, and mapping between English and Linear Temporal Logic, enabling a robot to understand complex natural language commands in city-scale environments. These advances represent steps towards robots that interpret complex natural language commands in partially observed environments using a decision theoretic framework.
|
10:30 AM | Break 1 |
11:00 AM |
Talk 3: Katerina Fragkiadaki Linking Language with World Common Sense Using 3D Visual Feature Representations [Abstract]
Abstract: To link language processing with spatial reasoning, we propose associating natural language utterances to a mental workspace of their meaning, encoded as 3-dimensional visual feature representations of the world scenes they describe. We learn such 3-dimensional visual representations---we call them visual imaginations--- by predicting images a mobile agent sees while moving around in the 3D world. The input image streams the agent collects are unprojected into egomotion-stable 3D scene feature maps of the scene, and projected from novel viewpoints to match the observed RGB image views in an end-to-end differentiable manner. We then train modular neural models to generate such 3D feature representations given language utterances, to localize the objects an utterance mentions in the 3D feature representation inferred from an image, and to predict the desired 3D object locations given a manipulation instruction. We empirically show the proposed models outperform by a large margin existing 2D models in spatial reasoning, referential object detection and instruction following, and generalize better across camera viewpoints and object arrangements.
|
11:45 AM |
Talk 4: Max Garagnani Action-Perception Circuits for Word Learning and Semantic Grounding [Abstract]
Abstract: Embodied semantic theories posit that word meaning is grounded in the perception and action systems of the human brain. Such theories are supported by a growing body of experimental results, indicating that processing of words belonging to specific semantic categories (e.g., visual object or motor action related, e.g. “sun” or “run”) leads to selective activation of corresponding modality-preferential areas.
I highlight a deep, spiking neurocomputational architecture of the left-hemispheric fronto-temporal areas that has been used to simulate and explain putative brain processes underlying word learning and semantic grounding in action and perception. The model closely replicates neuroanatomical and neurobiological features of the relevant brain regions and implements exclusively mechanisms mimicking known cellular- and synaptic-level features of the mammalian cortex. Lastly I discuss some recent experimental evidence confirming the model’s main predictions and conclude by suggesting elements of a unifying theory for the emergence of cognition based on the spontaneous formation of cortically distributed action-perception circuits (APCs) in the brain. |
12:30 PM | Break 2 |
13:00 PM | Panel Discussion |
14:00 PM | Break 3 |
14:30 PM |
Talk 5: Yejin Choi Grounded Causal Commonsense Reasoning [Abstract] [Slides]
Abstract: In this talk, we will consider Harnad’s symbol grounding problem from three different angles: learning the functional meaning of objects and actions through interactions in a 3D environment, learning the grounded meaning of more complex language by watching YouTube videos at extreme scale, and learning causal commonsense inferences of the visual scenes through a large-scale symbolic knowledge graph.
|
15:15 PM |
Talk 6: Justin Johnson Learning Visual Representations from Language [Abstract]
Abstract: Standard practice in vision+language is to treat multimodal vision+language tasks as downstream from vision: generic unimodal representations are combined for multimodal end tasks. In this talk I'll argue that this should be flipped: multimodal vision+language tasks should be used to learn powerful representations that can be transferred to downstream visual representation tasks. Our approach, termed VirTex, uses image captioning as a pretext task for learning visual features. When trained on COCO captions, VirTex learns representations that match or exceed supervised ImageNet pretraining on many downstream visual recognition tasks. I will also discuss our efforts to scale up this algorithm, for which we've created a new dataset of 11.7M high-quality images and natural-language captions.
|
16:00 PM | Spotlight Presentations |
16:10 PM | Poster |
17:30 PM | Break |
18:00 PM |
Talk 7: Trevor Darrell (presented by Daniel Fried and Rudy Corona) Modularity in Grounded Interaction [Abstract] [Slides]
Abstract: Neural networks have made great strides in language grounding, but still leave room for improvement in robustness, ease of design, and interpretability. Modularity, a staple of complex system design, has the potential to help on all of these. We find that modular neural nets outperform their non-modular counterparts on a grounded collaborative dialogue task and in compositional generalization settings for embodied instruction following.
|
18:45 PM |
Talk 8: Sandra Waxman How (and how early) do infants link language and cognition? [Abstract]
Abstract: Language is a signature of our species. It is the pathway through which we share the contents of our minds, imagine new ideas and ignite them in others. But how, and how early, do infants link language and thought? How do they identify which signals are part of their language and discover how these are linked to fundamental representations of objects and events? Infants begin to forge this language-cognition interface in the first months of life. Even before they say their first words, listening to human language promotes core cognitive capacities, including object categorization and rule-learning. Moreover, this precocious link emerges from a broader template that initially includes vocalizations of non-human primates, but is rapidly tuned specifically to human vocalizations. I’ll describe an exquisitely timed developmental cascade, fueled by both ‘nature’ and ‘nurture’, leading infants to discover increasingly precise links between language and cognition, and use this link to learn about their world.
|
19:30 PM | Closing Remark |
Speakers
University of Washington
Berkeley
Goldsmiths, University of London
University of Michigan
MIT
Brown University
Northwestern University
Accepted Papers
Note: 3 additional papers were accepted but are not listed here because of an anonymity period.
-
[Spotlight] Emergent Communication of Generalizations
-
[Spotlight] SocialAI 0.1: Towards a Benchmark to Stimulate Research on Socio-Cognitive Abilities in Deep Reinforcement Learning Agents
-
[Spotlight] Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task Feasibility in Interactive Visual Environments
-
Curriculum Learning for Vision-Grounded Instruction Following
-
RefineCap: Concept-Aware Refinement for Image Captioning
-
Image Translation Model
-
EVOQUER: Enhancing Temporal Grounding with Video-Pivoted Back Query Generation
-
“Yes” and “No”: Visually Grounded Polar Answers
-
VISITRON: Visual Semantics-Aligned Interactively Trained Object-Navigator
-
Interactive Learning from Activity Description
-
Language-Conditional Imitation Learning
-
Learning a natural-language to LTL executable semantic parser for grounded robotics
-
Zero-Shot Generalization using Intrinsically Motivated Compositional Emergent Protocols
-
Pose-Guided Sign Language Video GAN with Dynamic Lambda
-
Extracting Phone Numbers from Adversarial & Visually Corrupted Text
-
gComm: An environment for investigating generalization in grounded language acquisition
-
Locate then Segment: A Strong Pipeline for Referring Image Segmentation
-
Towards Multi-Modal Text-Image Retrieval to improve Human Reading
-
Attend, tell and ground: Weakly-supervised Object Grounding with Attention-based Conditional Variational Autoencoders
-
Referring to the recently seen: reference and perceptual memory in situated dialog
-
How Important are Visual Features for Visual Grounding? It Depends.
-
Leveraging Language for Abstraction and Program Search
-
SynthRef: Generation of Synthetic Referring Expressions for Object Segmentation
-
Do Videos Guide Translations? Evaluation of a Video-Guided Machine Translation dataset
Organizers
University of Cambridge
Facebook AI Research
Stanford
Oregon State University
Oregon State University
MIT
DeepMind
Cornell
Georgia Tech
Scientific Committee
University of Montreal
DeepMind
Google Brain
University of Montreal | Element AI
Program Committee
- Abhishek Das (Facebook AI Research)
- Adria Recasens (DeepMind)
- Alane Suhr (Cornell)
- Anna Potapenko (DeepMind)
- Arjun Majumdar (Georgia Tech)
- Cătălina Cangea (University of Cambridge)
- Catherine Wong (Massachusetts Institute of Technology)
- Christopher Davis (University of Cambridge)
- Daniel Fried (UC Berkeley)
- Drew Hudson (Stanford University)
- Erik Wijmans (Georgia Tech)
- Florian Strub (DeepMind)
- Gabriel Ilharco (University of Washington)
- Geoffrey Cideron (InstaDeep)
- Hammad Ayyubi (Columbia University)
- Hao Tan (University of North Carolina Chapel Hill)
- Hao Wu (Fudan University)
- Haoyue Shi (Toyota Technological Institute at Chicago)
- Hedi Ben-younes (Sorbonne université)
- Jack Hessel (Allen Institute for AI)
- Jacob Krantz (Oregon State University)
- Jean-Baptiste Alayrac (DeepMind)
- Jiayuan Mao (MIT)
- Joel Ye (Georgia Tech)
- Johan Ferret (Google Research, Brain Team)
- Karan Desai (University of Michigan)
- Lisa Anne Hendricks (DeepMind)
- Luca Celotti (Université de Sherbrooke)
- Mateusz Malinowski (DeepMind)
- Mathieu Rita (École polytechnique)
- Mathieu Seurin (University of Lille)
- Meera Hahn (Georgia Institute of Technology)
- Nicholas Tomlin (UC Berkeley)
- Olivier Pietquin (2)
- Peter Anderson (Google)
- Rodolfo Corona (UC Berkeley)
- Rowan Zellers (University of Washington)
- Ryan Benmalek (Cornell University)
- Sanjay Subramanian (Allen Institute for Artificial Intelligence)
- Sidd Karamcheti (Stanford University)
- Stefan Lee (Oregon State University)
- Valts Blukis (Cornell University)
- Volkan Cirik (Carnegie Mellon University)
Call for Papers
The authors are welcome to submit a 4-page paper based on in-progress work, or relevant paper being presented at the main conference, on any of the following topics:
- grounded and interactive language acquisition;
- reasoning and planning in language, vision, and interactive domains;
- machine translation with visual cues;
- transfer learning in language and vision tasks;
- visual captioning, dialog, storytelling, and question-answering;
- visual synthesis from language;
- embodied agents: language instructions, agent co-ordination through language, interaction;
- language-grounded robotic learning with multimodal inputs;
- human-machine interaction with language through robots or within virtual world;
- audio-visual scene understanding and dialog systems;
- novel tasks that combine language, vision, interactions, and other modalities;
- understanding and modeling the relationship between language and vision in humans;
- semantic systems and modeling of natural language and visual stimuli representations in the human brain;
- epistemology and research reflexions about language grounding, human embodiment and other related topics
- visual and linguistic cognition in infancy and/or adults
We welcome review and positional papers that may foster discussions. We also encourage published papers from *non-ML* conferences, e.g. epistemology, cognitive science, psychology, neuroscience, that are within the scope of the workshop. Accepted papers will be presented during joint poster sessions, with exceptional submissions selected for spotlight oral presentations. Accepted papers will be made publicly available as *non-archival* reports, allowing future submissions to archival conferences or journals.
Submission Guidelines
Please upload submissions at: cmt3.research.microsoft.com/VIGIL2021.
- Previously published work: We welcome previously published papers from non-ML conferences, will also accept cross-submissions from ML conferences (including NAACL 2021) which are within the scope of the workshop without re-formatting. These specific papers do not have to be anonymous. They are eligible for poster sessions and will only have a very light review process.
- Unpublished work: All submissions must be in PDF format. The submissions must be formated using the NAACL 2021 LaTeX style file. Submissions are limited to 4 content pages, including all figures and tables; additional pages containing statements of acknowledgements and funding disclosures, and references are allowed. The maximum file size for submissions is 50MB. The CMT-based review process will be double-blind to avoid potential conflicts of interests.
In case of any issues, feel free to email the workshop organizers at: vigilworkshop@gmail.com.
Introduction
Language is neither learned nor used in a vacuum, but rather grounded within a rich, embodied experience rife with physical groundings (vision, audition, touch) and social influences (pragmatic reasoning about interlocutors, commonsense reasoning, learning from interaction) [1]. For example, studies of language acquisition in children show a strong interdependence between perception, motor control, and language understanding [2]. Yet, AI research has traditionally carved out individual components of this multimodal puzzle—perception (computer vision, audio processing, haptics), interaction with the world or other agents (robotics, reinforcement learning), and natural language processing—rather than adopting an interdisciplinary approach.
This fractured lens makes it difficult to address key language understanding problems that future agents will face in the wild. For example, describing "a bird perched on the lowest branch singing in a high pitch trill" requires grounding to perception. Likewise, providing the instruction to "move the jack to the left so it pushes on the frame of the car" requires not only perceptual grounding, but also physical understanding. For these reasons, language, perception, and interaction should be learned and bootstrapped together. In the last several years, efforts to merge subsets of these areas have gained popularity through tasks like instruction-guided navigation in 3D environments [3–5], audio-visual navigation [6], video descriptions [7], question-answering [8–11], and language-conditioned robotic control [12, 13], though these primarily study disembodied problems via static datasets. As such, there remains considerable scientific uncertainty around how to bridge the gap from current monolithic systems to holistic agents. What are the tasks? The environments? How to design and train such models? To transfer knowledge between modalities? To perform multimodal reasoning? To deploy language agents in the wild?
As in past incarnations, the goal of this 4th ViGIL workshop is to support and promote this research direction by bringing together scientists from diverse backgrounds—natural language processing, machine learning, computer vision, robotics, neuroscience, cognitive science, psychology, and philosophy—to share their perspectives on language grounding, embodiment, and interaction. ViGIL provides a unique opportunity for interdisciplinary discussion. We intend to utilize this variety of perspectives to foster new ideas about how to define, evaluate, learn, and leverage language grounding. This one-day session would enable in-depth conversations on understanding the boundaries of current work and establishing promising avenues for future work, with the overall aim to bridge the scientific fields of human cognition and machine learning.
References
- Y. Bisk, A. Holtzman, J. Thomason, J. Andreas, Y. Bengio, J. Chai, M. Lapata, A. Lazaridou, J. May, A. Nisnevich, N. Pinto, and J. Turian. Experience grounds language. In EMNLP, 2020.
- L. Smith and M. Gasser. The development of embodied cognition: Six lessons from babies. Artificial life, 11(1-2):13–29, 2005.
- P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel. Vision-andlanguage Navigation: Interpreting Visually-grounded Navigation Instructions in Real Environments. In CVPR, 2018.
- A. Suhr, C. Yan, J. Schluger, S. Yu, H. Khader, M. Mouallem, I. Zhang, and Y. Artzi. Executing instructions in situated collaborative interactions. In EMNLP-IJCNLP, 2019.
- D. Misra, A. Bennett, V. Blukis, E. Niklasson, M. Shatkhin, and Y. Artzi. Mapping instructions to actions in 3D environments with visual goal prediction. In EMNLP, 2018.
- C. Chen, U. Jain, C. Schissler, S. V. A. Gari, Z. Al-Halah, V. K. Ithapu, P. Robinson, and K. Grauman. SoundSpaces: Audio-Visual Navigation in 3D Environments. ECCV, 2019.
- S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, and K. Saenko. Sequence to sequence-video to text. In ICCV, 2015.
- S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, L. C. Zitnick, and D. Parikh. VQA: Visual question answering. In CVPR, 2015.
- A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Embodied Question Answering. In CVPR, 2018.
- R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi. From recognition to cognition: Visual commonsense reasoning. In CVPR, 2019.
- J. Lei, L. Yu, M. Bansal, and T. L. Berg. TVQA: Localized, compositional video question answering. In EMNLP, 2018.
- S. Tellex, N. Gopalan, H. Kress-Gazit, and C. Matuszek. Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems, 3:25–55, 2020.
- V. Blukis, Y. Terme, E. Niklasson, R. A. Knepper, and Y. Artzi. Learning to map natural language instructions to physical quadcopter control using simulated flight. In CoRL, 2019.