Visually Grounded Interaction and Language (ViGIL)

NeurIPS 2019 Workshop, Vancouver, Canada

Friday, 13th December, 08:30 AM to 06:30 PM, Room: West 202 - 204 Latest Workshop: https://vigilworkshop.github.io

Introduction

The dominant paradigm in modern natural language understanding is learning statistical language models from text-only corpora. This approach is founded on a distributional notion of semantics, i.e. that the "meaning" of a word is based only on its relationship to other words. While effective for many applications, methods in this family suffer from limited semantic understanding, as they miss learning from the multimodal and interactive environment in which communication often takes place - the symbols of language thus are not grounded in anything concrete. The symbol grounding problem first highlighted this limitation, that "meaningless symbols (i.e.) words cannot be grounded in anything but other meaningless symbols" [1].

On the other hand, humans acquire language by communicating about and interacting within a rich, perceptual environment. This behavior provides the necessary grounding for symbols, i.e. to concrete objects or concepts (i.e. physical or psychological). Thus, recent work has aimed to bridge vision, interactive learning, and natural language understanding through language learning tasks based on natural images (ReferIt [2], Visual Question Answering [3-6], Visual Dialog [7-10], Captioning [11, 32-35], Visual-Audio Correspondence [30]) or through embodied agents performing interactive tasks [12-29] in physically simulated environments (DeepMind Lab [12], Baidu XWorld [13], Habitat [14], StreetLearn [18], AI2-THOR [21], House3D [22], Matterport3D [23], GIBSON [27], MINOS [28]), often drawing on the recent successes of deep learning and reinforcement learning. We believe this line of research poses a promising, long-term solution to the grounding problem faced by current, popular language understanding models.

While machine learning research exploring visually-grounded language learning may be in its earlier stages, it may be possible to draw insights from the rich research literature on human language acquisition. In neuroscience, recent progress in fMRI technology has enabled better understanding of the interaction between language, vision and other modalities [17,18] suggesting that the brains share neural representations of concepts across vision and language. In concurrent work, developmental cognitive scientists have argued that word acquisition in children is closely linked to them learning the underlying physical concepts in the real world [15, 31], and that they generalize surprisingly well at this from sparse evidence [36].

This workshop thus aims to gather people from various backgrounds - machine learning, computer vision, natural language processing, neuroscience, cognitive science, psychology, and philosophy - to share and debate their perspectives on why grounding may (or may not) be important in building machines that truly understand natural language.

Important Dates

Paper Submission Deadline	September 18, 2019 (11:59 PM Pacific time)
Decision Notifications	~~September 26, 2019~~ September 28, 2019
Workshop	December 13, 2019

Call for Papers

We invite high-quality paper submissions on the following topics:

language acquisition or learning through interactions
image/video captioning, visual dialogues, visual question-answering, and other visually grounded language challenges
reasoning in language and vision
transfer learning in language and vision tasks
audiovisual scene understanding and generation
navigation and question answering in virtual worlds with natural-language instructions
original multimodal works that can be extended to vision, language or interaction
human-machine interaction with vision and language
understanding and modeling the relationship between language and vision in humans semantic systems and modeling of natural language and visual stimuli representations in the human brain
epistemology and research reflexions about language grounding, human embodiment and other related topics
visual and linguistic cognition in infancy and/or adults

Submission

Please upload submissions at: cmt3.research.microsoft.com/VIGIL2019

Accepted papers will be presented during joint poster sessions, with exceptional submissions selected for spotlight oral presentations. Accepted papers will be made publicly available as non-archival reports, allowing future submissions to archival conferences or journals.

Submissions should be up to 4 pages excluding references, acknowledgements, and supplementary material, and should follow NeurIPS format. The CMT-based review process will be double-blind to avoid potential conflicts of interests.

We welcome published papers from *non-ML* conferences that are within the scope of the workshop (without re-formatting). These specific papers do not have to be anonymous. They are eligible for poster sessions and will only have a very light review process.

A limited pool of NeurIPS registrations might be available for accepted papers.

In case of any issues, feel free to email the workshop organizers at: vigilworkshop@gmail.com.

D-Day Workshop Information

Posters will be taped to the wall (we will provide tape). Please make sure they are printed on lightweight paper without lamination and no larger than 36 x 48 inches or 90 x 122 cm in portrait orientation.

Schedule

08:20 AM - 08:30 AM	Opening remarks [Video]
08:30 AM - 09:10 AM	Invited talk: Jason Baldridge [Video]
09:10 AM - 09:50 AM	Invited talk: Jesse Thomason [Video]
09:50 AM - 10:30 AM	Coffee break
10:30 AM - 10:50 AM	Spotlights [Video] VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering (Cătălina Cangea, Eugene Belilovsky, Pietro Liò, Aaron Courville) General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping (Gabriel Ilharco, Vihan Jain, Alexander Ku, Eugene Ie, Jason Baldridge) Structural and functional learning for learning language use (Angeliki Lazaridou, Anna Potapenko, Olivier Tieleman) Deep compositional robotic planners that follow natural language commands (Yen-Ling Kuo, Boris Katz, Andrei Barbu) Analyzing Compositionality in Visual Question Answering (Sanjay Subramanian, Sameer Singh, Matt Gardner)
10:50 AM - 11:30 AM	Invited talk: Jay McClelland [Video]
11:30 AM - 12:10 PM	Invited talk: Louis-Philippe Morency [Video]
12:10 PM - 01:50 PM	Poster session + Lunch
01:50 PM - 02:30 PM	Invited talk: Lisa Anne Hendricks [Video]
02:30 PM - 03:10 PM	Invited talk: Linda Smith [Video]
03:10 PM - 04:00 PM	Poster session + Coffee break
04:00 PM - 04:40 PM	Invited talk: Timothy Lillicrap [Video]
04:40 PM - 05:20 PM	Invited talk: Josh Tenenbaum (presented by Jiayuan Mao) [Video]
05:20 PM - 06:00 PM	Panel Discussion [Video]
06:00 PM - 06:10 PM	Closing remarks

Invited Speakers

Linda Smith is a Distinguished Professor, Psychological and Brain Sciences at Indiana University. Her recent work at the intersection of cognitive development and machine learning focuses specifically on the statistics of infants' visual experience and how it affects concept and word learning. [Webpage]

Josh Tenenbaum is a Professor in Computational Cognitive Science at MIT. His work studies learning and reasoning in humans and machines, with the twin goals of understanding human intelligence in computational terms and bringing artificial intelligence closer to human-level capacities. [Webpage]

Jay McClelland is a Professor in the Psychology Department and Director of the Center for Mind, Brain and Computation at Stanford University. His research spans a broad range of topics in cognitive science and cognitive neuroscience, including perception and perceptual decision making; learning and memory; language and reading; semantic and mathematical cognition; and cognitive development. [Webpage]

Jesse Thomason is a postdoctoral researcher at the University of Washington. His research focuses on language grounding and natural language processing applications for robotics, including how dialog with humans can facilitate both robot task execution and learning. [Webpage]

Lisa Anne Hendricks is a research scientist at DeepMind (previously a PhD student in Computer Vision at UC Berkeley). Her work focuses on building systems which can express information about visual content using natural language and retrieve visual information given natural language. [Webpage]

Timothy Lillicrap is a research scientist at DeepMind. His research focuses on machine learning and statistics for optimal control and decision making, as well as using these mathematical frameworks to understand how the brain learns. He has developed algorithms and approaches for exploiting deep neural networks in the context of reinforcement learning, and recurrent memory architectures for one-shot learning. [Webpage]

Jason Baldridge is a research scientist at Google. His research focuses on the theoretical and applied aspects of computational linguistics -- from formal and computational models of syntax to machine learning for natural language processing and geotemporal grounding of natural language. [Webpage]

Louis-Philippe Morency is an Associate Professor at the Language Technology Institute at Carnegie Mellon University. His research focuses on building the computational foundations to enable computers with the abilities to analyze, recognize and predict subtle human communicative behaviors during social interactions. [Webpage]

Accepted Papers

Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning
Khanh Nguyen (University of Maryland)*; Hal Daume (University of Maryland, College Park)
[PDF]
What is needed for simple spatial language capabilities in VQA?
Alexander Kuhnle (University of Cambridge)*; Ann Copestake (University of Cambridge)
[PDF] [Supplementary]
Learning from Observation-Only Demonstration for Task-Oriented Language Grounding via Self-Examination
Tsu-Jui Fu (UC Santa Barbara); Yuta Tsuboi (Preferred Networks)*; Sosuke Kobayashi (Preferred Networks); Yuta Kikuchi (Preferred Networks)
[PDF]
Not All Actions Are Equal: Learning to Stop in Language-Grounded Urban Navigation
Jiannan Xiang (University of Science and Technology of China); Xin Wang (University of California, Santa Barbara)*; William Yang Wang (UC Santa Barbara)
[PDF]
Hidden State Guidance: Improving Image Captioning Using an Image Conditioned Autoencoder
Jialin Wu (UT Austin)*; Raymond Mooney (Univ. of Texas at Austin)
[PDF]
Situated Grounding Facilitates Multimodal Concept Learning for AI
Nikhil Krishnaswamy (Brandeis University)*; James Pustejovsky (Brandeis University)
[PDF]
VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering
Cătălina Cangea (University of Cambridge)*; Eugene Belilovsky (Mila); Pietro Liò (University of Cambridge); Aaron Courville (Universite de Montreal)
[PDF]
Induced Attention Invariance: Defending VQA Models against Adversarial Attacks
Vasu Sharma (Carnegie Mellon University)*; Ankita Kalra (CMU); Louise-Phillipe Morency (Carnegie Mellon University)
[PDF]
Natural Language Grounded Multitask Navigation
Xin Wang (University of California, Santa Barbara)*; Vihan Jain (Google Research); Eugene Ie (Google Research); William Yang Wang (UC Santa Barbara); Zornitsa Kozareva (Google Cloud); Sujith Ravi (Google Research)
[PDF]
Contextual Grounding of Natural Language Entities in Images
Farley Lai (NEC Laboratories America, Inc.)*; Ning Xie (Wright State University); Derek Doran (Wright State University); Asim Kadav (NEC Labs)
[PDF] [Code]
Visual Dialog for Radiology: Data Curation and FirstSteps
Olga Kovaleva (UMass Lowell)*; Chaitanya Shivade (IBM Research); Satyananda Kashyap (IBM Research); Karina Kanjaria (IBM Research); Adam Coy (IBM Research); Deddeh Ballah (IBM Research); Yufan Guo (IBM Research); Joy Wu (IBM Research); Alexandros Karargyris (IBM Research); David Beymer (IBM); Anna Rumshisky (University of Massachusetts Lowell); Vandana Mukherjee (IBM Research)
[PDF]
Multimodal Generative Learning Utilizing Jensen-Shannon-Divergence
Thomas Sutter ()*; Imant Daunhawer (ETH Zurich); Julia Vogt (ETH Zurich)
[PDF]
Learning Question-Guided Video Representation for Multi-Turn Video Question Answering
Guan-Lin Chao (Carnegie Mellon University)*; Abhinav Rastogi (Google); Semih Yavuz (University of California, Santa Barbara); Dilek Hakkani-Tur (Amazon Alexa AI); Jindong Chen (Google); Ian Lane (Carnegie Mellon University)
[PDF]
Structural and functional learning for learning language use
Angeliki Lazaridou (DeepMind)*; Anna Potapenko (DeepMind); Olivier Tieleman (DeepMind)
[PDF]
Community size effect in artificial learning systems
Olivier Tieleman (DeepMind)*; Angeliki Lazaridou (DeepMind); Shibl Mourad (DeepMind); Charles Blundell (DeepMind); Doina Precup (DeepMind)
[PDF]
CLOSURE: Assessing Systematic Generalization of CLEVR Models
Harm De Vries (Montreal Institute for Learning Algorithms); Dzmitry Bahdanau (University of Montreal)*; Shikhar Murty (MILA, UdeM); Aaron Courville (MILA, Université de Montréal); Philippe Beaudoin (Element AI)
[PDF]
A Comprehensive Analysis of Semantic Compositionality in Text-to-Image Generation
Chihiro Fujiyama (Ochanomizu University)*; Ichiro kobayashi (ochanomizu university tokyo)
[PDF]
Recurrent Instance Segmentation using Sequences of Referring Expressions
Alba Maria Hererra-Palacio (Universitat Politecnica de Catalunya); Carles Ventura (Universitat Oberta de Catalunya); Carina Silberer (Universitat Pompeu Fabra); Ionut-Teodor Sorodoc (Universitat Pompeu Fabra); Gemma Boleda (Universitat Pompeu Fabra); Xavier Giro-i-Nieto (Universitat Politecnica de Catalunya)*
[PDF] [Supplementary]
Visually Grounded Video Reasoning in Selective Attention Memory
T.S. Jayram (IBM Research)*; Vincent Albouy (IBM Research); Tomasz Kornuta (IBM Research, Almaden); Emre Sevgen (University of Chicago); Ahmet Ozcan (IBM Almaden Research)
[PDF] [Supplementary]
Modulated Self-attention Convolutional Network for VQA
Jean-Benoit Delbrouck (UMONS)*
[PDF]
General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping
Gabriel Ilharco (University of Washington)*; Vihan Jain (Google Research); Alexander Ku (Google Research); Eugene Ie (Google Research); Jason Baldridge (Google Inc.)
[PDF]
A Simple Baseline for Visual Commonsense Reasoning
Jingxiang Lin (UIUC)*; Unnat Jain (UIUC); Alexander Schwing (UIUC)
[PDF]
Language Grounding through Social Interactions and Curiosity-Driven Multi-Goal Learning
Nicolas Lair (Inserm U1093 CAPS)*; Cédric Colas (Inria Bordeaux - Sud-Ouest); Rémy Portelas (Inria Bordeaux - Sud-Ouest); Jean-Michel Dussoux (Cloud Temple); Peter Dominey (INSERM); Pierre-Yves Oudeyer (Inria)
[PDF]
Deep compositional robotic planners that follow natural language commands
Yen-Ling Kuo (MIT)*; Boris Katz (MIT); Andrei Barbu (MIT)
Can adversarial training learn image captioning ?
Jean-Benoit Delbrouck (UMONS)*
[PDF]
Leveraging Topics and Audio Features with Multimodal Attention for Audio Visual Scene-Aware Dialog
Shachi H Kumar (Intel Labs); Eda Okur (Intel Labs)*; Saurav Sahay (Intel); Jonathan Huang (Intel); Lama Nachman (Intel Labs)
[PDF]
Supervised Multimodal Bitransformers for Classifying Images and Text
Douwe Kiela (Facebook AI Research)*; Suvrat Bhooshan (Facebook); Hamed Firooz (Facebook); Davide Testuggine (Facebook)
[PDF]
Shaping Visual Representations with Language for Few-shot Classification
Jesse Mu (Stanford University)*; Percy Liang (Stanford University); Noah Goodman (Stanford University)
[PDF]
Self-Educated Language Agent with Hindsight Experience Replay for Instruction Following
Geoffrey Cideron (University of Lille)*; Mathieu Seurin (University of Lille); Florian Strub (DeepMind); Olivier Pietquin (Google Research - Brain Team)
[PDF]
Analyzing Compositionality in Visual Question Answering
Sanjay Subramanian (Allen Institute for Artificial Intelligence)*; Sameer Singh (University of California, Irvine); Matt Gardner (AI2)
[PDF]
On Agreements in Visual Understanding
Yassine Mrabet (NLM/NIH)*; Dina Demner-Fushman (NLM/NIH)
[PDF]
A perspective on multi-agent communication for information fusion
Homagni Saha (Iowa state university)*; Vijay Venkataraman (Honeywell); Alberto Speranzon (Honeywell); Soumik Sarkar (Iowa State University)
[PDF]
Cross-Modal Mapping for Generalized Zero-Shot Learning by Soft-Labeling
Shabnam Daghaghi ()*; Anshumali Shrivastava (Rice University); Tharun Medini (Rice University)
[PDF]
Learning Language from Vision
Candace Ross (Massachusetts Institute of Technology); Cheahuychou Mao (MIT); Boris Katz (MIT); Andrei Barbu (MIT)*
[PDF]
Commonsense and Semantic-Guided Navigation through Language in Embodied Environment
Dian Yu (University of California, Davis)*; Chandra Khatri (Uber); Alexandros Papangelis (UberAI); Andrea Madotto (Hong Kong University of Science and Technology); Mahdi Namazifar (Uber Technologies, Inc.); Joost Huizinga (UberAI); Adrien Ecoffet (UberAI); Huaixiu Zheng (Uber Technologies); Piero Molino (Uber AI); Jeff Clune (Uber AI Labs); Zhou Yu (UC Davis); Kenji Sagae (University of California, Davis); Gokhan Tur (Uber)