Active Information Gathering to Disambiguate Referring Expressions

Natural language has the potential to be a powerful interface for communication in human-robot interaction. Robots can be trained to retrieve objects with unconstrained natural language expressions. To do so, the robot needs to ground the object from the semantic and spatial information provided in the referring expression. In the current grounding work, the referring expression is required to uniquely describe the referred object. To mitigate this assumption, we need a way to handle the uncertainty arising from ambiguous expressions. In this thesis, we investigate how the visual and spatial information of the objects could be used to gather information about the referred object. While there are many ways to gather information, directly asking questions is the most natural. Specifically, we focused on asking the questions related to the spatio-semantic attributes of the objects. Further, the robot should ask the most informative questions for minimum user vexation. We formulate the problem as a Partially Observable Markov Decision Process (POMDP), to decide ‘when to ask and what to ask?’. Later, we formulate a surrogate objective function to find the most informative spatio-semantic question. We show that the objective function is adaptive submodular, which guarantees that a greedy solution is near-optimal. In both approaches, we utilise INGRESS’ grounding-by-generation pipeline to compose questions. Experiment results showed that the presented approaches were about 11% more accurate and 33% faster than the baselines in terms of execution speed, and were able to handle unseen objects with unconstrained language expressions.