When humans communicate, we use deictic expressions to refer to objects in our surrounding and put them in the context of our actions. In face to face interaction, we complement verbal expressions with gestures and, hence, we do not need to be too precise in our verbal protocols. Our interlocutors hear our speaking; see our gestures and they even read our eyes. They interpret our deictic expressions, try to identify the referents and they understand. If only machines could do alike.
This thesis contributes to the research on multimodal conversational interfaces where humans are engaged in natural dialogues with computer systems:
- It provides precise data on multimodal interactions consisting of manual pointing gestures and eye gaze, which were collected using modern 3D tracking technology. In particular, new methods to measure the 3D point of regard of visual attention are introduced.
- It presents Gesture Space Volumes and Attention Volumes to represent gestural data and visual attention in space and provides informative 3D visualizations.
- It offers a data-driven model for human pointing, which specifies exactly how the direction of manual pointing is defined and highlights in particular the role of gaze in this process.
- It presents technologies for the recording, integration and analysis of multimodal data, which allow for a multimodal exploration of multimedia corpus data in an immersive virtual environment.
- It introduces the DRIVE framework for real-time deictic multimodal interaction in virtual reality, which implements the developed models.