Ways to a more natural and intuitive communication between humans and machines

05. September 2014


Current speech interfaces are perceived as little intuitive and error prone. In this talk I will present different strategies to make the man-machine interface more intuitive and less fragile. This includes features more robust against noise, learning of novel words and approaches to capture prosodic cues. Such prosodic cues are known to play a very important role for human-human communication, e.g. they differentiate between questions or statements or emphasize certain words (e.g. novel information) in an utterance. Yet they are rarely used in spoken dialog systems. I will start with the hierarchical spectro-temporal features we developed. Inspired by findings on neuron responses in the mammalian cortex these features jointly extract spectro-temporal information. This is particular beneficial in noisy conditions when combined with conventional spectral features. Next I will briefly present a framework for the efficient learning of novel words from only a few examples. This allows for a rapid adaptation to the way of speaking of a parituclar user, i.e. to words which are common for him but rare in the overall population. Finally, I will talk about my recent experiments on the audio-visual discrimination of prominent from non-prominent words. The target scenario is the detection of a spoken correction of a user. In this case the corrected word is expected to be highly prominent. I will show that there is a large variation from speaker to speaker and for some speakers the features extracted from the visual channel are as strong as those extracted from the acoustic channel. Furthermore, I will highlight that modeling the prosodic contours via Functional Principal component analysis improves the performance compared to using conventional functionals (i.e. mean, min, max, ...).