Interaction Capabilities

Attention management in smart environments

Attention is a valuable resource for human users. The CSRA provides a uniq possibility to investigate attentional behavior in more natural situations in order to develop a consistent approach how an assistive system should address a user's attention. One important strand of research deals with the question how to keep the user's attention and how to behave attentively in foreground interactions. On the other hand we will investigate less resource demanding approaches in future research.

Re-gaining the user's attention

Systems in a smart environment are capable to support a user by providing localized information about objects, for example about the capabilities of items in a yet unknown apartment, or ongoing household activities such as cooking or cleaning. We targeted the question how to deal in such joint task situations with distractions that arise from the social environment or potential autonomous devices seeking the user’s attention. Based on an incremental speech and dialog processing architecture (Carlmeyer et al., 2014) we developed and evaluated a strategy for implicitly regaining the user’s attention through self-interruptions, which turned out as effective, albeit at the cost of less positive subjective evaluations, an issue we will address in the further development of our interaction strategy (Carlmeyer et al., 2016a, Carlmeyer et al., 2016b).

 
How to be (not too) attentive

In the context of smart environments with various interactive agents and 24-7 activity like the CSRA it becomes increasingly important for agents to be able to decide whether they are addressed by human speech or not. Although the common solution of directly addressing an agent using a designated keyword yields quite acceptable results, we examine multi-modal social cues to achieve a more effective and natural way of interaction. In (Richter et al., 2016a) we use addressee recognition based on mutual gaze and mouth movements in combination with a multi modal attention management system to enhance the interaction capabilities of the robot Floka.

 

 

 

Addressing behavior in smart environments

When users without prior knowledge about the interfaces of their environment are allowed to freely act to solve daily tasks like switching the light, what do they do to narrow down the addressee of their actions? In (Richter et al., 2016b) we show how informative various situational and social cues can be when a smart environment needs to decide which appliance an inhabitant addresses and develop a first model.

 

 

 

 

Situated Speech Synthesis

Current speech synthesis approaches target to optimize synthesis performance in lab environments. However, when being faced with tasks that are tedious, time-critical or very important to us and we are also subject to distractions, we are likely to change our expectations of the communicative style of an assistive system. Within the CSRA our objective is to provide speech synthesis approaches that are more robust in real life situations.

 

Speech Synthesis with Attitudes

In human communication, the expression of attitudes such as surprise, doubt, or insecurity concerning the situation or ongoing dialogue is a frequently occurring phenomenon. It provides an interlocutor with important information as to how s/he evaluates the situation or the preceding message and can trigger clarifications, reformulations or confirmations without explicitly stating them. An ability to express attitudinal speech therefore can be expected to increase the rapport and transparency in the interaction between human and artificial agent within the CSRA. We have evaluated the acoustic characteristics and facial expressions of various attitudes in human speech production Hönemann et al., 2015a Hönemann et al., 2015b and created a first model for their acoustic realization within the TTS platform integrated into the CSRA dialogue system Hönemann and Wagner, 2016. We furthermore have developed ways to synchronize facial expressions and speech output within the architecture, thus paving the way for a multimodal model for attitudinal expression in verbal human-machine interaction.

 

 

Fundamentals of Hesitation Synthesis

Like human interlocutors in search for the right words, complex system architectures such as the CSRA  sometimes need to deal with system internal processing issues that delay a system response. In order to equip dialogue systems with capabilites similar to human speakers, we therefore developed a strategy for synthesizing hesitations. These aim "buying" extra dialogue time without the user noticing it, using subtle word lengthening. In case of severe internal delays that need more time to be resolved, the strategy can resort to overtly noticable disfluencies such as filled pauses which are known to provide important cues for the listener as to wait for a continuation.
We conducted research on fundamentals of word lengthenings and developed a simple classifier to aid lengthening detection in speech corpora Betz et al., 2016. We tested the effect on increasing lengthening on user feedback and interaction times to find thresholds for acceptability of synthesized hesitation Betz et al., 2017. In addition, we refined the elasticity hypothesis for segment duration distribution in disfluent contexts to ensure correct placement of the synthesized hesitation Voße et al., 2017

 

A Social Turn in Speech Synthesis Evaluation

The CSRA environment enables us to test our speech technological applications within realistic, interactive settings, thus necessarily going beyond standard approaches in current speech technological evaluations Wagner et al., 2015. Within a larger research program that strives for investigating speech communication within more interactive, less controlled settings Wagner et al., 2017, we are in the process of devising standardized approaches that allow for an optimized diagnostics of speech synthesis quality estimation Betz et al., 2017a. This will be achieved by developing suitable behavioural and physiological estimates of speech synthesis quality that can be assessed during ongoing interactions, and combining those