Video remote interpreting in clinical communication: A multimodal analysis


Video remote interpreting both enables and constrains language-discordant clinical consultations.

Video remote interpreting requires the healthcare professional to ensure appropriate technical and spatial arrangements.

Healthcare professionals need to acquire special skills for working with interpreters via video link.



Investigating how the spatial and audiovisual conditions in video remote interpreting (VRI) shape communicative interaction in a language-discordant clinical consultation.


We conducted a multimodal analysis of an authentic VRI-mediated consultation with special reference to spatial arrangements, audiovisual conditions, and the healthcare professional’s use of embodied communicative resources (body orientation, eye gaze, gestures).


The physician is found to pursue his communicative goals for the consultation by first creating an appropriate spatial and technical environment and then supporting his information-giving and relationship-building actions through the use of nonverbal (embodied) resources like body orientation, gaze and gestures as well as specific turn-management behaviour.


VRI allows healthcare professionals to access professional interpreters for language-discordant consultations but requires appropriate technical and spatial arrangements as well as users capable of adapting their communicative behaviour to spatial and audiovisual constraints.

Practice implications

Alongside telephone interpreting, VRI is the solution of choice for language-discordant clinical encounters in times of the Covid-19 pandemic. Its use requires appropriate technical and spatial arrangements as well as specific skills on the part of healthcare professionals to cope with inherent audiovisual constraints.


Video remote interpreting (VRI)

Language barriers

Language-discordant consultations

Professional interpreter

Triadic interaction

Multimodal analysis


  1. Introduction

As patient-centredness and shared decision-making have become widely adopted benchmarks for quality healthcare, the role of communication between healthcare professionals (HCPs) and patients has received unprecedented levels of attention, and many research and teaching initiatives have been designed to address the challenges of achieving effective communication in patientprovider interactions in a wide range of settings [1][2]. Much less consideration has been given to language-discordant encounters (i.e., where HCPs and patients do not share a language), even though the negative effects of language barriers on quality of care and patient safety and satisfaction are well documented [3][4][5][6]. The use of professional (i.e., trained and certified) interpreters has proved essential in a variety of settings [7][8][9]. Nevertheless, the need for research and training on how to work most effectively with interpreters has yet to be fully addressed [10][11], and timely and cost-efficient access to qualified interpreters remains a pervasive challenge. Accessing a professional interpreter via a videoconference link constitutes an innovative solution that offers particular advantages in times of a pandemic requiring distancing and a reduction of personal contacts.

Recent guidance [12] on language-support solutions for healthcare communication, including such choices as professional vs informal interpreters and on-site vs remote (telephone vs video link) interpreting as well as machine translation applications, leaves video-mediated interpreting underexplored but calls for best available evidence for implementing strategies to overcome language barriers. To the extent that such evidence has emerged, it focuses mainly on demonstrating the feasibility of videoconference-based remote interpreting [13] and on stakeholder preferences in relation to on-site and telephone-based interpreting [14][15][16][17]. A survey study among professional medical interpreters [18] points to limitations of telephone interpreting and of video-mediated interpreting in clinical scenarios with a strong interpersonal and psychosocial dimension.

1.1. Aims and approach

Our aim in studying video remote interpreting (VRI) is not a comparative evaluation of different interpreting service delivery modalities but a fuller engagement with this recent practice, treating VRI as a complex form of mediated communication in its own right. Its complexity stems from the confluence of video-mediated interaction [19] and interpreter-mediated communication set in a specific institutional context [10][12]. We therefore study this dual mediation by focusing on the affordances of the technical medium in relation to the communicative behaviour of participants in clinical interactions. Our special interest lies in the use of embodied multimodal resources which are generally important in face-to-face communication and of special relevance in technology-mediated interactions. While our conceptual and methodological approach is mainly informed by Interpreting Studies [20][21], which has in turn drawn heavily on interdisciplinary frameworks for the study of discourse and interaction, our analysis does not centre on the translatorial performance of the interpreter. Rather, we focus on the healthcare professional controlling and actively using the VRI service and investigate how the goals of clinical consultations, particularly regarding information and rapport [22], are pursued in what is ideally regarded as a joint professional performance by the physician and the interpreter. The mediality of VRI calls for an analysis foregrounding the visual modality [23], and our analysis of communicative behaviour is distinctly multimodal, highlighting the use and function of embodied resources such as body and head orientation, eye gaze and gestures (see 1.2).

Before elaborating on our conceptual framework and analytical approach, we acknowledge that our study is limited to a single consultation that involves only three participants (physician, patient, interpreter). The data were collected within a limited time frame for a pilot study in preparation for a larger-scale project. Although the case we present does not pose any apparent professional or psychosocial challenges, we hope that the insights derived from our analysis help exemplify some aspects of VRI-mediated interaction in a particular constellation of participants (see 3.1). After all, we work with authentic data from a real-life medical setting, unlike authors reporting findings based on simulations. De Boe, for instance, in one of the few discourse-based analyses of VRI to date [24], found significant individual variability in VRI user behaviour even in a relatively controlled study based on three simulated doctor–patient interviews. Considerable diversity also characterises the set of eleven authentic VRI-mediated healthcare encounters analysed by Hansen [25]. Her study incorporates insights from conversation-analytical work on video-mediated interpreting in other settings [26][27] and provides important foundations for the present analysis.

1.2. Conceptual and analytical framework

While set in several different research traditions, multimodal approaches to interpreting in face-to-face communication (dialogue interpreting) [28] invariably assume that verbal as well as nonverbal communicative resources need to be accounted for. The present analysis therefore discusses verbal behaviour as well as the use of embodied resources such as eye gaze and gestures. More fundamentally, it also seeks to account for the interactants’ spatial positioning, which is highly relevant in any triadic interaction – in this case among two primary interlocutors and an interpreter as intermediary [29] – but assumes particular significance in video-meditated settings. Unlike in onsite interaction, where all three (or more) co-present interactants have unmediated visual access to the other participants and hence to their use of embodied communicative resources, visual access in VRI depends on the technical set-up (screen size, camera position and angle of view) and the participants’ positioning in relation to the technical equipment. This is different again in telehealth scenarios where all participants are online in different locations. In VRI with only the interpreter being remote, as in the case we present, visual access is constrained both for the interpreter whose view depends on what is captured by the camera, and for the onsite participants, who are only able to see a two-dimensional image of the interpreter appearing as a talking head or in an upper-body view [26] and a small insert of the interpreter’s view of the scene. Aside from this inherently constrained or fractured ecology [30], video-mediated interaction in VRI also does not allow for mutual eye gaze between the interpreter and the onsite participants, which may impede the interpreter’s efforts to coordinate the interaction [27][28] and to gain and maintain the patient’s trust in the same way as the co-present physician.

Our multimodal analysis takes the spatial and audiovisual conditions of VRI, referred to as the visuospatial ecology, as a point of departure and investigates its implications for the dynamics of interpreter-mediated interaction in clinical encounters. It is in this unique ecology that the physician will be found to pursue the communicative goals of the consultation in a joint professional performance with the interpreter, using a combination of verbal and embodied resources as well as technological artifacts.

  1. Methods

2.1. Data collection

A ten-minute interview between a German-speaking physician and a Bulgarian-speaking patient was video-recorded in October 2019 at neunerhaus Health Centre in Vienna. This is an outpatient medical facility that serves uninsured individuals, many of whom come from different cultural and linguistic backgrounds. The interaction was facilitated by a professional video remote interpreter working for a Vienna-based agency with which the medical facility has a service agreement. A Canon EOS 600S video camera was used for the audiovisual recording, and an additional audio-recording was made with a Samsung S4 Mini smartphone. Both devices were put in place and operated by the first author, who was present in the room and made field notes on the physical arrangements and activities during the consultation.

Though the interpreter could not see the observer in the room (see Fig. 4), she was made aware of her presence, having been informed and asked for consent beforehand via a written declaration. In that declaration, the first author disclosed her affiliation with the Centre for Translation Studies, so that the interpreter may have concluded that her work would be under study and might have performed differently knowing that another professional was in the room. Also, the physician assured the patient (with the help of the interpreter) that the study would not focus on him as a patient but on the way the doctor communicates with the interpreter.

Before and after the recording, the first author had informal talks with the physician to gather information about the case and about his experience with VRI use (see 3.1); to discuss the placement of the video camera so as to ensure the patient’s anonymity; and to agree on procedures for securing the patient’s informed consent.

In line with institutional guidance on informed consent by the University of Vienna, permission for the recording was obtained orally from all three participants, and the physician and the interpreter were additionally provided with a document containing a data privacy and confidentiality statement in advance. Consent by the patient, whose face was never in full view as the camera was placed at an angle behind him, was obtained twice  first by a previously informed Bulgarian-speaking social worker who led the patient into the consultation room, and then, on camera, by the physician via the interpreter. This is documented in the first part of the recording, for both the patient and the interpreter (see Table 1, 01:18–02:24). The camera was only turned on after the patient had expressed his agreement.

Table 1. Timeline of the consultation (DOC = doctor; PAT = patient; INT = interpreter).

Time code Duration (min:sec) Activity Interview stage [22]
00:00 – 01:18 01:18 DOC sets up video call (operator) Initiating the session
01:18 – 01:42 00:24 DOC confirms interpreter’s consent
01:42 – 02:24 00:42 DOC confirms patient’s consent
02:24 – 02:51 00:27 DOC asks about present condition Gathering information
02:51 – 03:33 00:42 Physical examination Physical examination
03:34 – 06:03 02:29 DOC informs PAT about appointment Explanation and planning
06:04 – 06:56 00:52 DOC gets up and leaves camera range
06:56 – 08:08 01:12 DOC reassures PAT about appointment
08:08 – 08:44 00:36 DOC tells PAT about medication
08:44 – 09:14 00:30 DOC repeats date of appointment Closing the session
09:14 – 09:20 00:06 Goodbye to INT
09:20 – 09:28 00:08 Goodbye to PAT

2.2. Transcription

The videorecording (duration: 9 min 28 s) was transcribed for analysis using the annotation software ELAN [31] which allows for a user-defined multi-tier annotation linked to the time codes of the video. ELAN was used to transcribe the original spoken utterances as well as the physician’s gaze and body orientation. The Bulgarian utterances were transcribed by a professional translator, who received only the audio track to ensure participants’ anonymity. The transcript was then complemented with working translations into English by the Bulgarian translator and the second author. A sample transcript is shown in Fig.

Share this article

on your social networks

Related articles