MTRANS: A multi-channel, multi

Transcription

MTRANS: A multi-channel, multi
M TRANS: A multi-channel, multi-tier speech annotation tool
Julián Villegas1 , Martin Cooke1,2 , Vincent Aubanel1 , and Marco A. Piccolino-Boniforti3
1
2
Ikerbasque (Basque Science Foundation), Spain
Language and Speech Laboratory, Universidad del Pais Vasco, Spain
3
Dept. Linguistics, Univ. Cambridge, UK
j.villegas@laslab.org,m.cooke@ikerbasque.org,v.aubanel@laslab.org,map55@cam.ac.uk
Abstract
M TRANS, a freely available tool for annotating multi-channel
speech is presented. This software tool is designed to provide visual and aural display flexibility required for transcribing multi-party conversations; in particular, it eases the analysis
of speech overlaps by overlaying waveforms and spectrograms
(with controllable transparency), and the mapping from media
channels to annotation tiers by allowing arbitrary associations
between them. M TRANS supports interoperability with other
tools via the Open Sound Control protocol.
Index Terms: speech tools, multi-channel speech annotation,
conversation analysis.
1. Introduction
The analysis of multi-channel speech material such as that available in the ICSI [1] and AMI [2] corpora, or recordings which
result from exploring the effect of background conversations
on foreground conversations [3] often requires the transcription
and annotation of recordings of more than two talkers, here denoted ‘multi–channel’ conversations.
While some tools can handle multi-channel media, they
rarely possess the flexibility in audio output and visual display required when transcribing and annotating multi-channel
conversational speech. Overlaps, frequent in natural conversations, are particularly interesting, and their analysis would benefit from support for visual signal separation in both time and
frequency domains. A further issue is the mapping from annotation tiers to channels, where typical one-to-one assumptions
are too restrictive in cases where the tier corresponds to events,
such as inter-turn gaps, which are only meaningful with reference to more than one channel. To support rapid and flexible
annotation of multi-channel speech with many annotation levels, we have developed MTRANS, a free and open source tool
available from www.laslab.org/tools/mtrans.
2. Tools and Frameworks
Tools such as Praat [5] (probably the most widely used software for speech analysis), Speech Filing System [6], or arguably
®
MATLAB [4] (via its Signal Processing Toolbox), offer solutions to speech-signal analysis including modules for capturing,
displaying, editing, annotating, and analysing media. These
powerful tools also benefit from user-contributed plug-ins and
libraries such as the ‘voicebox’ collection of MATLAB functions [7]. In some cases, these applications are forced to compromise between flexibility and generality yielding them somewhat cumbersome for tasks required in the analysis of multichannel speech.
Numerous software applications for editing signals can be
used for conversational analysis. A complete review of these
applications is beyond the scope of this article: only the most
relevant tools will be discussed.1 Anvil [8], a video-oriented annotation tool, integrates multimodal information including motion capture data. The CLAN programs [9] adhere to the CHAT
transcription convention, and are useful for frequency counts,
co-occurrence and interactional analyses. E LAN [10] is a multimedia (video and audio) annotation tool supporting multiple
channel video. The EMU Speech Database System [11] allows
signal editing, hierarchical annotation handling and automatised
signal modification via batch processes. Being database oriented, large corpora can be analysed in detail and through its
interface with R it is possible to perform sophisticated statistical
analyses and create figures in an integrated environment. E X MAR a LDA [12], through its Partitur-Editor allows multiple level
annotation of single audio and video files. Folker [13] shares
many similarities with E XMARaLDA’s Partitur-Editor. LaBB CAT (a.k.a. ONZE Miner) [14], offers monochromatic spectrogram display (indirectly) for single channel analysis. The commercial application Transana [15] is another video-annotating
oriented tool which is simple to use and works with single
sources. Transcriber [16], designed specifically for the annotation of broadcast news recordings, supports multiple media
channels via the ‘channelTrans’ extension.2 Note that there
is a homonymous version of Transcriber3 currently not maintained, which allows monophonic transcription of audio files,
with spectrogram, signal, and energy plots. Wavesurfer [17]
supports multi-channel audio analysis and spectrogram view;
its development, stopped for about six years, has been recently
resumed. This tool is frequently used, e.g., EMU signal editing
functionality is built upon Wavesurfer.
The aforementioned applications provide the user with
common features such as time-alignment, basic signal processing (intensity control, lateralisation, etc.), basic editing (cut,
copy, and paste), interfaces to other programs and several import/export formats. Table 1 summarises some of these software
features.
Handling multi-channel audio is achieved in some applications (e.g., EMU and Praat) by synchronously rendering several
media files in independent windows. Such a mechanism creates
flexibility and scalability problems.
1 A representative list of annotation tools can be found at
annotation.exmaralda.org/index.php/
Linguistic_Annotation
2 www.icsi.berkeley.edu/Speech/mr/channeltrans
3 www.isip.piconepress.com/projects/speech/
software/legacy/transcriber
Figure 1: M TRANS interface. This example has four audio channels of which two spectrograms and 14 tiers are shown.
Table 1: Some applications related to MTRANS (all are freely
available, multi-platform, and feature waveform representation.)
Anvil
C LAN
E LAN
E MU
E XMARaLDA
Folker
LaBB - CAT
M TRANS
Praat
S FS
Transcriber
Wavesurfer
spectrogram
other media
multi- open
channel source
scriptable
No
No
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
Video, motion
Video
Video
jaw, tongue
Video
No
No
Partial
No
No
No
No
Yes
No
Yes
No
No
No
No
Yes
Partial
No
Yes
Yes
No
Yes
Yes
Yes
No
No
No
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
Yes
Yes
3. M TRANS philosophy
One of the goals in creating MTRANS was to cover the gap left
by existing applications in the analysis of natural, real-world,
multi-party conversations (rich in overlaps) by allowing the visual inspection of overlaid signals in different representations
(spectrographic and waveform). In its design, the following features were considered important:
Ease of experimentation: As an example, spectrogram parameters (e.g., maximum frequency, frequency resolution, energy range, and individual signal transparency) can be adjusted
continuously in real-time. This fine control facilitates visual
speech analysis which often demands tuning of display parameters dependent on the speaker, task and segment of speech material currently under analysis.
Flexibility: Arbitrary tier grouping and ordering allows a
logical relationship between tier type and channels (e.g., overlaps between channels) and the determination of which signals
are required for playback. Tier type and tier association provides useful semantic information in subsequent processing of
annotation files.
Information filtering: To better focus on a particular task,
allows the user to display/hide information such as individual tiers, tier groups, waveforms, annotation, and spectrograms. Aurally, one can lateralize each audio channel as well as
increase its relative sound intensity (reflected in its spectrogram
transparency).
MTRANS
Interoperability: M TRANS specialises in creating annotations from multiple audio channels but it is capable of sending
real-time transport information (e.g., playback starting point,
segment duration, and speed rate) to other applications via
the Open Sound Control (OSC) protocol.4 Provided that OSCenabled clients exist, there is no need to switch between tools
for a particular task nor to overload MTRANS with peripheral
features used infrequently. The use of OSC allows MTRANS to
focus on its core functionality while recruiting other specialised
tools for tasks such as video playback (see later) with no additional coding.
Speed and reactiveness: Regardless of media length,
By implementing resizable panels, users can examine speech in more detail, or conveniently
shrink them for improved performance. Since only the visible
spectrogram pixels are computed at given time, reducing the
size of this panel increases overall performance.
MTRANS is designed to be fast.
4 opensoundcontrol.org
4. Main Features
M TRANS runs on platforms supporting MATLAB (such as Windows, Mac OS X and Linux) and it is extensible via MATLAB scripts. It consists of two windows: the main display and
a control & search panel. Table 2 summarises the main features
of M TRANS.
Table 2: M TRANS features
Browsing
Audio channel lateralization and volume control,
transparency control, several levels of zooming,
resizable panels, overlaid hue-encoded spectrograms and waveforms, text and regular expression search, help via tool-tips.
Interoperation Import/Export tiers to Praat or HTK annotation
files. real-time inter-operation with OSC-enabled
software. Supports tier lexica to ensure uniformity amongst transcribers. Automatic generation
of overlap and turn tiers as the starting point for
manual correction.
Navigation
Several ways to move between annotations (skip
to the previous/next non-empty annotation, or
with the same label, etc.), access annotations
through search results.
Playback
Audio streams are processed with a vocoder to
speed
control the reproduction speed with no significant
formant artefacts.
real-time
Spectrogram and waveforms are rendered online
according to the current parameters, e.g., panel
size, gain, transparency, zoom, frequency resolution.
(a) monochromatic version
(b) hue-encoded
Figure 2: Hue-encoded spectrograms facilitate visual source
separation at overlaps and f0 contour visualisation. This feature
is especially useful in combination with transparency control in
the analysis of overlaps when the signals are very similar.
formation improves performance and usage of the visual space.
Display control of tiers and tier types is possible via the bottom
panel.
4.1. Main Window
The resizable panels inside the main window (shown in Fig. 1)
correspond to spectrogram, waveform, and tier view of the multiple channels to analyse. Hue-encoded spectrograms, with controlled transparency, facilitate visual source separation at overlaps as shown in Fig. 2. Identifying crosstalk between audio
channels (common even when the audio is captured using closetalking microphones) improves the analysis quality, and controlling each channel transparency favours crosstalk detection.
Annotations can be created, deleted, or edited in place using standard GUI conventions; their contents can be typed in or
selected from user-defined lexica specific to each tier type. In
fast annotation mode, a single click inserts annotation boundaries and, optionally, a default symbol.
Tiers map to multiple or individual media channels, e.g., in
Fig. 1, overlap tiers are associated with two audio channels. By
clicking on a tier one can add, remove, and edit it. Support for
large numbers of tiers is achieved via grouping, flexible ordering, hiding/display, and colouring options.
4.2. Control and Search Window
The control tab (highlighted in left window in Fig. 1) provides a
mechanism for selecting what and how to display. The top panel
allows online adjustments to, for example, frequency resolution,
upper frequency limit, and playback speed. These parameters
can be adjusted to permit visual analysis of formants and f0
contours. Through the middle panel users can select what information is displayed for each audio channel including waveform, annotation, spectrogram, etc. Deselecting unwanted in-
Figure 3: Search & Browse window (fragment). The resizable
bottom panel shows the result tree.
Through the Search tab (shown in Fig. 3), users can navigate a set of retrieved annotations by freely selecting them from
the result tree or consecutively by using arrow-keys.
4.3. O SC functionality
To demonstrate the benefits of this feature, a simple fourchannel video player (shown in Fig. 4) was created. This OSCclient performs synchronised video playback corresponding to
the annotation currently being played in MTRANS. For instance,
the application can reproduce the video segment at the playback
speed specified in MTRANS.
M TRANS users can specify server, port, and OSC-bundling
(by default localhost, 2012, and individual messages, respectively). If media and OSC-applications are available, one could
for example enrich the analysis of conversations by also looking
at participants’ video, motion capture, articulatory settings, as
well as other synchronous data such as EEG, etc. without necessarily harming each individual application’s performance and
Acknowledgements. The authors thank EU Future and Emerging Technology (FET– OPEN) Project LISTA (The Listening Talker) and EU Marie
Curie Research Training Network S2S (Sound to Sense) for support and
encouragement.
7. References
[1] N. Morgan, D. Baron, J. Edwards, D. Ellis, D. Gelbart, A. Janin,
T. Pfau, E. Shriberg, and A. Stolcke, “The Meeting Project at
I CSI,” 2001.
[2] J. Carletta, “Announcing the A MI Meeting Corpus,” The E LRA
Newsletter, vol. 11, no. 1, pp. 3–5, Jan. 2006.
[3] J. C. Webster and R. G. Klumpp, “Effects of Ambient Noise
and Nearby Talkers on a Face-to-Face Communication Task,” J.
Acoust. Soc. Am., vol. 34, no. 7, pp. 936–941, 1962.
[4] Mathworks, “M ATLAB,” Software, available [Mar. 2011] from
www.mathworks.com.
Figure 4: Multi-video player synchronised with MTRANS via
the OSC protocol (Videos from the AMI corpora [2]).
crucially without additional coding.
5. Experience with MTRANS
The need for a detailed analysis of multi-channel speech data
motivated the development of MTRANS . Speech material came
from an experiment on speaking in the presence of competing speech. MTRANS was used to annotate pairs of simultaneous conversations recorded in three sessions of 20 minutes
each. Five channels of synchronous audio data (four closetalking microphones and an omnidirectional table microphone)
were available during annotation. Annotation involved more
than 20 tiers grouped into seven tier types (session structure,
speech/non-speech, orthography, interactional events, inter-turn
gaps, turn-types, miscellaneous). More than 7000 individual
annotation symbols were created, including nearly 1300 events
(dysfluencies, alignments, etc.) and more than 2000 turn components (back-channels, interruptions, smooth change, etc.).
6. Further developments
M TRANS currently saves transcriptions in plain text files for
ease of use in subsequent text processing pipelines, but they
could be converted to XML files or stored in a database. Retrieving corpora from central databases via a database connection is
also an attractive feature to implement given the current trends
in data storage, retrieval and shared documents in the speech
community.
Much of the information necessary to generate a conversation analysis style of transcription (e.g. following Jefferson’s
conventions [18]) is available in MTRANS. This output format
will be automatically generated in future versions.
LATEX commands can be directly inserted in annotations to
produce the most common diacritics and digraphs used with the
Latin alphabet, Greek letters, and some phonetic symbols. A
full support of other languages and IPA symbols would be available via Unicode encoding (UTF-8) but is yet to be fully implemented in MATLAB. The current version of MTRANS is a hybrid
of MATLAB and Java code. Full migration to Java is planned for
future versions.
So far, MTRANS has OSC server capabilities. Implementing
client capabilities is currently under consideration.
[5] P. Boersma and D. Weenink, “Praat: doing phonetics by computer,” available [Mar. 2011] from www.praat.org.
[6] M. Huckvale, D. Brookes, L. Dworkin, M. Johnson, D. Pearce,
and L. Whitaker, “The S PAR Speech Filing System,” in Proc.
of European Conf. on Speech Technology, Edinburgh, 1987, pp.
305–308.
[7] M. Brookes, “VOICEBOX: Speech Processing Toolbox for M ATLAB ,” Software, available [Mar. 2011] from www.ee.ic.ac.uk/hp/
staff/dmb/voicebox/voicebox.html.
[8] M. Kipp, “Spatiotemporal coding in A NVIL,” Proc. Sixth Int.
Conf. on Language Resources and Evaluation (LREC-08), 2008.
[9] B. MacWhinney, The C HILDES project: tools for analyzing talk,
3rd ed. Mahwah, NJ, USA: Lawrence Erlbaum, 2000.
[10] P. Wittenburg, H. Brugman, A. Russel, A. Klassmann, and
H. Sloetjes, “E LAN: a professional framework for multimodality
research,” in Proc. of Language Resources and Evaluation Conference (LREC), 2006.
[11] S. Cassidy and J. Harrington, “Multi-level annotation in the Emu
speech database management system,” Speech Commun., vol. 33,
pp. 61–77, Jan 2001.
[12] T. Schmidt, “E XMARaLDA: un système pour la constitution et
l’exploitation de corpus oraux ,” in Pour une épistémologie de
la sociolinguistique. Actes du colloque Int. de Montpellier 10–12
déc. 2009. Lambert-Lucas, 2010, pp. 319–327.
[13] T. Schmidt and W. Schütte, “F OLKER: An Annotation Tool for
Efficient Transcription of Natural, Multi-party Interaction,” in
Proc. Seventh Conf. on Int. Language Resources and Evaluation
(LREC’10), Valletta, Malta, 2010.
[14] R. Fromont and J. Hay, “O NZE Miner: the development of a
browser-based research tool,” Corpora, vol. 3, no. 2, pp. 173–193,
2008.
[15] D. Woods and C. Fassnacht, Transana. Madison, WI, USA:
The Board of Regents of the U. of Wisconsin System [Computer
software], 2007.
[16] C. Barras, E. Geoffrois, Z. Wu, and M. Liberman, “Transcriber:
Development and Use of a Tool for Assisting Speech Corpora
Production,” Speech Communication, vol. 33, no. 1–2, pp. 5–22,
2001.
[17] K. Sjölander and J. Beskow, “WaveSurfer - an open source speech
tool,” in Proc. of ICSLP 2000: Sixth Int. Conf. on Spoken Language Processing, B. Yuan, T. Huang, and X. Tang, Eds., vol. 4,
Beijing, Oct. 2000, pp. 464–467.
[18] G. Jefferson, Structures of Social Action: Studies in Conversation
Analysis. New York, USA: Cambridge University Press, 1984,
ch. Transcript notation.