Siridus System Architecture and Interface Report (Baseline)

Transcription

Siridus System Architecture and Interface Report (Baseline)
Siridus System Architecture and Interface
Report (Baseline)
Ian Lewin, C.J. Rupp, Jim Hieronymus,
David Milward, Staffan Larsson, Alexander Berman
Distribution: Public
Specification, Interaction and Reconfiguration in
Dialogue Understanding Systems: IST-1999-10516
Deliverable D6.1
July 2000
Specification, Interaction and Reconfiguration in Dialogue Understanding Systems:
IST-1999-10516
Göteborg University
Department of Linguistics
SRI Cambridge
Natural Language Processing Group
Telefónica Investigación y Desarrollo SA Unipersonal
Speech Technology Division
Universität des Saarlandes
Department of Computational Linguistics
Universidad de Sevilla
Julietta Research Group in Natural Language Processing
For copies of reports, updates on project activities and other SIRIDUS-related information, contact:
The SIRIDUS Project Administrator
SRI International
23 Millers Yard,
Mill Lane,
Cambridge, United Kingdom
CB2 1RQ
milward@cam.sri.com
See also our internet homepage http://www.cam.sri.com/siridus
c 2000, The Individual Authors
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 3/58
No part of this document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval
system, without permission from the copyright owner.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 4/58
Primary responsibility for authorship is divided as follows. Ian Lewin and C.J. Rupp wrote
Chapter 1. Ian Lewin was the overall editor and author of chapters 2 through 5. Chapter 6
was written by C.J. Rupp. Chapter 7, describing the SIRIDUS architecture, is the result of a
collaborative effort by all the authors.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 5/58
Contents
1 Introduction
8
2 Dialogue architectures and Dialogue management
10
2.1
Context Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2
Early decision making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3
Asynchronous interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 The DARPA Communicator Architecture
3.1
3.2
3.3
16
Instance 1: The Mitre ‘CommandTalk’ system . . . . . . . . . . . . . . . . . . . 18
3.1.1
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.2
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Instance 2: The CU/CMU Communicator System . . . . . . . . . . . . . . . . . 21
3.2.1
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 The Open Agent Architecture
23
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
4.1
4.2
Instance: SRI’s CommandTalk . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.2
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5 The TrindiKit Architecture
5.1
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Instance 2: Conversational Game Player . . . . . . . . . . . . . . . . . . . . . . 31
5.2.1
5.3
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 The Verbmobil Communications Architecture
6.1
6.2
34
Diverse Lessons from Direct Experience with a Pool Communications Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.1
Specifying Functionalities . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.1.2
Publishing and Subscribing. . . . . . . . . . . . . . . . . . . . . . . . . 38
6.1.3
Distributing Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.1.4
Interrupting Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.1.5
Knowing your Segments and Fragments . . . . . . . . . . . . . . . . . . 40
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7 An Architecture for SIRIDUS
7.1
26
Instance 1: GoDiS - Gothenburg Dialogue System . . . . . . . . . . . . . . . . . 27
5.1.1
5.2
Page 6/58
42
Components and Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 7/58
7.2
TRINDI K IT and OAA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.3
Dataflows and Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.4
Dataflow timings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.5
Word Lattices as a Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.6
Speech Recognizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.7
Prosodic Markings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.8
Speech Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.9
Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.10 Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.11 The Repair Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.12 Parsing and Semantic Interpretation . . . . . . . . . . . . . . . . . . . . . . . . 54
7.13 Translation to Dialogue Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 8/58
Chapter 1
Introduction
The purpose of this deliverable is to discuss the requirements for a dialogue system architecture
for the Siridus Project and describe a baseline architecture for the initial Siridus demonstrator
dialogue system. This system is conceived of as an instantiation of an integrated dialogue toolkit.
Therefore, in considering architectural requirements it is important to bear in mind a range of
configuration options, overall modularity and the potential for scalability, in keeping with the
general goals of the Siridus project. In addition, the proposed architecture should be capable
of sustaining the more immediate project objectives. These include: supporting the Information
State Update approach to dialogue, as promoted by the Trindi Project; extending this approach
to cover new types of dialogue; and, in particular, applying the approach in the management of
spoken dialogues. The latter will involve both exploiting the information state as a resource in
speech recognition and synthesis; and incorporating new approaches to the robust processing of
spoken dialogues.
Indeed, the most ambitious form of this goal would involve an attempt to enhance the state of
the art in spoken dialogue systems with the fruits of the information state update paradigm. This
presupposes an architecture that is capable of supporting the state of the art not only in the management of spoken dialogues but in the natural treatment of spoken language behaviours, such as
maintaining realistic time performance, handling performance phenomena, backchannelling and
interruptions. At this stage of development it will not be necessary to know how specific phenomena are to be treated but the general architectural implications must be catered for. This will
affect, in particular, process control, module dependencies and incrementality. The constraints
implied by the processing of spoken language will be formulated as assumed user requirements.
Some implications arising from these constraints can be drawn from actual systems. However,
since there are relatively few documented systems that attempt to meet all of these constraints,
conclusions must be drawn judiciously.
In order to set an appropriate context for our own architecture we describe and discuss some other
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 9/58
current major proposals concerning the architecture of spoken dialogue systems. The proposals
we discuss include the Darpa Communicator HUB architecture for spoken dialogue systems
which is currently the focus of a major and very well funded development effort in the United
States. We also discuss SRI’s Open Agent Architecture which is not an architecture for spoken
dialogue systems per se but which can be used as the backbone of such a system. It bears interesting comparison with the Darpa Hub architecture even though it predates it by several years.
Finally we discuss the TrindiKit architecture resulting from the predecessor Trindi Project and
which promotes the Information State Update approach to dialogue. In general, an architecture
can be instantiated in different ways and one can gain a better understanding of the space of possibilities permitted by an architecture by considering very different instantiations. We therefore
include a short examination of example systems for each of the architectures. It should be noted
that the examples we pick are not chosen because they represent particularly good instances of
spoken dialogue systems. They are chosen because they exploit particularly interesting features
of their underlying architecture.
We begin by discussing the very notion of a dialogue system architecture and, in particular, the
relation that it bears to the notion of dialogue management.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 10/58
Chapter 2
Dialogue architectures and Dialogue
management
A dialogue system architecture may be conceived of as a particular way of plugging together,
and perhaps controlling, a number of dialogue system components such that the whole is capable
of undertaking a dialogue with a user. A typical component list will include at least the following
linguistically oriented components:
speech recognizer
parser
semantic interpreter
dialogue manager
generator
speech synthesizer
Systems will also include other components, for example, a database query component or perhaps a component capable of executing commands.
Each of the linguistically oriented components, with the possible exception of dialogue management, has a reasonably well agreed function within the language engineering community.
Thus, speech recognizers decode speech signals into word strings, or n-best lists, or perhaps
word-graphs; parsers generate syntactic structures (generally trees) from strings or word-graphs;
semantic interpreters generate logical forms from syntactic structures; and so forth. Indeed,
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 11/58
such an understanding of the function of these components leads very naturally to the classical
“pipeline” architecture for joining them together. The output of the recognizer is simply piped
into the input of the parser whose output is itself piped into a semantic interpreter. The output
of semantic interpretation is piped into a dialogue manager which, typically, integrates the latest
interpretation into a dialogue interpretation (a semi-persistent structure that is built up and updated as a dialogue progresses and is the result of possibly many passes through the pipeline) and
then generates something, say a function call, that is meaningful to the back-end system. System outputs may also be thought of as generated through another pipeline consisting of meaning
generation by the back-end, utterance generation and then synthesis.
The precise functions of a dialogue manager component are not however universally agreed.
Indeed ‘dialogue management’ is a term that can cover a great many different sorts of activity
and phenomena. These activities can include, but are not limited to
1. turn-taking management: who can speak next, when, even for how long
2. topic management: what can be spoken about next
3. utterance understanding: understanding the content of an utterance in the context of previous dialogue
4. ‘intentional’ understanding: understanding the point or aim behind an utterance in the
context of previous dialogue
5. context maintenance: maintaining a linguistic and dialogue context
6. ‘intentional’ generation (or planning) : generating a system objective given a current dialogue context
7. utterance generation : generating a suitable form to express an intention in the current
dialogue context
Sometimes systems will contain a component called ‘dialogue manager’ which will execute some
of these functions. Sometimes other components will execute or at least contribute to these
functions. Furthermore, the contribution that the dialogue manager makes to those functions for
which it does have responsibility has also to be considered in relation to the overall dialogue
system architecture itself. To take a very simple example, in the simplest sequential pipeline
architecture, it is not generally possible for users to take a turn to say something except when
the system permits it. The speech recognizer will not even be listening for user utterances until
the current information flow down the pipeline begun by the previous user utterance has made its
way to the end. Nevertheless, dialogue managers may still be claimed to be responsible for turn
taking in the following sense: they determine whether to solicit input from the user at any given
point or whether they will keep the turn to themselves. As another example, we can consider the
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 12/58
interpretation of elliptical or fragmentary utterances. These often occur in dialogue as answers to
questions. In very many systems, e.g. the Philips train-timetable system [4], ellipsis resolution is
a task solely of the dialogue manager. At this point, certain sorts of information can be brought
to bear on the resolution and, generally, certain other sorts cannot. If data is piped from parsing
and semantic interpretation into dialogue management, then, generally, syntactic or semantic
structure will no longer be available. In the SRI system described in [12], the resolution or filling
in of ellipsis occurs within SRI’s generic Core Language Engine processor (CLE) ([2]) and not
within Dialogue Management. In this way, syntactic and semantic structures which are not
generally available to dialogue management can be used to help resolve ellipsis. Nevertheless, it
is still required that the language processor have access to some dialogue information, namely the
dialogue (game or task) structure over utterances. In this way, proximity to the lateset utterance
(in task structure, not necessarily historical sequence) can be used to score different candidate
resolutions of the ellipsis. So information flow from the dialogue manager to the linguistic
processor is required. [12] claim that ellipsis is a linguistic phenomenon and should be handled
in the linguistic processor. System reconfiguration becomes easier in that no additional work is
required (in respect of ellipsis) to port the system to a new domain. The general point is simply
that the role of an individual component can only properly be considered in relation to that of the
other components and the dataflows between them.
The two examples above illustrate three important general issues that arise in considering the
merits of a pipeline architecture (or indeed any other): context sensitivity, early decision making
and the possibility of asynchronous interaction.
2.1 Context Sensitivity
Context sensitivity arises in general because if the pipeline is such that each component can only
use the information sent to it by an upstream component and possibly some internal state of its
own, then recognizers and parsers, for example, simply cannot be made sensitive to context and
state uncovered by the dialogue manager. The picture resembles the game of Chinese Whispers
in which one whispers something to the first child in a line of children who then repeats it, as best
he can, in a whisper to the next child. The message is repeated, or mis-repeated down the line. At
the end of the line, one generally finds the message has been transformed out of all recognition.
Unless a child mis-pronounces the message whilst relaying it, the final version of the message
will be determined simply by the initial message and how it was processed by each child using
only the input message and their own internal state as information to go on.
In order to build in context sensitivity, one must alter the simple version of the pipeline. One very
common desire is to make speech recognition dependent upon dialogue state. If one can be sure
enough about which dialogue state one is in, for example, if one can be sure one has just asked
a yes-no question, then it makes sense to use this information to constrain the recognizer to be
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 13/58
especially on the look-out for ‘yes’ and ‘no’ and all other indirect ways of saying ‘yes’ and ‘no’.
In Chinese Whispers, the equivalent notion would be having the last child in the line prime the
first child in the line as to what to expect to hear. In Chinese Whispers, this strategy will probably
not help at all unless the last child has good reason for his expectations. Perhaps the game leader
is his father and the child knows his father’s favourite epigram. In any case, there appear to
be two main strategies available for building in such context sensitivity. First, the downstream
component can directly update the upstream component by making a call to it or sending it a
message. Secondly, the downstream component can update a piece of global state which the
upstream component consults before it undertakes its processing. The component may consult
it either by reading it itself off the global state or by receiving it as an argument or parameter in
whatever call causes the component to act. The two strategies have quite different implications.
In particular, in the first strategy, components must now be viewed as objects providing several
services. For example, the Nuance speech recognizer [16] is a component that offers both an
audio-to-string conversion service and a set-my-grammar-expectations service. Such services
could be called upon by any other component in the system. This encourages a distributed object
view of the architecture. On the second strategy, the recognizer only provides the first service
albeit one which is sensitive to a piece of global state which must be set the other components.
This encourages a blackboard view of the architecture.
2.2 Early decision making
The issue of early decision making arises in a simple pipeline architecture again because downstream components may have access to information which could be usefully exploited by upstream components. For example, a statistical n-gram recognizer only has limited grammatical
knowledge to deploy in order to decide on the best output string for a given audio input. The
downstream parsing component will possess much more detailed grammatical knowledge but
this information is not available to the upstream component. It may not be possible to package
this information (e.g. into a dialogue state that one can associate with a grammar or language
model) such that the upstream component can use it.
There again appear to be two main ways to avoid the problem of early decision making. First, one
can simply not make (all of) the decisions early. Secondly, one alter the pipeline architecture. The
Spoken Language Translator system [6] is a good example of the first method. The output from
the statistical recognizer is a list of the top 5 hypotheses rather than just the top 1. Each of the
hypotheses is parsed 1 by the downstream component which, by invoking the extra information
available to it, can decide which the best string hypothesis actually was. In fact, the policy is
propagated throughout the system so that an evaluation of the possible output translations also
helps determine the system’s final belief about the best speech hypothesis. The success of this
1
technically, a lattice constructed over the hypothesis is parsed to save duplicating effort over common parts of
different hypotheses
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 14/58
policy naturally depends on whether the top 5 hypotheses are indeed likely to contain the correct
hypothesis even when the top hypothesis is not itself correct. In general, the policy requires
upstream components to be able to generate some set of results (a proper subset of all possible
results) likely to contain a good hypothesis in the absence of information available downstream.
In the second method, one passes information derived by a downstream component back to the
upstream component in order to generate a new and improved hypothesis by that component. A
very simple example of this is provided by Mitre’s version of CommandTalk ([1], discussed further below). In this system, the recognizer produces a top hypothesis only, but if the downstream
recognizer cannot make sense of it, recognition is re-invoked in order to produce the next best
hypothesis. This scheme may possibly have a speed advantage over generating an n-best hypothesis list straightaway if the top hypothesis is generally correct. [6] report, for example, that 5-best
operation of SRI’s Decipher recognizer was a little more than twice as slow as 1-best operation.
It is also arguably more flexible in that one does not have to fix in advance the value of N in
N-best. On the other hand, it may turn out to be considerably slower in practice and, in any case,
does not permit the comparison of different hypotheses. The first processable hypothesis one
obtains is the only processable hypothesis one obtains. In any case, a far more interesting policy
would be one which involved sending more information back from downstream processing than
just a simple reject message. For instance, in N-best processing on a statistical recognizer it can
easily happen that there are 2 (or more) particularly significant points at which the recognizer
cannot easily distinguish the competing hypotheses. The top 4 hypotheses in a 5 best list may
then just contain different ways of resolving one of these points whilst keeping the other point
fixed on the locally preferred but incorrect solution. Ideally, one would like downstream processing to recognize the partial correctness of the 5th solution and ask upstream processing to deliver
another 5-best list all of which extend the known partially correct solution. (Again, it may not be
practically feasible to configure the upstream processor appropriately in advance).
2.3 Asynchronous interaction
A pipeline architecture does not permit asynchronous processing in the sense that an upstream
component must complete its operation before its immediate downstream neighbour can start.
Consequently, any controlling module must call each component in turn and wait for it to return
before calling the next. This does not prohibit the possibility of all parallelism of course as one
can exploit temporal parallelism by permitting an upstream component to begin processing its
next input while downstream components are still processing earlier inputs. In fact, academic literature on Computer Architectures reserves the word ‘pipelining’ for just this sort of parallelism.
However, such parallelism appears only to have a limited use in the standard pipeline of spoken
dialogue systems. The problem is simply that by permitting speech recognition to occur before
the dialogue manager has finished processing the previous utterance, one creates an opportunity
for adding or altering what has been said which is not real. It is exactly like conducting a con-
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 15/58
versation over a very slow communications link. One can say something to one’s partner and,
during the time that it takes to transmit the message, one can attempt to add to or alter one’s first
expression. However, while the additional message is being transmitted, the partner constructs
a response to the original utterance and he receives the addition only after he has issued his response. The result is inevitably confusion. The confusion is particularly serious if the partner is
liable to interpret the new message in the light of making his own utterance even though that utterance played no part in generating that new message. In such circumstances, people soon learn
to observe strict turn-taking. Not every instance of ‘pipelining’ parallelism need be confusing,
of course. In SmartSpeak ([11]), the system permits a user to begin outlining his requirements
for the return leg of a journey while the back-end system is consulting an online database for
an itinerary for the outward leg. This sort of parallelism can be successful precisely because
both partners are aware that the new dialogue will not affect the results to come of the previous
dialogue.
Evidently, asynchronous processing is very much a feature of human-human dialogues at least
in circumstances where the communications link is fast enough. People interrupt each other.
That is, they process part of what is being said and understand enough to make them interrupt
so that they may correct it or stop any more being said. People also backchannel. For example,
they confirm or assent to early parts of utterances while later parts of the same utterance are still
being generated. A relatively recent addition to speech recognition software is the possibility of
‘barge-in’ in which, if the system is saying something, the user may barge in and say something
themselves. In such systems, the system stops saying what it was saying and starts listening
instead. This can be useful if the user indeed wishes to interrupt but may lead to undesired
behaviour if the user in fact was ‘back-channelling’. Furthermore, detecting an interruption is
only useful if one can successfully interpret the interruption. Unfortunately, the meaning of an
interpretation often depends on the material immediately preceding the interruption.
S: Depart Bideford at 5 pm for Bradford //
U:
via the
No Bedford
Unless one knows that ‘No Bedford’ immediately follows ‘to Bradford’, one cannot interpret the
interruption correctly. Current implementations of barge-in software do not register how much of
an utterance has been synthesized so far. Interruptions have to be interpreted either in the context
that preceded generation of the interrupted utterance or in the context that would have resulted
had it not been interrupted.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 16/58
Chapter 3
The DARPA Communicator Architecture
The DARPA Communicator Architecture is primarily designed to provide a common framework
for dialogue systems that promotes interoperability of components by permitting a plug-and-play
approach. It is hoped that this will support rapid and cost-effective development of spoken dialogue systems (indeed, multi-modal systems that include speech). Multiple developers ought
to be able to combine different commercial and research components so long as they are architecture compliant. Furthermore, collaboration between researchers will be fostered and a more
structured approach to the testing of both whole systems and their components will be enabled.
The current version of the Architecture takes the form of a Hub-and-Spoke system and the architecture is often referred to simply as the Darpa Hub and components are described as being
Hub-compliant. The architecture consists of a central Hub process which connects with any number of Servers. A typical instance of the Darpa Hub architecture is depicted in figure 3. Typically,
each server will host one dialogue system component as described above: speech recognition,
natural language parsing, dialogue management and so forth. The Hub itself has three major
functions
1. Routing. The Hub is responsible for correctly directing messages from one component to
another
2. State Maintenance. The Hub can store global state information and make it accessible to
all servers
3. Flow Control. The Hub can also direct the flow of processing control, deciding which
servers should be called upon next
An instance of the Darpa Hub system is declared in a script file which is read by the Hub. The
script file declares all the information that the Hub needs to know about the servers, for example,
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 17/58
Figure 3.1: A typical Darpa Hub Configuration
for each server, its name, its address and the operations that the server provides. The script file
can also contain a program for directing flow of control. When the Hub is started, it attempts
to make connections with all the declared servers, processes an initial “token”, and then begins
monitoring for incoming messages. Tokens are frames with a name, an index and a list of key
value pairs. The Hub tries to associate each incoming message with a token. If the message
contains a token index, then the message is associated with the existing token with that index
and is taken to be a response to a previous message that the Hub sent to that server. Otherwise
the message corresponds to a new token and the token is initialized with any keys and values
specified in the message. New tokens are then processed by executing a program (with the same
name as the frame) that contains a set of rules determining what to do next. Each rule consists
of a pre-condition (on keys and values, for example) and an action to undertake. Old tokens are
processed by resuming execution of the instance of the rule already in existence for the token.
Tokens are destroyed when their programs terminate.
The Darpa Hub can also operate in “scriptless” mode, that is, where no programs are specified
for the Hub to find for new incoming messages. In this case, the Hub does not direct the flow
of processing control. Each message is then expected to specify an operation which is provided
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 18/58
by (at least one of) the servers. The function of the Hub then becomes simply to select a server
which supports the operation and dispatch it correctly.
3.1 Instance 1: The Mitre ‘CommandTalk’ system
3.1.1 Outline
The Mitre Corporation has built a conversational system for use in a distributed wargame simulation, based heavily on SRI’s original CommandTalk system [18], but particularly configured
for the Darpa Hub Architecture. The system consists of a number of servers including: speech
recognition, language understanding, a pragmatics server (for generating back-end simulation
commands), a back-end interface to the wargame simulator itself and, apparently, a dialogue
manager although it is unclear quite what its function is.
Figure 3.2 is an example of a Hub script server declaration:
SERVER
HOST
PORT
OPERATIONS
INIT
: recognizer
: rec-host
: 15003
: reinitialize RecognizeAudio
UpdateGrammar ChangeRegion
:grammar “stricom”
Figure 3.2: Hub script: server declarations
The operations declared by the recognizer include RecognizeAudio, an operation which expects a token containing binary data and which generates a value containing the most likely string
of words. Another operation is ChangeRegion which determines which part of the recognizer
grammar is active. This expects a token containing the name of a grammar region.
Figure 3.3 shows example flow of control rules:
The first rule specifies that if the current token contains a speech recognition hypothesis, then
the operation Do-NL-understanding should be invoked with an input parameter of sr-hypothesis
and two output parameters: nl-output and synthesis-test. Do-NL-understanding is an operation
supplied by the natural language processing server.
The flow of control in the Mitre CommandTalk system is determined by a Hub script. There is
one principal token in the system containing keys for audio data, a speech recognition hypothesis (:sr-hypothesis), a logical form (:nl-output), a resolved logical form (a logical form with
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 19/58
RULE : :sr-hypothesis Do-NL-understanding
IN
: :sr-hypothesis
OUT
: :nl-output :synthesis-test
RULE : nl-output Do-context-tracking
IN
: :nl-output :sr-hypothesis
OUT
: ct-output
Figure 3.3: Hub script: program declaration
pronominal and other referential expressions resolved), a set of back-end simulation commands,
a back-end response, an output logical form, an output text string, and, finally, output audio data.
Typically, the Hub will give a token to start capture of input binary audio data, and, if data is
indeed captured, the value of the appropriate key in the token will be filled in. The Hub will then
determine that speech recognition should be called next with the value of this key. If a hypothesis
results from speech recognition, then :sr-hypothesis is filled in and the Hub then determines that
natural language understanding should be called next. At each point, the Hub examines the state
of its token in order to decide what service to call next.
3.1.2 Discussion
One of the advantages claimed by [1] for the Hub architecture is that it facilitates different execution flows than the standard ‘pipeline’ picture. One might expect, then, to see context-sensitivity
of the sort outlined earlier highlighted. Somewhat surprisingly this does not appear to be the
case.
The first example given of a non-pipeline information flow is this: if the pragmatics server cannot
fill in any back-end simulation commands from the input logical form, then it can simply write
an appropriate text message into the output text string key in the token. In the standard pipeline,
this key would not receive a value until after the back-end server (which executes commands),
the pragmatics output server (which generates logical forms) and the natural language generator
(which generates output text strings) had all been called. There is a simple rule in the Hub script
which determines that the synthesizer should be called once the output text string key has been
filled in. This rule is quite insensitive to which processes have run, what other pieces of the
token have or have not been filled or indeed who filled in the output text string key. Clearly,
this non-standard flow is essentially one of ‘skipping’ parts of the pipeline although the flow of
information is still in the same direction. One can of course replicate the effect in a standard
pipeline by arranging for a special message to be passed downstream from the pragmatics server
to the synthesizer. Intervening servers must know not to act on the message but simply to relay
it. Clearly, the Hub architecture avoids having both to set up such a special message and to
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 20/58
configure the intervening servers.
The second example is that the language understanding server can cause the speech recognizer
to be reinvoked by generating a new token. The speech recognizer will then issue its next best
hypothesis and this cycle can continue until the recognizer issues a hypothesis that can be understood or until the recognizer has no more hypotheses (as discussed above). Although information
is passed back in this example, that information is minimal.
Besides, a much more significant point in both these examples appears to concern who is deciding
on the action. In the first example, the pragmatics server is effectively deciding on the next action,
what should be said and, indeed, how to say it. In the Mitre CommandTalk system, this consists
simply in issuing a standard error message, but one can easily imagine circumstances in which a
more flexible response is desired, one which depends in part on the current state of the dialogue.
In this case, one would want the pragmatics server to send a failure message (and possibly a
reason) to a dialogue manager with access to dialogue state information. Could this manager
be the Hub itself? It appears this cannot be so since all a Hub script can do is test certain
conditions on tokens and then call another server. Consequently, for more flexible behaviour,
one is almost bound to set up a dialogue management server and have the Hub simply direct
the failure message to it. Indeed, as a general rule, it is not difficult to imagine that one will
nearly always want to achieve flexibility by this means. The issue then becomes: what is the
value of Hub scripting if this is so? Similar considerations apply also in the second example of
re-invoking the recognizer. The issue is again whether the language understanding server should
be deciding that re-recognition is required. If it doesn’t decide but the dialogue manager does,
then what really is the role of the Hub script in flow of control?
From the published description of Mitre’s CommandTalk, there do appear to be two more interesting cases of information flow. First, since the simulation itself can generate new objects to be
talked about, the simulator needs to inform the module in charge of reference resolution (e.g. of
pronouns) of their presence. This can happen at any time. In fact, this is not so much a case of
information flow being up or down a pipeline, as the result of action by an autonomous agent
entirely outside the realm of linguistic processing. The Hub can be of course be used to ensure
that a message from simulation about new objects always gets routed to the reference resolver.
The point is that this will not involve the flow-of-control property of Hub scripting. The second
example is that an Agent called Dialogue Management (it is far from clear what its other roles
are) can send a message to change the recognizer grammar, presumably based on a belief about
what the current dialogue state is. It may be that this example is not highlighted simply because
this feature remains unchanged from the original CommandTalk system developed by SRI. Also,
it again requires a simple control message to be sent from one server to another and the Hub is
required to function just as a router.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 21/58
3.2 Instance 2: The CU/CMU Communicator System
3.2.1 Outline
The Colorado University system [19] is another Hub compliant dialogue system. The system
contains 7 servers: an audio server, speech recognition, language processing, dialogue management, database interface, synthesis and one other peripheral (for our purposes) server. The
particular emphasis in this system is on robust processing. This is to be achieved by robust parsing strategies which emphasise semantics over syntax and by an event-driven dialogue manager.
(Indeed, apart from the parser and dialogue manager, the system is essentially the CMU Communicator system hence we have entitled the system the CU/CMU system). The dialogue manager
itself has a simple structure strongly reminiscent of that employed in the Philips train timetable
system ([4]). Any parse (a set of filled slots) that it receives is first integrated into the dialogue
context maintained by the dialogue manager and then, in order, the dialogue manager will
1. clarify the interpretation if necessary
2. finish if done
3. submit a database query (if sufficient information is present) and give the user the first
answer
4. prompt the user for the next unfilled slot or highest priority information
3.2.2 Discussion
There is a Hub script in the Colorado system but it is used simply for message routing and
not complex flow of control decisions. The Hub script ensures that audio input is passed to
recognition, that the results of recognition are passed to language understanding, that the results
of language understanding are passed to dialogue management and so on. That is, although the
Colorado system uses the Darpa Hub, it appears to be used simply to implement the standard
pipeline dataflow. Consequently, questions arise as to how, if at all, issues of context-sensitivity
and early decision making are tackled.
One clear instance of early decision making in the Colorado system is that the recognizer currently delivers its best string hypothesis to the language processor for analysis. Unsurprisingly,
given the general architecture, the designers intend to make the interface between recognition
and parsing contain a word lattice rather than a string.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 22/58
As far as context sensitivity is concerned, it is very unclear how even simple short answers
to questions can be handled in the Colorado system. (Of course, by clever design of the set
of prompts, one might avoid the problem). The language processor maps strings onto frames
consisting of a set of slots which the system aims to fill. In the travel domain, these slots include
departure-location, arrival-location and so forth. Each slot has a context free
grammar associated with it which defines the word strings that can match it. It is unclear whether
a short answer such as ‘Boston’ must be mapped into one of departure-location and
arrival-location. Given the standard pipeline, it is unclear how the language processor
could use information about the last question to prefer one or the other. Equally, the dialogue
manager appears not to maintain any history or state other than its beliefs about the current
values of slots. In this case, the decision about the interpretation of a short answer cannot even
be delayed until dialogue management. Indeed the absence of state in the dialogue manager is
claimed to make the system more robust since there is nothing to lose track of.
3.3 Conclusion
The Darpa Communicator Architecture is a Hub and Spoke architecture designed to support
‘plug and play’ for different linguistic components. The various components sit on different
servers. The services they can provide are declared to the Hub. One can plug different components into the Darpa Hub and play with the resulting system just so long as they provide the
same service. Nothing is stipulated about the internal structure of these services or the platform
on which they are provided.
The Hub is primarily a data router. It ensures that messages from one component to another are
correctly transmitted. The Hub can also store global information common to all servers. The
Hub can also direct the flow of control amongst the components based on the contents of a token
which persists over a period of time and which messages from servers can refer to. Although the
Hub can direct control in this way, many systems in fact only use it to implement the standard
data pipeline. Furthermore, for the most flexible dialogue control, it appears best to vest control
in a dedicated dialogue management server rather than in the Hub itself.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 23/58
Chapter 4
The Open Agent Architecture
The Open Agent Architecture (OAA) is not designed especially for dialogue systems. It provides
a general framework for building distributed software systems. OAA predates the Darpa Communicator Architecture by several years but there are several clear resemblances between them.
First, OAA has a Hub and Spoke architecture. The OAA Hub is called the Facilitator. The spokes
are client agents which register the services they can perform with the Facilitator. In a similar
fashion to CORBA [8] and DCOM [14], agents build in notions of encapsulation whilst a system
of agents can be spread across multiple hardware and software platforms. In general, the client
agents are to be thought of as a community of agents cooperating through the Facilitator. If one
agent needs a service, it sends a request to the Facilitator which checks its registry of services and
forwards the request to an appropriate agent. The Facilitator also receives replies from that agent
and sends them back to the original requesting agent. That is, an agent does not need to know
the identity of another agent who can carry out the service. Indeed agents can submit complex
goals to the Facilitator (e.g. Request-1 AND Request-2) and the Facilitator can delegate
different parts to different agents in parallel. The submitting agent need know nothing about how
the Facilitator delegates the task. Facilitators can also store global information for all agents and
thereby permit a blackboard style of interaction. Agents can make asynchronous requests.
4.1 Instance: SRI’s CommandTalk
4.1.1 Outline
CommandTalk [18] is a spoken language interface to a battlefield simulator which incorporates
a number of agents including: speech recognition, natural language parsing, prosody, synthesis,
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 24/58
dialogue management, and the simulator itself. Speech recognition is undertaken by the Nuance
recognition engine [16] enclosed in a ‘thin’ OAA wrapper which declares two services to the
Facilitator: speech recognition and change of grammar. Natural language parsing and interpretation is undertaken by the unification based Gemini parsing system. The dialogue manager is
responsible for managing the linguistic dialogue context, interpreting user utterances within that
context, planning the next move and also for changing the grammar within the recognizer. For
any new input, it first considers whether it is a correction of a prior utterance, then it considers
whether it continues the current discourse segment, and finally it considers whether it initiates a
new segment. A segment represents a local interaction with the user which is defined by a simple finite state machine. Most machines have only two states corresponding to the user saying
something and the system responding. Longer machines encoding form-filling dialogues are also
possible. The current state of a segment is represented by a frame containing: a dialogue state
identifier, semantic representations of the user’s and system’s utterances; the background ‘open
proposition’ for the user response; focus spaces of objects and gestures for anaphoric resolution;
a guard. A stack of frames together with a history trail of all operations on the stack represents
the state of the dialogue as a whole. If a new user utterance is deemed to continue the current
segment (the one on top of the stack), e.g. the user answers a system question, then the system
calculates the next state in the appropriate finite state machine and pops the current segment.
If the next state is not final, a new segment containing the new state identifier is pushed back
onto the stack. If the new user utterance begins a new segment, then the system first considers
whether the current segment is actually finished or not. If the current frame contains an open
proposition, then it is deemed not to be finished and the new frame is just pushed onto the stack.
Otherwise the current frame is popped before the new one is pushed. Once a frame is popped,
the information that was in it is no longer available for interpretation. The history trail can be
used to recover in situations where the system has considered an interaction closed only for the
user to issue a subsequent correction.
4.1.2 Discussion
CommandTalk agents are connected through the Open Agent Architecture. It appears, although
the matter is not explicitly referred to in any of the literature, that the role of the OAA Facilitator in the CommandTalk system is simply one of matching up service requests with service
providers. Blackboard facilities do not appear to be used and certainly no control-flow is vested
in the Facilitator (as it is in Darpa Hub scripting). Consequently the flow of control is indeed
distributed throughout the agents in the system. The order in which agents are called will depend on information that is spread throughout the community of agents. One possible set-up
would be for the speech recognition agent to finish its processing by sending a request for language understanding on its output (but not to wait for a reply). The language understanding
agent similarly could similarly call dialogue management after it had finished processing. Code
within the dialogue manager might invoke several services in different agents (e.g. consulting
the simulator database, updating the recognition grammar) before calling generation and synthe-
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 25/58
sis. CommandTalk appears to include some parallel processing. At any time, the simulator can
notify the dialogue manager of the appearance of a new object that can be referred to (as also
discussed above, under Mitre’s version of CommandTalk, section 3.1). This update can occur at
any time, so long as the Dialogue Manager is not actually busy doing something else (in which
case, messages are presumably queued).
The main source of context sensitivity in a CommandTalk component other than dialogue management is in speech recognition. The Dialogue Manager may send messages to the recognizer
reconfiguring it dynamically according to what it believes the current state is. Again, these messages may happen at any time so long as the recognizer is not actually busy at the time. It appears
that natural language parsing and interpretation is not context sensitive. Certainly, anaphoric and
ellipsis resolution are all delayed until dialogue management. It is unclear whether any asynchronous calls are made in the CommandTalk system at all.
4.2 Conclusion
OAA offers a very flexible environment for constructing a dialogue system. In many ways, it
appears that the scriptless version of the Darpa Communicator Architecture, which is a very
recent development, is an attempt to offer the same sort of architecture. The price to be paid for
the flexibility is of course the amount of attention one has to give in ensuring that the community
of agents one defines really does collaborate in order to compute the desired result. For instance,
the possibility of updating the dialogue manager at any time with the identities of new objects is
very attractive but one must ensure that the dialogue manager is not permanently busy on other
tasks in order for this to succeed. The built-in support for asynchronous processing is also very
attractive but the price again is simply the sheer complexity of programming such systems.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 26/58
Chapter 5
The TrindiKit Architecture
The TrindiKit is a toolkit resulting from the Trindi project for building and experimenting with
dialogue move engines and information states. Utterances in dialogue are understood to be moves
which transform information states. The notion of information state is quite general. In the simplest form, a state might be just the name of a node in a transition network and the moves would
then be the labels on the arcs connecting them. However, states can be much more complex and in
the TrindiKit one can use different sorts of data structure (e.g. typed feature structures, Discourse
Representation Structures, record structures) to represent them. Part of the discipline involved in
using the TrindiKit is simply declaring formally all the aspects of Information State that one will
need to access during a dialogue state. It is not just the states themselves that are more complex.
Transitioning from state to state need not be a case of actually traversing a declared network
structure. Rather, a set of update rules must be provided. Each rule consists of a pre-condition
and an action. The pre-condition is a set of conditions on the Information State and the action is
a set of operations to carry out on the Information State. Again, it is part of TrindiKit discipline
that the conditions and operations are not just arbitrary but must be supported for the datatypes
used in the Information State. As a simple example, if one declares a feature of Information
State called questions under discussion to be a stack, then one cannot invoke a membership
operation on it in the action of an update rule. Of course, one could define a set of update rules
which used pop, and identity to implement membership but these rules would certainly not
correspond to dialogue moves in any intuitive way. It is also possible to define external resources
which can supply additional operations for updating Information States.
There is no requirement that only one rule ever be applicable in any state, consequently one also
needs a strategy for choosing between them or for applying multiple rules. In fact, TrindiKit
permits one to define an update algorithm in a specially constructed language, the DME Algorithm Definition Language (DME-ADL). The control structures of the language include a variety
of constructs (including if-then, while, try etc.) over primitives which are rule-names or
rule-class-names. The interpretation of a rule-class-name is that the first applicable rule of that
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 27/58
class should be executed. TrindiKit comes with a default set-up in which all rules are in one class
(universal) and the Update Algorithm is the one-line program containing the rule-class-name
universal.
A dialogue move engine can also have several different update algorithms though only the possibility of two is ever discussed. One algorithm is for interpretation, somewhat confusingly
sometimes also called update; the other is for move selection. The intention is that one set of
rules interprets the last move made and another chooses the next move to make. Although the
same update rule formalism is to be used for both types (or all types, if more types are required),
different algorithms can be applied to them.
The TrindiKit architecture which encompasses a complete dialogue system and not just the Dialogue Move Engine is shown in figure 5.
The DME heart of the system is contained within two boxes: ‘Dialogue Move Engine’ and
the largest box labelled ‘TIS’ which represents the Information State. ‘TIS’ stands for Total
Information state, and is the sum of the Information State (as discussed above), links to any
external resources and links between the Dialogue Move Engine and other TrindiKit components.
The other components shown in the diagram are input, interpretation, generation,
output and control. In a spoken language dialogue system, input and output might
naturally correspond to recognition and synthesis.
The TrindiKit comes with a default control algorithm which, unsurprisingly, implements a repeating loop over the sequence: input, interpret, update, select, generate, output. In fact, if the
system ought to make the first move, then one needs to supply a different control algorithm
(namely, one which starts with the generation half of the cycle. However, in general, one can
write one’s own control algorithm and it is a design feature that one can include test conditions
against the Information State, since these are exported to the control module. The current version
of the TrindiKit does not include asynchronous control possibilities.
5.1 Instance 1: GoDiS - Gothenburg Dialogue System
GoDiS is an experimental dialogue system designed in particular for experimenting with notions
of ‘question under discussion’ [7] and ‘accommodation’ [13] in dialogue management. Information states are represented as record structures and, in particular, model the private beliefs and
goals of an agent and those parts which are shared. The intuition is that a dialogue progresses as
information becomes shared. An agent also has a private plan which is more of a long term set
of goals which can also be used for accommodating information (for example, behaving, on the
basis of something in the plan, as if a question has been explicity asked even though it has not).
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 28/58
Control
Dialogue Move Engine (DME)
Interpretation
Generation
Output
(update)
output
next_moves
(update)
program_state
latest_moves
latest_speaker
input
(update)
TIS
Input
IS :
(Information State Type)
database
dialogue grammar
plan library
...
Resource Interface
Information State Interface
Optional component
Obligatory component
Figure 5.1: The TRINDIKIT architecture
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 29/58
The GoDiS dialogue move engine consists of two main modules update and select, which
respectively update the Information State on the basis of the last user move and select the next
system move. There is a set of selection rules each of which consists of a condition on the
Information State and an action which simply records the decision of the dialogue manager to
make a particular move. The selection algorithm chooses the first applicable selection rule it can
find. (That is, as explained above, it consists just of the name of the class of all selection rules).
The update rules divide into 6 classes: grounding, integration, accommodation, agenda refill,
database enquiry and store. These classes are intended to represent a natural classification of
agent actions whilst interpreting input utterances. Grounding refers to the process of moving information from the private sectors of the Information State to the shared sectors. Accommodation
adds information from the private plan to the shared questions under discussion, given the stimulus of a particular input content. Integration adds new content to Information State. Agenda refill
transfers information from the private plan to the private short term agenda and corresponds to
an agent setting up a short term conversational goal from his long-range overall plan. Database
enquiry refers to looking up an external resource. Store refers to saving the current shared Information State - this is in case a problem arises later in which case the ‘old’ shared Information
State needs to be restored. Finally, there is an algorithm for exploiting these various rule classes.
Simplifying slightly, new inputs should be grounded and then integrated. If integration fails, one
should accommodate and then try integrating again. Repeat until integration succeeds. Then, if
the input was the user’s, one should refill the agenda and attempt database lookup. Otherwise
one simply stores the current Information State.
Architecturally, GoDiS uses the default control algorithm, namely the standard sequencing of
select, generate, update, input, interpret, update. (The standard sequence when the system makes
the first move in a dialogue, that is). Update occurs twice, once as a result of generating a move
and once as a resulting of interpreting a move. Although the same algorithm is executed twice,
the update rules themselves always distinguish whether the latest move was actually made by
the system or the user. No module apart from the Dialogue Move Engine modules accesses the
Information State.
5.1.1 Discussion
The main focus of our interest in dialogue system architectures is the architecture and dataflows
between components. From this perspective, GoDiS is less interesting because it does not exploit
to any great degree the possibilities provided for in the TrindiKit. The default control algorithm
resembles the standard control flow in a pipeline albeit with the additional benefit that each component updates the global Information State in turn rather than passing on information to each
other directly. That is, the pipeline is not a dataflow pipeline merely an order of calling components. Given the focus of GoDiS on certain particular theoretical issues, it is perhaps not too
surprising that the components other than the dialogue manager itself are treated in a somewhat
cursory manner. The input module accepts typed text. The interpretation module appears to be
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 30/58
a simple phrase spotter. Neither of these modules appears to access the global Information State
other than to read their input from the right module interface variable. The interpretation module, for example, reads the input word string from the interface variable input which the input
module placed there. It writes its output into another interface variable called latestmove
which the Dialogue Manager reads from.
The role of the interface variables does however raise some more substantive issue in the design of the TrindiKit architecture. Although interface variables are part of the ‘Total Information
State’, they are not part of the Information State proper. Consider, for instance, the interface
Variable nextmove which is written by the dialogue move engine (in fact, the selection algorithm within the dialogue move engine) and read by the generator. As noted above, the output
of the selection algorithm is the choice of the best move to make next. This move is recorded
in the nextmove interface variable. One might ask why, in a system that is designed to model
beliefs and goals, it is not recorded in a part of the Information State representing Intentions. Of
course, one could record it there also and thereby make it available to other components. But
why then should the generator access only the interface variable and not the Intention itself?
The latter approach appears to be the more general solution. The only reason for the interface
variable appears to be to facilitate a ‘plug and play’ architecture in which it is a simple matter
to plug in different components and test the resulting configuration. Having to code the name of
a particular location in one’s Information State into the generator would detract seriously from
re-configurability. Indeed, a minor change in one’s Information State could stop the generator
from working at all. Ideally, of course, one would also like to experiment with plugging different
dialogue managers into an overall dialogue system. Again, the interface variables make this at
least a conceptual possibility. If different dialogue managers all write to the nextmove interface variable, then any generator that reads it can work with all of them. Nevertheless, the status
of interface variables as outside the Information State proper looks an unhappy one. Presumably,
what is really required is a means of mapping interface variables directly onto parts of the Information State. That is, the interface between a dialogue manager and a component should be
mediated by an indirection.
The interface variables in GoDiS are essentially a device for creating a data pipeline in the
TrindiKit, albeit one that is mediated by the Information State. One corollary of this is that
the selection algorithm, which is responsible for choosing the next move to make, can be viewed
as selecting and executing an update rule whose action updates the Information State. Selection
therefore follows the general pattern outlined for Information State updates. This attractive corollary is however liable to over-interpretation. The part of the Information State that is updated is
of course the interface variable. Since, in GoDiS, interface variables are actually designed purely
for data pipelining, it is evident that the use of Information State update here is really more of an
implementational detail than an instantiation of a theoretical claim. That is, for the purposes of
GoDiS one might just as well have pipelined the outcome of selection directly to generation. Although this is in itself a comparatively trivial matter, it is important that there is really no reason
why selection should take the form of an algorithm operating over update rules. Selection could
take any form so long as its input source is the Information State and its output is a move.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 31/58
The point of course threatens also to extend to the role of update rules in Interpretation. Why
should Dialogue Interpretation itself consist of an algorithm over update rules and not just some
algorithm whose input is the Information State and whose output is an updated Information State?
Of course, by examining the GoDiS update algorithm, one realizes that it embodies theoretical
claims. That is, its division into grounding, accommodation and so forth is theoretically significant. However, it is perhaps unclear whether any particular significance attaches to individual
rules or the rule format.
5.2 Instance 2: Conversational Game Player
The SRI Autoroute demonstrator is an instance of the TrindiKit architecture. Like GoDiS, it is
designed with the aim of exploring certain theoretical issues in dialogue management. The issue
in question is formalizing and using Conversational Game Theory as a basis for dialogue management. Also like GoDiS, the treatment of the other system components (input, interpretation,
generation and output) is somewhat rudimentary.
In Conversational Game Theory, rational agents have beliefs, goals and a set of operators for
undertaking actions in the world. Included in these actions is the playing of conversational
games - these being joint actions between dialogue participants. Knowledge of the structure of
conversational games is shared between dialogue participants so that, once one partner realizes
the other partner has started a game, the other partner both knows how to continue it and cannot
avoid doing so on pain of being deemed uncooperative. In the SRI demonstrator system, game
knowledge is encoded in simple recursive transition networks which specify not only the valid
sequences of dialogue moves but also their meaning. The meaning of each move is specified
as a context update function on propositions under discussion. That is, the role of each move
in a game is to modify a set of propositions. The meaning of a game is to update a context of
propositions that have been agreed.
One of the features of Conversational Game Theory is that the game and move definitions are independent of who happens to be the speaker and who happens to be the hearer. That is, the update
effect of an utterance should not differ according to whether one is a speaker or hearer. Of course,
whether and how one chooses to make a certain move or not will depend on one’s mental state
and this may differ from speaker to hearer but the effect of the act itself is invariant. The control
strategy for the SRI Autoroute demonstrator reflects this perspective: it consists of a repeating
loop over two constructs update and generate-or-acquire. update updates the Information State with the latest input, regardless of who generated it. generate-or-acquire
tests to see whose turn it is in the dialogue and then either calls input followed by interpret
or calls select followed by generate and then output.
Update itself (and indeed Selection) is realized by a set of update rules in TrindiKit format which,
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 32/58
together with the update (or selection) algorithm , encode a general agenda based search strategy.
The search is over a set of possible paths through the recursive transition networks. When a new
input (user or system) occurs in a dialogue, there may be several possible interpretations of what
has happened so far in the dialogue. Each possible path is generated and stored in an agenda.
The Update algorithm has two parts: first, repeatedly finding the first applicable rule and then
executing it until no more rules are applicable; second, sorting the agenda according to a simple
utility-based preference metric. The top-ranking hypothesis in the agenda represents the state
that the agent believes is the correct one.
5.2.1 Discussion
The control algorithm used for the CGT demonstrator within TrindiKit is interestingly different
from that employed in GoDiS. The control algorithm is to certain extent context sensitive. If a
new move is required, the Control algorithm consults the Information State in order to see whose
turn it is to make the next move. Once the move is made, the update algorithm is called to
integrate its effects, regardless of who made the move. In GoDiS, the update rules for system
moves differ greatly from those for user moves.
A number of points can be made about this simple difference. First, the system analyses its own
utterances after they have been made, just as it analyses the user’s. One advantage of this procedure is that it is straightforward for the system to recognize possible alternative interpretations
of its own utterances. This can be useful for detecting when dialogue has not gone according to
plan. The simplest case is the user simply not understanding the system and saying ‘pardon’. At
this point, the system effectively has to backtrack - it thought it had accomplished a particular
speech act (say, asked a question) but in fact it had failed to do that. In the CGT demonstrator,
both interpretations of the system’s original utterance (‘I asked a question’, ‘I said something
unintelligible’) are generated and maintained in the agenda. If ‘pardon’ is the next move, then
only the second analysis can be extended to incorporate it. A pardon move may only legitimately
follow something which was unintelligible. In GoDiS, the state that existed before the question
was asked is stored in a special tmp field and is simply restored when ‘pardon’ is encountered.
However, this sort of treatment will not extend to cases, probably rare in Human-Computer dialogues, in which the second utterance does not simply cancel the first but coerces a particular
interpretation onto it. Witness the child who responds with ‘I don’t think so, mummy’ to his
mother’s ‘You’re going to school tomorrow’. Of course, one could try adding another sort of
update rule to GoDiS to be used when the usual interpretation procedures failed (just as accommodation is invoked when ordinary integration fails) but it is not clear how such a rule would
work.
The second point is that the CGT demonstrator does not actually backtrack. In fact, it generates
all analyses in advance and merely selects an alternative that was already known to it if the current
favourite analysis cannot be maintained. This is probably neither psychologically unrealistic nor
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 33/58
even a very good engineering solution. If you both intended to ask a question and believe that
you have, there is little point in calculating that you might be held to have done something else
until a problem actually arises that causes you to re-assess your own actions.
The third point is that, even though the control algorithm is context sensitive in a limited fashion,
the system as a whole still enforces a fairly strict turn-taking model. The control algorithm uses
the Information State simply to see whose turn it is to make a move and this is determined by
the conversational game definitions themselves and what the system currently believes the game
state is. In fact, since the system is maintaining alternative interpretations of dialogue state, it
is conceivable that the system and user both believe that the dialogue is in a state in which it is
their turn to speak next. This would give rise to both attempting to take the floor and require
a mechanism for resolving the conflict. The CGT game definitions are constructed so that this
cannot occur. In any case, if the system believes it has the turn, the user does not even have an
opportunity to try to take the floor since he will not suitably prompted for input.
The interpretation module for the CGT system is also mildly context sensitive. The interpreter is
essentially a simple phrase spotter which uses knowledge of the last question asked in order to
interpret fragments. For example, the meaning of ‘London’ is taken to be ‘that the destination is
London’ given a context in which the last question asked was ‘Where do you want to go?’
5.3 Conclusion
The TrindiKit architecture is designed to exploit an Information State Update approach to dialogue. The heart of the system is the Dialogue Move Engine which accesses and updates a
well-defined Information State, a data object that comes with pre-defined mechanisms for testing values of its component parts and updating them. The Information State is available to all
other components in a dialogue system, including the control module. The internal structure of
the Dialogue Move Engine is also constrained in that it is intended to consist of a number of
modules each of which invokes a user-specifiable algorithm over a number of update rules of a
certain pre-defined format. (A default algorithm is also available). Each rule must consist of a
set of pre-conditions on the Information State and a set of actions which are operations on the
Information State.
Instantiations of the TrindiKit have so far focussed mainly on particular theoretical objectives
within Dialogue management and have not paid particular attention to links between Dialogue
management and other components. Generally, the standard data pipeline has been implemented.
SRI’s CGT demonstrator system employs a control algorithm that is sensitive to Information
State but in general one might expect this facility to be of limited value, just as Hub scripting
is in the Darpa Communicator Architecture. In most TrindiKit systems, components other than
Dialogue management do not access the Information state.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 34/58
Chapter 6
The Verbmobil Communications
Architecture
The system architectures and underlying toolkits considered so far have been quite similar in
their aims, scale and range of components. Before we proceed to consider an architecture for
the Siridus demonstrator and toolkit, it may be useful to look at an environment that, while also
addressing the processing of spoken language dialogues, has quite distinct goals and correspondingly different challenges.
The Verbmobil system differs from a typical dialogue system in a number of ways. The most
evident of these is that it does not perform dialogue management in the accepted sense, because
it processes dialogues between humans and is only concerned with providing a translation of the
individual dialogue turns. This entails the maintenance of a dialogue model, but it is used for
tracking the flow of the dialogue and supporting translation decisions, as the system never has to
take the initiative.
For our current purposes the more interesting aspects of Verbmobil architecture are not so much
the the modules that are required for the dialogue translation task, as the way that the system
architecture copes with three crucial challenges:
The scale of the system and its coverage.
The interchangeability of modules.
The time constraints implied by processing spoken dialogues in quasi real time.
Since Siridus’ prime concerns consist of the scalability and reconfigurability of spoken dialogue
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 35/58
systems it is useful to have as a yardstick a system which achieved considerable complexity, scale
and numbers of configuration options within approximately the same domain.
The final Verbmobil system comprised 69 modules whose communication requirements would
probably have defeated any architecture based direct communication between individual modules, e.g. via communications channels, since 2,380 point-to-point communications channels
would have been required. An alternative communications concept was developed based on a
notion of local blackboards [10, 3], reducing the number of connections to be defined by a factor
of about ten, in fact 224 pools were defined.
The multi-blackboard architecture also facilitates the use of interchangeable modules and indeed
of modules that operate in parallel, since an individual functionality can be defined as connecting
two (or more) pools. Multiple modules supplying this functionality can be employed in parallel
or in alternation. To some extent the availability of multiple modules is a luxury derived from
the scale of the project, but some aspects of the architectural requirements are parallelled in
the design of a toolkit if the notion of plug and play with multiple off the shelf components is
adopted.
The processing strategy in Verbmobil is essentially ’pipelining’ in the technical sense referred to
in 2.3, since the sequence of processes for any given input is sequential, but the modules operate
asynchronously, because the basic increment is taken at a lower level than the turn. The spoken
input is segmented according to various criteria, including the length of pauses, prosodic cues
and syntactic structure, so that different modules may be working on different segments at the
same time. The purpose of this incrementality is to keep the end-to-end processing requirements
within a small multiple of real time.
We have outlined the main properties of the Verbmobil communications architecture as a source
of comparisons, but what can be directly learnt for the Siridus architecture where the initial
scale and resources are much more limited, but where the ultimate processing task and level of
intended incrementality are more demanding?
6.1 Diverse Lessons from Direct Experience with a Pool Communications Architecture.
There are a number of practical lessons that can be drawn from the Verbmobil experience with
communication via a number of data pool. Here it is helpful to maintain a contrast between the
design of the communications architecture and the way it was actually used. The development
method adopted in Verbmobil involved maintaining a running system throughout virtually the
whole development period. The persistent performance pressure, combined with the fact that
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 36/58
only one system was ever intended, meant that operations were encoded that went to the limits
of the intended architecture. So the positive properties of the communications architecture can
be contrasted with anecdotal evidence of some slightly perverse mechanisms that were coded
within it. Conversely, options that the architecture supports but were deliberately avoided in the
implementation may indicate pitfalls to be avoided. Verbmobil is not only a large system but a
very complex one, supporting numerous configuration options1. To some extent it is therefore
also like a toolkit, since multiple solutions to the same problem may be encoded within it. Another aspect of its complexity is the overhead of maintaining consistency which may or may not
be a requirement in a smaller system.
In summary we should take account of the benefits of the overall framework, but also, in both
a positive and a negative sense, the cases that take this approach to its limits. We should also
bear in mind that some of the mechanisms required here will not be essential in a more modest
system, but there again may be desirable for the sake of scalability.
6.1.1 Specifying Functionalities
The fact that interface specifications are conceived of in terms of abstract functionalities, rather
than the requirements of specific module instances, has positive consequences both for the interchangeability of modules and for the management of communications. The processing modules
read from and write to data pools, so they are not required to know which modules supply or consume their data and are, hence, impervious to changes non-local changes in configuration. The
actual message passing is performed by a PCA (Pool Communication Architecture) package.
This simply has to know which module is currently registered for the functionality associated as
either consumer or producer with a given data pool. Where modules are exchanged it is only
the instantiation of the functionality that is affected. Similarly, competing modules for the same
functionality may be employed, or one module may fulfill several functionalities, e.g. the same
service for several languages or language pairs. There may also be sequences of several functionalities that, when composed, are equivalent to one functionality also defined in the system.
The set of functionalities defined is therefore the basic level of the architecture specification.
The current state of the module configuration is maintained by a module manager which essentially provides the PCA with the table of which processes currently fulfill each functionality. It
should be noted that this design does not assume that all data pools associated with a given functionality will necessarily be in use. Hence, it is conceivable that modules for a given functionality
may be interchangeable, even though they have somewhat different data requirements, though it
will be assumed that the main input and output specifications will be maintained. Equivalently
the reconfiguration of a local module may mean that existing functionalities are not actually
1
At the last count there were 196 individual module configuration option in a given configuration, but not all are
independent of each other.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 37/58
called on in a given processing session.
Examples using Functionality Definitions
The standard examples for the use of functionality definitions are the interchangeable modules
for speech recognition in German and optional language identification. The main examples of
modules working in parallel are more complex, because they also involve the concatenation of
functionalities and implicitly result selection. In addition, the definition of multiple functionalities can obviate an absolute choice for the form of recognition output. Local configuration of
individual modules can make so-called subfunctionalities optional.
There are three possible instantiations for the functionality Acoustic.Recognition.Continuous.German.Frequency16kHz. Each maps the recorded microphone signal to an
initial word lattice. The choice of which recogniser module to use is fixed by the arbitration module as part of the system configuration. A further optional functionality can be activated for Acoustic.Recognition.Continuous.Unknown.Frequency16kHz this
provides language identification.
The functionality Linguistic.Analysis.German is normally instantiated by four different module in parallel. Three of these also instantiated Linguistic.Transfer.German.English
and Linguistic.Generation.English. That is, the modules for statistical translation, example-based translation and case-based translation implement a complete mapping from
recognition output to synthesis input as well as translation from source to target language. The
fourth instantiation of the analysis functionality is the entry point to a more complex deep linguistic translation process involving, not just the three functionalities mentioned above but several
subfunctionalities and the further subdivision of analysis into Linguistic.Analysis.German.Syntax
and Linguistic.Analysis.German.Semantic.
The convergence point of these various translation paths is the selection module, the only instantiation of Selection.English, but also a module that has to determine when the preceding
tasks are complete and ready for selection and which priorities to apply. The choices can be
affected by local module configuration which then in turn affect the sequence of processes in the
whole system, since the preference for an efficient option will leave more expensive processes
incomplete. Local module configuration can also make subfunctionalities and, even, data pools
redundant. This is particularly noticeable with request and response loops that are implemented,
essentially, as module to module communications within the data pool framework. Transfer in
particular makes use of context-based semantic disambiguation of predicates to support translation choices, but this is comparatively expensive in execution time and can be deselected. Then
not only the subfuctionality but also the relevant data pools are ignored for the the remainder of
the session. This can be contrasted with the provision of additional information to a recogniser
by a dialogue manager. Just because the information is there does not mean you have to be able
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 38/58
to use it.
The convergence of distinct translation paths is interesting because it imposes additional control
tasks. Their divergence is interesting because different analysis methods have different data requirements from the recognition results. These vary from the best hypothesis over n-best strings
to the complete word lattice. The choice is familiar, but here it is not actually made. There are
distinct pools for each type of recognition result. The best hypothesis and full lattice are taken
direct from the recognisers. The n-best strings, actually flat lattices, are generated on demand
in the first stage of deep linguistic analysis as input to the most sophisticated parsers, so that all
options are catered for within the same architecture2.
6.1.2 Publishing and Subscribing.
In Verbmobil individual modules have considerable autonomy. It is the modules that declare
with functionalities they fulfill and hence which pools they will use for reading and writing. The
metaphor used for communication requirements is that of publication and subscription. When
data is produced it is published on the relevant pool at which point the PCA sends a message to
each of the subscribers for that pool. This removes from the module some of the requirement
poll an input pool for the arrival of new data. This also means that the pool can be seen as an
autonomous agent that dispatches data according to the requirements specified in advance by
the modules. Although in practice, this is achieved by a central communications manager. This
effect could be replicated either through a few common data pools defined as individual agents
or by a hub module functioning primarily as a communications manager.
6.1.3 Distributing Control
Despite the overall complexity of the system architecture Verbmobil has no overall control module. The control of the incremental processing is devolved to the individual processing modules
which determine local decisions that, in turn, go to make up the overall processing strategy. This
has a side effect that the proportion of processing to controlling carried out by each module may
vary. In most cases modules carry no controlling responsibilities. They fulfill their designated
function on the basis of simple input and output requirements. Where an interaction based on
request and response is required, the calling module must initiate the request and determine how
long it will wait for a response, but the responding module treats each request as an individual
task.
More complex control problems arise where processing paths merge and decisions have to be
2
This is generally known as having your cake and eating it.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 39/58
made that are based not on absolute timings, but the relative completion of a task by a set of
modules working in parallel. A consequence of the devolution of control on a demand-driven
basis is that the modules that inherit the largest controlling responsibilities may themselves carry
out relatively limited functionalities. This is not in itself a bad thing, but it does not facilitate the
management of the project resources or the debugging of system performance, since it obscures
the overview of where the most significant events take place. Such instances did occur in the
actual Verbmobil architecture and were essentially accidental. While these accidents were not
necessarily harmful a more hierarchically oriented control architecture would have precluded the
need for idiosyncratic knowledge. Any assertions as to the benefits for Verbmobil of a more centralised control strategy would be counterfactual, but, as the Trindikit examples show, a central
controlling instance is usually assumed in dialogue systems.
Problems with Distributed Control
There are two main points where parallel processing paths converge in the standard Verbmobil
architecture. The first has been mentioned where the main selection between competing translations occurs. This is complex, but not the most complex instance. Translations only have to
be synthesised when a turn is complete and preferences for more efficient processes with lower
accuracy may mean that it is not necessary to wait for all results to be present. Selection can
be carried out when the preferred translation path delivers a complete result. The other main
selection point occurs in the semantics module which receives results from various linguistic
analysers. Here segment incrementality must be maintained so a selection has to be made as
soon as all parsers have made an adequate contribution. This is somewhat more difficult to judge
as it also takes account of the quality of fragmentary analyses and the segmentation carried out
by the parsers. However, this is a significant control decision because it affects “downstream”
processing for that increment. The weight of this responsibility can be compared with the limited
remit of the local functionality of the semantics module itself which comprises robust semantics
processing, as to be applied in the Siridus repair module, and some linguistic resolution of predicate ambiguities, ellipses and anaphoric bindings.
6.1.4 Interrupting Processing
Another direct lesson from Verbmobil experience is the implicit overhead involved in handling
the interruption of normal turn processing. While Verbmobil does not support the full set of
barge in options, nor should it given that the system never has the initiative, user barge in and the
processing of spoken commands can lead to the interruption of turn processing at virtually any
point. Actually, Verbmobil does not retain the content of a turn that is interrupted by the user or
re-interpreted as a command, but this is really a detail of how interruptions are treated by local
modules. The real problem is ensuring that interruptions are perceived quickly by all affected
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 40/58
modules and that the synchronisation of turn processing is maintained. The direct implication of
this impinges on interchangeability of modules, in that it is assumed that each processing module
regularly checks for incoming control messages and is prepared to abort processing the current
turn at relatively short notice. If this condition is not met then inconsistent system states may
occur and the overall processing of the dialogue may actually be halted. These conditions are
to some extent the consequences of local decisions about how interruptions are to be handled.
However, these decisions have to be taken. If they are not treated with due care requirements can
result which preclude the use of pre-existing processors which do not have the appropriate hooks
to trap real time control messages.
6.1.5 Knowing your Segments and Fragments
Verbmobil is segment incremental. This means that at any one time different segments of the
same turn may be in different stages of processing. In addition, multiple segmentations are
allowed. Consequently different processing paths may recognise different segmentations of the
same turn. For the final output these segmentations have to be combined. These requirements
can only be met by keeping track at all times of which portion of the input signal corresponds to
the message that is being processed locally. This is achieved by attaching a symbolic segment
identifier to each message that is passed. Relative to the size of some messages this is a relatively
small overhead, but it is a necessary one in this context where relatively small and also variable
increments are allowed. There may be simpler ways of achieving segment incrementality, but
direct experience has shown that it is difficult to find a segmentation that serves all components
of a complex system.
6.2 Conclusions
Although Verbmobil was designed as a single system and not as a toolkit, the development
method, involving an almost uninterrupted sequence of working prototypes, ensured that the
underlying architecture framework exhibits several properties that are desirable in a toolkit. In
particular, the communications architecture through the use of abstract functionality definitions,
multiple data pools and the publish and subscribe metaphor for message passing, supports the
wide range of configurations options available in the final system. These include exchanging
execution paths through the architecture and varying modules with the same functionality, as
well as the local configuration of individual modules.
Where the Verbmobil method is less efficient is in the selection of which modules and functionalities are defined and the apportioning of control tasks to modules. Here rather too many
accidental choices were allowed to develop over the development of the system and the success
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 41/58
of resulting system relies more on the individual module implementers than on the architectural
design. Although the system does not have to carry out dialogue management in the accepted
sense, this is not an adequate explanation for why there is no overall or even local controlling
instance.
Verbmobil does take the processing of spoken language seriously and achieves quasi real time
performance by adopting segment incrementality, but the notion of a segment is a flexible one.
This follows from the large number of levels and methods of processing involved. The bookkeeping overhead that is required in any asynchronous incremental system is amplified by the
need to coordinate diverse segmentations. However, turn and increment identification are necessary in any dialogue system that allows the interruption and resumption of processing, whether
it be for barge in or other functions, such as user commands, in the Vermobil case.
To summarise and put these conclusions in the perspective of the preceding discussion of architectures for more standard spoken dialogue systems, the combination of abstract functionalities,
data pools and publish and subscribe communications would seem to be a useful communications architecture where reconfigurability of both the modules and the architecture is a priority.
Some of these facilities are already offered where a partitioned blackboard is adopted or the hub
is primarily used as a communications controller. However, control of communications needs to
be supplemented with genuine control of the processing tasks if real dialogue management is to
be carried out with realistic time performance. In this context a degree of incrementality below
the turn level is likely to be required, perhaps even different increments for different processing
tasks. That would imply turn and segment identification in all messages to maintain consistent
processing.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 42/58
Chapter 7
An Architecture for SIRIDUS
The objective of the SIRIDUS architecture is to integrate developments in three main research
areas of the SIRIDUS project. These areas are: extending the Information State Update approach
of the Trindi project to cover new types of dialogue, building in and developing new approaches
to robustness in dialogue processing and exploiting the Information State Update approach in
speech recognition and synthesis. The Baseline Architecture for SIRIDUS is designed primarily
as a means of bringing into a common computational framework research that was originally
developed outside of the SIRIDUS project as well as providing a foundation basis for developing
both that and other work originating wholly within the SIRIDUS project.
7.1 Components and Processes
The baseline architecture for dialogue components and their processes is shown in figure 7.1.
One of our principal interests in this architecture is especially the possibility of running the various components in it in parallel. In particular, in our work on developing interactions between
recognition, synthesis and dialogue management we are interested in developing methods to
cope with user interruptions, user ‘back-channeling’ and system interruptions. For example, as
mentioned earlier, although commercial speech recognizers now permit the possibility of user
barge in, they are limited in that anything the user says at all (so long as it reaches recognition
threshold) counts as an interruption. Users cannot therefore backchannel by confirming parts
of a system utterance while the system is still speaking. Conversely, if the system continues a
dialogue while simultaneously looking up a database (as in the SmartSpeak example discussed
earlier) then one wants the system to be able to interrupt the dialogue if important information
comes back from the database which needs to be shared. Although it is not a first year SIRIDUS
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 43/58
Figure 7.1: The Component architecture
objective to build in all these capabilities into a working demonstration, it is important that our
baseline architecture permits their possibility.
7.2
T RINDI K IT and OAA
For implementation of this architecture, the main options are an asynchronous version of the
TRINDI K IT currently under development, and the Open Agent Architecture from SRI. The final
choice of implementation route cannot be made at the time of this deliverable since the asynchronous TRINDI K IT is not yet available. It is our preferred implementation route in that it
naturally builds in for us the Trindi Dialogue Move Engine. Extending the Dialogue Move Engine to cope with new types of dialogue is another SIRIDUS project objective. The optimal
solution might be to combine OAA and TRINDI K IT , and this section examines different ways of
seeing the relation between these two architectures.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 44/58
We have decided not to use the Darpa Communicator Hub mainly because its use in asynchronous processing remains largely untested. Also, as SIRIDUS partners are not formally
members of the Darpa project, the level of support that we might reasonably expect to receive is
somewhat unclear. Use of the current release version of the TRINDI K IT is not a suitable route
for us since it enjoins a sequential pipeline calling of dialogue system components.
Some preliminary points
The asynchronous TRINDI K IT is built on top of AE (Agent Environment), which can be seen
as a stripped-down version of OAA1 . TRINDI K IT allows both asynchronous systems (running as
several processes) or synchronous (serial) systems, running as a single process. It is also possible
to replace AE with OAA and run TRINDI K IT on top of OAA.
An OAA agent declares a set of solvables (prolog goals) and is accessed by giving it a goal
(instantiation of any of its solvables) and getting a response. AE agents are similar to OAA
agents, but simpler. AE agents offer services (similar to OAA solvables).
TRINDI K IT is implemented on top of AE in roughly the following way. The TIS handler is an AE
agent offering services for accessing (checking and updating) the TIS. TRINDI K IT modules are
AE agents which use the services of the TIS handler, and which export the service of executing
the module algorithm to the TIS. Each module has a trigger condition (actually one more trigger
conditions), and when such a condition holds the TIS handler will send a non-blocking request
to the module to run the appropriate algorithm.
In the rest of this section, we will investigate four ways of seeing the relation between TRINDI K IT
TIS, modules and resources, AE agents, and OAA agents.
T RINDI K IT modules are AE agents
In this scenario, TRINDI K IT is run on top of AE, as described above. This solution as it stands
does not permit interaction between TRINDI K IT and OAA agents.
1
AE was developed solely for use with T RINDI K IT. The reason OAA was not used is that OAA was not available
for Sicstus prolog when implementation of the asynchronous T RINDI K IT started.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 45/58
T RINDI K IT modules and the TIS are OAA agents.
In this scenario, TRINDI K IT is run on top of OAA. The OAA agents offer solvables corresponding to the AE services. This will allow TRINDI K IT to interact with OAA in a natural way.
Unfortunately, preliminary tests indicate that this may be to be too slow for any interesting applications; in GoDiS, the number of interactions between the DME and the TIS is quite large (as
it is bound to be in any interesting system) and it takes about 2 seconds from user utterance to
system utterance (compared to about 0.5 second when running on top of AE). However, it must
be stressed that these results are very preliminary, and require further investigation.
A T RINDI K IT system, run on top of AE, is an OAA agent
On this view, TRINDI K IT modules and TIS are AE agents; OAA agents may serve as “backend” services to the TRINDI K IT system in the form of TRINDI K IT -modules, but not as part of
the TRINDI K IT dialogue system per se.
The TRINDI K IT system would (probably) not offer any services to the OAA community, but
OAA agents could act as TRINDI K IT modules, given an interface rule similar to DARPA’s Hub
scripts. The interface can either run as a separate process or be included in the OAA agent (or
possibly in the TIS handler). In general these interface rules (not to be confused with TIS update
rules) will have the form
RULE:
solve(
)
IN: Conditions to check before query
OUT: Operations to apply after response
An impressionistic example (assuming the TIS contains two variables to calendar and from calendar,
each whose value is a record with the fields date and entries, and that some OAA agent offers the solvable entries(Date,Entries):
RULE: readable(to calendar) solve(entries(Date,Entries))
IN: to calendar.date = Date
OUT: set(to calendar.entries,Entries)
This rule means that the interface will wait for a trigger from the TIS; on receiving the trigger it
asks the OAA agent to solve Solvable, whose arguments may include information found in the
TIS by the IN conditions; the answer is then stored by the TIS by the OUT operation.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 46/58
This solution can be implemented in TRINDI K IT 2.0 by building an interfacing TRINDI K IT
module which is also an OAA agent. The module would have a trigger (as do all modules in the
asynchronous TRINDI K IT ) and a simple algorithm which could have the same function as the
interface rules shown above, e.g.
if to_calendar.date $= Date and
oaa_solve(entries(Date,Entries),_)
then
set(from_calendar.entries,Entries).
A similar solution (which only works for OAA agents implemented in Sicstus prolog) is to modify the original OAA agent so it also becomes a TRINDI K IT module; in this case there is no
separate interface outside the OAA agent. The algorithm could be e.g.
if to_calendar/date $= Date &
entries(Date,Entries)
then
set(from_calendar/entries,Entries).
Presumably, if the OAA agent offers the service entries(Date,Entries), that predicate is available
inside the OAA agent.
In addition to calling OAA agents directly, TRINDI K IT could also access OAA agents indirectly
by communicating with the OAA facilitator in a similar way as indicated above.
T RINDI K IT modules are either AE or OAA agents
This solution extends the previous one a bit further by allowing OAA agents (properly interfaced)
inside the TRINDI K IT dialogue system. This solution is probably the most flexible and useful.
This would be implemented in the same way as the previous solution.
One variant of this solution would be to run the DME and TIS serially in TRINDI K IT , and using
this process as an OAA agent. This variant would not require any asynchronous behaviour from
the TRINDI K IT. It would amount to ripping out the core of the TRINDI K IT and using OAA as a
basic architecture.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 47/58
T RINDI K IT resources are OAA agents
In this scenario, OAA agents are not modules in TRINDI K IT, as in some of the solutions above.
Instead, they are implemented as TRINDI K IT resources.
There is one possible problem in using agents as resources. In TRINDI K IT , resources are called
from TIS update rules, which are executed by the TIS handler. Only one rule can be executed at
a time, and while a rule is being executed the TIS handler will not do anything else. The idea is
that rules are bundles of TIS conditions and operations which are protected from asynchronicity;
this guarantees that the preconditions of a rule still hold when its effects are executed. The
same holds for conditions and operations; while a conditions is being checked (or an operation
applied), nothing else happens in the TIS. So if e.g. a condition calls a resource which is an OAA
agent, and it takes 2 seconds before the agent replies, then the TIS will be blocked for 2 seconds
(and requests from other modules will be put in a queue).
However, the TIS-blocking problem can be overridden by defining the OAA resource interface
as a standalone part of the TIS, which means it runs as a separate process. Still, if the call to the
resource is made from an update rule the TIS will be blocked; but if it’s made from a module
algorithm there will be no block.
The resource interface (call it oaa) should import the OAA library. In addition, one should build
a TRINDI K IT module which calls the standalone OAA resource. As any module, it has a trigger
and an algorithm. This allows us to define the kind of interface rules mentioned above using
the standard TRINDI K IT module definition format. To make life easier for the users, one could
include the OAA resource interface in the TRINDI K IT distribution (since it’s generic).
Preliminary conclusion
Above, we have presented various ways to conceive of the relation between TRINDI K IT and
OAA. All of these deserve further exploration, and TRINDI K IT 2.0 will make it possible to
implement them all and experiment with them to find the best solution. Of course, it is very
possible that there is no single optimal solution for all situations; consequently, experimentation
may be required to determine a suitable compromise or a system specific optimisation.
7.3 Dataflows and Interfaces
Part of the Siridus project aims are to provide a dialogue system incorporating new features concerned with robustness and dialogue management. In particular, one project objective is to com-
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 48/58
bine a ‘repair’ approach to robustness with a ‘shallow’ approach to interpretation. That is, faced
with input which cannot literally be made sense of using a linguistically motivated grammar of
the well-formed strings of a language, one can take two approaches. First, one can attempt to
repair the input by transforming it so that it is as if the input were initially perfectly well-formed.
Secondly, one take a shallower approach to interpretation using whatever information one can
find in the input together with suitable dialogue expectations. Our shallow approach to interpretation is designed also to be suitably gradable - the more linguistically structured information
there is in the input, the more it will be taken notice of by the interpreter.
In our initial baseline architecture, we will be building in the combination of these two ideas
using a simple dataflow pipeline between the two components of repair and interpretation. We
shall also use a simple dataflow pipeline between recognition, parsing and repair. Of course,
the component architecture outlined in section 7.1 above permits more complicated patterns of
dataflow but we only intend to examine a pipeline model in our first year. For the data interfaces,
we intend to use a lattice or chart structure. The output of speech recognition will be a word graph
rather than a simple 1 best string or nbest list of hypotheses. Our reasons for this are precisely
the same as for the Colorado University Communicator system described in section 3.2 above.
We wish to examine whether and how much valuable information can be found clustering around
the speech recognizer’s best hypothesis even if the actual best string that the recognizer would
select on the information available to it is not optimal. The output of the repair module will also
be a chart structure but with additional edges added into it by the repair module. These edges
will be annotated with scoring information indicating the repair module’s estimate of confidence
in its own work. Finally, the ‘gradably shallow’ interpreter will be updated to take account of the
original edges and the repaired edges when available.
The resulting picture is summarized in figure 7.3
7.4 Dataflow timings
Given the use of a parallel and asynchronous architecture, it is not sufficient simply to map out
where information flows. When information flows also becomes very significant. In our baseline
architecture, we intend to explore one particular possibility: the impact on parsing, repair and
shallow interpretation of the recognizer delivering a sequence of partial word-graphs as speech
recognition proceeds. That is, speech recognition will not wait until it has processed everything
there is to be processed before delivering its output. Rather, it will output partial hypotheses
during recognition. It is an interesting research question what stability one will find in the partial
hypotheses and how much use can be made of them by downstream processing.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 49/58
Figure 7.2: Baseline Dataflows
7.5 Word Lattices as a Data Format
Speech recognition systems often produce many plausible word strings for each utterance and
need to pass this ambiguity on to the next stage in the processing. Early systems only put out one
output word string, the best scoring one. Because the path score is based on the acoustic model
of the context conditioned phonemes and a word sequence model, sometimes the second or 99th
best hypothesis is actually the correct one. This led to the idea of n-best lists of hypothesized
word strings. Because each string often differs by a single word, the whole of a 10 best list
might consist of the possible permutations of 4 words. This is not a very good way to represent
ambiguity of this sort. An alternative is to provide a word lattice which contains the words and
their start and stop times spanning the length of the utterance. This way a four way ambiguity
between two time points is just a list of words with the same start and end times and their
probability scores. The depth of the lattice at any point depends on the level of pruning. The idea
is to set the pruning threshold so that the correct sentence is not lost from the search. Figure 7.5
shows a word lattice in graph form for a simple sentence in the Wall Street Journal Domain “It’s
like having an umbrella insurance policy, Camuzzi said.” The pruning threshold was very low
for this word lattice.
Word lattices can also be provided at time intervals of approximately 1/4 of a second. These
lattices should have the correct words in them, but the scores might be wrong. How much the
scores change and whether impossible paths appear in the lattice is a research question which
will be studied during the course of the project. Some special pruning methods are available,
which attempt to eliminate most dead end paths, before putting out a word lattice. There may be
paths which fail on the last few words, which would have to be removed from the parsing near
the end of processing.
Figure 7.3: Word Lattice
10630/5.074
10630/3.769
10630/0.643
10630/5.953
10630/0
1/0
231
10630/0
190
0/0
48142/7.562
243
1/0
53941/10.01
48142/7.562
196
1/0
48323/7.807
10633/11.52
48142/11.29
1/0
53941/9.770
10630/1.494
46774/0
10630/0
10631/12.57
10630/0.175
10630/3.769
10630/0
10630/5.953
192
48142/11.29
0/0
1/0
155
48142/7.562
47246/9.401
48137/9.997
53941/10.64
53941/14.62
48188/6.595
46774/0
233
234
1/0
10630/0
1/0
53941/15.49
48317/5.909
256
10630/5.074
235
10630/0.175
0/0
0/0.000
1/0
257
273
10633/8.913
0/0
285
48317/5.909
0/0
288
0/0
292
0/0
295/-1388
0/0
0/0
293
297/-1389
0/0
194
48188/6.789
0/6.729
296/-1389
0/5.681
0/0
48188/6.789
0/0
195
27934/11.82
263
41/4.987
48188/10.51
258
0/0
294/-1389
291/-1389
0/0
0/0
0/0.001
275
1/0
48317/5.540
286
240
0/0
1/0
0/0
1/0
1/0
267
186
279
1/0.170
53941/6.800
1/0
8736/6.374
47915/6.628
10633/4.981
1/0
47915/6.628
0/0
1/3.465
282
0/0
0/0
48317/3.439
0/3.782
8736/6.374
54488/11.83
41/3.877
0/0.000
238
266
241
41/4.430
0/0
10630/0
10630/0
161
53941/1.826
290
0/0
53941/9.770
48188/6.595
0/1.323
10630/0
27934/15.78
48188/10.51
10630/0.175
53941/8.183
121
0/0
53941/4.594
0/0
0/0
48317/3.439
10630/0
29632/5.074
10630/0
10633/5.787
41/1.254
0/5.994
27281/0
53941/10.64
1/0
48317/9.270
261
1/3.354
1/0
0/0.000
47920/6.420
177
29632/3.642
1/0
27281/0
280
1/0
244
41/9.615
41213/0
27281/0
48317/5.540
154
1/8.998
27281/0
27281/0
53941/18.81
53941/7.899
0/0
27281/0
27281/0
41/4.224
41211/7.873
41/5.028
167
0/3.332
146
1/9.195
191
232
248
53941/2.629
1/0
239
0/0
264
0/0.000
1/1.046
1/3.465
47915/10.35
1/0
0/0
0/5.827
276
287
47336/8.039
0/0
1/0
0/1.047
278
1/3.354
0/0
48323/7.567
187
274
0/0
242
8736/10.10
1/0
10633/11.52
262
1/5.268
1/0
1/0
236
48317/9.270
0/0.001
281
0/0
1/11.23
0/7.061
0/0
1/0
0/0
8783/5.637
188
10630/0
1/0
41213/0
0/2.308
1/1.309
47915/10.35
10631/12.57
0/0
0/2.308
265
1/5.268
1/0.141
53941/0
53941/11.43
53941/14.45
10630/0
211
180
1/1.309
10633/5.147
10633/8.913
0/7.061
41214/11.34
53941/5.232
47887/7.641
226
0/5.855
53941/11.14
0/0
29632/4.214
10633/5.147
259
10630/0
152
89
10633/4.981
260
10631/6.766
1/0
1/0
24191/11.36
29632/4.214
53941/2.623
1/3.465
157
284
10627/3.832
158
8783/9.368
1/0
169
277
1/0.141
8736/10.10
10633/5.147
24191/5.924
41213/0
1/0
48317/5.540
27281/0
53941/6.601
230
0/0
0/0
271
47336/11.76
10633/4.981
10627/6.958
1/3.465
1/0.170
47887/11.37
27281/0
165
10631/6.766
29632/2.999
1/0
48323/11.29
10633/4.981
1858/15.72
1/10.99
0/0
1/0
48188/6.789
0/0
10633/6.449
53941/7.071
237
10633/4.981
53941/14.79
0/1.844
53941/6.782
145
0/0
1/0.170
289
48323/7.807
10630/0
10630/0
27281/0
209
27281/0
53941/11.43
8744/5.609
205
27281/0
27281/0
53941/12.61
10633/6.449
86
27281/0
53941/4.652
143
53941/6.506
0/3.331
272
53941/4.203
27934/9.982
0/3.782
10630/0.175
27281/0
10630/0
27281/0
53941/10.71
10633/6.475
27281/0
27281/0
1923/8.965
27281/0
0/5.486
27281/0
144
0/0.506
117
53941/10.71
1923/10.68
10630/0
10631/7.503
1923/16.09
0/3.683
116
269
0/0
283
1/0.170
53941/14.91
29632/2.611
0/0
10631/6.766
47920/10.15
1/0.170
10633/6.449
53941/13.41
24191/6.728
156
0/4.036
1/10.99
8736/6.374
229
204
53941/7.134
268
8744/9.339
10631/6.766
27281/0
166
47246/9.401
10633/5.143
1923/18.06
29632/2.999
0/0
47887/11.37
0/0
47920/10.15
46774/0
10631/7.503
252
48323/11.29
270
193
53941/7.416
1/3.619
10627/3.188
1923/16.21
41211/7.873
147
120
0/1.116
10630/0
8744/9.339
48137/9.997
8783/9.368
150
10627/4.682
10633/6.449
47336/11.76
29632/5.074
27281/0
1923/14.71
10631/7.503
27281/0
118
1/0
46774/0
29632/1.304
29632/0
1923/8.435
10627/3.188
47336/8.039
10501/10.77
1923/16.09
27281/0
42360/9.637
53941/7.071
46774/0
0/2.469
1858/15.84
29632/0
10501/10.77
47887/7.641
29632/0
176
42360/9.637
0/1.340
27281/0
153
1/0
184
27281/0
47920/6.420
29632/1.304
10631/7.503
0/1.844
47915/6.628
10627/3.188
0/0.506
46774/0
46774/0
29632/1.116
140
0/3.095
1/6.354
42360/13.05
1923/10.68
27934/10.51
10633/5.147
1923/8.716
0/3.683
29632/1.304
53941/8.129
54488/17.41
1/12.83
199
10627/3.188
97
1923/8.716
85
1/3.5
113
10501/10.77
46774/0
10633/4.981
46774/0
200
1923/8.716
54488/9.911
46774/0
0/1.323
1/0
1/3.619
0/3.331
198
210
27281/0
54488/9.911
27281/0
0/1.108
10501/10.77
0/0
46774/0
197
0/2.469
46774/0
48323/7.567
10623/4.474
42360/13.05
0/3.095
228
29632/2.999
0/6.499
1923/9.430
1923/7.000
8744/5.609
247
1/0
0/2.308
1/0
127
1/5.458
1/5.458
90
1923/8.137
53941/7.630
208
1858/14.34
1/5.338
1923/10.14
164
10627/3.188
185
8783/5.637
10627/0
160
54810/11.69
7542/11.49
10677/5.599
7542/7.377
47961/0
217
8783/5.637
10681/4.984
53941/0
54810/11.69
1/7.527
47336/8.039
141
10677/6.243
91
27281/0
7542/0
0/0
53941/1.250
255
10623/4.474
0/3.793
53941/0
0/0.385
87
227
10627/6.958
27281/0
46774/0
0/2.664
0/0
1/0
202
149
135
53941/1.245
54810/11.69
0/5.827
0/1.843
1923/8.138
7542/0.330
168
172
1/0
0/1.082
1/4.988
1/3.619
246
48323/7.567
0/0
53941/0.412
27281/0
0/0
53941/7.416
1/0
47920/6.420
0/4.522
0/4.522
53941/6.837
1923/8.136
1923/8.179
93
1/0.170
0/4.522
53941/6.837
1/3.039
0/3.331
47887/7.641
54810/11.69
41211/7.873
1/5.338
54488/7.600
10677/5.437
125
54488/7.149
1/3.5
8513/11.97
0/0.443
27281/0
137
27281/0
27281/0
27843/10.88
10677/9.369
27281/0
1/7.425
53941/0.850
0/0
10677/9.369
0/3.683
251
173
10623/8.244
37842/9.858
53941/6.938
0/0.506
1/0
215
10631/6.766
1923/8.716
69
24210/8.246
1923/7.000
55816/0
7542/0
1/0
0/0
203
27281/0
0/0
128
23529/10.37
245
8744/5.609
0/2.242
42360/9.637
1923/8.931
124
54488/7.600
0/0
10627/0
42360/9.637
0/4.522
0/1.844
54488/15.09
115
68
23708/11.77
0/3.683
1858/0
24499/10.79
24995/11.81
138
53298/10.17
23664/11.70
0/1.108
24114/7.284
0/0
1858/3.906
53941/1.088
53941/1.610
225
46774/0
216
27281/0
53941/7.127
24114/7.284
0/2.308
7542/3.064
136
10623/5.118
0/2.881
54488/15.09
0/0
10623/5.753
10677/6.931
29632/2.999
142
53941/1.959
29632/0
1858/8.065
15618/5.631
24210/8.246
27281/0
10681/5.627
54488/7.600
24191/0
10681/8.753
37842/17.23
24114/2.177
7542/3.644
10623/5.753
201
21214/5.583
37842/9.858
1858/0.5
23664/11.70
37
7542/0
24191/1.737
10681/4.822
8744/2.294
10681/8.753
37842/8.599
26455/4.347
222
7542/3.321
0/5.994
53961/6.040
10633/4.981
1/3.866
1923/8.178
37842/15.16
23529/7.381
0/3.366
53941/1.501
182
27281/0
0/3.792
108
0/3.331
23664/11.70
24114/7.284
23529/10.37
23529/10.37
26
250
54488/6.311
0/0
10681/6.316
37842/15.16
0/5.692
27843/10.88
23801/6.916
24191/0
24191/2.119
0/2.308
41213/0
23800/6.086
31235/1.234
34
50
8049/7.503
0/2.029
10633/4.981
37842/8.599
10623/5.753
0/1.026
8049/8.677
54488/7.600
37842/7.662
174
24191/0
10631/6.766
10681/4.822
84
23697/7.462
10623/5.753
53941/6.358
33
0/5.691
31235/1.234
0/3.683
53941/1.610
1/0
37842/7.662
52
1923/8.179
223
0/3.331
26455/0
1/0
24114/2.177
8049/8.677
0/0.458
1/6.354
37842/7.662
24114/2.177
23529/7.381
24127/0
7542/0
1/0
29632/1.304
53941/6.358
181
46774/0
212
41214/11.34
37842/7.662
1923/8.965
132
0/5.827
1/0
163
114
1923/0.532
27281/0
23529/7.381
10681/4.822
10623/8.244
28768/4.187
1858/5.405
29198/4.845
29198/4.845
1/7.659
23801/6.916
0/0
92
0/3.692
23801/6.916
10627/3.188
10627/3.188
23801/6.916
58
1923/7.000
4588/0
24210/8.246
10681/4.822
0/3.683
26455/5.485
36
15618/5.631
23800/6.086
0/0
10627/3.188
37842/9.586
23800/6.086
24313/9.883
30
1923/7.000
23800/6.086
1858/0
123
21214/5.583
58525/11.80
94
37842/7.210
29198/4.845
37842/8.649
54488/7.130
37842/7.192
37842/7.210
53961/6.040
72
27281/0
29198/5.925
10681/4.822
183
37842/8.649
10681/4.822
10631/6.766
37842/8.649
1858/0
28767/3.369
54488/6.311
57
10681/4.822
0/0.351
0/5.691
37842/7.261
24210/8.246
23426/6.269
27937/2.406
53941/6.252
1858/0.209
23697/7.462
54
1858/8.561
224
27281/0
130
1858/7.767
1858/0
27937/0.502
1858/0
0/5.691
21
26077/9.565
0/5.061
0/1.903
44
23697/7.462
27937/0
29198/5.925
107
1/1.919
126
0/0
0/3.683
213
27281/0
0/0.991
119
54488/7.200
23697/7.873
0/1.424
0/5.994
29198/5.925
1/1.304
0/8.812
1/7.195
27281/0
29198/5.925
41214/11.34
27281/0
122
101
249
0/3.014
0/1.424
10675/6.679
28767/4.673
1923/9.489
4588/0
1858/7.809
36193/6.865
44497/15.04
27281/0
1858/0
0/2.729
0/2.188
1858/0
35
23697/7.462
1923/3.730
27281/0
27281/0
0/1.424
1858/15.72
214
24191/1.957
1923/4.485
23589/8.189
0/3.331
7840/11.96
1923/2.011
71
37502/8.754
55
1923/3.362
4588/0
1/5.586
111
7542/0
206
27281/0
0/1.424
29632/1.117
139
0/0.990
55816/0
1923/4.485
74
1/3.040
27281/0
44497/7.547
24127/0
28768/6.308
28768/4.187
0/2.729
26455/6.132
6
45
23426/6.269
6089/11.79
3
0/2.729
1/5.587
73
23589/8.189
1923/4.065
0/2.541
43/8.013
26455/6.465
10623/4.474
1923/6.531
0/0.762
0/2.870
0/3.454
26455/15.45
55816/0
10623/4.474
27281/0
1923/4.065
51
0/0
27281/0
27281/0
53941/6.235
26455/5.876
23589/8.189
31235/0
10675/6.679
131
1/10.62
0/2.914
27843/0
21911/10.87
24194/11.68
27843/11.79
10627/3.188
27281/0
1858/0
61
27937/0
24191/11.82
10
27281/0
0/0.990
24210/8.246
56
23589/8.189
1/0
129
27281/0
23426/6.269
24114/5.774
31235/0
24990/2.558
1/2.568
2
1/9.196
96
23426/6.269
0/3.366
0/0
27843/0
27843/0.493
1/3.061
0/0
27281/0
1858/0
42
0/3.311
0/0
95
0/3.253
23589/8.189
32
29
12
22
1086/9.059
1081/9.258
1
24191/0
1/0.803
54488/7.149
55894/10.85
1858/0
23529/7.381
0/3.331
1858/0.889
24114/2.177
29995/4.230
27
53
31210/10.29
11
27937/0
27937/0
27937/0
27934/8.635
1858/7.768
23426/6.269
32035/11.86
27937/0
7
23697/7.462
1858/0
1923/9.489
24127/0
0/0
1081/11.65
1086/11.45
5399/10.47
0/0
254
53941/6.704
1858/9.060
26455/0
59
53941/4.305
19495/4.769
31210/4.653
31210/4.653
29995/4.230
31235/0
20
31
10627/3.188
170
1858/8.347
112
24114/2.177
23801/6.916
14
19
0/0
24990/2.998
0/0
5
1/3.914
29198/4.845
1/1.304
88
8049/7.503
1/0
70
23529/7.381
27968/6.900
27968/4.166
27968/4.483
24990/7.809
0/5.061
1/2.396
0/0.458
53941/6.704
27968/6.900
24990/7.370
17
53941/9.070
1858/0
579/8.298
23800/6.086
13
27968/4.166
7439/10.28
27937/0
29228/10.29
27934/9.128
0
148
29198/5.925
40156/9.248
43/8.013
41213/0
31235/0
19792/6.857
1/10.62
23697/7.873
0/4.217
1923/4.065
26455/9.389
43/15.36
1858/0
109
1858/8.347
0/2.871
0/2.729
31235/0
25
0/3.448
53929/5.602
25973/11.82
579/10.41
55816/0
1858/7.768
29198/5.925
28767/3.369
0/2.541
10486/6.092
26455/7.955
1923/0.596
26455/9.389
29198/5.925
1858/8.347
44497/7.547
46
41/5.870
1923/1.434
23708/11.77
7840/11.96
28
10675/6.679
26455/5.390
23697/7.873
0/0.695
16
28767/5.817
24191/1.575
0/3.332
0/3.366
0/5.469
4
53929/6.587
53929/5.669
28768/5.492
10681/4.822
1858/7.809
134
36193/7.198
0/5.061
24499/10.79
1858/6.563
26455/5.390
0/0.360
110
41/6.203
0/0.272
31236/5.313
1858/7.809
253
24995/11.81
24191/1.403
189
43/7.866
8513/11.97
31235/0
9
21486/11.20
0/1.518
30368/11.33
43/8.013
36193/6.123
75
8
10486/6.092
26455/7.955
53941/6.303
23
53941/8.255
0/0
10675/9.090
1/3.610
37502/8.011
60
24127/0
43/7.866
1923/6.600
36193/6.123
24995/11.81
24313/9.883
14302/10.80
0/6.612
27281/0
1858/6.563
10675/6.679
1858/3.292
36193/6.123
26455/4.684
1/0
27281/0
24
10677/5.437
10677/5.437
41/5.127
24127/0
0/3.610
10677/5.437
43/7.415
133
41
24127/0
53941/8.588
43/7.396
178
175
10675/6.679
53941/7.513
39
24313/9.883
171
221
1923/4.187
31235/0
0/0.028
47
23708/11.77
0/0
15
24499/10.79
15099/11.38
0/0
10623/4.474
1858/0.356
0/5.419
53298/10.17
15098/6.233
18
54098/11.67
40820/11.56
10486/6.092
26455/5.390
41/5.127
15098/10.28
10787/7.165
24191/1.137
10623/4.474
27281/0
1923/4.065
10787/7.165
1858/6.093
24872/11.53
0/0
10486/6.092
41/5.127
53941/7.513
41/5.050
25662/11.52
43/15.36
26455/5.512
24127/0
24191/0.461
218
26455/7.485
41/5.406
1858/8.347
43
24127/0
10623/4.474
43/7.465
28768/5.492
37502/0
0/5.826
25662/11.52
106
26455/15.45
0/2.082
207
1858/6.163
0/2.871
219
43/7.866
0/0.991
26455/7.554
63
24127/0
0/6.612
41/5.25
0/1.838
0/0
66
1858/6.563
65
25662/11.52
38
24872/11.53
10486/6.092
26455/7.955
40
1858/0
28767/4.673
10675/6.679
43/7.866
53941/7.513
24127/0
1858/0
26455/5.390
10787/7.165
36193/6.245
0/3.610
1858/6.563
36193/10.03
10623/4.474
53941/7.635
26455/17.81
37502/9.087
26455/7.955
37502/0
41/5.127
28767/3.369
37502/0
20297/18.27
0/1.838
53941/7.513
10677/5.437
43/7.415
20297/10.77
10787/7.165
78
1/3.039
1858/0
62
44497/7.096
1858/2.504
24191/0
10675/9.090
10675/9.090
26455/3.370
0/0
220
28767/4.486
10675/6.679
36193/6.123
26455/10.31
55816/5.489
28768/5.304
20297/10.77
10486/6.092
179
1923/3.730
0/0
55816/2.396
28768/4.187
0/0
10677/5.437
41/5.406
54488/10.66
26455/10.31
37912/5.083
40156/9.248
10787/7.165
81
37502/0.942
83
36193/10.03
38249/0
28768/4.187
49
1858/5.957
48
40156/9.248
10623/4.474
37502/0
29938/4.031
53941/9.272
10677/5.437
37502/8.011
37912/5.083
1/0.802
1923/3.730
37502/8.011
38249/0
10787/7.165
38249/0
37046/4.458
37502/8.133
10486/6.092
151
38661/5.897
67
10677/5.437
54488/10.66
37502/8.011
1389/6.257
64
1389/3.018
46774/0
28767/3.369
40156/9.248
38249/0
0/0
0/0
159
100
38077/3.035
53941/9.272
10787/7.165
38249/2.932
82
0/2.859
0/0.506
28768/5.303
40156/12.66
1/0
10675/9.090
26455/4.684
38077/3.985
28767/4.485
44497/7.078
10623/4.474
40156/12.66
28768/5.492
38911/0.981
1858/3.292
103
24242/3.217
28767/4.673
10623/5.968
44497/7.147
1923/2.429
0/2.859
29938/1.382
162
24191/0.005
28767/8.443
31583/2.818
28768/9.423
102
28768/5.492
1858/3.292
44497/15.04
38911/0.981
76
44497/7.547
28768/9.423
26455/4.684
28767/8.443
37842/1.287
104
44497/7.547
27281/0
38077/3.531
28767/4.673
28768/6.798
44497/6.564
0/2.711
44497/6.526
28767/5.980
31961/3.358
0/2.881
37046/1.965
37046/2.968
31167/1.941
24242/3.217
29938/1.382
98
31583/2.818
31961/3.358
44497/4.364
37969/3.250
53941/4.067
38912/2.639
105
37842/1.287
1858/2.795
41/3.659
80
31583/5.019
31167/0.672
27281/0
31961/4.937
58559/4.348
1389/4.879
31167/0.914
41/3.485
37969/3.250
1/3.040
31167/0.672
37615/6.517
27281/0
41/6.332
99
31167/0
77
38912/2.639
27281/0
0/0.063
54146/3.618
41/6.332
37615/6.517
1389/6.257
79
29938/4.031
27281/0
37046/4.458
41213/0
27281/0
38661/5.897
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 50/58
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 51/58
7.6 Speech Recognizers
The Siridus project has available to it several recognizers which will be tried and refined for our
purposes. First of all there is the Decipher recognizer from SRI USA, which has successfully
competed in the DARPA Project evaluations. This recognizer uses triphone acoustic models, and
a ngram grammar for word sequence modeling. It is capable of outputting a word lattice, which
is the prefered interchange format between the recognizer and the Siridus parser. This recognizer
runs on Suns and PC’s under Windows NT. Another recognizer is the Entropic HTK recognizer,
which uses similar phonetic models and ngram grammars for recognition. This recognizer can
put out partial word graphs and thus can provide a left to right progressive hypothesis for the
recognized utterance. This is very useful for parsing, since the parser can begin before the end
of the sentence, thus cutting down the delay due to recognition. We plan to use this capability
extensively in the Siridus system.
The Siridus partner Gothenburg University has been given the CMU Darpa Communicator system in a cooperative arrangement with Carnegie-Mellon University. Since this is a complete
DARPA communicator system which uses the DARPA hub as only a pass through mechanism,
this gives a platform for integrating the Siridus architecture into a larger system which includes
travel database lookup and processing of the database response into a generated speech response.
The present recognizer in the CMU system is the Sphinx II system, which is known to be fast but
less accurate than the current Sphinx III system. The Sphinx III system is to become available
in the next few months. This system runs on Windows NT (recognizer) and Solaris (dialogue
manager, database lookup).
Finally we have been experimenting with the IBM Via Voice recognizer which runs on Linux
and Windows and the Dragon Naturally Speaking recognizer which runs on Windows. These
are large vocabulary dictation systems, which recognizes words which are outside of our travel
dialogue domain systems. This should mean that it takes longer to recognize the words within
the dialogue domain, because there are so many words (approximately 100,000) in the dictation
recognition system. Also these systems do not provide word lattices, but take the best scoring
transcription for the utterance. We do not anticipate using these in the Siridus system.
7.7 Prosodic Markings
Important words and phrase boundaries are marked by the speaker prosodically, using loudness,
pitch rise-falls, and duration. Using acoustic measures of the syllabic nuclei found by the recognizer, it is possible to find the most probable focus words and phrase boundaries in an utterance.
These help in determining what the speaker means in the utterance. Given a word lattice and
the prosodic markings it is possible to mark the focus words, make sure that the lexical stress
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 52/58
aligns with the acoustic stress and to mark major phrase boundaries. Major phrase boundaries
help the parser to eliminate unlikely parses, and partition the parses, in case the parse of the
whole sentence fails. Phrase final intonation may also help to tell questions from statements,
but this is not always the case. Certain questions (yes/no questions and wh questions) are often
produced with declaritive (falling) intonation, whereas other questions would be produced with
rising intonation.
A sentential stress detector based on a finite state machine architecture was developed by [9].
This detector was tested on spontaneous monologues and achieved good agreement with hand
labels. Several attempts at stress detectors integrated with the speech recognizer have failed
in the past. This remains an interesting area of research. The Siridus Project plans to use the
Hieronymus and Williams detector in initial prosodic markings of the word lattices.
7.8 Speech Generation
The dialogue manager needs to generate questions for the dialogue. Since the Information State
(IS) has a representation of the information it needs, the correct question with correct intonation
can be generated. Each speech synthesis system has its own way of marking intonation, and there
are some standard marking methods like SABLE, which allow standard intonation marking to
be added to the text. Intonation markings for synthesis can be generated automatically from the
IS. Some preliminary work on generating speech from the IS forms has shown that the resulting
intonation is less ambiguous in intent than the default intonation provided by the TTS system.
We intend to extend this work to more complex questions which might be asked in the travel
dialogue domain.
Dialogue systems need to ask questions in different ways, rather than asking the same question
during subsequent dialogues. We will explore different ways of generating novel questions which
have essentially the same meaning. One source is to use data from natural dialogues to collect
a list of questions which humans use in similar circumstances. By choosing from the list of
questions in a random way, the systems appears to be more natural. Another technique developed
at CMU is to generate questions from word trigrams for the set of similar circumstance questions.
The trigrams sometimes produce repeated words, but these can be eliminated simply. This in
principle gives a greater variety of sentences, but some of them may seem to be completely
unnatural to a native speaker. We will experiment with ways of generating questions which give
the dialogue system many ways to ask the same logical question.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 53/58
7.9 Speech Synthesis
A range of speech synthesizers is being tried in the Siridus system, both to test different levels of quality and prosodic control, and to provide other languages besides American English.
Presently a Bell Labs system which runs on Windows, the IBM ViaVoice synthesizer from Eloquent Systems which runs on Linux and a Telia Synthesizer which runs on Windows have been
tried. A new set of synthesis systems have been developed by ATR, AT&T Research and Lernout
and Hauspie, which involve concatenating larger sections of speech together, words and phrases.
These have more natural sounding speech, having strange intonations and abrupt transitions at
times. We will explore the trade offs between normal diphone synthesis and larger unit synthesis
in the Siridus system. It is not yet clear if these larger unit synthesis systems are good at changing
the focus of the words being produced, if they are marked as focused in the text. This feature
seems essential in our use of speech output for dialogues.
7.10 Platforms
Many of the latest recognition systems aimed at telephone dialogues use pc’s and windows to
do the speech recognition and synthesis. This is because there is a much greater variety of
telephony interfaces to pc platforms. Given the recognition interface, the subsequent processing
can be done on other machines running Sun Solaris or Linux. During the Siridus project we
will experiment with different platform configurations. Since the core DME, the asynchronous
DME, and the OAA is written in Prolog, we expect to use Linux or Solaris platforms to run these
components. The Siridus system will use standard interfaces, so that components running on
different platforms can be integrated together into a final system.
7.11 The Repair Module
In this section we will specify the functionality of the repair module we intend incorporate in the
initial Siridus prototype architecture. The core of this module is a set of rules for handling spontaneous speech phenomena adopted from the corresponding Verbmobil robust semantic processing
submodule [20, 21, 17]. The rules represent heuristics for recognising phenomena such as hesitation, self corrections and false starts and attempting to remedy their effects by reconstructing the
intended utterance. In this sense the rule repairs a fragmented input by reconstructing a coherent
utterance out of the consistent parts of the input.
The input and output behaviour of the module, or more abstractly the corresponding functionality,
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 54/58
is fairly symmetrical, in that the input and output data structures are of the same type and vary
in principle in their size an number. The level at which the rule base is encoded requires that
the objects processed be linguistic analyses that exhibit clearly defined syntactic and semantic
properties. The original application made use of Verbmobil Interface Terms (VITs) [5], but the
individual tests and operations used in the rules are implemented via an ADT definition so that
the rule application can be adapted to any structures that exhibit the appropriate properties, e.g.
HPSG signs.
The Verbmobil module maintained its own internal chart for storing fragments and results, as a
VIT Hypothesis Graph [20]. In Siridus we intend to employ a common chart for all linguistic
analyses and recognition results, so that storage will be external to the module. Hence the module
will read a sequence of fragmentary analyses from the semantic chart and write its result to the
same chart. At this stage the process does not impose many constraints on the type of analyses
the chart contains, but two conditions must be met:
The provision of ADT functions supporting the tests and operations in the rules.
The inclusion of quality measure in the resulting analyses, since the rules are heuristic in
both the tests for the recognition of phenomena covered and in the repair operations carried
out.
The repair module is not dependent on a preceding analyser completing its task, or for that matter
on the completion of the word lattice. It can commence processing as soon as viable fragments
are available. However, this does imply access to fragments which is guaranteed by taking the
common semantic chart as one of the basic data pools.
7.12 Parsing and Semantic Interpretation
The parsing module is designed to take as input a word lattice and create extra edges encoding
syntactic and semantic information. The input need not be complete: the parser creates all the
edges it can with the existing partial input, then adds to these as new input arrives.
The parser works similarly to a chart parser. A standard chart parser takes edges between word
positions and joins them together. For example, a Determiner from positions 1 to 2 and a Noun
from position 2 to 3 might be joined to give a NP from position 1 to 3. Here we just adapt this
to a lattice so that we are taking a Determiner from Node1 to Node2 and a Noun from Node2
to Node3 to give a NP from Node1 to Node3. In a lattice, edges are between nodes rather than
word positions or time intervals since we can only join together two edges if the end node of
the first matches the start node of the second. The first edge finishing at the same time point as
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 55/58
the second edge starting is not sufficient. For example, consider a recogniser which hypothesises
words “abc” and “uvw” between times T1 and T2, and words “def” and “xyz” between times
T2 and T3. It may be the case that the recogniser allows “abc” followed by “def” and “uvw”
followed by “xyz” but not “abc” followed by “xyz”. Thus “abc” will be an edge finishing in a
different node, but at the same time as “uvw”.
As well as creating new edges, the parser also creates a record of how this larger edge was
formed e.g. that the np edge from node1 to node3 can be formed from the determiner edge
between node1 and node2 and the noun edge between node2 and node3. The new edges and
records encode the syntactic structure in a convenient packed format.
For semantics, the parser similarly creates new edges between nodes in the lattice. Each edge is
associated with an indexed semantic representation. This allows similar packing to the syntax, so
working with a relatively large lattice should be plausible. Details of the semantic representation
and further motivation are given in [15].
The simplest model of processing assumes that the lattice grows ‘edge monotonically’, in the
sense that the recogniser adds extra edges as it absorbs more input, and no edges are deleted,
but weightings on existing edges may change. This would allow the interaction to be relatively
simple: the parser takes a new edge from the lattice (or waits until a new edge is ready), and sends
back a new set of syntactic and semantic edges to update the lattice. If weightings on individual
edges change, the changes would have to be percolated upto derived edges (alternatively all
derived weightings would be expressed as formulae and evaluated on demand).
Unfortunately, parsing the whole lattice is unlikely to be practical. We will explore several
options. One is for the parser to work just with the best weighted edges at any point. This would
allow a single uniform lattice incorporating all the recogniser, parser and repair modules output.
Another is for the recogniser to output pruned lattices. These would be much smaller and much
easier to parse, but there would be no guarantee of edge monotonicity. A pruned lattice at time
T2 may not contain all the edges of a pruned lattice at time T1. There are again options here:
if we find that in practice the pruned lattices are usually edge monotonic, we may be able to
deal efficiently with the occasional non-monotonic cases by adding any new edges required, and
keeping the discarded edges with weights set to zero.
7.13 Translation to Dialogue Moves
In this section we will specify the functionality of the translation module which we intend to
incorporate in the initial Siridus prototype architecture. The translation module takes the output
of the repair module plus the dialogue context and maps this to one or more dialogue moves
which act as input to the dialogue manager.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 56/58
The translation module uses the output of the repair module, which it assumes to be a lattice that
includes semantic edges. We tend to think of the edges as comprising a semantic representation
for the utterance (a ‘semantic chart’ or ‘semantic lattice’): it may be partial i.e. containing
unconnected fragments, and may also represent several different packed readings.
Mapping consists of going from the semantic chart to database slots or a task language. This is
achieved by providing mapping rules which are of the form:
partial semantic representation + constraints on context
database slot value/command
The partial semantic representation is matched against the chart and the contextual constraints are
checked. If there is a match, then the database slot value is added to a set of potential mappings.
The most specific set of consistent mappings are chosen. [15] provides example mappings and
more motivation for the approach.
The constraints on context may refer to the last utterance, or the current state of the task constraints e.g. which slots are filled in a slot-value model for a particular task. Thus the translation
module needs information created and stored by the dialogue manager.
The output of the translation module is a sequence of moves. We take a move to be something
with both content and function. For example, ‘to Boston’ results in a move with function ‘add’
and content ‘destination=Boston’. A more complex utterance such as ‘not to London, to Boston’
results in a sequence of two moves:
retract(destination=London);add(destination=Boston)
We intend to experiment with various granularities of move ‘function’, ranging from the minimal
distinction above which states just whether to add or retract, to the finer grained distinctions
provided by Conversational Game Theory.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 57/58
Bibliography
[1] Goldschen A. and D Loehr. The role of the darpa communicator architecture as a human
computer interface for distributed simulations. In 1999 Simulation Interoperability Standards Organization (SISO) Spring Simulation Interoperability Workshop (SIW), Orlando,
Florida, March 14-19 1999, 1999.
[2] H. Alshawi. The Core Language Engine. M.I.T.Press, 1992.
[3] Marko Auerswald. Kommunikation und synchronization der verarbeitung in einem modularen speech-to-speech translation system. Master’s thesis, Department of Computer Science, University of Kaiserslautern, Germany, 1997.
[4] H. Aust and M. Oerder. Dialogue control in automatic inquiry systems. In Proceedings of
ESCA Workshop on Spoken Dialogue Systems, Vigso, Denmark, pages 121–124, 1995.
[5] Johan Bos, Bianka Buschbeck-Wolf, Michael Dorna, and C. J. Rupp. Managing information at linguistic interfaces. In Proc. of the 17 COLING/36 ACL, Montréal, Canada,
1998.
[6] D. Carter, M. Rayner, P Boullion, and M. Wirén. Spoken Language Translation. Cambridge
University Press, 2000.
[7] J. Ginzburg. Dynamics and the semantics of dialogue. In J. Seligman and D. Westerståhl,
editors, Logic, Language and Computation, volume 1. CSLI publications, 1996.
[8] Object Management Group.
The complete
http://www.omg.org/corba/corbiiop.htm, 1997.
corba/iiop
2.1
specification.
[9] J.L. Hieronymus and B.J. Williams. An investigation of the relation between perceived
pitch accent and automatically-located accent in british english. In Proceedings of Eurospeech 91, Genoa, Italy, Vol. 3, pages 1157–1160, 1991.
[10] Andreas Klüter, Alassane Ndiaye, and Heinz Kirchmann. Verbmobil from a software engineering point of view: System design and software integration. In Wolfgang Wahlster, editor, Verbmobil: Foundations of Speech-to-Speech Translation. Springer, Heidelberg, 2000.
To appear.
SIRIDUS project Ref. IST-1999-10516, March 12, 2001
Page 58/58
[11] I. Lewin, R. Becket, J. Boye, D. Carter, M. Rayner, and M. Wirén. Language processing for
spoken dialogue systems: is shallow parsing enough? In Accessing Information in Spoken
Audio: Proceedings of ESCA ETRW Workshop, Cambridge, 19 & 20th April 1999, pages
37–42, 1999.
[12] I. Lewin and S.G. Pulman. Inference in the resolution of ellipsis. In Proceedings of ESCA
Workshop on Spoken Dialogue Systems, Vigso, Denmark, pages 53–56, 1995.
[13] D. Lewis. Scorekeeping in a language game. Journal of Philosophical Logic, 8:339–359,
1979.
[14] Microsoft.
Distributed component object
http://www.microsoft.com/activex/+dcom, 1996.
model
protocol
dcom/1.0.
[15] D. Milward. Distributing representation for robust interpretation of dialogue utterances. In
Proceedings of ACL 2000, Hong Kong, 2000.
[16] Nuance-Communications. Nuance speech recognition system, version 5, developer’s
manual. Technical report, Nuance Communications, Menlo Park, California, 1996.
http://www.nuance.com.
[17] Manfred Pinkal, C.J. Rupp, and Karsten L. Worm. Robust semantic processing of spoken language. In Wolfgang Wahlster, editor, Verbmobil: Foundations of Speech-to-Speech
Translation. Springer, Heidelberg, 2000. To appear.
[18] A. Stent, J. Dowding, J.M. Gawron, E.O. Bratt, and Moore R. The commandtalk spoken
dialogue system. In Proceedings of the 37th ACL, Maryland, pages 183–190, 1999.
[19] W. Ward and B. Pellom. The cu communicator. In IEEE Workshop on Automatic Speech
Recognition and Understanding, Keystone, Colorado, 1999.
[20] Karsten L. Worm. Robust Semantic Processing for Spoken Language. PhD thesis, Universität des Saarlandes, Saarbrücken, Germany, June 2000.
[21] Karsten L. Worm and C. J. Rupp. Towards robust understanding of speech by combination
of partial analyses. In Proc. of the 13 ECAI, pages 190–194, Brighton, UK, 1998.