Siridus System Architecture and Interface Report (Baseline)
Transcription
Siridus System Architecture and Interface Report (Baseline)
Siridus System Architecture and Interface Report (Baseline) Ian Lewin, C.J. Rupp, Jim Hieronymus, David Milward, Staffan Larsson, Alexander Berman Distribution: Public Specification, Interaction and Reconfiguration in Dialogue Understanding Systems: IST-1999-10516 Deliverable D6.1 July 2000 Specification, Interaction and Reconfiguration in Dialogue Understanding Systems: IST-1999-10516 Göteborg University Department of Linguistics SRI Cambridge Natural Language Processing Group Telefónica Investigación y Desarrollo SA Unipersonal Speech Technology Division Universität des Saarlandes Department of Computational Linguistics Universidad de Sevilla Julietta Research Group in Natural Language Processing For copies of reports, updates on project activities and other SIRIDUS-related information, contact: The SIRIDUS Project Administrator SRI International 23 Millers Yard, Mill Lane, Cambridge, United Kingdom CB2 1RQ milward@cam.sri.com See also our internet homepage http://www.cam.sri.com/siridus c 2000, The Individual Authors SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 3/58 No part of this document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 4/58 Primary responsibility for authorship is divided as follows. Ian Lewin and C.J. Rupp wrote Chapter 1. Ian Lewin was the overall editor and author of chapters 2 through 5. Chapter 6 was written by C.J. Rupp. Chapter 7, describing the SIRIDUS architecture, is the result of a collaborative effort by all the authors. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 5/58 Contents 1 Introduction 8 2 Dialogue architectures and Dialogue management 10 2.1 Context Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Early decision making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Asynchronous interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 The DARPA Communicator Architecture 3.1 3.2 3.3 16 Instance 1: The Mitre ‘CommandTalk’ system . . . . . . . . . . . . . . . . . . . 18 3.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Instance 2: The CU/CMU Communicator System . . . . . . . . . . . . . . . . . 21 3.2.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4 The Open Agent Architecture 23 SIRIDUS project Ref. IST-1999-10516, March 12, 2001 4.1 4.2 Instance: SRI’s CommandTalk . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5 The TrindiKit Architecture 5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Instance 2: Conversational Game Player . . . . . . . . . . . . . . . . . . . . . . 31 5.2.1 5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6 The Verbmobil Communications Architecture 6.1 6.2 34 Diverse Lessons from Direct Experience with a Pool Communications Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.1.1 Specifying Functionalities . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.1.2 Publishing and Subscribing. . . . . . . . . . . . . . . . . . . . . . . . . 38 6.1.3 Distributing Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.1.4 Interrupting Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 6.1.5 Knowing your Segments and Fragments . . . . . . . . . . . . . . . . . . 40 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 7 An Architecture for SIRIDUS 7.1 26 Instance 1: GoDiS - Gothenburg Dialogue System . . . . . . . . . . . . . . . . . 27 5.1.1 5.2 Page 6/58 42 Components and Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 7/58 7.2 TRINDI K IT and OAA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 7.3 Dataflows and Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.4 Dataflow timings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.5 Word Lattices as a Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 7.6 Speech Recognizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.7 Prosodic Markings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.8 Speech Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.9 Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 7.10 Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 7.11 The Repair Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 7.12 Parsing and Semantic Interpretation . . . . . . . . . . . . . . . . . . . . . . . . 54 7.13 Translation to Dialogue Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 8/58 Chapter 1 Introduction The purpose of this deliverable is to discuss the requirements for a dialogue system architecture for the Siridus Project and describe a baseline architecture for the initial Siridus demonstrator dialogue system. This system is conceived of as an instantiation of an integrated dialogue toolkit. Therefore, in considering architectural requirements it is important to bear in mind a range of configuration options, overall modularity and the potential for scalability, in keeping with the general goals of the Siridus project. In addition, the proposed architecture should be capable of sustaining the more immediate project objectives. These include: supporting the Information State Update approach to dialogue, as promoted by the Trindi Project; extending this approach to cover new types of dialogue; and, in particular, applying the approach in the management of spoken dialogues. The latter will involve both exploiting the information state as a resource in speech recognition and synthesis; and incorporating new approaches to the robust processing of spoken dialogues. Indeed, the most ambitious form of this goal would involve an attempt to enhance the state of the art in spoken dialogue systems with the fruits of the information state update paradigm. This presupposes an architecture that is capable of supporting the state of the art not only in the management of spoken dialogues but in the natural treatment of spoken language behaviours, such as maintaining realistic time performance, handling performance phenomena, backchannelling and interruptions. At this stage of development it will not be necessary to know how specific phenomena are to be treated but the general architectural implications must be catered for. This will affect, in particular, process control, module dependencies and incrementality. The constraints implied by the processing of spoken language will be formulated as assumed user requirements. Some implications arising from these constraints can be drawn from actual systems. However, since there are relatively few documented systems that attempt to meet all of these constraints, conclusions must be drawn judiciously. In order to set an appropriate context for our own architecture we describe and discuss some other SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 9/58 current major proposals concerning the architecture of spoken dialogue systems. The proposals we discuss include the Darpa Communicator HUB architecture for spoken dialogue systems which is currently the focus of a major and very well funded development effort in the United States. We also discuss SRI’s Open Agent Architecture which is not an architecture for spoken dialogue systems per se but which can be used as the backbone of such a system. It bears interesting comparison with the Darpa Hub architecture even though it predates it by several years. Finally we discuss the TrindiKit architecture resulting from the predecessor Trindi Project and which promotes the Information State Update approach to dialogue. In general, an architecture can be instantiated in different ways and one can gain a better understanding of the space of possibilities permitted by an architecture by considering very different instantiations. We therefore include a short examination of example systems for each of the architectures. It should be noted that the examples we pick are not chosen because they represent particularly good instances of spoken dialogue systems. They are chosen because they exploit particularly interesting features of their underlying architecture. We begin by discussing the very notion of a dialogue system architecture and, in particular, the relation that it bears to the notion of dialogue management. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 10/58 Chapter 2 Dialogue architectures and Dialogue management A dialogue system architecture may be conceived of as a particular way of plugging together, and perhaps controlling, a number of dialogue system components such that the whole is capable of undertaking a dialogue with a user. A typical component list will include at least the following linguistically oriented components: speech recognizer parser semantic interpreter dialogue manager generator speech synthesizer Systems will also include other components, for example, a database query component or perhaps a component capable of executing commands. Each of the linguistically oriented components, with the possible exception of dialogue management, has a reasonably well agreed function within the language engineering community. Thus, speech recognizers decode speech signals into word strings, or n-best lists, or perhaps word-graphs; parsers generate syntactic structures (generally trees) from strings or word-graphs; semantic interpreters generate logical forms from syntactic structures; and so forth. Indeed, SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 11/58 such an understanding of the function of these components leads very naturally to the classical “pipeline” architecture for joining them together. The output of the recognizer is simply piped into the input of the parser whose output is itself piped into a semantic interpreter. The output of semantic interpretation is piped into a dialogue manager which, typically, integrates the latest interpretation into a dialogue interpretation (a semi-persistent structure that is built up and updated as a dialogue progresses and is the result of possibly many passes through the pipeline) and then generates something, say a function call, that is meaningful to the back-end system. System outputs may also be thought of as generated through another pipeline consisting of meaning generation by the back-end, utterance generation and then synthesis. The precise functions of a dialogue manager component are not however universally agreed. Indeed ‘dialogue management’ is a term that can cover a great many different sorts of activity and phenomena. These activities can include, but are not limited to 1. turn-taking management: who can speak next, when, even for how long 2. topic management: what can be spoken about next 3. utterance understanding: understanding the content of an utterance in the context of previous dialogue 4. ‘intentional’ understanding: understanding the point or aim behind an utterance in the context of previous dialogue 5. context maintenance: maintaining a linguistic and dialogue context 6. ‘intentional’ generation (or planning) : generating a system objective given a current dialogue context 7. utterance generation : generating a suitable form to express an intention in the current dialogue context Sometimes systems will contain a component called ‘dialogue manager’ which will execute some of these functions. Sometimes other components will execute or at least contribute to these functions. Furthermore, the contribution that the dialogue manager makes to those functions for which it does have responsibility has also to be considered in relation to the overall dialogue system architecture itself. To take a very simple example, in the simplest sequential pipeline architecture, it is not generally possible for users to take a turn to say something except when the system permits it. The speech recognizer will not even be listening for user utterances until the current information flow down the pipeline begun by the previous user utterance has made its way to the end. Nevertheless, dialogue managers may still be claimed to be responsible for turn taking in the following sense: they determine whether to solicit input from the user at any given point or whether they will keep the turn to themselves. As another example, we can consider the SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 12/58 interpretation of elliptical or fragmentary utterances. These often occur in dialogue as answers to questions. In very many systems, e.g. the Philips train-timetable system [4], ellipsis resolution is a task solely of the dialogue manager. At this point, certain sorts of information can be brought to bear on the resolution and, generally, certain other sorts cannot. If data is piped from parsing and semantic interpretation into dialogue management, then, generally, syntactic or semantic structure will no longer be available. In the SRI system described in [12], the resolution or filling in of ellipsis occurs within SRI’s generic Core Language Engine processor (CLE) ([2]) and not within Dialogue Management. In this way, syntactic and semantic structures which are not generally available to dialogue management can be used to help resolve ellipsis. Nevertheless, it is still required that the language processor have access to some dialogue information, namely the dialogue (game or task) structure over utterances. In this way, proximity to the lateset utterance (in task structure, not necessarily historical sequence) can be used to score different candidate resolutions of the ellipsis. So information flow from the dialogue manager to the linguistic processor is required. [12] claim that ellipsis is a linguistic phenomenon and should be handled in the linguistic processor. System reconfiguration becomes easier in that no additional work is required (in respect of ellipsis) to port the system to a new domain. The general point is simply that the role of an individual component can only properly be considered in relation to that of the other components and the dataflows between them. The two examples above illustrate three important general issues that arise in considering the merits of a pipeline architecture (or indeed any other): context sensitivity, early decision making and the possibility of asynchronous interaction. 2.1 Context Sensitivity Context sensitivity arises in general because if the pipeline is such that each component can only use the information sent to it by an upstream component and possibly some internal state of its own, then recognizers and parsers, for example, simply cannot be made sensitive to context and state uncovered by the dialogue manager. The picture resembles the game of Chinese Whispers in which one whispers something to the first child in a line of children who then repeats it, as best he can, in a whisper to the next child. The message is repeated, or mis-repeated down the line. At the end of the line, one generally finds the message has been transformed out of all recognition. Unless a child mis-pronounces the message whilst relaying it, the final version of the message will be determined simply by the initial message and how it was processed by each child using only the input message and their own internal state as information to go on. In order to build in context sensitivity, one must alter the simple version of the pipeline. One very common desire is to make speech recognition dependent upon dialogue state. If one can be sure enough about which dialogue state one is in, for example, if one can be sure one has just asked a yes-no question, then it makes sense to use this information to constrain the recognizer to be SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 13/58 especially on the look-out for ‘yes’ and ‘no’ and all other indirect ways of saying ‘yes’ and ‘no’. In Chinese Whispers, the equivalent notion would be having the last child in the line prime the first child in the line as to what to expect to hear. In Chinese Whispers, this strategy will probably not help at all unless the last child has good reason for his expectations. Perhaps the game leader is his father and the child knows his father’s favourite epigram. In any case, there appear to be two main strategies available for building in such context sensitivity. First, the downstream component can directly update the upstream component by making a call to it or sending it a message. Secondly, the downstream component can update a piece of global state which the upstream component consults before it undertakes its processing. The component may consult it either by reading it itself off the global state or by receiving it as an argument or parameter in whatever call causes the component to act. The two strategies have quite different implications. In particular, in the first strategy, components must now be viewed as objects providing several services. For example, the Nuance speech recognizer [16] is a component that offers both an audio-to-string conversion service and a set-my-grammar-expectations service. Such services could be called upon by any other component in the system. This encourages a distributed object view of the architecture. On the second strategy, the recognizer only provides the first service albeit one which is sensitive to a piece of global state which must be set the other components. This encourages a blackboard view of the architecture. 2.2 Early decision making The issue of early decision making arises in a simple pipeline architecture again because downstream components may have access to information which could be usefully exploited by upstream components. For example, a statistical n-gram recognizer only has limited grammatical knowledge to deploy in order to decide on the best output string for a given audio input. The downstream parsing component will possess much more detailed grammatical knowledge but this information is not available to the upstream component. It may not be possible to package this information (e.g. into a dialogue state that one can associate with a grammar or language model) such that the upstream component can use it. There again appear to be two main ways to avoid the problem of early decision making. First, one can simply not make (all of) the decisions early. Secondly, one alter the pipeline architecture. The Spoken Language Translator system [6] is a good example of the first method. The output from the statistical recognizer is a list of the top 5 hypotheses rather than just the top 1. Each of the hypotheses is parsed 1 by the downstream component which, by invoking the extra information available to it, can decide which the best string hypothesis actually was. In fact, the policy is propagated throughout the system so that an evaluation of the possible output translations also helps determine the system’s final belief about the best speech hypothesis. The success of this 1 technically, a lattice constructed over the hypothesis is parsed to save duplicating effort over common parts of different hypotheses SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 14/58 policy naturally depends on whether the top 5 hypotheses are indeed likely to contain the correct hypothesis even when the top hypothesis is not itself correct. In general, the policy requires upstream components to be able to generate some set of results (a proper subset of all possible results) likely to contain a good hypothesis in the absence of information available downstream. In the second method, one passes information derived by a downstream component back to the upstream component in order to generate a new and improved hypothesis by that component. A very simple example of this is provided by Mitre’s version of CommandTalk ([1], discussed further below). In this system, the recognizer produces a top hypothesis only, but if the downstream recognizer cannot make sense of it, recognition is re-invoked in order to produce the next best hypothesis. This scheme may possibly have a speed advantage over generating an n-best hypothesis list straightaway if the top hypothesis is generally correct. [6] report, for example, that 5-best operation of SRI’s Decipher recognizer was a little more than twice as slow as 1-best operation. It is also arguably more flexible in that one does not have to fix in advance the value of N in N-best. On the other hand, it may turn out to be considerably slower in practice and, in any case, does not permit the comparison of different hypotheses. The first processable hypothesis one obtains is the only processable hypothesis one obtains. In any case, a far more interesting policy would be one which involved sending more information back from downstream processing than just a simple reject message. For instance, in N-best processing on a statistical recognizer it can easily happen that there are 2 (or more) particularly significant points at which the recognizer cannot easily distinguish the competing hypotheses. The top 4 hypotheses in a 5 best list may then just contain different ways of resolving one of these points whilst keeping the other point fixed on the locally preferred but incorrect solution. Ideally, one would like downstream processing to recognize the partial correctness of the 5th solution and ask upstream processing to deliver another 5-best list all of which extend the known partially correct solution. (Again, it may not be practically feasible to configure the upstream processor appropriately in advance). 2.3 Asynchronous interaction A pipeline architecture does not permit asynchronous processing in the sense that an upstream component must complete its operation before its immediate downstream neighbour can start. Consequently, any controlling module must call each component in turn and wait for it to return before calling the next. This does not prohibit the possibility of all parallelism of course as one can exploit temporal parallelism by permitting an upstream component to begin processing its next input while downstream components are still processing earlier inputs. In fact, academic literature on Computer Architectures reserves the word ‘pipelining’ for just this sort of parallelism. However, such parallelism appears only to have a limited use in the standard pipeline of spoken dialogue systems. The problem is simply that by permitting speech recognition to occur before the dialogue manager has finished processing the previous utterance, one creates an opportunity for adding or altering what has been said which is not real. It is exactly like conducting a con- SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 15/58 versation over a very slow communications link. One can say something to one’s partner and, during the time that it takes to transmit the message, one can attempt to add to or alter one’s first expression. However, while the additional message is being transmitted, the partner constructs a response to the original utterance and he receives the addition only after he has issued his response. The result is inevitably confusion. The confusion is particularly serious if the partner is liable to interpret the new message in the light of making his own utterance even though that utterance played no part in generating that new message. In such circumstances, people soon learn to observe strict turn-taking. Not every instance of ‘pipelining’ parallelism need be confusing, of course. In SmartSpeak ([11]), the system permits a user to begin outlining his requirements for the return leg of a journey while the back-end system is consulting an online database for an itinerary for the outward leg. This sort of parallelism can be successful precisely because both partners are aware that the new dialogue will not affect the results to come of the previous dialogue. Evidently, asynchronous processing is very much a feature of human-human dialogues at least in circumstances where the communications link is fast enough. People interrupt each other. That is, they process part of what is being said and understand enough to make them interrupt so that they may correct it or stop any more being said. People also backchannel. For example, they confirm or assent to early parts of utterances while later parts of the same utterance are still being generated. A relatively recent addition to speech recognition software is the possibility of ‘barge-in’ in which, if the system is saying something, the user may barge in and say something themselves. In such systems, the system stops saying what it was saying and starts listening instead. This can be useful if the user indeed wishes to interrupt but may lead to undesired behaviour if the user in fact was ‘back-channelling’. Furthermore, detecting an interruption is only useful if one can successfully interpret the interruption. Unfortunately, the meaning of an interpretation often depends on the material immediately preceding the interruption. S: Depart Bideford at 5 pm for Bradford // U: via the No Bedford Unless one knows that ‘No Bedford’ immediately follows ‘to Bradford’, one cannot interpret the interruption correctly. Current implementations of barge-in software do not register how much of an utterance has been synthesized so far. Interruptions have to be interpreted either in the context that preceded generation of the interrupted utterance or in the context that would have resulted had it not been interrupted. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 16/58 Chapter 3 The DARPA Communicator Architecture The DARPA Communicator Architecture is primarily designed to provide a common framework for dialogue systems that promotes interoperability of components by permitting a plug-and-play approach. It is hoped that this will support rapid and cost-effective development of spoken dialogue systems (indeed, multi-modal systems that include speech). Multiple developers ought to be able to combine different commercial and research components so long as they are architecture compliant. Furthermore, collaboration between researchers will be fostered and a more structured approach to the testing of both whole systems and their components will be enabled. The current version of the Architecture takes the form of a Hub-and-Spoke system and the architecture is often referred to simply as the Darpa Hub and components are described as being Hub-compliant. The architecture consists of a central Hub process which connects with any number of Servers. A typical instance of the Darpa Hub architecture is depicted in figure 3. Typically, each server will host one dialogue system component as described above: speech recognition, natural language parsing, dialogue management and so forth. The Hub itself has three major functions 1. Routing. The Hub is responsible for correctly directing messages from one component to another 2. State Maintenance. The Hub can store global state information and make it accessible to all servers 3. Flow Control. The Hub can also direct the flow of processing control, deciding which servers should be called upon next An instance of the Darpa Hub system is declared in a script file which is read by the Hub. The script file declares all the information that the Hub needs to know about the servers, for example, SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 17/58 Figure 3.1: A typical Darpa Hub Configuration for each server, its name, its address and the operations that the server provides. The script file can also contain a program for directing flow of control. When the Hub is started, it attempts to make connections with all the declared servers, processes an initial “token”, and then begins monitoring for incoming messages. Tokens are frames with a name, an index and a list of key value pairs. The Hub tries to associate each incoming message with a token. If the message contains a token index, then the message is associated with the existing token with that index and is taken to be a response to a previous message that the Hub sent to that server. Otherwise the message corresponds to a new token and the token is initialized with any keys and values specified in the message. New tokens are then processed by executing a program (with the same name as the frame) that contains a set of rules determining what to do next. Each rule consists of a pre-condition (on keys and values, for example) and an action to undertake. Old tokens are processed by resuming execution of the instance of the rule already in existence for the token. Tokens are destroyed when their programs terminate. The Darpa Hub can also operate in “scriptless” mode, that is, where no programs are specified for the Hub to find for new incoming messages. In this case, the Hub does not direct the flow of processing control. Each message is then expected to specify an operation which is provided SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 18/58 by (at least one of) the servers. The function of the Hub then becomes simply to select a server which supports the operation and dispatch it correctly. 3.1 Instance 1: The Mitre ‘CommandTalk’ system 3.1.1 Outline The Mitre Corporation has built a conversational system for use in a distributed wargame simulation, based heavily on SRI’s original CommandTalk system [18], but particularly configured for the Darpa Hub Architecture. The system consists of a number of servers including: speech recognition, language understanding, a pragmatics server (for generating back-end simulation commands), a back-end interface to the wargame simulator itself and, apparently, a dialogue manager although it is unclear quite what its function is. Figure 3.2 is an example of a Hub script server declaration: SERVER HOST PORT OPERATIONS INIT : recognizer : rec-host : 15003 : reinitialize RecognizeAudio UpdateGrammar ChangeRegion :grammar “stricom” Figure 3.2: Hub script: server declarations The operations declared by the recognizer include RecognizeAudio, an operation which expects a token containing binary data and which generates a value containing the most likely string of words. Another operation is ChangeRegion which determines which part of the recognizer grammar is active. This expects a token containing the name of a grammar region. Figure 3.3 shows example flow of control rules: The first rule specifies that if the current token contains a speech recognition hypothesis, then the operation Do-NL-understanding should be invoked with an input parameter of sr-hypothesis and two output parameters: nl-output and synthesis-test. Do-NL-understanding is an operation supplied by the natural language processing server. The flow of control in the Mitre CommandTalk system is determined by a Hub script. There is one principal token in the system containing keys for audio data, a speech recognition hypothesis (:sr-hypothesis), a logical form (:nl-output), a resolved logical form (a logical form with SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 19/58 RULE : :sr-hypothesis Do-NL-understanding IN : :sr-hypothesis OUT : :nl-output :synthesis-test RULE : nl-output Do-context-tracking IN : :nl-output :sr-hypothesis OUT : ct-output Figure 3.3: Hub script: program declaration pronominal and other referential expressions resolved), a set of back-end simulation commands, a back-end response, an output logical form, an output text string, and, finally, output audio data. Typically, the Hub will give a token to start capture of input binary audio data, and, if data is indeed captured, the value of the appropriate key in the token will be filled in. The Hub will then determine that speech recognition should be called next with the value of this key. If a hypothesis results from speech recognition, then :sr-hypothesis is filled in and the Hub then determines that natural language understanding should be called next. At each point, the Hub examines the state of its token in order to decide what service to call next. 3.1.2 Discussion One of the advantages claimed by [1] for the Hub architecture is that it facilitates different execution flows than the standard ‘pipeline’ picture. One might expect, then, to see context-sensitivity of the sort outlined earlier highlighted. Somewhat surprisingly this does not appear to be the case. The first example given of a non-pipeline information flow is this: if the pragmatics server cannot fill in any back-end simulation commands from the input logical form, then it can simply write an appropriate text message into the output text string key in the token. In the standard pipeline, this key would not receive a value until after the back-end server (which executes commands), the pragmatics output server (which generates logical forms) and the natural language generator (which generates output text strings) had all been called. There is a simple rule in the Hub script which determines that the synthesizer should be called once the output text string key has been filled in. This rule is quite insensitive to which processes have run, what other pieces of the token have or have not been filled or indeed who filled in the output text string key. Clearly, this non-standard flow is essentially one of ‘skipping’ parts of the pipeline although the flow of information is still in the same direction. One can of course replicate the effect in a standard pipeline by arranging for a special message to be passed downstream from the pragmatics server to the synthesizer. Intervening servers must know not to act on the message but simply to relay it. Clearly, the Hub architecture avoids having both to set up such a special message and to SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 20/58 configure the intervening servers. The second example is that the language understanding server can cause the speech recognizer to be reinvoked by generating a new token. The speech recognizer will then issue its next best hypothesis and this cycle can continue until the recognizer issues a hypothesis that can be understood or until the recognizer has no more hypotheses (as discussed above). Although information is passed back in this example, that information is minimal. Besides, a much more significant point in both these examples appears to concern who is deciding on the action. In the first example, the pragmatics server is effectively deciding on the next action, what should be said and, indeed, how to say it. In the Mitre CommandTalk system, this consists simply in issuing a standard error message, but one can easily imagine circumstances in which a more flexible response is desired, one which depends in part on the current state of the dialogue. In this case, one would want the pragmatics server to send a failure message (and possibly a reason) to a dialogue manager with access to dialogue state information. Could this manager be the Hub itself? It appears this cannot be so since all a Hub script can do is test certain conditions on tokens and then call another server. Consequently, for more flexible behaviour, one is almost bound to set up a dialogue management server and have the Hub simply direct the failure message to it. Indeed, as a general rule, it is not difficult to imagine that one will nearly always want to achieve flexibility by this means. The issue then becomes: what is the value of Hub scripting if this is so? Similar considerations apply also in the second example of re-invoking the recognizer. The issue is again whether the language understanding server should be deciding that re-recognition is required. If it doesn’t decide but the dialogue manager does, then what really is the role of the Hub script in flow of control? From the published description of Mitre’s CommandTalk, there do appear to be two more interesting cases of information flow. First, since the simulation itself can generate new objects to be talked about, the simulator needs to inform the module in charge of reference resolution (e.g. of pronouns) of their presence. This can happen at any time. In fact, this is not so much a case of information flow being up or down a pipeline, as the result of action by an autonomous agent entirely outside the realm of linguistic processing. The Hub can be of course be used to ensure that a message from simulation about new objects always gets routed to the reference resolver. The point is that this will not involve the flow-of-control property of Hub scripting. The second example is that an Agent called Dialogue Management (it is far from clear what its other roles are) can send a message to change the recognizer grammar, presumably based on a belief about what the current dialogue state is. It may be that this example is not highlighted simply because this feature remains unchanged from the original CommandTalk system developed by SRI. Also, it again requires a simple control message to be sent from one server to another and the Hub is required to function just as a router. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 21/58 3.2 Instance 2: The CU/CMU Communicator System 3.2.1 Outline The Colorado University system [19] is another Hub compliant dialogue system. The system contains 7 servers: an audio server, speech recognition, language processing, dialogue management, database interface, synthesis and one other peripheral (for our purposes) server. The particular emphasis in this system is on robust processing. This is to be achieved by robust parsing strategies which emphasise semantics over syntax and by an event-driven dialogue manager. (Indeed, apart from the parser and dialogue manager, the system is essentially the CMU Communicator system hence we have entitled the system the CU/CMU system). The dialogue manager itself has a simple structure strongly reminiscent of that employed in the Philips train timetable system ([4]). Any parse (a set of filled slots) that it receives is first integrated into the dialogue context maintained by the dialogue manager and then, in order, the dialogue manager will 1. clarify the interpretation if necessary 2. finish if done 3. submit a database query (if sufficient information is present) and give the user the first answer 4. prompt the user for the next unfilled slot or highest priority information 3.2.2 Discussion There is a Hub script in the Colorado system but it is used simply for message routing and not complex flow of control decisions. The Hub script ensures that audio input is passed to recognition, that the results of recognition are passed to language understanding, that the results of language understanding are passed to dialogue management and so on. That is, although the Colorado system uses the Darpa Hub, it appears to be used simply to implement the standard pipeline dataflow. Consequently, questions arise as to how, if at all, issues of context-sensitivity and early decision making are tackled. One clear instance of early decision making in the Colorado system is that the recognizer currently delivers its best string hypothesis to the language processor for analysis. Unsurprisingly, given the general architecture, the designers intend to make the interface between recognition and parsing contain a word lattice rather than a string. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 22/58 As far as context sensitivity is concerned, it is very unclear how even simple short answers to questions can be handled in the Colorado system. (Of course, by clever design of the set of prompts, one might avoid the problem). The language processor maps strings onto frames consisting of a set of slots which the system aims to fill. In the travel domain, these slots include departure-location, arrival-location and so forth. Each slot has a context free grammar associated with it which defines the word strings that can match it. It is unclear whether a short answer such as ‘Boston’ must be mapped into one of departure-location and arrival-location. Given the standard pipeline, it is unclear how the language processor could use information about the last question to prefer one or the other. Equally, the dialogue manager appears not to maintain any history or state other than its beliefs about the current values of slots. In this case, the decision about the interpretation of a short answer cannot even be delayed until dialogue management. Indeed the absence of state in the dialogue manager is claimed to make the system more robust since there is nothing to lose track of. 3.3 Conclusion The Darpa Communicator Architecture is a Hub and Spoke architecture designed to support ‘plug and play’ for different linguistic components. The various components sit on different servers. The services they can provide are declared to the Hub. One can plug different components into the Darpa Hub and play with the resulting system just so long as they provide the same service. Nothing is stipulated about the internal structure of these services or the platform on which they are provided. The Hub is primarily a data router. It ensures that messages from one component to another are correctly transmitted. The Hub can also store global information common to all servers. The Hub can also direct the flow of control amongst the components based on the contents of a token which persists over a period of time and which messages from servers can refer to. Although the Hub can direct control in this way, many systems in fact only use it to implement the standard data pipeline. Furthermore, for the most flexible dialogue control, it appears best to vest control in a dedicated dialogue management server rather than in the Hub itself. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 23/58 Chapter 4 The Open Agent Architecture The Open Agent Architecture (OAA) is not designed especially for dialogue systems. It provides a general framework for building distributed software systems. OAA predates the Darpa Communicator Architecture by several years but there are several clear resemblances between them. First, OAA has a Hub and Spoke architecture. The OAA Hub is called the Facilitator. The spokes are client agents which register the services they can perform with the Facilitator. In a similar fashion to CORBA [8] and DCOM [14], agents build in notions of encapsulation whilst a system of agents can be spread across multiple hardware and software platforms. In general, the client agents are to be thought of as a community of agents cooperating through the Facilitator. If one agent needs a service, it sends a request to the Facilitator which checks its registry of services and forwards the request to an appropriate agent. The Facilitator also receives replies from that agent and sends them back to the original requesting agent. That is, an agent does not need to know the identity of another agent who can carry out the service. Indeed agents can submit complex goals to the Facilitator (e.g. Request-1 AND Request-2) and the Facilitator can delegate different parts to different agents in parallel. The submitting agent need know nothing about how the Facilitator delegates the task. Facilitators can also store global information for all agents and thereby permit a blackboard style of interaction. Agents can make asynchronous requests. 4.1 Instance: SRI’s CommandTalk 4.1.1 Outline CommandTalk [18] is a spoken language interface to a battlefield simulator which incorporates a number of agents including: speech recognition, natural language parsing, prosody, synthesis, SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 24/58 dialogue management, and the simulator itself. Speech recognition is undertaken by the Nuance recognition engine [16] enclosed in a ‘thin’ OAA wrapper which declares two services to the Facilitator: speech recognition and change of grammar. Natural language parsing and interpretation is undertaken by the unification based Gemini parsing system. The dialogue manager is responsible for managing the linguistic dialogue context, interpreting user utterances within that context, planning the next move and also for changing the grammar within the recognizer. For any new input, it first considers whether it is a correction of a prior utterance, then it considers whether it continues the current discourse segment, and finally it considers whether it initiates a new segment. A segment represents a local interaction with the user which is defined by a simple finite state machine. Most machines have only two states corresponding to the user saying something and the system responding. Longer machines encoding form-filling dialogues are also possible. The current state of a segment is represented by a frame containing: a dialogue state identifier, semantic representations of the user’s and system’s utterances; the background ‘open proposition’ for the user response; focus spaces of objects and gestures for anaphoric resolution; a guard. A stack of frames together with a history trail of all operations on the stack represents the state of the dialogue as a whole. If a new user utterance is deemed to continue the current segment (the one on top of the stack), e.g. the user answers a system question, then the system calculates the next state in the appropriate finite state machine and pops the current segment. If the next state is not final, a new segment containing the new state identifier is pushed back onto the stack. If the new user utterance begins a new segment, then the system first considers whether the current segment is actually finished or not. If the current frame contains an open proposition, then it is deemed not to be finished and the new frame is just pushed onto the stack. Otherwise the current frame is popped before the new one is pushed. Once a frame is popped, the information that was in it is no longer available for interpretation. The history trail can be used to recover in situations where the system has considered an interaction closed only for the user to issue a subsequent correction. 4.1.2 Discussion CommandTalk agents are connected through the Open Agent Architecture. It appears, although the matter is not explicitly referred to in any of the literature, that the role of the OAA Facilitator in the CommandTalk system is simply one of matching up service requests with service providers. Blackboard facilities do not appear to be used and certainly no control-flow is vested in the Facilitator (as it is in Darpa Hub scripting). Consequently the flow of control is indeed distributed throughout the agents in the system. The order in which agents are called will depend on information that is spread throughout the community of agents. One possible set-up would be for the speech recognition agent to finish its processing by sending a request for language understanding on its output (but not to wait for a reply). The language understanding agent similarly could similarly call dialogue management after it had finished processing. Code within the dialogue manager might invoke several services in different agents (e.g. consulting the simulator database, updating the recognition grammar) before calling generation and synthe- SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 25/58 sis. CommandTalk appears to include some parallel processing. At any time, the simulator can notify the dialogue manager of the appearance of a new object that can be referred to (as also discussed above, under Mitre’s version of CommandTalk, section 3.1). This update can occur at any time, so long as the Dialogue Manager is not actually busy doing something else (in which case, messages are presumably queued). The main source of context sensitivity in a CommandTalk component other than dialogue management is in speech recognition. The Dialogue Manager may send messages to the recognizer reconfiguring it dynamically according to what it believes the current state is. Again, these messages may happen at any time so long as the recognizer is not actually busy at the time. It appears that natural language parsing and interpretation is not context sensitive. Certainly, anaphoric and ellipsis resolution are all delayed until dialogue management. It is unclear whether any asynchronous calls are made in the CommandTalk system at all. 4.2 Conclusion OAA offers a very flexible environment for constructing a dialogue system. In many ways, it appears that the scriptless version of the Darpa Communicator Architecture, which is a very recent development, is an attempt to offer the same sort of architecture. The price to be paid for the flexibility is of course the amount of attention one has to give in ensuring that the community of agents one defines really does collaborate in order to compute the desired result. For instance, the possibility of updating the dialogue manager at any time with the identities of new objects is very attractive but one must ensure that the dialogue manager is not permanently busy on other tasks in order for this to succeed. The built-in support for asynchronous processing is also very attractive but the price again is simply the sheer complexity of programming such systems. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 26/58 Chapter 5 The TrindiKit Architecture The TrindiKit is a toolkit resulting from the Trindi project for building and experimenting with dialogue move engines and information states. Utterances in dialogue are understood to be moves which transform information states. The notion of information state is quite general. In the simplest form, a state might be just the name of a node in a transition network and the moves would then be the labels on the arcs connecting them. However, states can be much more complex and in the TrindiKit one can use different sorts of data structure (e.g. typed feature structures, Discourse Representation Structures, record structures) to represent them. Part of the discipline involved in using the TrindiKit is simply declaring formally all the aspects of Information State that one will need to access during a dialogue state. It is not just the states themselves that are more complex. Transitioning from state to state need not be a case of actually traversing a declared network structure. Rather, a set of update rules must be provided. Each rule consists of a pre-condition and an action. The pre-condition is a set of conditions on the Information State and the action is a set of operations to carry out on the Information State. Again, it is part of TrindiKit discipline that the conditions and operations are not just arbitrary but must be supported for the datatypes used in the Information State. As a simple example, if one declares a feature of Information State called questions under discussion to be a stack, then one cannot invoke a membership operation on it in the action of an update rule. Of course, one could define a set of update rules which used pop, and identity to implement membership but these rules would certainly not correspond to dialogue moves in any intuitive way. It is also possible to define external resources which can supply additional operations for updating Information States. There is no requirement that only one rule ever be applicable in any state, consequently one also needs a strategy for choosing between them or for applying multiple rules. In fact, TrindiKit permits one to define an update algorithm in a specially constructed language, the DME Algorithm Definition Language (DME-ADL). The control structures of the language include a variety of constructs (including if-then, while, try etc.) over primitives which are rule-names or rule-class-names. The interpretation of a rule-class-name is that the first applicable rule of that SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 27/58 class should be executed. TrindiKit comes with a default set-up in which all rules are in one class (universal) and the Update Algorithm is the one-line program containing the rule-class-name universal. A dialogue move engine can also have several different update algorithms though only the possibility of two is ever discussed. One algorithm is for interpretation, somewhat confusingly sometimes also called update; the other is for move selection. The intention is that one set of rules interprets the last move made and another chooses the next move to make. Although the same update rule formalism is to be used for both types (or all types, if more types are required), different algorithms can be applied to them. The TrindiKit architecture which encompasses a complete dialogue system and not just the Dialogue Move Engine is shown in figure 5. The DME heart of the system is contained within two boxes: ‘Dialogue Move Engine’ and the largest box labelled ‘TIS’ which represents the Information State. ‘TIS’ stands for Total Information state, and is the sum of the Information State (as discussed above), links to any external resources and links between the Dialogue Move Engine and other TrindiKit components. The other components shown in the diagram are input, interpretation, generation, output and control. In a spoken language dialogue system, input and output might naturally correspond to recognition and synthesis. The TrindiKit comes with a default control algorithm which, unsurprisingly, implements a repeating loop over the sequence: input, interpret, update, select, generate, output. In fact, if the system ought to make the first move, then one needs to supply a different control algorithm (namely, one which starts with the generation half of the cycle. However, in general, one can write one’s own control algorithm and it is a design feature that one can include test conditions against the Information State, since these are exported to the control module. The current version of the TrindiKit does not include asynchronous control possibilities. 5.1 Instance 1: GoDiS - Gothenburg Dialogue System GoDiS is an experimental dialogue system designed in particular for experimenting with notions of ‘question under discussion’ [7] and ‘accommodation’ [13] in dialogue management. Information states are represented as record structures and, in particular, model the private beliefs and goals of an agent and those parts which are shared. The intuition is that a dialogue progresses as information becomes shared. An agent also has a private plan which is more of a long term set of goals which can also be used for accommodating information (for example, behaving, on the basis of something in the plan, as if a question has been explicity asked even though it has not). SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 28/58 Control Dialogue Move Engine (DME) Interpretation Generation Output (update) output next_moves (update) program_state latest_moves latest_speaker input (update) TIS Input IS : (Information State Type) database dialogue grammar plan library ... Resource Interface Information State Interface Optional component Obligatory component Figure 5.1: The TRINDIKIT architecture SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 29/58 The GoDiS dialogue move engine consists of two main modules update and select, which respectively update the Information State on the basis of the last user move and select the next system move. There is a set of selection rules each of which consists of a condition on the Information State and an action which simply records the decision of the dialogue manager to make a particular move. The selection algorithm chooses the first applicable selection rule it can find. (That is, as explained above, it consists just of the name of the class of all selection rules). The update rules divide into 6 classes: grounding, integration, accommodation, agenda refill, database enquiry and store. These classes are intended to represent a natural classification of agent actions whilst interpreting input utterances. Grounding refers to the process of moving information from the private sectors of the Information State to the shared sectors. Accommodation adds information from the private plan to the shared questions under discussion, given the stimulus of a particular input content. Integration adds new content to Information State. Agenda refill transfers information from the private plan to the private short term agenda and corresponds to an agent setting up a short term conversational goal from his long-range overall plan. Database enquiry refers to looking up an external resource. Store refers to saving the current shared Information State - this is in case a problem arises later in which case the ‘old’ shared Information State needs to be restored. Finally, there is an algorithm for exploiting these various rule classes. Simplifying slightly, new inputs should be grounded and then integrated. If integration fails, one should accommodate and then try integrating again. Repeat until integration succeeds. Then, if the input was the user’s, one should refill the agenda and attempt database lookup. Otherwise one simply stores the current Information State. Architecturally, GoDiS uses the default control algorithm, namely the standard sequencing of select, generate, update, input, interpret, update. (The standard sequence when the system makes the first move in a dialogue, that is). Update occurs twice, once as a result of generating a move and once as a resulting of interpreting a move. Although the same algorithm is executed twice, the update rules themselves always distinguish whether the latest move was actually made by the system or the user. No module apart from the Dialogue Move Engine modules accesses the Information State. 5.1.1 Discussion The main focus of our interest in dialogue system architectures is the architecture and dataflows between components. From this perspective, GoDiS is less interesting because it does not exploit to any great degree the possibilities provided for in the TrindiKit. The default control algorithm resembles the standard control flow in a pipeline albeit with the additional benefit that each component updates the global Information State in turn rather than passing on information to each other directly. That is, the pipeline is not a dataflow pipeline merely an order of calling components. Given the focus of GoDiS on certain particular theoretical issues, it is perhaps not too surprising that the components other than the dialogue manager itself are treated in a somewhat cursory manner. The input module accepts typed text. The interpretation module appears to be SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 30/58 a simple phrase spotter. Neither of these modules appears to access the global Information State other than to read their input from the right module interface variable. The interpretation module, for example, reads the input word string from the interface variable input which the input module placed there. It writes its output into another interface variable called latestmove which the Dialogue Manager reads from. The role of the interface variables does however raise some more substantive issue in the design of the TrindiKit architecture. Although interface variables are part of the ‘Total Information State’, they are not part of the Information State proper. Consider, for instance, the interface Variable nextmove which is written by the dialogue move engine (in fact, the selection algorithm within the dialogue move engine) and read by the generator. As noted above, the output of the selection algorithm is the choice of the best move to make next. This move is recorded in the nextmove interface variable. One might ask why, in a system that is designed to model beliefs and goals, it is not recorded in a part of the Information State representing Intentions. Of course, one could record it there also and thereby make it available to other components. But why then should the generator access only the interface variable and not the Intention itself? The latter approach appears to be the more general solution. The only reason for the interface variable appears to be to facilitate a ‘plug and play’ architecture in which it is a simple matter to plug in different components and test the resulting configuration. Having to code the name of a particular location in one’s Information State into the generator would detract seriously from re-configurability. Indeed, a minor change in one’s Information State could stop the generator from working at all. Ideally, of course, one would also like to experiment with plugging different dialogue managers into an overall dialogue system. Again, the interface variables make this at least a conceptual possibility. If different dialogue managers all write to the nextmove interface variable, then any generator that reads it can work with all of them. Nevertheless, the status of interface variables as outside the Information State proper looks an unhappy one. Presumably, what is really required is a means of mapping interface variables directly onto parts of the Information State. That is, the interface between a dialogue manager and a component should be mediated by an indirection. The interface variables in GoDiS are essentially a device for creating a data pipeline in the TrindiKit, albeit one that is mediated by the Information State. One corollary of this is that the selection algorithm, which is responsible for choosing the next move to make, can be viewed as selecting and executing an update rule whose action updates the Information State. Selection therefore follows the general pattern outlined for Information State updates. This attractive corollary is however liable to over-interpretation. The part of the Information State that is updated is of course the interface variable. Since, in GoDiS, interface variables are actually designed purely for data pipelining, it is evident that the use of Information State update here is really more of an implementational detail than an instantiation of a theoretical claim. That is, for the purposes of GoDiS one might just as well have pipelined the outcome of selection directly to generation. Although this is in itself a comparatively trivial matter, it is important that there is really no reason why selection should take the form of an algorithm operating over update rules. Selection could take any form so long as its input source is the Information State and its output is a move. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 31/58 The point of course threatens also to extend to the role of update rules in Interpretation. Why should Dialogue Interpretation itself consist of an algorithm over update rules and not just some algorithm whose input is the Information State and whose output is an updated Information State? Of course, by examining the GoDiS update algorithm, one realizes that it embodies theoretical claims. That is, its division into grounding, accommodation and so forth is theoretically significant. However, it is perhaps unclear whether any particular significance attaches to individual rules or the rule format. 5.2 Instance 2: Conversational Game Player The SRI Autoroute demonstrator is an instance of the TrindiKit architecture. Like GoDiS, it is designed with the aim of exploring certain theoretical issues in dialogue management. The issue in question is formalizing and using Conversational Game Theory as a basis for dialogue management. Also like GoDiS, the treatment of the other system components (input, interpretation, generation and output) is somewhat rudimentary. In Conversational Game Theory, rational agents have beliefs, goals and a set of operators for undertaking actions in the world. Included in these actions is the playing of conversational games - these being joint actions between dialogue participants. Knowledge of the structure of conversational games is shared between dialogue participants so that, once one partner realizes the other partner has started a game, the other partner both knows how to continue it and cannot avoid doing so on pain of being deemed uncooperative. In the SRI demonstrator system, game knowledge is encoded in simple recursive transition networks which specify not only the valid sequences of dialogue moves but also their meaning. The meaning of each move is specified as a context update function on propositions under discussion. That is, the role of each move in a game is to modify a set of propositions. The meaning of a game is to update a context of propositions that have been agreed. One of the features of Conversational Game Theory is that the game and move definitions are independent of who happens to be the speaker and who happens to be the hearer. That is, the update effect of an utterance should not differ according to whether one is a speaker or hearer. Of course, whether and how one chooses to make a certain move or not will depend on one’s mental state and this may differ from speaker to hearer but the effect of the act itself is invariant. The control strategy for the SRI Autoroute demonstrator reflects this perspective: it consists of a repeating loop over two constructs update and generate-or-acquire. update updates the Information State with the latest input, regardless of who generated it. generate-or-acquire tests to see whose turn it is in the dialogue and then either calls input followed by interpret or calls select followed by generate and then output. Update itself (and indeed Selection) is realized by a set of update rules in TrindiKit format which, SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 32/58 together with the update (or selection) algorithm , encode a general agenda based search strategy. The search is over a set of possible paths through the recursive transition networks. When a new input (user or system) occurs in a dialogue, there may be several possible interpretations of what has happened so far in the dialogue. Each possible path is generated and stored in an agenda. The Update algorithm has two parts: first, repeatedly finding the first applicable rule and then executing it until no more rules are applicable; second, sorting the agenda according to a simple utility-based preference metric. The top-ranking hypothesis in the agenda represents the state that the agent believes is the correct one. 5.2.1 Discussion The control algorithm used for the CGT demonstrator within TrindiKit is interestingly different from that employed in GoDiS. The control algorithm is to certain extent context sensitive. If a new move is required, the Control algorithm consults the Information State in order to see whose turn it is to make the next move. Once the move is made, the update algorithm is called to integrate its effects, regardless of who made the move. In GoDiS, the update rules for system moves differ greatly from those for user moves. A number of points can be made about this simple difference. First, the system analyses its own utterances after they have been made, just as it analyses the user’s. One advantage of this procedure is that it is straightforward for the system to recognize possible alternative interpretations of its own utterances. This can be useful for detecting when dialogue has not gone according to plan. The simplest case is the user simply not understanding the system and saying ‘pardon’. At this point, the system effectively has to backtrack - it thought it had accomplished a particular speech act (say, asked a question) but in fact it had failed to do that. In the CGT demonstrator, both interpretations of the system’s original utterance (‘I asked a question’, ‘I said something unintelligible’) are generated and maintained in the agenda. If ‘pardon’ is the next move, then only the second analysis can be extended to incorporate it. A pardon move may only legitimately follow something which was unintelligible. In GoDiS, the state that existed before the question was asked is stored in a special tmp field and is simply restored when ‘pardon’ is encountered. However, this sort of treatment will not extend to cases, probably rare in Human-Computer dialogues, in which the second utterance does not simply cancel the first but coerces a particular interpretation onto it. Witness the child who responds with ‘I don’t think so, mummy’ to his mother’s ‘You’re going to school tomorrow’. Of course, one could try adding another sort of update rule to GoDiS to be used when the usual interpretation procedures failed (just as accommodation is invoked when ordinary integration fails) but it is not clear how such a rule would work. The second point is that the CGT demonstrator does not actually backtrack. In fact, it generates all analyses in advance and merely selects an alternative that was already known to it if the current favourite analysis cannot be maintained. This is probably neither psychologically unrealistic nor SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 33/58 even a very good engineering solution. If you both intended to ask a question and believe that you have, there is little point in calculating that you might be held to have done something else until a problem actually arises that causes you to re-assess your own actions. The third point is that, even though the control algorithm is context sensitive in a limited fashion, the system as a whole still enforces a fairly strict turn-taking model. The control algorithm uses the Information State simply to see whose turn it is to make a move and this is determined by the conversational game definitions themselves and what the system currently believes the game state is. In fact, since the system is maintaining alternative interpretations of dialogue state, it is conceivable that the system and user both believe that the dialogue is in a state in which it is their turn to speak next. This would give rise to both attempting to take the floor and require a mechanism for resolving the conflict. The CGT game definitions are constructed so that this cannot occur. In any case, if the system believes it has the turn, the user does not even have an opportunity to try to take the floor since he will not suitably prompted for input. The interpretation module for the CGT system is also mildly context sensitive. The interpreter is essentially a simple phrase spotter which uses knowledge of the last question asked in order to interpret fragments. For example, the meaning of ‘London’ is taken to be ‘that the destination is London’ given a context in which the last question asked was ‘Where do you want to go?’ 5.3 Conclusion The TrindiKit architecture is designed to exploit an Information State Update approach to dialogue. The heart of the system is the Dialogue Move Engine which accesses and updates a well-defined Information State, a data object that comes with pre-defined mechanisms for testing values of its component parts and updating them. The Information State is available to all other components in a dialogue system, including the control module. The internal structure of the Dialogue Move Engine is also constrained in that it is intended to consist of a number of modules each of which invokes a user-specifiable algorithm over a number of update rules of a certain pre-defined format. (A default algorithm is also available). Each rule must consist of a set of pre-conditions on the Information State and a set of actions which are operations on the Information State. Instantiations of the TrindiKit have so far focussed mainly on particular theoretical objectives within Dialogue management and have not paid particular attention to links between Dialogue management and other components. Generally, the standard data pipeline has been implemented. SRI’s CGT demonstrator system employs a control algorithm that is sensitive to Information State but in general one might expect this facility to be of limited value, just as Hub scripting is in the Darpa Communicator Architecture. In most TrindiKit systems, components other than Dialogue management do not access the Information state. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 34/58 Chapter 6 The Verbmobil Communications Architecture The system architectures and underlying toolkits considered so far have been quite similar in their aims, scale and range of components. Before we proceed to consider an architecture for the Siridus demonstrator and toolkit, it may be useful to look at an environment that, while also addressing the processing of spoken language dialogues, has quite distinct goals and correspondingly different challenges. The Verbmobil system differs from a typical dialogue system in a number of ways. The most evident of these is that it does not perform dialogue management in the accepted sense, because it processes dialogues between humans and is only concerned with providing a translation of the individual dialogue turns. This entails the maintenance of a dialogue model, but it is used for tracking the flow of the dialogue and supporting translation decisions, as the system never has to take the initiative. For our current purposes the more interesting aspects of Verbmobil architecture are not so much the the modules that are required for the dialogue translation task, as the way that the system architecture copes with three crucial challenges: The scale of the system and its coverage. The interchangeability of modules. The time constraints implied by processing spoken dialogues in quasi real time. Since Siridus’ prime concerns consist of the scalability and reconfigurability of spoken dialogue SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 35/58 systems it is useful to have as a yardstick a system which achieved considerable complexity, scale and numbers of configuration options within approximately the same domain. The final Verbmobil system comprised 69 modules whose communication requirements would probably have defeated any architecture based direct communication between individual modules, e.g. via communications channels, since 2,380 point-to-point communications channels would have been required. An alternative communications concept was developed based on a notion of local blackboards [10, 3], reducing the number of connections to be defined by a factor of about ten, in fact 224 pools were defined. The multi-blackboard architecture also facilitates the use of interchangeable modules and indeed of modules that operate in parallel, since an individual functionality can be defined as connecting two (or more) pools. Multiple modules supplying this functionality can be employed in parallel or in alternation. To some extent the availability of multiple modules is a luxury derived from the scale of the project, but some aspects of the architectural requirements are parallelled in the design of a toolkit if the notion of plug and play with multiple off the shelf components is adopted. The processing strategy in Verbmobil is essentially ’pipelining’ in the technical sense referred to in 2.3, since the sequence of processes for any given input is sequential, but the modules operate asynchronously, because the basic increment is taken at a lower level than the turn. The spoken input is segmented according to various criteria, including the length of pauses, prosodic cues and syntactic structure, so that different modules may be working on different segments at the same time. The purpose of this incrementality is to keep the end-to-end processing requirements within a small multiple of real time. We have outlined the main properties of the Verbmobil communications architecture as a source of comparisons, but what can be directly learnt for the Siridus architecture where the initial scale and resources are much more limited, but where the ultimate processing task and level of intended incrementality are more demanding? 6.1 Diverse Lessons from Direct Experience with a Pool Communications Architecture. There are a number of practical lessons that can be drawn from the Verbmobil experience with communication via a number of data pool. Here it is helpful to maintain a contrast between the design of the communications architecture and the way it was actually used. The development method adopted in Verbmobil involved maintaining a running system throughout virtually the whole development period. The persistent performance pressure, combined with the fact that SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 36/58 only one system was ever intended, meant that operations were encoded that went to the limits of the intended architecture. So the positive properties of the communications architecture can be contrasted with anecdotal evidence of some slightly perverse mechanisms that were coded within it. Conversely, options that the architecture supports but were deliberately avoided in the implementation may indicate pitfalls to be avoided. Verbmobil is not only a large system but a very complex one, supporting numerous configuration options1. To some extent it is therefore also like a toolkit, since multiple solutions to the same problem may be encoded within it. Another aspect of its complexity is the overhead of maintaining consistency which may or may not be a requirement in a smaller system. In summary we should take account of the benefits of the overall framework, but also, in both a positive and a negative sense, the cases that take this approach to its limits. We should also bear in mind that some of the mechanisms required here will not be essential in a more modest system, but there again may be desirable for the sake of scalability. 6.1.1 Specifying Functionalities The fact that interface specifications are conceived of in terms of abstract functionalities, rather than the requirements of specific module instances, has positive consequences both for the interchangeability of modules and for the management of communications. The processing modules read from and write to data pools, so they are not required to know which modules supply or consume their data and are, hence, impervious to changes non-local changes in configuration. The actual message passing is performed by a PCA (Pool Communication Architecture) package. This simply has to know which module is currently registered for the functionality associated as either consumer or producer with a given data pool. Where modules are exchanged it is only the instantiation of the functionality that is affected. Similarly, competing modules for the same functionality may be employed, or one module may fulfill several functionalities, e.g. the same service for several languages or language pairs. There may also be sequences of several functionalities that, when composed, are equivalent to one functionality also defined in the system. The set of functionalities defined is therefore the basic level of the architecture specification. The current state of the module configuration is maintained by a module manager which essentially provides the PCA with the table of which processes currently fulfill each functionality. It should be noted that this design does not assume that all data pools associated with a given functionality will necessarily be in use. Hence, it is conceivable that modules for a given functionality may be interchangeable, even though they have somewhat different data requirements, though it will be assumed that the main input and output specifications will be maintained. Equivalently the reconfiguration of a local module may mean that existing functionalities are not actually 1 At the last count there were 196 individual module configuration option in a given configuration, but not all are independent of each other. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 37/58 called on in a given processing session. Examples using Functionality Definitions The standard examples for the use of functionality definitions are the interchangeable modules for speech recognition in German and optional language identification. The main examples of modules working in parallel are more complex, because they also involve the concatenation of functionalities and implicitly result selection. In addition, the definition of multiple functionalities can obviate an absolute choice for the form of recognition output. Local configuration of individual modules can make so-called subfunctionalities optional. There are three possible instantiations for the functionality Acoustic.Recognition.Continuous.German.Frequency16kHz. Each maps the recorded microphone signal to an initial word lattice. The choice of which recogniser module to use is fixed by the arbitration module as part of the system configuration. A further optional functionality can be activated for Acoustic.Recognition.Continuous.Unknown.Frequency16kHz this provides language identification. The functionality Linguistic.Analysis.German is normally instantiated by four different module in parallel. Three of these also instantiated Linguistic.Transfer.German.English and Linguistic.Generation.English. That is, the modules for statistical translation, example-based translation and case-based translation implement a complete mapping from recognition output to synthesis input as well as translation from source to target language. The fourth instantiation of the analysis functionality is the entry point to a more complex deep linguistic translation process involving, not just the three functionalities mentioned above but several subfunctionalities and the further subdivision of analysis into Linguistic.Analysis.German.Syntax and Linguistic.Analysis.German.Semantic. The convergence point of these various translation paths is the selection module, the only instantiation of Selection.English, but also a module that has to determine when the preceding tasks are complete and ready for selection and which priorities to apply. The choices can be affected by local module configuration which then in turn affect the sequence of processes in the whole system, since the preference for an efficient option will leave more expensive processes incomplete. Local module configuration can also make subfunctionalities and, even, data pools redundant. This is particularly noticeable with request and response loops that are implemented, essentially, as module to module communications within the data pool framework. Transfer in particular makes use of context-based semantic disambiguation of predicates to support translation choices, but this is comparatively expensive in execution time and can be deselected. Then not only the subfuctionality but also the relevant data pools are ignored for the the remainder of the session. This can be contrasted with the provision of additional information to a recogniser by a dialogue manager. Just because the information is there does not mean you have to be able SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 38/58 to use it. The convergence of distinct translation paths is interesting because it imposes additional control tasks. Their divergence is interesting because different analysis methods have different data requirements from the recognition results. These vary from the best hypothesis over n-best strings to the complete word lattice. The choice is familiar, but here it is not actually made. There are distinct pools for each type of recognition result. The best hypothesis and full lattice are taken direct from the recognisers. The n-best strings, actually flat lattices, are generated on demand in the first stage of deep linguistic analysis as input to the most sophisticated parsers, so that all options are catered for within the same architecture2. 6.1.2 Publishing and Subscribing. In Verbmobil individual modules have considerable autonomy. It is the modules that declare with functionalities they fulfill and hence which pools they will use for reading and writing. The metaphor used for communication requirements is that of publication and subscription. When data is produced it is published on the relevant pool at which point the PCA sends a message to each of the subscribers for that pool. This removes from the module some of the requirement poll an input pool for the arrival of new data. This also means that the pool can be seen as an autonomous agent that dispatches data according to the requirements specified in advance by the modules. Although in practice, this is achieved by a central communications manager. This effect could be replicated either through a few common data pools defined as individual agents or by a hub module functioning primarily as a communications manager. 6.1.3 Distributing Control Despite the overall complexity of the system architecture Verbmobil has no overall control module. The control of the incremental processing is devolved to the individual processing modules which determine local decisions that, in turn, go to make up the overall processing strategy. This has a side effect that the proportion of processing to controlling carried out by each module may vary. In most cases modules carry no controlling responsibilities. They fulfill their designated function on the basis of simple input and output requirements. Where an interaction based on request and response is required, the calling module must initiate the request and determine how long it will wait for a response, but the responding module treats each request as an individual task. More complex control problems arise where processing paths merge and decisions have to be 2 This is generally known as having your cake and eating it. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 39/58 made that are based not on absolute timings, but the relative completion of a task by a set of modules working in parallel. A consequence of the devolution of control on a demand-driven basis is that the modules that inherit the largest controlling responsibilities may themselves carry out relatively limited functionalities. This is not in itself a bad thing, but it does not facilitate the management of the project resources or the debugging of system performance, since it obscures the overview of where the most significant events take place. Such instances did occur in the actual Verbmobil architecture and were essentially accidental. While these accidents were not necessarily harmful a more hierarchically oriented control architecture would have precluded the need for idiosyncratic knowledge. Any assertions as to the benefits for Verbmobil of a more centralised control strategy would be counterfactual, but, as the Trindikit examples show, a central controlling instance is usually assumed in dialogue systems. Problems with Distributed Control There are two main points where parallel processing paths converge in the standard Verbmobil architecture. The first has been mentioned where the main selection between competing translations occurs. This is complex, but not the most complex instance. Translations only have to be synthesised when a turn is complete and preferences for more efficient processes with lower accuracy may mean that it is not necessary to wait for all results to be present. Selection can be carried out when the preferred translation path delivers a complete result. The other main selection point occurs in the semantics module which receives results from various linguistic analysers. Here segment incrementality must be maintained so a selection has to be made as soon as all parsers have made an adequate contribution. This is somewhat more difficult to judge as it also takes account of the quality of fragmentary analyses and the segmentation carried out by the parsers. However, this is a significant control decision because it affects “downstream” processing for that increment. The weight of this responsibility can be compared with the limited remit of the local functionality of the semantics module itself which comprises robust semantics processing, as to be applied in the Siridus repair module, and some linguistic resolution of predicate ambiguities, ellipses and anaphoric bindings. 6.1.4 Interrupting Processing Another direct lesson from Verbmobil experience is the implicit overhead involved in handling the interruption of normal turn processing. While Verbmobil does not support the full set of barge in options, nor should it given that the system never has the initiative, user barge in and the processing of spoken commands can lead to the interruption of turn processing at virtually any point. Actually, Verbmobil does not retain the content of a turn that is interrupted by the user or re-interpreted as a command, but this is really a detail of how interruptions are treated by local modules. The real problem is ensuring that interruptions are perceived quickly by all affected SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 40/58 modules and that the synchronisation of turn processing is maintained. The direct implication of this impinges on interchangeability of modules, in that it is assumed that each processing module regularly checks for incoming control messages and is prepared to abort processing the current turn at relatively short notice. If this condition is not met then inconsistent system states may occur and the overall processing of the dialogue may actually be halted. These conditions are to some extent the consequences of local decisions about how interruptions are to be handled. However, these decisions have to be taken. If they are not treated with due care requirements can result which preclude the use of pre-existing processors which do not have the appropriate hooks to trap real time control messages. 6.1.5 Knowing your Segments and Fragments Verbmobil is segment incremental. This means that at any one time different segments of the same turn may be in different stages of processing. In addition, multiple segmentations are allowed. Consequently different processing paths may recognise different segmentations of the same turn. For the final output these segmentations have to be combined. These requirements can only be met by keeping track at all times of which portion of the input signal corresponds to the message that is being processed locally. This is achieved by attaching a symbolic segment identifier to each message that is passed. Relative to the size of some messages this is a relatively small overhead, but it is a necessary one in this context where relatively small and also variable increments are allowed. There may be simpler ways of achieving segment incrementality, but direct experience has shown that it is difficult to find a segmentation that serves all components of a complex system. 6.2 Conclusions Although Verbmobil was designed as a single system and not as a toolkit, the development method, involving an almost uninterrupted sequence of working prototypes, ensured that the underlying architecture framework exhibits several properties that are desirable in a toolkit. In particular, the communications architecture through the use of abstract functionality definitions, multiple data pools and the publish and subscribe metaphor for message passing, supports the wide range of configurations options available in the final system. These include exchanging execution paths through the architecture and varying modules with the same functionality, as well as the local configuration of individual modules. Where the Verbmobil method is less efficient is in the selection of which modules and functionalities are defined and the apportioning of control tasks to modules. Here rather too many accidental choices were allowed to develop over the development of the system and the success SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 41/58 of resulting system relies more on the individual module implementers than on the architectural design. Although the system does not have to carry out dialogue management in the accepted sense, this is not an adequate explanation for why there is no overall or even local controlling instance. Verbmobil does take the processing of spoken language seriously and achieves quasi real time performance by adopting segment incrementality, but the notion of a segment is a flexible one. This follows from the large number of levels and methods of processing involved. The bookkeeping overhead that is required in any asynchronous incremental system is amplified by the need to coordinate diverse segmentations. However, turn and increment identification are necessary in any dialogue system that allows the interruption and resumption of processing, whether it be for barge in or other functions, such as user commands, in the Vermobil case. To summarise and put these conclusions in the perspective of the preceding discussion of architectures for more standard spoken dialogue systems, the combination of abstract functionalities, data pools and publish and subscribe communications would seem to be a useful communications architecture where reconfigurability of both the modules and the architecture is a priority. Some of these facilities are already offered where a partitioned blackboard is adopted or the hub is primarily used as a communications controller. However, control of communications needs to be supplemented with genuine control of the processing tasks if real dialogue management is to be carried out with realistic time performance. In this context a degree of incrementality below the turn level is likely to be required, perhaps even different increments for different processing tasks. That would imply turn and segment identification in all messages to maintain consistent processing. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 42/58 Chapter 7 An Architecture for SIRIDUS The objective of the SIRIDUS architecture is to integrate developments in three main research areas of the SIRIDUS project. These areas are: extending the Information State Update approach of the Trindi project to cover new types of dialogue, building in and developing new approaches to robustness in dialogue processing and exploiting the Information State Update approach in speech recognition and synthesis. The Baseline Architecture for SIRIDUS is designed primarily as a means of bringing into a common computational framework research that was originally developed outside of the SIRIDUS project as well as providing a foundation basis for developing both that and other work originating wholly within the SIRIDUS project. 7.1 Components and Processes The baseline architecture for dialogue components and their processes is shown in figure 7.1. One of our principal interests in this architecture is especially the possibility of running the various components in it in parallel. In particular, in our work on developing interactions between recognition, synthesis and dialogue management we are interested in developing methods to cope with user interruptions, user ‘back-channeling’ and system interruptions. For example, as mentioned earlier, although commercial speech recognizers now permit the possibility of user barge in, they are limited in that anything the user says at all (so long as it reaches recognition threshold) counts as an interruption. Users cannot therefore backchannel by confirming parts of a system utterance while the system is still speaking. Conversely, if the system continues a dialogue while simultaneously looking up a database (as in the SmartSpeak example discussed earlier) then one wants the system to be able to interrupt the dialogue if important information comes back from the database which needs to be shared. Although it is not a first year SIRIDUS SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 43/58 Figure 7.1: The Component architecture objective to build in all these capabilities into a working demonstration, it is important that our baseline architecture permits their possibility. 7.2 T RINDI K IT and OAA For implementation of this architecture, the main options are an asynchronous version of the TRINDI K IT currently under development, and the Open Agent Architecture from SRI. The final choice of implementation route cannot be made at the time of this deliverable since the asynchronous TRINDI K IT is not yet available. It is our preferred implementation route in that it naturally builds in for us the Trindi Dialogue Move Engine. Extending the Dialogue Move Engine to cope with new types of dialogue is another SIRIDUS project objective. The optimal solution might be to combine OAA and TRINDI K IT , and this section examines different ways of seeing the relation between these two architectures. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 44/58 We have decided not to use the Darpa Communicator Hub mainly because its use in asynchronous processing remains largely untested. Also, as SIRIDUS partners are not formally members of the Darpa project, the level of support that we might reasonably expect to receive is somewhat unclear. Use of the current release version of the TRINDI K IT is not a suitable route for us since it enjoins a sequential pipeline calling of dialogue system components. Some preliminary points The asynchronous TRINDI K IT is built on top of AE (Agent Environment), which can be seen as a stripped-down version of OAA1 . TRINDI K IT allows both asynchronous systems (running as several processes) or synchronous (serial) systems, running as a single process. It is also possible to replace AE with OAA and run TRINDI K IT on top of OAA. An OAA agent declares a set of solvables (prolog goals) and is accessed by giving it a goal (instantiation of any of its solvables) and getting a response. AE agents are similar to OAA agents, but simpler. AE agents offer services (similar to OAA solvables). TRINDI K IT is implemented on top of AE in roughly the following way. The TIS handler is an AE agent offering services for accessing (checking and updating) the TIS. TRINDI K IT modules are AE agents which use the services of the TIS handler, and which export the service of executing the module algorithm to the TIS. Each module has a trigger condition (actually one more trigger conditions), and when such a condition holds the TIS handler will send a non-blocking request to the module to run the appropriate algorithm. In the rest of this section, we will investigate four ways of seeing the relation between TRINDI K IT TIS, modules and resources, AE agents, and OAA agents. T RINDI K IT modules are AE agents In this scenario, TRINDI K IT is run on top of AE, as described above. This solution as it stands does not permit interaction between TRINDI K IT and OAA agents. 1 AE was developed solely for use with T RINDI K IT. The reason OAA was not used is that OAA was not available for Sicstus prolog when implementation of the asynchronous T RINDI K IT started. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 45/58 T RINDI K IT modules and the TIS are OAA agents. In this scenario, TRINDI K IT is run on top of OAA. The OAA agents offer solvables corresponding to the AE services. This will allow TRINDI K IT to interact with OAA in a natural way. Unfortunately, preliminary tests indicate that this may be to be too slow for any interesting applications; in GoDiS, the number of interactions between the DME and the TIS is quite large (as it is bound to be in any interesting system) and it takes about 2 seconds from user utterance to system utterance (compared to about 0.5 second when running on top of AE). However, it must be stressed that these results are very preliminary, and require further investigation. A T RINDI K IT system, run on top of AE, is an OAA agent On this view, TRINDI K IT modules and TIS are AE agents; OAA agents may serve as “backend” services to the TRINDI K IT system in the form of TRINDI K IT -modules, but not as part of the TRINDI K IT dialogue system per se. The TRINDI K IT system would (probably) not offer any services to the OAA community, but OAA agents could act as TRINDI K IT modules, given an interface rule similar to DARPA’s Hub scripts. The interface can either run as a separate process or be included in the OAA agent (or possibly in the TIS handler). In general these interface rules (not to be confused with TIS update rules) will have the form RULE: solve( ) IN: Conditions to check before query OUT: Operations to apply after response An impressionistic example (assuming the TIS contains two variables to calendar and from calendar, each whose value is a record with the fields date and entries, and that some OAA agent offers the solvable entries(Date,Entries): RULE: readable(to calendar) solve(entries(Date,Entries)) IN: to calendar.date = Date OUT: set(to calendar.entries,Entries) This rule means that the interface will wait for a trigger from the TIS; on receiving the trigger it asks the OAA agent to solve Solvable, whose arguments may include information found in the TIS by the IN conditions; the answer is then stored by the TIS by the OUT operation. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 46/58 This solution can be implemented in TRINDI K IT 2.0 by building an interfacing TRINDI K IT module which is also an OAA agent. The module would have a trigger (as do all modules in the asynchronous TRINDI K IT ) and a simple algorithm which could have the same function as the interface rules shown above, e.g. if to_calendar.date $= Date and oaa_solve(entries(Date,Entries),_) then set(from_calendar.entries,Entries). A similar solution (which only works for OAA agents implemented in Sicstus prolog) is to modify the original OAA agent so it also becomes a TRINDI K IT module; in this case there is no separate interface outside the OAA agent. The algorithm could be e.g. if to_calendar/date $= Date & entries(Date,Entries) then set(from_calendar/entries,Entries). Presumably, if the OAA agent offers the service entries(Date,Entries), that predicate is available inside the OAA agent. In addition to calling OAA agents directly, TRINDI K IT could also access OAA agents indirectly by communicating with the OAA facilitator in a similar way as indicated above. T RINDI K IT modules are either AE or OAA agents This solution extends the previous one a bit further by allowing OAA agents (properly interfaced) inside the TRINDI K IT dialogue system. This solution is probably the most flexible and useful. This would be implemented in the same way as the previous solution. One variant of this solution would be to run the DME and TIS serially in TRINDI K IT , and using this process as an OAA agent. This variant would not require any asynchronous behaviour from the TRINDI K IT. It would amount to ripping out the core of the TRINDI K IT and using OAA as a basic architecture. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 47/58 T RINDI K IT resources are OAA agents In this scenario, OAA agents are not modules in TRINDI K IT, as in some of the solutions above. Instead, they are implemented as TRINDI K IT resources. There is one possible problem in using agents as resources. In TRINDI K IT , resources are called from TIS update rules, which are executed by the TIS handler. Only one rule can be executed at a time, and while a rule is being executed the TIS handler will not do anything else. The idea is that rules are bundles of TIS conditions and operations which are protected from asynchronicity; this guarantees that the preconditions of a rule still hold when its effects are executed. The same holds for conditions and operations; while a conditions is being checked (or an operation applied), nothing else happens in the TIS. So if e.g. a condition calls a resource which is an OAA agent, and it takes 2 seconds before the agent replies, then the TIS will be blocked for 2 seconds (and requests from other modules will be put in a queue). However, the TIS-blocking problem can be overridden by defining the OAA resource interface as a standalone part of the TIS, which means it runs as a separate process. Still, if the call to the resource is made from an update rule the TIS will be blocked; but if it’s made from a module algorithm there will be no block. The resource interface (call it oaa) should import the OAA library. In addition, one should build a TRINDI K IT module which calls the standalone OAA resource. As any module, it has a trigger and an algorithm. This allows us to define the kind of interface rules mentioned above using the standard TRINDI K IT module definition format. To make life easier for the users, one could include the OAA resource interface in the TRINDI K IT distribution (since it’s generic). Preliminary conclusion Above, we have presented various ways to conceive of the relation between TRINDI K IT and OAA. All of these deserve further exploration, and TRINDI K IT 2.0 will make it possible to implement them all and experiment with them to find the best solution. Of course, it is very possible that there is no single optimal solution for all situations; consequently, experimentation may be required to determine a suitable compromise or a system specific optimisation. 7.3 Dataflows and Interfaces Part of the Siridus project aims are to provide a dialogue system incorporating new features concerned with robustness and dialogue management. In particular, one project objective is to com- SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 48/58 bine a ‘repair’ approach to robustness with a ‘shallow’ approach to interpretation. That is, faced with input which cannot literally be made sense of using a linguistically motivated grammar of the well-formed strings of a language, one can take two approaches. First, one can attempt to repair the input by transforming it so that it is as if the input were initially perfectly well-formed. Secondly, one take a shallower approach to interpretation using whatever information one can find in the input together with suitable dialogue expectations. Our shallow approach to interpretation is designed also to be suitably gradable - the more linguistically structured information there is in the input, the more it will be taken notice of by the interpreter. In our initial baseline architecture, we will be building in the combination of these two ideas using a simple dataflow pipeline between the two components of repair and interpretation. We shall also use a simple dataflow pipeline between recognition, parsing and repair. Of course, the component architecture outlined in section 7.1 above permits more complicated patterns of dataflow but we only intend to examine a pipeline model in our first year. For the data interfaces, we intend to use a lattice or chart structure. The output of speech recognition will be a word graph rather than a simple 1 best string or nbest list of hypotheses. Our reasons for this are precisely the same as for the Colorado University Communicator system described in section 3.2 above. We wish to examine whether and how much valuable information can be found clustering around the speech recognizer’s best hypothesis even if the actual best string that the recognizer would select on the information available to it is not optimal. The output of the repair module will also be a chart structure but with additional edges added into it by the repair module. These edges will be annotated with scoring information indicating the repair module’s estimate of confidence in its own work. Finally, the ‘gradably shallow’ interpreter will be updated to take account of the original edges and the repaired edges when available. The resulting picture is summarized in figure 7.3 7.4 Dataflow timings Given the use of a parallel and asynchronous architecture, it is not sufficient simply to map out where information flows. When information flows also becomes very significant. In our baseline architecture, we intend to explore one particular possibility: the impact on parsing, repair and shallow interpretation of the recognizer delivering a sequence of partial word-graphs as speech recognition proceeds. That is, speech recognition will not wait until it has processed everything there is to be processed before delivering its output. Rather, it will output partial hypotheses during recognition. It is an interesting research question what stability one will find in the partial hypotheses and how much use can be made of them by downstream processing. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 49/58 Figure 7.2: Baseline Dataflows 7.5 Word Lattices as a Data Format Speech recognition systems often produce many plausible word strings for each utterance and need to pass this ambiguity on to the next stage in the processing. Early systems only put out one output word string, the best scoring one. Because the path score is based on the acoustic model of the context conditioned phonemes and a word sequence model, sometimes the second or 99th best hypothesis is actually the correct one. This led to the idea of n-best lists of hypothesized word strings. Because each string often differs by a single word, the whole of a 10 best list might consist of the possible permutations of 4 words. This is not a very good way to represent ambiguity of this sort. An alternative is to provide a word lattice which contains the words and their start and stop times spanning the length of the utterance. This way a four way ambiguity between two time points is just a list of words with the same start and end times and their probability scores. The depth of the lattice at any point depends on the level of pruning. The idea is to set the pruning threshold so that the correct sentence is not lost from the search. Figure 7.5 shows a word lattice in graph form for a simple sentence in the Wall Street Journal Domain “It’s like having an umbrella insurance policy, Camuzzi said.” The pruning threshold was very low for this word lattice. Word lattices can also be provided at time intervals of approximately 1/4 of a second. These lattices should have the correct words in them, but the scores might be wrong. How much the scores change and whether impossible paths appear in the lattice is a research question which will be studied during the course of the project. Some special pruning methods are available, which attempt to eliminate most dead end paths, before putting out a word lattice. There may be paths which fail on the last few words, which would have to be removed from the parsing near the end of processing. Figure 7.3: Word Lattice 10630/5.074 10630/3.769 10630/0.643 10630/5.953 10630/0 1/0 231 10630/0 190 0/0 48142/7.562 243 1/0 53941/10.01 48142/7.562 196 1/0 48323/7.807 10633/11.52 48142/11.29 1/0 53941/9.770 10630/1.494 46774/0 10630/0 10631/12.57 10630/0.175 10630/3.769 10630/0 10630/5.953 192 48142/11.29 0/0 1/0 155 48142/7.562 47246/9.401 48137/9.997 53941/10.64 53941/14.62 48188/6.595 46774/0 233 234 1/0 10630/0 1/0 53941/15.49 48317/5.909 256 10630/5.074 235 10630/0.175 0/0 0/0.000 1/0 257 273 10633/8.913 0/0 285 48317/5.909 0/0 288 0/0 292 0/0 295/-1388 0/0 0/0 293 297/-1389 0/0 194 48188/6.789 0/6.729 296/-1389 0/5.681 0/0 48188/6.789 0/0 195 27934/11.82 263 41/4.987 48188/10.51 258 0/0 294/-1389 291/-1389 0/0 0/0 0/0.001 275 1/0 48317/5.540 286 240 0/0 1/0 0/0 1/0 1/0 267 186 279 1/0.170 53941/6.800 1/0 8736/6.374 47915/6.628 10633/4.981 1/0 47915/6.628 0/0 1/3.465 282 0/0 0/0 48317/3.439 0/3.782 8736/6.374 54488/11.83 41/3.877 0/0.000 238 266 241 41/4.430 0/0 10630/0 10630/0 161 53941/1.826 290 0/0 53941/9.770 48188/6.595 0/1.323 10630/0 27934/15.78 48188/10.51 10630/0.175 53941/8.183 121 0/0 53941/4.594 0/0 0/0 48317/3.439 10630/0 29632/5.074 10630/0 10633/5.787 41/1.254 0/5.994 27281/0 53941/10.64 1/0 48317/9.270 261 1/3.354 1/0 0/0.000 47920/6.420 177 29632/3.642 1/0 27281/0 280 1/0 244 41/9.615 41213/0 27281/0 48317/5.540 154 1/8.998 27281/0 27281/0 53941/18.81 53941/7.899 0/0 27281/0 27281/0 41/4.224 41211/7.873 41/5.028 167 0/3.332 146 1/9.195 191 232 248 53941/2.629 1/0 239 0/0 264 0/0.000 1/1.046 1/3.465 47915/10.35 1/0 0/0 0/5.827 276 287 47336/8.039 0/0 1/0 0/1.047 278 1/3.354 0/0 48323/7.567 187 274 0/0 242 8736/10.10 1/0 10633/11.52 262 1/5.268 1/0 1/0 236 48317/9.270 0/0.001 281 0/0 1/11.23 0/7.061 0/0 1/0 0/0 8783/5.637 188 10630/0 1/0 41213/0 0/2.308 1/1.309 47915/10.35 10631/12.57 0/0 0/2.308 265 1/5.268 1/0.141 53941/0 53941/11.43 53941/14.45 10630/0 211 180 1/1.309 10633/5.147 10633/8.913 0/7.061 41214/11.34 53941/5.232 47887/7.641 226 0/5.855 53941/11.14 0/0 29632/4.214 10633/5.147 259 10630/0 152 89 10633/4.981 260 10631/6.766 1/0 1/0 24191/11.36 29632/4.214 53941/2.623 1/3.465 157 284 10627/3.832 158 8783/9.368 1/0 169 277 1/0.141 8736/10.10 10633/5.147 24191/5.924 41213/0 1/0 48317/5.540 27281/0 53941/6.601 230 0/0 0/0 271 47336/11.76 10633/4.981 10627/6.958 1/3.465 1/0.170 47887/11.37 27281/0 165 10631/6.766 29632/2.999 1/0 48323/11.29 10633/4.981 1858/15.72 1/10.99 0/0 1/0 48188/6.789 0/0 10633/6.449 53941/7.071 237 10633/4.981 53941/14.79 0/1.844 53941/6.782 145 0/0 1/0.170 289 48323/7.807 10630/0 10630/0 27281/0 209 27281/0 53941/11.43 8744/5.609 205 27281/0 27281/0 53941/12.61 10633/6.449 86 27281/0 53941/4.652 143 53941/6.506 0/3.331 272 53941/4.203 27934/9.982 0/3.782 10630/0.175 27281/0 10630/0 27281/0 53941/10.71 10633/6.475 27281/0 27281/0 1923/8.965 27281/0 0/5.486 27281/0 144 0/0.506 117 53941/10.71 1923/10.68 10630/0 10631/7.503 1923/16.09 0/3.683 116 269 0/0 283 1/0.170 53941/14.91 29632/2.611 0/0 10631/6.766 47920/10.15 1/0.170 10633/6.449 53941/13.41 24191/6.728 156 0/4.036 1/10.99 8736/6.374 229 204 53941/7.134 268 8744/9.339 10631/6.766 27281/0 166 47246/9.401 10633/5.143 1923/18.06 29632/2.999 0/0 47887/11.37 0/0 47920/10.15 46774/0 10631/7.503 252 48323/11.29 270 193 53941/7.416 1/3.619 10627/3.188 1923/16.21 41211/7.873 147 120 0/1.116 10630/0 8744/9.339 48137/9.997 8783/9.368 150 10627/4.682 10633/6.449 47336/11.76 29632/5.074 27281/0 1923/14.71 10631/7.503 27281/0 118 1/0 46774/0 29632/1.304 29632/0 1923/8.435 10627/3.188 47336/8.039 10501/10.77 1923/16.09 27281/0 42360/9.637 53941/7.071 46774/0 0/2.469 1858/15.84 29632/0 10501/10.77 47887/7.641 29632/0 176 42360/9.637 0/1.340 27281/0 153 1/0 184 27281/0 47920/6.420 29632/1.304 10631/7.503 0/1.844 47915/6.628 10627/3.188 0/0.506 46774/0 46774/0 29632/1.116 140 0/3.095 1/6.354 42360/13.05 1923/10.68 27934/10.51 10633/5.147 1923/8.716 0/3.683 29632/1.304 53941/8.129 54488/17.41 1/12.83 199 10627/3.188 97 1923/8.716 85 1/3.5 113 10501/10.77 46774/0 10633/4.981 46774/0 200 1923/8.716 54488/9.911 46774/0 0/1.323 1/0 1/3.619 0/3.331 198 210 27281/0 54488/9.911 27281/0 0/1.108 10501/10.77 0/0 46774/0 197 0/2.469 46774/0 48323/7.567 10623/4.474 42360/13.05 0/3.095 228 29632/2.999 0/6.499 1923/9.430 1923/7.000 8744/5.609 247 1/0 0/2.308 1/0 127 1/5.458 1/5.458 90 1923/8.137 53941/7.630 208 1858/14.34 1/5.338 1923/10.14 164 10627/3.188 185 8783/5.637 10627/0 160 54810/11.69 7542/11.49 10677/5.599 7542/7.377 47961/0 217 8783/5.637 10681/4.984 53941/0 54810/11.69 1/7.527 47336/8.039 141 10677/6.243 91 27281/0 7542/0 0/0 53941/1.250 255 10623/4.474 0/3.793 53941/0 0/0.385 87 227 10627/6.958 27281/0 46774/0 0/2.664 0/0 1/0 202 149 135 53941/1.245 54810/11.69 0/5.827 0/1.843 1923/8.138 7542/0.330 168 172 1/0 0/1.082 1/4.988 1/3.619 246 48323/7.567 0/0 53941/0.412 27281/0 0/0 53941/7.416 1/0 47920/6.420 0/4.522 0/4.522 53941/6.837 1923/8.136 1923/8.179 93 1/0.170 0/4.522 53941/6.837 1/3.039 0/3.331 47887/7.641 54810/11.69 41211/7.873 1/5.338 54488/7.600 10677/5.437 125 54488/7.149 1/3.5 8513/11.97 0/0.443 27281/0 137 27281/0 27281/0 27843/10.88 10677/9.369 27281/0 1/7.425 53941/0.850 0/0 10677/9.369 0/3.683 251 173 10623/8.244 37842/9.858 53941/6.938 0/0.506 1/0 215 10631/6.766 1923/8.716 69 24210/8.246 1923/7.000 55816/0 7542/0 1/0 0/0 203 27281/0 0/0 128 23529/10.37 245 8744/5.609 0/2.242 42360/9.637 1923/8.931 124 54488/7.600 0/0 10627/0 42360/9.637 0/4.522 0/1.844 54488/15.09 115 68 23708/11.77 0/3.683 1858/0 24499/10.79 24995/11.81 138 53298/10.17 23664/11.70 0/1.108 24114/7.284 0/0 1858/3.906 53941/1.088 53941/1.610 225 46774/0 216 27281/0 53941/7.127 24114/7.284 0/2.308 7542/3.064 136 10623/5.118 0/2.881 54488/15.09 0/0 10623/5.753 10677/6.931 29632/2.999 142 53941/1.959 29632/0 1858/8.065 15618/5.631 24210/8.246 27281/0 10681/5.627 54488/7.600 24191/0 10681/8.753 37842/17.23 24114/2.177 7542/3.644 10623/5.753 201 21214/5.583 37842/9.858 1858/0.5 23664/11.70 37 7542/0 24191/1.737 10681/4.822 8744/2.294 10681/8.753 37842/8.599 26455/4.347 222 7542/3.321 0/5.994 53961/6.040 10633/4.981 1/3.866 1923/8.178 37842/15.16 23529/7.381 0/3.366 53941/1.501 182 27281/0 0/3.792 108 0/3.331 23664/11.70 24114/7.284 23529/10.37 23529/10.37 26 250 54488/6.311 0/0 10681/6.316 37842/15.16 0/5.692 27843/10.88 23801/6.916 24191/0 24191/2.119 0/2.308 41213/0 23800/6.086 31235/1.234 34 50 8049/7.503 0/2.029 10633/4.981 37842/8.599 10623/5.753 0/1.026 8049/8.677 54488/7.600 37842/7.662 174 24191/0 10631/6.766 10681/4.822 84 23697/7.462 10623/5.753 53941/6.358 33 0/5.691 31235/1.234 0/3.683 53941/1.610 1/0 37842/7.662 52 1923/8.179 223 0/3.331 26455/0 1/0 24114/2.177 8049/8.677 0/0.458 1/6.354 37842/7.662 24114/2.177 23529/7.381 24127/0 7542/0 1/0 29632/1.304 53941/6.358 181 46774/0 212 41214/11.34 37842/7.662 1923/8.965 132 0/5.827 1/0 163 114 1923/0.532 27281/0 23529/7.381 10681/4.822 10623/8.244 28768/4.187 1858/5.405 29198/4.845 29198/4.845 1/7.659 23801/6.916 0/0 92 0/3.692 23801/6.916 10627/3.188 10627/3.188 23801/6.916 58 1923/7.000 4588/0 24210/8.246 10681/4.822 0/3.683 26455/5.485 36 15618/5.631 23800/6.086 0/0 10627/3.188 37842/9.586 23800/6.086 24313/9.883 30 1923/7.000 23800/6.086 1858/0 123 21214/5.583 58525/11.80 94 37842/7.210 29198/4.845 37842/8.649 54488/7.130 37842/7.192 37842/7.210 53961/6.040 72 27281/0 29198/5.925 10681/4.822 183 37842/8.649 10681/4.822 10631/6.766 37842/8.649 1858/0 28767/3.369 54488/6.311 57 10681/4.822 0/0.351 0/5.691 37842/7.261 24210/8.246 23426/6.269 27937/2.406 53941/6.252 1858/0.209 23697/7.462 54 1858/8.561 224 27281/0 130 1858/7.767 1858/0 27937/0.502 1858/0 0/5.691 21 26077/9.565 0/5.061 0/1.903 44 23697/7.462 27937/0 29198/5.925 107 1/1.919 126 0/0 0/3.683 213 27281/0 0/0.991 119 54488/7.200 23697/7.873 0/1.424 0/5.994 29198/5.925 1/1.304 0/8.812 1/7.195 27281/0 29198/5.925 41214/11.34 27281/0 122 101 249 0/3.014 0/1.424 10675/6.679 28767/4.673 1923/9.489 4588/0 1858/7.809 36193/6.865 44497/15.04 27281/0 1858/0 0/2.729 0/2.188 1858/0 35 23697/7.462 1923/3.730 27281/0 27281/0 0/1.424 1858/15.72 214 24191/1.957 1923/4.485 23589/8.189 0/3.331 7840/11.96 1923/2.011 71 37502/8.754 55 1923/3.362 4588/0 1/5.586 111 7542/0 206 27281/0 0/1.424 29632/1.117 139 0/0.990 55816/0 1923/4.485 74 1/3.040 27281/0 44497/7.547 24127/0 28768/6.308 28768/4.187 0/2.729 26455/6.132 6 45 23426/6.269 6089/11.79 3 0/2.729 1/5.587 73 23589/8.189 1923/4.065 0/2.541 43/8.013 26455/6.465 10623/4.474 1923/6.531 0/0.762 0/2.870 0/3.454 26455/15.45 55816/0 10623/4.474 27281/0 1923/4.065 51 0/0 27281/0 27281/0 53941/6.235 26455/5.876 23589/8.189 31235/0 10675/6.679 131 1/10.62 0/2.914 27843/0 21911/10.87 24194/11.68 27843/11.79 10627/3.188 27281/0 1858/0 61 27937/0 24191/11.82 10 27281/0 0/0.990 24210/8.246 56 23589/8.189 1/0 129 27281/0 23426/6.269 24114/5.774 31235/0 24990/2.558 1/2.568 2 1/9.196 96 23426/6.269 0/3.366 0/0 27843/0 27843/0.493 1/3.061 0/0 27281/0 1858/0 42 0/3.311 0/0 95 0/3.253 23589/8.189 32 29 12 22 1086/9.059 1081/9.258 1 24191/0 1/0.803 54488/7.149 55894/10.85 1858/0 23529/7.381 0/3.331 1858/0.889 24114/2.177 29995/4.230 27 53 31210/10.29 11 27937/0 27937/0 27937/0 27934/8.635 1858/7.768 23426/6.269 32035/11.86 27937/0 7 23697/7.462 1858/0 1923/9.489 24127/0 0/0 1081/11.65 1086/11.45 5399/10.47 0/0 254 53941/6.704 1858/9.060 26455/0 59 53941/4.305 19495/4.769 31210/4.653 31210/4.653 29995/4.230 31235/0 20 31 10627/3.188 170 1858/8.347 112 24114/2.177 23801/6.916 14 19 0/0 24990/2.998 0/0 5 1/3.914 29198/4.845 1/1.304 88 8049/7.503 1/0 70 23529/7.381 27968/6.900 27968/4.166 27968/4.483 24990/7.809 0/5.061 1/2.396 0/0.458 53941/6.704 27968/6.900 24990/7.370 17 53941/9.070 1858/0 579/8.298 23800/6.086 13 27968/4.166 7439/10.28 27937/0 29228/10.29 27934/9.128 0 148 29198/5.925 40156/9.248 43/8.013 41213/0 31235/0 19792/6.857 1/10.62 23697/7.873 0/4.217 1923/4.065 26455/9.389 43/15.36 1858/0 109 1858/8.347 0/2.871 0/2.729 31235/0 25 0/3.448 53929/5.602 25973/11.82 579/10.41 55816/0 1858/7.768 29198/5.925 28767/3.369 0/2.541 10486/6.092 26455/7.955 1923/0.596 26455/9.389 29198/5.925 1858/8.347 44497/7.547 46 41/5.870 1923/1.434 23708/11.77 7840/11.96 28 10675/6.679 26455/5.390 23697/7.873 0/0.695 16 28767/5.817 24191/1.575 0/3.332 0/3.366 0/5.469 4 53929/6.587 53929/5.669 28768/5.492 10681/4.822 1858/7.809 134 36193/7.198 0/5.061 24499/10.79 1858/6.563 26455/5.390 0/0.360 110 41/6.203 0/0.272 31236/5.313 1858/7.809 253 24995/11.81 24191/1.403 189 43/7.866 8513/11.97 31235/0 9 21486/11.20 0/1.518 30368/11.33 43/8.013 36193/6.123 75 8 10486/6.092 26455/7.955 53941/6.303 23 53941/8.255 0/0 10675/9.090 1/3.610 37502/8.011 60 24127/0 43/7.866 1923/6.600 36193/6.123 24995/11.81 24313/9.883 14302/10.80 0/6.612 27281/0 1858/6.563 10675/6.679 1858/3.292 36193/6.123 26455/4.684 1/0 27281/0 24 10677/5.437 10677/5.437 41/5.127 24127/0 0/3.610 10677/5.437 43/7.415 133 41 24127/0 53941/8.588 43/7.396 178 175 10675/6.679 53941/7.513 39 24313/9.883 171 221 1923/4.187 31235/0 0/0.028 47 23708/11.77 0/0 15 24499/10.79 15099/11.38 0/0 10623/4.474 1858/0.356 0/5.419 53298/10.17 15098/6.233 18 54098/11.67 40820/11.56 10486/6.092 26455/5.390 41/5.127 15098/10.28 10787/7.165 24191/1.137 10623/4.474 27281/0 1923/4.065 10787/7.165 1858/6.093 24872/11.53 0/0 10486/6.092 41/5.127 53941/7.513 41/5.050 25662/11.52 43/15.36 26455/5.512 24127/0 24191/0.461 218 26455/7.485 41/5.406 1858/8.347 43 24127/0 10623/4.474 43/7.465 28768/5.492 37502/0 0/5.826 25662/11.52 106 26455/15.45 0/2.082 207 1858/6.163 0/2.871 219 43/7.866 0/0.991 26455/7.554 63 24127/0 0/6.612 41/5.25 0/1.838 0/0 66 1858/6.563 65 25662/11.52 38 24872/11.53 10486/6.092 26455/7.955 40 1858/0 28767/4.673 10675/6.679 43/7.866 53941/7.513 24127/0 1858/0 26455/5.390 10787/7.165 36193/6.245 0/3.610 1858/6.563 36193/10.03 10623/4.474 53941/7.635 26455/17.81 37502/9.087 26455/7.955 37502/0 41/5.127 28767/3.369 37502/0 20297/18.27 0/1.838 53941/7.513 10677/5.437 43/7.415 20297/10.77 10787/7.165 78 1/3.039 1858/0 62 44497/7.096 1858/2.504 24191/0 10675/9.090 10675/9.090 26455/3.370 0/0 220 28767/4.486 10675/6.679 36193/6.123 26455/10.31 55816/5.489 28768/5.304 20297/10.77 10486/6.092 179 1923/3.730 0/0 55816/2.396 28768/4.187 0/0 10677/5.437 41/5.406 54488/10.66 26455/10.31 37912/5.083 40156/9.248 10787/7.165 81 37502/0.942 83 36193/10.03 38249/0 28768/4.187 49 1858/5.957 48 40156/9.248 10623/4.474 37502/0 29938/4.031 53941/9.272 10677/5.437 37502/8.011 37912/5.083 1/0.802 1923/3.730 37502/8.011 38249/0 10787/7.165 38249/0 37046/4.458 37502/8.133 10486/6.092 151 38661/5.897 67 10677/5.437 54488/10.66 37502/8.011 1389/6.257 64 1389/3.018 46774/0 28767/3.369 40156/9.248 38249/0 0/0 0/0 159 100 38077/3.035 53941/9.272 10787/7.165 38249/2.932 82 0/2.859 0/0.506 28768/5.303 40156/12.66 1/0 10675/9.090 26455/4.684 38077/3.985 28767/4.485 44497/7.078 10623/4.474 40156/12.66 28768/5.492 38911/0.981 1858/3.292 103 24242/3.217 28767/4.673 10623/5.968 44497/7.147 1923/2.429 0/2.859 29938/1.382 162 24191/0.005 28767/8.443 31583/2.818 28768/9.423 102 28768/5.492 1858/3.292 44497/15.04 38911/0.981 76 44497/7.547 28768/9.423 26455/4.684 28767/8.443 37842/1.287 104 44497/7.547 27281/0 38077/3.531 28767/4.673 28768/6.798 44497/6.564 0/2.711 44497/6.526 28767/5.980 31961/3.358 0/2.881 37046/1.965 37046/2.968 31167/1.941 24242/3.217 29938/1.382 98 31583/2.818 31961/3.358 44497/4.364 37969/3.250 53941/4.067 38912/2.639 105 37842/1.287 1858/2.795 41/3.659 80 31583/5.019 31167/0.672 27281/0 31961/4.937 58559/4.348 1389/4.879 31167/0.914 41/3.485 37969/3.250 1/3.040 31167/0.672 37615/6.517 27281/0 41/6.332 99 31167/0 77 38912/2.639 27281/0 0/0.063 54146/3.618 41/6.332 37615/6.517 1389/6.257 79 29938/4.031 27281/0 37046/4.458 41213/0 27281/0 38661/5.897 SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 50/58 SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 51/58 7.6 Speech Recognizers The Siridus project has available to it several recognizers which will be tried and refined for our purposes. First of all there is the Decipher recognizer from SRI USA, which has successfully competed in the DARPA Project evaluations. This recognizer uses triphone acoustic models, and a ngram grammar for word sequence modeling. It is capable of outputting a word lattice, which is the prefered interchange format between the recognizer and the Siridus parser. This recognizer runs on Suns and PC’s under Windows NT. Another recognizer is the Entropic HTK recognizer, which uses similar phonetic models and ngram grammars for recognition. This recognizer can put out partial word graphs and thus can provide a left to right progressive hypothesis for the recognized utterance. This is very useful for parsing, since the parser can begin before the end of the sentence, thus cutting down the delay due to recognition. We plan to use this capability extensively in the Siridus system. The Siridus partner Gothenburg University has been given the CMU Darpa Communicator system in a cooperative arrangement with Carnegie-Mellon University. Since this is a complete DARPA communicator system which uses the DARPA hub as only a pass through mechanism, this gives a platform for integrating the Siridus architecture into a larger system which includes travel database lookup and processing of the database response into a generated speech response. The present recognizer in the CMU system is the Sphinx II system, which is known to be fast but less accurate than the current Sphinx III system. The Sphinx III system is to become available in the next few months. This system runs on Windows NT (recognizer) and Solaris (dialogue manager, database lookup). Finally we have been experimenting with the IBM Via Voice recognizer which runs on Linux and Windows and the Dragon Naturally Speaking recognizer which runs on Windows. These are large vocabulary dictation systems, which recognizes words which are outside of our travel dialogue domain systems. This should mean that it takes longer to recognize the words within the dialogue domain, because there are so many words (approximately 100,000) in the dictation recognition system. Also these systems do not provide word lattices, but take the best scoring transcription for the utterance. We do not anticipate using these in the Siridus system. 7.7 Prosodic Markings Important words and phrase boundaries are marked by the speaker prosodically, using loudness, pitch rise-falls, and duration. Using acoustic measures of the syllabic nuclei found by the recognizer, it is possible to find the most probable focus words and phrase boundaries in an utterance. These help in determining what the speaker means in the utterance. Given a word lattice and the prosodic markings it is possible to mark the focus words, make sure that the lexical stress SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 52/58 aligns with the acoustic stress and to mark major phrase boundaries. Major phrase boundaries help the parser to eliminate unlikely parses, and partition the parses, in case the parse of the whole sentence fails. Phrase final intonation may also help to tell questions from statements, but this is not always the case. Certain questions (yes/no questions and wh questions) are often produced with declaritive (falling) intonation, whereas other questions would be produced with rising intonation. A sentential stress detector based on a finite state machine architecture was developed by [9]. This detector was tested on spontaneous monologues and achieved good agreement with hand labels. Several attempts at stress detectors integrated with the speech recognizer have failed in the past. This remains an interesting area of research. The Siridus Project plans to use the Hieronymus and Williams detector in initial prosodic markings of the word lattices. 7.8 Speech Generation The dialogue manager needs to generate questions for the dialogue. Since the Information State (IS) has a representation of the information it needs, the correct question with correct intonation can be generated. Each speech synthesis system has its own way of marking intonation, and there are some standard marking methods like SABLE, which allow standard intonation marking to be added to the text. Intonation markings for synthesis can be generated automatically from the IS. Some preliminary work on generating speech from the IS forms has shown that the resulting intonation is less ambiguous in intent than the default intonation provided by the TTS system. We intend to extend this work to more complex questions which might be asked in the travel dialogue domain. Dialogue systems need to ask questions in different ways, rather than asking the same question during subsequent dialogues. We will explore different ways of generating novel questions which have essentially the same meaning. One source is to use data from natural dialogues to collect a list of questions which humans use in similar circumstances. By choosing from the list of questions in a random way, the systems appears to be more natural. Another technique developed at CMU is to generate questions from word trigrams for the set of similar circumstance questions. The trigrams sometimes produce repeated words, but these can be eliminated simply. This in principle gives a greater variety of sentences, but some of them may seem to be completely unnatural to a native speaker. We will experiment with ways of generating questions which give the dialogue system many ways to ask the same logical question. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 53/58 7.9 Speech Synthesis A range of speech synthesizers is being tried in the Siridus system, both to test different levels of quality and prosodic control, and to provide other languages besides American English. Presently a Bell Labs system which runs on Windows, the IBM ViaVoice synthesizer from Eloquent Systems which runs on Linux and a Telia Synthesizer which runs on Windows have been tried. A new set of synthesis systems have been developed by ATR, AT&T Research and Lernout and Hauspie, which involve concatenating larger sections of speech together, words and phrases. These have more natural sounding speech, having strange intonations and abrupt transitions at times. We will explore the trade offs between normal diphone synthesis and larger unit synthesis in the Siridus system. It is not yet clear if these larger unit synthesis systems are good at changing the focus of the words being produced, if they are marked as focused in the text. This feature seems essential in our use of speech output for dialogues. 7.10 Platforms Many of the latest recognition systems aimed at telephone dialogues use pc’s and windows to do the speech recognition and synthesis. This is because there is a much greater variety of telephony interfaces to pc platforms. Given the recognition interface, the subsequent processing can be done on other machines running Sun Solaris or Linux. During the Siridus project we will experiment with different platform configurations. Since the core DME, the asynchronous DME, and the OAA is written in Prolog, we expect to use Linux or Solaris platforms to run these components. The Siridus system will use standard interfaces, so that components running on different platforms can be integrated together into a final system. 7.11 The Repair Module In this section we will specify the functionality of the repair module we intend incorporate in the initial Siridus prototype architecture. The core of this module is a set of rules for handling spontaneous speech phenomena adopted from the corresponding Verbmobil robust semantic processing submodule [20, 21, 17]. The rules represent heuristics for recognising phenomena such as hesitation, self corrections and false starts and attempting to remedy their effects by reconstructing the intended utterance. In this sense the rule repairs a fragmented input by reconstructing a coherent utterance out of the consistent parts of the input. The input and output behaviour of the module, or more abstractly the corresponding functionality, SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 54/58 is fairly symmetrical, in that the input and output data structures are of the same type and vary in principle in their size an number. The level at which the rule base is encoded requires that the objects processed be linguistic analyses that exhibit clearly defined syntactic and semantic properties. The original application made use of Verbmobil Interface Terms (VITs) [5], but the individual tests and operations used in the rules are implemented via an ADT definition so that the rule application can be adapted to any structures that exhibit the appropriate properties, e.g. HPSG signs. The Verbmobil module maintained its own internal chart for storing fragments and results, as a VIT Hypothesis Graph [20]. In Siridus we intend to employ a common chart for all linguistic analyses and recognition results, so that storage will be external to the module. Hence the module will read a sequence of fragmentary analyses from the semantic chart and write its result to the same chart. At this stage the process does not impose many constraints on the type of analyses the chart contains, but two conditions must be met: The provision of ADT functions supporting the tests and operations in the rules. The inclusion of quality measure in the resulting analyses, since the rules are heuristic in both the tests for the recognition of phenomena covered and in the repair operations carried out. The repair module is not dependent on a preceding analyser completing its task, or for that matter on the completion of the word lattice. It can commence processing as soon as viable fragments are available. However, this does imply access to fragments which is guaranteed by taking the common semantic chart as one of the basic data pools. 7.12 Parsing and Semantic Interpretation The parsing module is designed to take as input a word lattice and create extra edges encoding syntactic and semantic information. The input need not be complete: the parser creates all the edges it can with the existing partial input, then adds to these as new input arrives. The parser works similarly to a chart parser. A standard chart parser takes edges between word positions and joins them together. For example, a Determiner from positions 1 to 2 and a Noun from position 2 to 3 might be joined to give a NP from position 1 to 3. Here we just adapt this to a lattice so that we are taking a Determiner from Node1 to Node2 and a Noun from Node2 to Node3 to give a NP from Node1 to Node3. In a lattice, edges are between nodes rather than word positions or time intervals since we can only join together two edges if the end node of the first matches the start node of the second. The first edge finishing at the same time point as SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 55/58 the second edge starting is not sufficient. For example, consider a recogniser which hypothesises words “abc” and “uvw” between times T1 and T2, and words “def” and “xyz” between times T2 and T3. It may be the case that the recogniser allows “abc” followed by “def” and “uvw” followed by “xyz” but not “abc” followed by “xyz”. Thus “abc” will be an edge finishing in a different node, but at the same time as “uvw”. As well as creating new edges, the parser also creates a record of how this larger edge was formed e.g. that the np edge from node1 to node3 can be formed from the determiner edge between node1 and node2 and the noun edge between node2 and node3. The new edges and records encode the syntactic structure in a convenient packed format. For semantics, the parser similarly creates new edges between nodes in the lattice. Each edge is associated with an indexed semantic representation. This allows similar packing to the syntax, so working with a relatively large lattice should be plausible. Details of the semantic representation and further motivation are given in [15]. The simplest model of processing assumes that the lattice grows ‘edge monotonically’, in the sense that the recogniser adds extra edges as it absorbs more input, and no edges are deleted, but weightings on existing edges may change. This would allow the interaction to be relatively simple: the parser takes a new edge from the lattice (or waits until a new edge is ready), and sends back a new set of syntactic and semantic edges to update the lattice. If weightings on individual edges change, the changes would have to be percolated upto derived edges (alternatively all derived weightings would be expressed as formulae and evaluated on demand). Unfortunately, parsing the whole lattice is unlikely to be practical. We will explore several options. One is for the parser to work just with the best weighted edges at any point. This would allow a single uniform lattice incorporating all the recogniser, parser and repair modules output. Another is for the recogniser to output pruned lattices. These would be much smaller and much easier to parse, but there would be no guarantee of edge monotonicity. A pruned lattice at time T2 may not contain all the edges of a pruned lattice at time T1. There are again options here: if we find that in practice the pruned lattices are usually edge monotonic, we may be able to deal efficiently with the occasional non-monotonic cases by adding any new edges required, and keeping the discarded edges with weights set to zero. 7.13 Translation to Dialogue Moves In this section we will specify the functionality of the translation module which we intend to incorporate in the initial Siridus prototype architecture. The translation module takes the output of the repair module plus the dialogue context and maps this to one or more dialogue moves which act as input to the dialogue manager. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 56/58 The translation module uses the output of the repair module, which it assumes to be a lattice that includes semantic edges. We tend to think of the edges as comprising a semantic representation for the utterance (a ‘semantic chart’ or ‘semantic lattice’): it may be partial i.e. containing unconnected fragments, and may also represent several different packed readings. Mapping consists of going from the semantic chart to database slots or a task language. This is achieved by providing mapping rules which are of the form: partial semantic representation + constraints on context database slot value/command The partial semantic representation is matched against the chart and the contextual constraints are checked. If there is a match, then the database slot value is added to a set of potential mappings. The most specific set of consistent mappings are chosen. [15] provides example mappings and more motivation for the approach. The constraints on context may refer to the last utterance, or the current state of the task constraints e.g. which slots are filled in a slot-value model for a particular task. Thus the translation module needs information created and stored by the dialogue manager. The output of the translation module is a sequence of moves. We take a move to be something with both content and function. For example, ‘to Boston’ results in a move with function ‘add’ and content ‘destination=Boston’. A more complex utterance such as ‘not to London, to Boston’ results in a sequence of two moves: retract(destination=London);add(destination=Boston) We intend to experiment with various granularities of move ‘function’, ranging from the minimal distinction above which states just whether to add or retract, to the finer grained distinctions provided by Conversational Game Theory. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 57/58 Bibliography [1] Goldschen A. and D Loehr. The role of the darpa communicator architecture as a human computer interface for distributed simulations. In 1999 Simulation Interoperability Standards Organization (SISO) Spring Simulation Interoperability Workshop (SIW), Orlando, Florida, March 14-19 1999, 1999. [2] H. Alshawi. The Core Language Engine. M.I.T.Press, 1992. [3] Marko Auerswald. Kommunikation und synchronization der verarbeitung in einem modularen speech-to-speech translation system. Master’s thesis, Department of Computer Science, University of Kaiserslautern, Germany, 1997. [4] H. Aust and M. Oerder. Dialogue control in automatic inquiry systems. In Proceedings of ESCA Workshop on Spoken Dialogue Systems, Vigso, Denmark, pages 121–124, 1995. [5] Johan Bos, Bianka Buschbeck-Wolf, Michael Dorna, and C. J. Rupp. Managing information at linguistic interfaces. In Proc. of the 17 COLING/36 ACL, Montréal, Canada, 1998. [6] D. Carter, M. Rayner, P Boullion, and M. Wirén. Spoken Language Translation. Cambridge University Press, 2000. [7] J. Ginzburg. Dynamics and the semantics of dialogue. In J. Seligman and D. Westerståhl, editors, Logic, Language and Computation, volume 1. CSLI publications, 1996. [8] Object Management Group. The complete http://www.omg.org/corba/corbiiop.htm, 1997. corba/iiop 2.1 specification. [9] J.L. Hieronymus and B.J. Williams. An investigation of the relation between perceived pitch accent and automatically-located accent in british english. In Proceedings of Eurospeech 91, Genoa, Italy, Vol. 3, pages 1157–1160, 1991. [10] Andreas Klüter, Alassane Ndiaye, and Heinz Kirchmann. Verbmobil from a software engineering point of view: System design and software integration. In Wolfgang Wahlster, editor, Verbmobil: Foundations of Speech-to-Speech Translation. Springer, Heidelberg, 2000. To appear. SIRIDUS project Ref. IST-1999-10516, March 12, 2001 Page 58/58 [11] I. Lewin, R. Becket, J. Boye, D. Carter, M. Rayner, and M. Wirén. Language processing for spoken dialogue systems: is shallow parsing enough? In Accessing Information in Spoken Audio: Proceedings of ESCA ETRW Workshop, Cambridge, 19 & 20th April 1999, pages 37–42, 1999. [12] I. Lewin and S.G. Pulman. Inference in the resolution of ellipsis. In Proceedings of ESCA Workshop on Spoken Dialogue Systems, Vigso, Denmark, pages 53–56, 1995. [13] D. Lewis. Scorekeeping in a language game. Journal of Philosophical Logic, 8:339–359, 1979. [14] Microsoft. Distributed component object http://www.microsoft.com/activex/+dcom, 1996. model protocol dcom/1.0. [15] D. Milward. Distributing representation for robust interpretation of dialogue utterances. In Proceedings of ACL 2000, Hong Kong, 2000. [16] Nuance-Communications. Nuance speech recognition system, version 5, developer’s manual. Technical report, Nuance Communications, Menlo Park, California, 1996. http://www.nuance.com. [17] Manfred Pinkal, C.J. Rupp, and Karsten L. Worm. Robust semantic processing of spoken language. In Wolfgang Wahlster, editor, Verbmobil: Foundations of Speech-to-Speech Translation. Springer, Heidelberg, 2000. To appear. [18] A. Stent, J. Dowding, J.M. Gawron, E.O. Bratt, and Moore R. The commandtalk spoken dialogue system. In Proceedings of the 37th ACL, Maryland, pages 183–190, 1999. [19] W. Ward and B. Pellom. The cu communicator. In IEEE Workshop on Automatic Speech Recognition and Understanding, Keystone, Colorado, 1999. [20] Karsten L. Worm. Robust Semantic Processing for Spoken Language. PhD thesis, Universität des Saarlandes, Saarbrücken, Germany, June 2000. [21] Karsten L. Worm and C. J. Rupp. Towards robust understanding of speech by combination of partial analyses. In Proc. of the 13 ECAI, pages 190–194, Brighton, UK, 1998.