Vorlesung SS 2013 Multilinguale Mensch-Maschine
Transcription
Vorlesung SS 2013 Multilinguale Mensch-Maschine
Vorlesung SS 2013 Multilinguale Mensch-Maschine Kommunikation Einführung Prof. Dr. Tanja Schultz Dipl.-Inform. Tim Schlippe Dipl.-Inform. Ngoc Thang Vu Dipl.-Inform. Dominic Telaar Dienstag, 16. April 2013 Überblick Vorlesung 1: Übersicht und Einführung Allgemeine Informationen zur Vorlesung Vorstellen des Lehrstuhls Hinführung zum Thema Anwendungsbeispiele Einführung • • • • 2 Allgemeine Informationen: Vorlesung Weiterführende Vorlesung im Hauptdiplom – Vorkenntnisse sind nicht erforderlich Prüfungsmöglichkeit: – Ja, in Kognitive Systeme und Anthropomatik Turnus: – Jährlich im SS, 4+0 – Prüfung nur während der Vorlesungszeit (frühzeitig anmelden!) Termine: Einführung – Di 14:00 – 15:30 (HS -101) und Do 14:00 – 15:30 (SR 131) – Start 16.04.2013, Ende 18.07.2013 DozentInnen: – Prof. Dr. Tanja Schultz – Weitere MitarbeiterInnen des LS 3 Allgemeine Informationen: Vorlesung Alle Vorlesungsunterlagen befinden sich unter http://csl.anthropomatik.kit.edu > Studium und Lehre > SS2013 > Multilinguale Mensch-Maschine Kommunikation – Alle Folien als pdf (kein passwd Schutz) – Aktuelle Änderungen, Ankündigungen, Syllabus – Gegebenenfalls zusätzliches Material (Papers) Grundlagen für Prüfungen: – Vorlesungsinhalt, Folien, zusätzliches Material Einführung Fragen, Probleme und Kommentare sind jederzeit während der Vorlesung willkommen, oder im persönlichen Gespräch: CSL, Laborgebäude Kinderklinik, Geb. 50.21, Adenauerring 4 – Tanja Schultz (tanja.schultz@kit.edu), Raum 113 – Tim Schlippe, Ngoc Thang Vu, Dominic Telaar, Raum 117 Sprechstunden Tanja Schultz nach Vereinbarung 4 Allgemeine Informationen: CSL Lehrstuhl für Kognitive Systeme seit 1. Juni 2007 – Karlsruher Institut für Technologie, Fakultät für Informatik – Institut für Anthropomatik (neu seit 2009) – Homepage: http://csl.anthropomatik.kit.edu – Adresse: Adenauerring 4, 76131 Karlsruhe Einführung Kontakt: – Prof. Dr.-Ing. Tanja Schultz • tanja.schultz@kit.edu • +49 721 608 46300 – Sekretariat Frau Helga Scherer • helga.scherer@kit.edu • +49 721 608 46312 5 Forschung: Human-Centered Technologies Anwendungsfeld Mensch-Maschine Interaktion Herasusforerderungen und Aufgagen: Produktivität und Usability Einführung Anwendungsfeld Mensch-Mensch Kommunikation Herausforderung und Aufgaben: Sprachenvielfalt, kulturelle Barrieren Aufwand und Kosten Kommunikation des Menschen mit seiner Umwelt im weitesten Sinn: Sprache, Bewegung, Biosignale Technologien und Methoden: Erkennen, Verstehen, Identifizieren Statistische Modellierung, Klassifikation, ... 6 Lehre am CSL – Winter Einführung Wintersemester • Biosignale und Benutzerschnittstellen – 4+0, prüfbar in Kognitive Systeme und Anthropomatik – Einführung in Erfassung und Interpretation von Biosignalen – Anwendungsbeispiele • Design und Evaluation Innovativer Benutzerschnittstellen – 2+0, prüfbar in Kognitive Systeme und Anthropomatik • Multilingual Speech Processing – 2+0, Praktikum – Entwicklung von Spracherkennungssystemen mittels Rapid Language Adaptation Tools • Praktikum Biosignale 2: Emotion und Kognition – 2+0 – Aufzeichnung und Analyse von Biosignalen (z.B. Puls, Hautleitwert, Atmung) zur Erfassung emotionaler und kognitiver Prozesse des Menschen 7 Lehre am CSL – Sommer Sommersemester • Multilinguale Mensch-Maschine Kommunikation – 4+0, prüfbar in Kognitive Systeme und Anthropomatik – Einführung in die automatische Spracherkennung und -verarbeitung – Signalverarbeitung, statistische Modellierung, praktische Ansätze und Methoden, Multilingualität – Anwendungen in Mensch-Mensch Kommunikation und MenschMaschine Interaktion – Anwendungsbeispiele • Praktikum: Biosignale Einführung – Praktische Entwicklung • Aufnahme von Bewegungsdaten (in Koop mit Sportinstitut) • Verschiedene Biosensoren (Vicon, Beschleunigungssensoren, EMG) • Automatischer Bewegungserkennung 8 Lehre am CSL – Sommer Sommersemester • Kognitive Modellierung Einführung – 2+0, prüfbar in Kognitive Systeme und Anthropomatik – Modellierung menschlicher Kognition und menschlichen Affekts im Kontext der Mensch-Maschine-Interaktion – Modelle menschlichen Verhaltens, menschliches Lernen (Zusammenhang und Unterschiede zu maschinellen Lernverfahren), Repräsentation von Wissen, Emotionsmodelle, und kognitive Architekturen 9 Arbeiten am CSL Bachelor Master Studienarbeiten Diplomarbeiten Hiwi-Jobs Einführung • • • • • 10 Hörerliste • Ausfüllen! Fach, Semester Mtr.-Nr Informatik, 36 Einführung N Nachname, Vorname 1 SCHULTZ, Tanja 2 11 Email tanja@ira.uka.de Literatur Xuedong Huang, Alex Acero and Hsiao-wuen Hon, Spoken Language Processing, Prentice Hall PTR, NJ, 2001 ($81.90 internet price) Rabiner and Juang, Fundamentals of Speech Recognition, Prentice Hall Signal Processing Series, Englewood Cliffs, NJ, 1993 Einführung Jelinek, Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA, 1997 ($35) Schultz and Kirchhoff, Multilingual Speech Processing, Elsevier, Academic Press, 2006 (ask the authors for discounts!) + diverse Artikel (pdf), die wir im Web zur Verfügung stellen (wirklich lesen!) 12 Nützliche Links, Zusätzliches Material • Alle Folien werden als pdf ins Web gestellt http://csl.anthropomatik.kit.edu > Studium und Lehre > SS2013 > Multilinguale Mensch-Maschine Kommunikation • Elektronisches Archiv vieler Publikationsbände und Berichte (Proceedings) der wichtigsten Konferenzen zum Thema “Speech and Language” ICASSP (International Conference on Acoustics, Speech, and Signal Processing) Interspeech (Zusammenschluss von Eurospeech und ICSLP) ASRU (Automatic Speech Recognition and Understanding) Einführung ACL (Association of Comp Linguistics), NA-ACL (North American ACL) HLT (Human Language Technologies) … 13 Nützliche Links, Zusätzliches Material • Biosignale und Benutzerschnittstellen (Schultz) – Sprache als ein Biosignal in einem allgemeineren Rahmen • Maschinelle Übersetzung (Waibel) – Zusammenhang: Sprachübersetzung, statistische Methoden, Sprachmodellierung • Mustererkennung (Beyerer) – Grundlagen Mustererkennung • Automatische Spracherkennung (Waibel/Stüker) Einführung – Grundlagen Spracherkennung • • • • Praktikum: Multilingual Speech Processing (Schultz) (WS) Praktikum: Automatische Spracherkennung (Waibel) Seminar: Sprach-zu-Sprach-Übersetzung (Waibel) Praktikum: Natürlichsprachliche Dialogsysteme (Waibel) 14 Allgemeine Information: Ziel der Veranstaltung Ziele der Vorlesung •Sprache in der Mensch-Maschine Kommunikation – Vor- und Nachteile von Sprache als Eingabesignal – Aspekte der Multilingualität in der Spracherkennung Einführung •Grundlagen der Spracherkennung – – – – – – Grundbegriffe Sprachproduktion und Perzeption Digitale Signalverarbeitung, Merkmalsextraktion Statistische Modellierung, Klassifikation Akustische Modellierung, HMMs Sprachmodellierung •Weitere Themen der Sprachverarbeitung – Dialogmodellierung, Synthese, (Übersetzung: bei Prof. Waibel) •Anwendungsbeispiele aus der Forschung 15 Heute: Anwendungsbeispiele Einführung • Spracherkennung: Von Spracheingabesignal nach Text • Sprachsynthese: Von Text nach Sprachausgabesignal • Sprachübersetzung (über Sprachengrenzen): Von Sprachsignal in Sprache L1 zu Sprachsignal in L2 = Spracherkennung + MT + Sprachsynthese • Sprachverstehen, Zusammenfassen = Von Spracheingabesignal nach Bedeutung • Sprachaktivität ist aber nicht nur das Was wird gesprochen Wer spricht? → SprecherIDentifizierung Welche Sprache wird gesprochen? → LanguageID Über was wird gesprochen? → TopicID Wie wird gesprochen? → EmotionID Zu wem wird gesprochen? → Focus of Attention • Übersetzung (über Speziesgrenzen): Beispiel Delphine 16 Introduction Einführung • Each of the lessons covers one topic from “speech recognition and understanding” • It covers the most important areas of today’s research and also discusses some historic issues • The goal of the course is to introduce you to the science of automatic speech recognition and understanding • Today‘s topic: – Why are we doing Speech Recognition? • What are the advantages and disadvantages – Where is it useful? • Examples of applications, demos 17 Why Automatic Speech Recognition? ADVANTAGES: • Einführung • • • • • Natural way of communication for human beings No practicing necessary for users, i.e. speech does not require any teaching as opposed to reading/writing High bandwidth (speaking is faster than typing) Additional communication channel (Multimodality) Hands and eyes are free for other tasks → Works in the car / on the run / in the dark Mobility (microphones are smaller than keyboards) Some communication channels (e.g. phone) are designed for speech ... 18 Why Automatic Speech Recognition? DISADVANTAGES: • Einführung • • • • Unusable where silence/confidentiality is required (meetings, library, spoken access codes) … we are working on solutions (see later) Still unsatisfactory recognition rate when: Environment is very noisy (party, restaurant, train) Unknown or unlimited domains Uncooperative speakers (whisper, mumble, …) Problems with accents, dialects, code-switching Cultural factors (e.g. collectivism, uncertainty avoidance) Speech input is still more expensive than keyboard 19 Input Speeds (Characters per Minute) Mode Handwriting Typewriter Einführung Stenography Speech 20 Standard Best 200 500 200 1000 500 2000 1000 4000 Where is Speech Recognition and Understanding useful Human - Machine Interaction: 1. Remote control applications • Operating Machines over the Phone 2. Hands/Eyes busy or not useful • Speech Recognition in cars • Help for Physically Challenged, Nurse bots 3. Authentication • Speaker Identification/Verification/Segmentation • Language/Accent Identification Einführung 4. Entertainment / Convenience • Speech Recognition for Entertainment • Gaming 5. Indexing and Transcribing Acoustic Documents • Archive, Summarize, Search and Retrieve Where is Speech Recognition and Understanding useful Human - Human Interaction: Einführung 1. Mediate communication across language boundaries • Speech Translation • Language Learning • Synchronization / Sign Language 2. Support human interaction • Meeting and Lecture systems • Non-verbal Cue Identification • Multimodal applications • Speech therapy support Operating Machines over the Phone • Remote Controlled Home Operate heating / air conditioning, turn lights on/off, check email • Voice-Operated Answering Machine • Call answering machine from anywhere and discuss recent calls Access Databases Pittsburgh Bus Information with CMU’s Let’s Go at 412-268-3526 Check the weather with MIT’s Jupiter at 1-888-573-8255 Zugauskunft (Erlangen), Telefonauskunft, Fluggesellschaften, Kino Einführung • Call Center Route or dispatch calls, 911 emergency line AT&T: How may I help you? The HMIHY system was deployed in 2001, and according to AT&T was handling more than 2 million calls per month by the end of 2001. • Use Interactive Services worldwide Plan your next trip with an artificial travel agent Hands-Free / Eyes-Free Tasks • Hands and/or Eyes are busy with tools Radio repair Construction site • Hands and/or Eyes are needed to operate machines/cars Hold the steering wheel Pull levers, turn knobs, operate switches Watch the street while driving Monitor production line Einführung • Hands are working on other people Hair stylist cutting hair Surgeon working on a patient • Hands and/or Eyes are not helpful in the environment Dark rooms (photography) Outer Space (remote control) Speech Recognition in Cars • Use your cellular phone while keeping your hands on the wheel and eyes on the street, e.g. voice dialing • Operate your audio device while driving • Dictate messages (e-mails, SMS) TODAY several companies and services are emerging which do exactly this Einführung • Talk to your personal digital assistant • Navigation Ask your way through a foreign city Find the nearest restaurant Support in everyday life, Help for Elderly and Physically Challenged People who are immobile such as lying in bed/hospital or who can‘t their hands due to illness or accidents • operate parts of their environment/machines by voice • ask a robot for help Einführung Nursebot Pearl and Florence: CMU‘s Robotic assistant for the elderly ISAC feeding a physically challenged individual Center for Intelligent Systems, Vanderbilt Univ Children with speaking disorders make significant improvements by trying to make a speech recognizer understand them Children with dyslexia and similar problems learn to read faster using automatic speech recognition Einführung Information in Sprache Speech Recognition Words Language Recognition Language Name Turkish Speaker Recognition Speaker Name Umut Onune baksana be adam! : : : : Emotion Recognition Emotion Angry Accent Recognition Accent Istanbul Topic ID: Entity Tracking: Acoustic Scene: Discourse Analysis: Chemicals Istanbul Bus Station Negotiation Tanja Schultz, Speaker Characteristics, In: C. Müller (Ed.) Speaker Classification, Lecture Notes in Computer Science / Artificial Intelligence, Springer, Heidelberg - Berlin - New York, Volume 4343. Speaker Recognition Identification Whose voice is it? Verification/Detection ? Is it Sally’s voice? ? ? ? Segmentation and Clustering Where are the speaker changes? Tim Einführung Which segments are from the same speaker? Will Speaker Identification/Verification/Recognition Verification verify someone’s claimed identity, e.g. is the person who s/he claims to be Instead of password: say something instead of typing Identification “who is speaking” Identifies a speaker from an enrolled population by searching the database Personalized behavior: Einführung customize machine reaction automatically to the current user Recognition Often used to refer to all problems of verification, identification, segmentation&clustering Speaker Segmentation and Clustering Segmentation: Automatically segment incoming speech by speaker Clustering: cluster segments of the same speaker Adaptation: use parameters that are optimized recognition for specific speaker Einführung Mandarin Broadcast News Speaker turn miss Overlapping speech Speech over noise Language Identification o o o o Auswahl Erkenner (bei multilingualer Spracherkennung) Anrufweiterleitung (z.B. 911 emergency line) Datenanalyse, Auswahl Spezialfall: Akzenterkennung o Optimierung aller Systemparameter auf Sprecherakzent o E-Language Learning Einführung Japanese Tanja Schultz, Identifizierung von Sprachen -Exemplarisch aufgezeigt am Beispiel der Sprachen Deutsch, Englisch und Spanisch, Diplomarbeit, Institut für Logik, Komplexität und Deduktionssysteme, Universität Karlsruhe, April 1995 FarSID: Far-Field Speaker Recognition • Originalsignal • Effekt Echo • Effekt Distance Einführung • Effect Raumgröße (1-m Distanz, .5-sec Echo) Klein Q. Jin, Y. Pan, T. Schultz, Far-Field Speaker Recognition, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, Toulouse, France, 2006 Global Communication Einführung The dream (?) of communicating across language boundaries - A babelfish for everybody • Fun, Everyday life: • Chat in your mother tongue Worldwide • Travel without comm. problems • Business: • Negotiate and being sure that your partner is getting it right • Computer has no stakes, e.g. neutral translation, not lopsided • Face-to-Face Communication • Over the phone or internet • Text-to-Text vs Speech-to-Speech „The building of the tower of Babel“, 1563 by Pieter Brueghel, Kunsthistorisches Museum, Vienna The building of the Tower of Babel and the Confusion of Tongues (languages) in ancient Babylon mentioned in Genesis "Babel" is composed of two words "baa“meaning "gate" and "el," "god." Hence, "the gate of god.“ A related word in Hebrew, "balal" means "confusion." GALE GALE = Global Autonomous Language Exploitation: Process huge volumes of speech and text data in multiple languages (Arabic, Chinese, English) • Broadcast News, Shows, Telephone Conversations Apply automatic technology to spoken and written language: • Absorb, Analyze, and Interpret Einführung Deliver pertinent information in easy-to-understand forms to monolingual analysts Three engines: - Transcription, - Translation, - Distillation Demonstration GALE – Chinese TV Mandarin Broadcast News CCTV recorded in the US over satellite ASR Einführung SMT Transforming the Mandarin speech Into Chinese text using Automatic Speech Recognition Translating from Chinese text into English text using Statistical Machine Translation H. Yu, Y.C. Tam, T. Schaaf, S. Stüker, Q. Jin, M. Noamany, T. Schultz, The ISL RT04 Mandarin Broadcast News Evaluation System, EARS Rich Transcription Workshop, Palisades, NY, November 2004 PDA Speech Translation in Mobile Scenarios • Tourism – Needs in Foreign Country – International Events Einführung • Conferences • Business • Olympics • Humanitarian Needs – Humanitarian, Government – Emergency line 911 – USA, multicultural population • Army, peace corps A. Waibel, A. Badran, A. Black, R. Frederking, D. Gates, A. Lavie, L. Levin, K. Lenzo, L Mayfield Tomokiyo, J. Reichert, T. Schultz, D. Wallace, M. Woszczyna, J. Zhang, Speechalator: Two-way Speech-to-Speech Translation in your Hand. HLT-NAACL 2003, Edmonton, Alberta, Canada, 2003 Verbmobil Einführung Talk to people (face-to-face) from/in other countries in your own language. A step towards Startrek's "Universal Translator“ Mobility: Personal Digital Assistants Use your PDA or cellular phone to get help • Navigation • Translation • Information (travel, transportation, medical, ...) Einführung Demo RLAT: Rapid Language Adaptation Tools Major Problem: Tremendous costs and time for development – Very few languages ( 50 out of 6900) with many resources – Lack of conventions (e.g. Languages without writing system) – Gap between technology and language expertise SPICE: Intelligent system that learns language from user – – – – Speech Processing: Interactive Creation and Evaluation toolkit Develop web-based toolkits for Speech Processing: ASR, MT, TTS http://cmuspice.org http://csl.ira.uka.de/rlat-dev Demo • Interactive efficient learning Einführung Interactive learning: – Solicite knowledge from user in the loop – Rapid adaptation of language independent models Efficiency: – Reduce time and costs by a factor of 10 T. Schultz, A. Black, S. Badaskar, M. Hornyak, J. Kominek, SPICE: Web-based Tools for Rapid Language Adaptation in Speech Processing Systems, Proceedings of Interspeech, Antwerp, Belgium, August 2007 Meeting Room Einführung The Meeting Browser is a powerful tool that allows us to record a new meeting, review or summarize an existing meeting or search a set of existing meetings for a particular speaker, topic, or idea. http://www.is.cs.cmu.edu/meeting_room/ Indexing Acoustic Documents The world is flooded with information. More and more information is coming through audio-visual channels. Einführung Trying to find information in acoustic documents needs an intelligent acoustic search engine. View4You / Informedia Einführung Automatically records Broadcast News and allows the user to retrieve video segments of news items for different topics using spoken language input Kemp/Waibel 1999 Einführung Education, Learning Languages • LISTEN: Automated reading tutor that listens to a child read it aloud a displayed text, and helps where needed. • CHENGO: web-based language learning in a gaming environment for English, Chinese • Programm CALL at CMU on Computer Assisted Language Learning Robust and Confidential Speech Recognition Einführung Traditional Speech Recognition: • Capture the acoustic sound wave by microphone • Transform signal into electrical energy Requirements and Challenges: • Audibility: Speech needs to be perceivable by microphone (no low voice or whispering, no silent speech) • Interference: Speech disturbs others (no speaking in libraries, theaters, meetings) • Privacy: Speech signal can be captured by others (no confidential phone calls in public places) • Robustness: Signal is corrupted by noisy environment (difficult to recognize in restaurants, bars, cars) Bone-conduction • When we speak normally our body is a resonance box Skin and bones vibrate when we speak (try this!) • Capture this vibration by so-called bone-conducting Stethoscopic or skin-conducting microphones Microphone Einführung Zheng et al. Jou et al. / Intecs • Whispered speech is defined as: – the articulated production of respiratory sound – with few or no vibration of the vocal-folds – produced by the motion of the articulator apparatus – transmitted through the soft tissue or bones of the head Nakajima Electromyography – Silent Speech Approach: – Surface Electromyography (EMG) – Surface = No needles – Electro = electrical activity – Myo = muscle – Graphy = recording EMG-Signal s1 s1 – s2 s2 Einführung - Measure the electrical activity of facial muscles by capturing the electrical capacity differences - MOTION is recorded, not acoustic signal silently moving the lips / articulators is good enough SILENT SPEECH Demo Demo Lautlose Kommunikation Einführung • http://csl.anthropomatik.kit.edu/img/EMGDemoVideo1Fin al.m4v 47 Delphinisch Kommunikation über Sprachgrenzen über Speziesgrenzen • Zusammenarbeit mit Wild Dolphins Project • freilebende Atlantis Spotted Dolphins • Bestimmung, Verhalten, Kommunikation • Kommunikation mit Delphinen • Delphine versuchen Kontakt aufzunehmen • Information 20Mio Jahre alte Spezies http://wilddolphinproject.com Einführung • “Dolphone” und “Delphinisch” • Lautproduktion, Perzeption, Frequenz, Medium • Mustererkennung, Extraktion, Clustering, Statistische Modellierung • Audio- und Video indexing, archiving, retrieval • Audioaufnahme, -analyse, -synthese, -übersetzung Even Beyond Human Speech … Towards Communication with Dolphins Why do we want to talk to Dolphins? • They might have a lot to say (20Mio old species) • It is a challenging scientific problem - Cross language boundaries Cross species boundaries - Different sound production, perception, … - Different medium (water), transmission, omni-directional • Nothing is known about dolphins’ language • It involves spending a lot of time in the Bahamas Einführung Why do Dolphins want to talk to us? We don’t know … … but there is evidence that they try hard CMU: www.cs.cmu.edu/~tanja Wild Dolphin Project (http://wilddolphinproject.com) Quaero Einführung • Collaborative research and development program • Developing multimedia and multilingual indexing and management tools e.g. automatic analysis, classification, extraction and exploration of information • Facilitate extraction of information in unlimited quantities of multimedia and multilingual documents, including written texts, speech and music audio files, and images and videos. • Available to everyone via personal computers, television and handheld terminals. Conclusions Speech: • Is the most natural way of communication for human beings • Does not require any teaching or practicing • Has high bandwidth (speaking is faster than typing) • Supplements other communication channels (Multimodality) Einführung Speech Recognition is useful: • In hands-busy and eyes-busy environments • For mobile / small devices • Support in everyday life, Help for physically challenged folks Speech Recognition and Understanding: • Allows to (remotely) operate Machines • Supports global communication between humans • Break language (and maybe sometimes cultural) barriers