thèse telecom bretagne - Pages personnelles à TELECOM Bretagne
Transcription
thèse telecom bretagne - Pages personnelles à TELECOM Bretagne
N° d’ordre : 2008telb0096 THÈSE Présentée à TELECOM BRETAGNE Sous le Sceau de l’Université Européenne de Bretagne En habilitation conjointe avec l’Université de Rennes 1 pour obtenir le grade de DOCTEUR de TELECOM BRETAGNE Mention «Traitement du Signal et Télécommunications» par Emmanuel Rossignol THEPIE FAPI Réduction du Bruit et Annulation de L’Écho Acoustique dans le Domaine des Paramètres des Codeurs de Type CELP, Intégrés dans Les Réseaux Mobiles Soutenue le 09 Janvier 2009 devant la Commission d’examen : Composition du Jury - Rapporteurs : Geneviève BAUDOIN, Professeur, ESIEE Paris Stéphane AZOU, Maître de Conférences HDR, Université de Brest - Examinateurs : Régine LE BOUQUIN-JEANNES, Professeur, Université de Rennes 1 Ramesh PYNDIAH, Directeur d’études, TELECOM Bretagne Dominique PASTOR, Professeur, TELECOM Bretagne Christophe BEAUGEANT, Ingénieur de Recherche, Infineon Technologies Invités : Hervé TADDEI, Ingénieur de Recherche, Huawei Technologies Gang FENG, Professeur, Université Stendhal - Grenoble 3 Acknowledgements First of all, I would like to thank Dr. Chritophe BEAUGEANT, Dr. Hervé TADDEI and Pr. Dominique PASTOR for offering me the chance to work on this thesis, for always ensuring good working conditions, and for proofreading so many times my dissertation. I am thankful to sponsors from Siemens Mobile, BenQ Siemens, Siemens Networks and Nokia Siemens Networks for financing my study. A special thank goes to Dr. Jarmo HILLO, Dr. Marcel WAGNER and Ms Stephanie DAHLHOFF. I will like to thank my brother Simon Prosper HAPPI for his affection, as well as my parents Noé and Marie-Claire FAPI. I want also to express my gratitude to my supervisor Dr. Ramesh PYNDIAH and Pr. Dominique PASTOR from TELECOM Bretagne. I also like to thank Dr. André GOALIC. I am grateful to Pr. Geneviève BAUDOIN and Dr. Stephane AZOU for accepting to be rapporteurs of my work. Moreover, many thanks go to Pr. Régine LE BOUQUIN-JEANNES for being president of the examination. I am also indebted to Dr. Mickael De MEULENEIRE and Nicolas DUESTCH, for their contributions prior to my work. Many thanks go to the collegues, students that I have met over these four years at Siemens Mobile, BenQ Siemens, Siemens Network and Nokia Siemens Networks, especially Ketra KANG, with whom I had a really good time, for their help and participation to listening tests. I would like to deeply thank my familly, YEMNGA familly, as well as my fiancée Odette YEMNGA and my son Ludovic Fabrice, for their support and their love throughout all these years. I finally dedicate this thesis to the memory of my big sister Brigitte-Chantal TOUKAM FAPI (Dec. 1966 − f ev. 2005). Abstract Voice Quality Enhancement (VQE) solutions are now moving from Mobile Device to the network. This is due to constraints of low-complexity, low-delay and the need of centralized control of the network. The deployment of incompatible standardized speech codecs implies interoperability issue between telecommunication networks. To insure interconnection between networks, the transcoding from one codec format to another is necessary. The common point to the classical network VQE and standard transcoding is that they use the speech signal in PCM format during the process. An alternative way to perform network VQE is developed in this thesis. This new approach leads to modification of the CELP parameters to perform network VQE. A Noise Reduction algorithm is implemented in this thesis by modifying the fixed codebook gain and the LPC coefficients of the noisy speech signal. An Acoustic Echo Canceller is developed by filtering the fixed gain of the microphone signal. These algorithms are based on extrapolation of existing algorithms in the time or the frequency domain into the CELP parameter domain. During this thesis, the algorithms developed in coded domain have been integrated into smart transcoding algorithms. The smart transcoding strategy is applied to the fixed codebook gain, the LPC coefficients and the adaptive codebook gain. With this approach, the non-linearity introduced by the coders does not affect the performance of the network AEC. Many functions at the target encoder are skipped, leading to a significant computational load reduction of about 27%, compared to the classical approach. The network VQE embedded into smart transcoding has been implemented. Objective metrics (the Signal-to-Noise Ratio Improvement (SNRI) and the Total Noise Reduction Level (TNRL)) indicate that noise reduction integrated in smart transcoding performance is better than the classical Wiener method when transcoding from the AMR-NB in 7.4 kbps mode to 12.2 kbps mode. The performance is equivalent during transcoding from 12.2 kbps mode to 7.4 kbps mode. The Echo Return Loss Enhancement (ERLE) values of our proposed algorithms are highly improved compared to the standard NLMS (up to 40 dB). The required 45 dB ERLE in GSM is achieved. Key words: CELP, AMR-NB, VQE based on CELP parameters, Wiener filter, GSM network, smart transcoding. Résumé L’amélioration de la qualité de la parole s’effectue progressivement dans les réseaux, plutôt que dans les terminaux mobiles. Les contraintes liées à la réduction du délai, la réduction de la complexité et le souhait d’un contrôle centralisé des réseaux motivent cette nouvelle approche. Le déploiement des codeurs de parole standardisés pose des problèmes d’interopérabilité entre les réseaux. Pour assurer l’interconnexion entre ces réseaux, le transcodage du train binaire d’un codeur vers le codeur cible est indispensable. Les solutions classiques d’amélioration de la qualité et le transcodage classique nécessitent la présence du signal sous format PCM, c’est-à-dire des échantillons du signal. Un concept alternatif pour améliorer la qualité de la parole dans les réseaux est proposé dans cette thèse. Cette approche repose sur le traitement des paramètres des codeurs de type CELP. Un système de réduction du bruit est implémenté dans cette thèse en modifiant le gain fixe et les coefficients LPC. Deux algorithmes destinés à l’annulation de l’écho acoustique développés modifient le gain fixe. Ces différents algorithmes utilisent des extrapolations et des transpositions des techniques existantes, du domaine temporel ou fréquentiel dans le domaine des paramètres des codeurs de type CELP. Au cours de cette thèse, nous avons aussi intégré les algorithmes ci-dessus mentionnés dans des schémas de transcodage intelligent impliquant les gains fixes et adaptatifs, ainsi que les coefficients LPC. Avec cette approche, la complexité du système est réduite d’environ 27%. Les problèmes liés à la non-linéarité introduits par les codeurs sont significativement réduits. Les tests objectifs indiquent en ce qui concerne la réduction du bruit, que les performances sont meilleures que celles du filtre classique de Wiener pendant le transcodage de l’AMR-NB 7.4 kbps vers 12.2 kbps. Elles sont sensiblement équivalentes dans le transcodage de l’AMR-NB 12.2 kbps mode vers 7.4 kbps mode. Les mesures objectives concernant l’annulation de l’écho acoustique (ERLE) montrent un gain de plus de 40 dB des algorithmes proposés par rapport au NLMS. Le seuil minimal de 45 dB fixé pour le GSM est atteint. Mots clés: codeur CELP, AMR-NB, réduction de bruit et annulation de l’écho acoustique dans le domaine des paramètres CELP, filtre de Wiener, réseau GSM, transcodage intelligent. Résumé des chapitres Chapitre 1 : Introduction et Contexte de la Thèse Ce premier chapitre introduit le contexte de la thèse. Dans ce chapitre, nous mettons en évidence les problèmes liés à la présence du bruit environnemental et de l’écho acoustique pendant les communications à travers les réseaux mobiles. Les techniques classiques pour réduire les effets du bruit ou de l’écho acoustique utilisent essentiellement le signal sous format PCM. Ces techniques classiques peuvent être implémentées dans les terminaux mobiles, ou directement dans les réseaux. Les désavantages majeurs avec ces méthodes classiques sont entre autre l’augmentation du coût de calcul, et le retard qu’elles peuvent introduire dans la communication. La multiplication des réseaux, mobiles ou non, crée aujourd’hui des problèmes dûs à l’interoperabilité. L’approche classique une fois de plus nécessite la présence du signal sous format PCM. L’opération de transcodage, aussi dite de ’tandeming’ : (décodage-encodage) est requise pour les approches classiques. Les conséquences sont ici la dégradation de la qualité du signal de parole, le retard et le coût de calcul élevé. Les architectures actuelles se tournent progressivement vers des implémentations de réduction de bruit et d’annulation de l’écho acoustique directement dans les réseaux. On parle aussi dans ce cas de Centralized Voice Quality Enhancement. Ce chapitre d’introduction énonce l’idée selon laquelle il est avantageux de directement modifier les paramètres transmis par les codeurs de parole pour améliorer la qualité de la parole. Pour faire face aux problèmes liés à l’interoperabilité entre les réseaux, le transcodage intelligent est de plus testé. Le transcodage intelligent se basera sur la quasi similarité entre les codeurs de parole déployés dans les réseaux. Le point central de cette thèse se définit comme étant l’imbrication dans des schémas de transcodage intelligent de nos ’nouveaux algorithmes’. Le résultat attendu est la résolution en un seul module des problèmes d’interoperabilité, les problèmes de qualité et d’intelligibilité dus à la présence du bruit et de l’écho acoustique. Chapitre 2 : Codage de la Parole et les Techniques du Codage CELP Le deuxième chapitre est consacré au codage de la parole par prédiction linéaire, notamment à la prédiction linéaire avec excitation par séquences codées, plus connu sous l’abréviation anglaise CELP. Les codeurs CELP travaillent sur des morceaux consécutifs d’égale longueur du signal d’entrée, appelés trames. Suivant le codeur, la longueur d’une trame varie entre 10 ms et 30 ms pour rester dans l’hypothèse de stationnarité du signal de parole. Les paramètres transmis par les codeurs de type CELP ont une signification physique étroitement liée au système de production de la parole de l’homme. Tout d’abord, la corrélation à court terme sur une trame du signal est réduite par prédiction linéaire, c’est-à-dire qu’un échantillon de la trame est estimé par combinaison linéaire des échantillons précédents, ceci pour un nombre fini d’échantillons. L’ensemble des coefficients de prédiction forme le filtre de synthèse. Ce filtre modélise en fait le conduit vocal. Un résidu de prédiction est obtenu par différence entre le signal d’entrée et son estimée par prédiction linéaire. Ce résidu est quantifié par une combinaison linéaire de deux mots de code, provenant de deux dictionnaires, le dictionnaire fixe et le dictionnaire adaptatif. Le premier dictionnaire, dit adaptatif, modélise la corrélation à long terme présente dans le résidu, résultant de la vibration des cordes vocales. Ce dictionnaire contient un ensemble d’excitations quantifiées des dernières trames codées. Les mots de code sont indicés par une valeur appelée pitch, qui caractérise la périodicité du signal à la trame courante. Une fois le mot de code optimal trouvé, son gain associé est également calculé. Le deuxième dictionnaire, qualifié de fixe, contient un ensemble de séquences prédéfinies et code l’information non prédictible, appelé innovation. Le codeur détermine le mot de code optimal ainsi que son gain associé. Dans les deux cas, le mot de code ainsi que son gain sont obtenus en minimisant l’erreur quadratique moyenne entre le signal original et le signal reconstruit. Cette méthode est appelée analyse par synthèse. L’excitation consiste en la somme des deux mots de code, pondérés par leur gain quantifié respectif. Le dictionnaire adaptatif est mis à jour en concatenant cette excitation aux excitations des trames précédentes. Les propriétés de masquage du système auditif peuvent être prises en compte en pondérant l’erreur par une fonction dépendant des coefficients de prédiction à court terme. Le codeur AMR-NB, standardisé par le 3GPP, est brièvement décrit en fin de chapitre. Ce codeur sera à la base pour toutes les simulations effectuées dans cette thèse. L’AMR-NB encode des signaux échantillonnés à 8 kHz et est capable de produire un débit variable aussi appelé mode, en fonction des ressources du cannal. Les débits sont les suivants : 4.75 kbps, 5.15 kbps, 5.90 kbps, 6.70 kbps (identique à PDC-Japan), 7.4 kbps (identique à DAMPS-IS136), 7.95 kbps, 10.2 kbps and 12.2 kbps (identique au GSM-EFR). Le nombre total de bits attribué dépend du mode sous lequel fonctionne le codeur, mais c’est la qantification du mot de code fixe qui requiert le plus de bits. Ce codeur comprend d’autres fonctionnalités, entre autre deux options de détection d’activité vocale. Chapitre 3 : Reduction du Bruit dans le Domaine des Paramètres des Codeurs de type CELP Les algorithmes dédiés à la réduction du bruit additif sont généralement implémentés dans le domaine fréquentiel. Le signal bruité est dans un premier temps transformé dans le domaine fréquentiel. Un filtrage est appliqué à chaque composante fréquentielle du signal bruité. Le résultat de ce filtrage est une estimation du signal utile. Cette estimation du signal utile est ensuite convertie via une FFT inverse dans le domaine temporel. Les paramètres du filtre ou du gain utilisé dépendent essentiellement du bruit et du signal bruité. Le filtrage appliqué dans le domaine des fréquences se décline sous plusieurs techniques. On peut citer les atténuations spectrales, le filtre de Wiener, la règle d’Ephraim et Malah qui sont résumés en première partie de ce chapitre. Le filtrage nécessite dans la plupart des cas une estimation du rapport Signal-sur-Bruit (SNR). Ce chapitre propose deux algorithmes qui réduisent le bruit en modifiant d’une part le gain fixe et d’autre part les coefficients LPC du signal bruité. L’ algorithme dédié à la réduction du bruit via la modification du gain fixe exploite le filtre de Wiener. L’algorithme implémenté ici fait suite à des travaux initiés en amont de cette thèse. L’innovation majeure dans l’approche est la transposition et l’extrapolation du minimum statistique, généralement utilisé dans le domaine fréquentiel. Nous avons aussi introduit un parallèle avec le rapport Signal-sur-Bruit a priori dans le domaine des paramètres. Etant donné que l’opération de codage n’est pas linéaire, nous avons proposé une relation simple qui lie le gain fixe du signal bruité à celui du signal utile et du bruit. Le gain fixe du signal utile ainsi estimé est ensuite introduit dans le train binaire. Les tests d’écoute ont montré que cet algorithme réduit significativement les bruits au niveau du décodeur si le SNR n’est pas trop faible. Le second algorithme proposé dans ce chapitre traite la réduction du bruit via la modification des coefficients LPC du signa bruité. Cet algorithme exploite la relation qui lie les coefficients LPC du signal bruité à ceux du bruit et du signal utile. Cette approche nécessite une détection d’activité vocale (VAD), étant donné que la relation qui lie les coefficients LPC n’est réalisable que si le signal utile est non nul et/ou le bruit est non nul. Lorsque la parole est présente, nous utilisons cette relation pour estimer les coefficients LPC du signal utile. S’il n’y a pas d’activité vocale, nous avons privilégié une atténuation spectrale. Les résultats expérimentaux ont prouvé que la modification des coefficients LPC améliore les caractéristiques spectrales du signal de parole en présence du bruit, surtout dans les sections voisées. Pae comparaison au filtrage classique de Wiener, des tests objectifs montrent une amélioration significative. Chapitre 4 : Annulation de l’Écho Acoustique dans le Domaine des Paramètres des Codeurs de CELP Pour réduire les effets indésirables de l’écho acoustique, traditionnellement, on estime la réponse impulsionnelle du filtre qui modélise la cavité acoustique. Cette estimation peut se faire dans le domaine temporel ou dans le domaine fréquentiel. Cette estimation permet alors de reconstruire l’écho qui sera par la suite soustrait du signal du microphone. On peut citer dans cette catégorie le LMS, le NLMS et le filtre de Wiener. D’autres techniques se limitent à un calcul de gain en fonction de l’énergie du signal du microphone et du haut-parleur. Le gain une fois appliqué au microphone réduit l’impact de l’écho acoustique. Cette approche est plus connue sous le nom de Gain Loss Control. Ces différentes approches sont discutées en début de chapitre sous forme d’état de l’art. Ce chapitre propose deux algorithms qui modifient directement le gain fixe du microphone pour réduire les effets de l’écho acoustique. Le premier algorithme s’inspire de l’approche classique du Gain-Loss-Control (GLC) dans le domaine temporel. Un parallèle a été fait entre l’amplitude du signal et le gain fixe. Les coefficients d’atténuation sont calculés dans cette méthode en estimant l’énergie du signal à l’aide des paramètres du codeur. Des vérifications expérimentales ont permis de constater que cet algorithme se comporte remarquablement bien pendant les périodes dites de ’Single Talk’ (un seul locuteur parle). Un second algorithme correspond au filtrage du gain fixe. Cette approche est basée sur une analogie entre filtrage classique de Wiener dans le domaine fréquentiel et le filtrage sur les paramètres du codeur. Il requiert une estimation du gain fixe de l’écho acoustique. Pour cette estimation, nous avons utilisé une méthode basée sur la corrélation. Le filtrage prend en compte les périodes de ’Double Talk’ et des périodes de ’Single Talk’. Pour ce pseudo-filtre de Wiener, la notion de rapport Signal-sur-Écho (SER) dans le domaine des paramètres du codeur, semblable au rapport Signal-sur-Bruit (SNR), a été introduite. L’estimation du SER proposée est inspirée de l’approche récursive d’Ephraim et Malah. L’avantage de cette méthode est qu’elle se comporte relativement bien pendant les périodes dites de ’Double Talk’ par rapport au GLC. Les performances dépendent du type de filtre qui modélise l’environnement qui génère l’écho, ainsi que du SER. Chapitre 5 : Réduction du Bruit et Annulation de l’Écho Acoustique dans le Domaine des Paramètres du Codeur CELP et Transcodage Intelligent Le Chapitre 6 constitue le point central de cette thèse. Ce chapitre traite du concept qui consiste à intégrer les algorithmes implémentés dans les chapitres 3 et 4 dans des schémas de transcodage intelligents. Il s’agit en effet du traitement centralisé, c’est-à-dire de l’annulation de l’écho acoustique et de la réduction du bruit dans le réseaux. Le transcodage intelligent dans ce chapitre s’applique sur les coefficients LPC, le gain fixe et le gain adaptatif. Les coefficients LPC et le gain fixe extraits du décodeur source sont transmis à nos modules d’amélioration de la qualité de la parole. En fin de traitement, les coefficients LPC et gain fixe améliorés, ainsi que le gain adaptatif sont directement reportés dans l’encodeur cible. Avec ce schéma de transcodage l’analyse par prédiction lineaire au niveau de l’encodeur cible n’est plus effectuée. Il n’est non plus nécessaire de calculer les gains fixes et adaptatifs. Le coût de calcul lié au transcodage du codeur sous le mode 12.2 kbps vers 7.4 kbps et vice versa est réduit de 27 % environ. Le délai est réduit de 5 ms pendant le transcodage du mode 12.2 kbps vers 7.4 kbps. La réduction du bruit se comporte de manière similaire au filtre classique de Wiener lors du transcodage du mode 12.2 kbps vers 7.4 kbps. Cependant, pendant le transcodage du 7.4 kbps vers 12.2 kbps, la méthode proposée dans cette thèse donne des performances objectives supérieures à celles du filtre classique de Wiener. Le résultat le plus intéressant concerne l’annulation de l’écho acoustique. Le traitement des paramètres du codeur de parole permet de contourner les problèmes de non linearité introduits par les codeurs qui dégradent les performances des algorithmes classiques de type stochastiques (NLMS). Les résultats (analyse de l’ERLE) montrent que le traitement des paramètres du codeur de parole permet aisément d’atteindre des mesures de l’ERLE de plus de 45 dB comme recommandé pour le GSM. Chapitre 6 : Conclusion Générale Le chapitre 6 est consacré à la conclusion générale des travaux effectués au cours de cette thèse. Nous dressons dans un premier temps les acquis de cette thèse ainsi que les résultats concernant aussi bien la réduction du bruit, l’annulation de l’écho acoustique et le transcodage intelligent obtenus. Ce chapitre se termine par la présentation de plusieurs pistes et axes de recherches qui contribueront d’une part à l’améloration des performances et d’autre part à la généralisation du concept introduit dans cette thèse. CONTENTS i Contents Contents i List of figures v List of tables ix 1 Introduction and Context of the Thesis Introduction . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivations . . . . . . . . . . . . . . . . . . . . 1.1.1 Voice Quality Enhancement in Terminal 1.1.2 Network Voice Quality Enhancement . . 1.1.3 Transcoding . . . . . . . . . . . . . . . . 1.1.4 Alternative Approach . . . . . . . . . . 1.2 Objective of the PhD . . . . . . . . . . . . . . . 1.3 Organization of the Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 6 6 7 7 7 8 9 2 Speech Coding and CELP Techniques Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Speech Coding: General Overview . . . . . . . . 2.1.1 Speech Coder Classification . . . . . . . . 2.1.2 Speech Coding Techniques . . . . . . . . . 2.2 Analysis-by-Synthesis Principle . . . . . . . . . . 2.3 CELP Coding Overview . . . . . . . . . . . . . . 2.3.1 CELP Decoder . . . . . . . . . . . . . . . 2.3.2 Physical Description of CELP Parameters 2.3.3 Speech Perception . . . . . . . . . . . . . 2.4 Standard . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 The Standard ETSI-GSM AMR . . . . . . 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 12 12 14 15 16 17 19 29 30 30 33 3 Noise Reduction 35 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Noise Reduction in the Frequency Domain . . . . . . . . . . . . . . . . . . . . . 37 ii CONTENTS 3.2.1 3.2.2 3.2.3 3.3 3.4 3.5 3.6 Overview: General Synoptic . . . . . . . . . . . . . . . . . . . . . . Spectral Attenuation Filters . . . . . . . . . . . . . . . . . . . . . . Techniques to Estimate the Noise Power Spectral Density . . . . . 3.2.3.1 Estimation of the Noise PSD based on the Voice Activity tection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3.2 The Minimum Statistic Technique . . . . . . . . . . . . . Introduction to Noise Reduction in the Coded Domain . . . . . . . . . . . 3.3.1 Some Previous Works in the Codec Domain . . . . . . . . . . . . . Noise Reduction Based by Weighting the Fixed Gain . . . . . . . . . . . . 3.4.1 Estimation of the Noise Fixed codebook Gain . . . . . . . . . . . . 3.4.2 Relation between Fixed Codebook Gains . . . . . . . . . . . . . . . 3.4.3 Attenuation Function . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Noise Reduction Control: Post Filtering . . . . . . . . . . . . . . . Noise Reduction through Modification of the LPC coefficients . . . . . . . 3.5.1 Estimation during Voice Activity Periods . . . . . . . . . . . . . . 3.5.1.1 Estimation of the noise LPC vector: ÂD . . . . . . . . . 3.5.1.2 Estimation of the Noise Autocorrelation Matrix: Γ̂D . . . 3.5.1.3 Estimation of the Speech Autocorrelation Matrix: Γ̂S . . 3.5.2 Estimation during Noise Only Periods . . . . . . . . . . . . . . . . 3.5.3 Some experimental Results . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . De. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 39 41 41 43 44 45 47 48 50 51 53 55 57 58 58 60 63 65 67 4 Acoustic Echo Cancellation 69 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 Acoustic Echo Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3 Acoustic Echo Cancellation: State of the Art . . . . . . . . . . . . . . . . . . . 71 4.3.1 The Least Mean Square Algorithm . . . . . . . . . . . . . . . . . . . . . 71 4.3.2 The Gain Loss Control in the Time Domain . . . . . . . . . . . . . . . . 73 4.3.3 The Wiener Filter Applied to Acoustic Echo Cancellation . . . . . . . . 75 4.3.4 The Double Talk Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4 Overview on Acoustic Echo Cancellation Approaches in the Coded Parameter Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.5 The Gain Loss Control in the Coded Parameter Domain . . . . . . . . . . . . . 78 4.5.1 Estimation of the Energy . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5.2 Computation of the Attenuation Gain Factors . . . . . . . . . . . . . . . 80 4.5.3 Experimental Results Analysis . . . . . . . . . . . . . . . . . . . . . . . 82 4.6 Acoustic Echo Cancellation by Filtering the Fixed Gain . . . . . . . . . . . . . 84 4.6.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.6.2 Approximation of the joint function f (., .) and the Filter G(m) . . . . . 86 4.6.3 Estimation of the Echo Signal Fixed Codebook Gain: ĝf,Z . . . . . . . . 88 4.6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 CONTENTS iii 5 Voice Quality Enhancement and Smart Transcoding 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Network Interoperability Problems and Voice Quality Enhancement . . . . . . . 5.2.1 Classical Speech Transcoding Scenarios . . . . . . . . . . . . . . . . . . . 5.2.2 Classical Speech Transcoding and Voice Quality Enhancement . . . . . . 5.3 Alternative Approach: the Speech Smart Transcoding . . . . . . . . . . . . . . 5.3.1 The Speech Smart Transcoding Principle and Strategies . . . . . . . . . 5.3.2 Mapping Strategy of the LPC Coefficients . . . . . . . . . . . . . . . . . 5.3.3 Mapping Strategy of the Fixed and Adaptive Codebook Gains . . . . . . 5.4 Network Voice Quality Enhancement and Smart Transcoding . . . . . . . . . . 5.5 The proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Noise Reduction Integrated in the Smart Transcoding Algorithm . . . . 5.5.2 Acoustic Echo Cancellation Integrated in Smart Transcoding Algorithm 5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Overall Computational Load and Algorithmic Delay . . . . . . . . . . . 5.6.2 Overall Voice Quality Improvement . . . . . . . . . . . . . . . . . . . . . 5.6.3 Noise Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.3.1 The ITU-T Objective Measurement Standard for GSM Noise Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.3.2 Noise Reduction: Simulation Results . . . . . . . . . . . . . . . 5.6.4 Acoustic Echo Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 97 99 99 101 102 102 104 105 112 114 115 116 117 118 119 119 119 120 124 125 129 6 General Conclusion 6.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 135 136 139 A GSM Network and Interconnection 143 A.1 GSM Networks Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 B CELP Speech Coding Tools B.1 The Recursive Levinson Durbin Algorithm . . . . . . . . . . B.1.1 Steps of the Recursive Levinson-Durbin Algorithm . B.2 The Inverse Recursive Levinson-Durbin Algorithm . . . . . B.3 The ITU-T P. 160 . . . . . . . . . . . . . . . . . . . . . . . B.3.1 Assessment of SNR Improvement (SNRI) . . . . . . B.3.2 Assessment of Total Noise Level Reduction (TNLR) Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 147 148 148 149 151 152 155 iv CONTENTS LIST OF FIGURES v List of Figures 1.1 1.2 1.3 Acoustic Echo Scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VQE in a Digital Wireless Network. . . . . . . . . . . . . . . . . . . . . . . . . Codec Domain Speech Enhancement. . . . . . . . . . . . . . . . . . . . . . . . . 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 Generic Design of a Speech Coder. . . . Illustration of Coding Delay. . . . . . . . Encoder based on Analysis-by-Synthesis. Typical CELP Decoder. . . . . . . . . . Human Speech Production Mechanism. . Voiced Sound. . . . . . . . . . . . . . . . Unvoiced Sound. . . . . . . . . . . . . . LPC Coefficients as Spectral Estimation. Structure of the Analysis Window. . . . Rectangular Window. . . . . . . . . . . Hamming Window. . . . . . . . . . . . . Typical CELP Encoder. . . . . . . . . . Decoding Block of the AMR-NB. . . . . . . . . . . . . . . . . . 13 14 16 17 19 20 21 22 23 24 24 29 31 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 Existing Noise Reduction Unit Location. . . . . . . . . . . . . . . . . . . . . . . Spectral Attenuation Principle. . . . . . . . . . . . . . . . . . . . . . . . . . . . Simplified Block Diagram of the AMR VAD Algorithm, Option 1. . . . . . . . . Example of VAD Decision, Option 1. . . . . . . . . . . . . . . . . . . . . . . . . Experimental Setup for the Exchange of Parameters. . . . . . . . . . . . . . . . Coded Domain Scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fixed Codebook Gain Modification in Parameter Domain. . . . . . . . . . . . . Example of Noise Fixed Codebook Gain Estimation. . . . . . . . . . . . . . . . (a)-clean Speech signal, (b)-noisy speech fixed gain (red), clean fixed gain (blue), (c)-noisy speech fixed gain (red), noise fixed gain (blue). . . . . . . . . . . . . . (a)-clean speech, (b)- noisy speech, (c)-noisy fixed gain(red), estimated clean fixed gain(blue). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimated Fixed codebook Gain. . . . . . . . . . . . . . . . . . . . . . . . . . . Principle of NR based on LPC Coefficients. . . . . . . . . . . . . . . . . . . . . Estimation Flowchart of the Clean Speech LPC Coefficients. . . . . . . . . . . . 36 38 42 43 45 46 47 50 3.10 3.11 3.12 3.13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 8 51 53 55 56 58 vi LIST OF FIGURES 3.14 3.15 3.16 3.17 Lad Windowing Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Damping Factor Characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . Typical Example of Spectrum Damping. . . . . . . . . . . . . . . . . . . . . . . Typical Estimation Spectrun (SN Rseg = 12 dB): (a)-our proposed method is displayed with the noisy spectrum, (b)-our proposed method is compared to the noisy, the clean and the wiener method spectrum. . . . . . . . . . . . . . . . . . 62 64 64 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 Acoustic Echo Scenario. . . . . . . . . . . . . . . . . . . . . . . . System Identification in AEC. . . . . . . . . . . . . . . . . . . . . Control Characteristics of the Microphone in Gain Loss Control. Combined AEC/CELP Predictor. . . . . . . . . . . . . . . . . . . Gain Loss Control in the Codec Parameter Domain. . . . . . . . Example of Energy Estimation in Codec Parameter Domain. . . . Characteristics of the Attenuation Gains. . . . . . . . . . . . . . Example of AEC based on Gain Loss Control. . . . . . . . . . . . Typical Example of the Evolution of the Attenuation Factor. . . Filtering of the Microphone Fixed Codebook Gain Principle. . . . Example of AEC by Filtering the Fixed Gain. . . . . . . . . . . . . . . . . . . . . . . 70 72 74 77 78 81 82 83 84 85 93 5.1 5.2 5.3 5.4 5.5 Generic GSM Interconnection Architecture. . . . . . . . . . . . . . . . . . . . . Transcoding, Classical Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . Network VQE, Classical Solution. . . . . . . . . . . . . . . . . . . . . . . . . . . Smart Transcoding Principle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transcoding Example from 7.4 kbps mode to 12.2 kbps mode: Spectrum of the associated synthesis filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transcoding Example from 12.2 kbps mode to 7.4 kbps mode: Spectrum of the associated synthesis filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adaptive Gains in Transcoding, typical example during transcoding from 7.4 kbps to 12.2 kbps mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Typical Example of Decoded Fixed Codebook Gains during transcoding from 7.4 kbps mode to 12.2 kbps mode. . . . . . . . . . . . . . . . . . . . . . . . . . Fixed Codebook Gains mapping transcoding from 7.4 kbps mode to 12.2 kbps mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fixed Codebook Gains mapping transcoding from 12.2 kbps mode to 7.4 kbps mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Structure of the Codec Domain VQE Embedded in Smart Transcoding. . . . . Proposed Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Flowchart Noise Reduction in Smart Transcoding. . . . . . . . . . . . . . . . . . Overview of the Gain Loss Control Integrated in Smart Transcoding. . . . . . . Filtering of the Fixed Codebook Gain Integrated in Smart Transcoding. . . . . Objective Metrics versus Segmented SNR. Transcoding from 12.2 kbps mode to 7.4 kbps mode. Proposed NR method (blue dashed circle), Wiener NR method (red dashed diamond). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 100 101 102 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 105 106 107 109 110 111 113 114 116 117 118 122 LIST OF FIGURES 5.17 Objective Metrics versus Segmented SNR. Transcoding from 7.4 kbps mode to 12.2 kbps mode. Proposed NR method (blue dashed circle). Wiener NR method (red dashed diamond). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.18 Spectrogram of the Noisy Speech Signal: 6 dB Segmented SNR. . . . . . . . . . 5.19 Spectrogram of the Noisy Speech Enhanced With the Standard Wiener Filter. . 5.20 Spectrogram of Coded Domain Enhancement: Transcoding from 12.2 kbps mode to 7.4 kbps mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.21 Spectrogram of Coded Domain Enhancement: Transcoding from 7.4 kbps mode to 12.2 kbps mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.22 Time Evolution of the ERLE: from 12.2 kbps mode to 7.4 kbps mode, case filter h1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.23 Time Evolution of the ERLE: from 7.4 kbps mode to 12.2 kbps mode, case filter h1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 123 124 125 126 127 130 131 A.1 Generic GSM Interconnection Architecture. . . . . . . . . . . . . . . . . . . . . 144 viii LIST OF FIGURES List of Tables 2.1 2.2 2.3 Bit-rate Classification of Speech Coder. . . . . . . . . . . . . . . . . . . . . . . . Vocoder Relationship. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 kbps Mode Algebraic Codebook Position. . . . . . . . . . . . . . . . . . . . 13 16 32 4.1 4.2 Mean Linear Coefficients in Double Talk Mode. . . . . . . . . . . . . . . . . . . Mean and Standard Deviation Opinion Score. . . . . . . . . . . . . . . . . . . . 88 93 5.1 5.2 Total Average of the Objective Metrics. . . . . . . . . . . . . . . . . . . . . . . 120 Echo Return Loss Enhancement Values. . . . . . . . . . . . . . . . . . . . . . . 133 B.1 Threshold Level for Speech Classification. . . . . . . . . . . . . . . . . . . . . . 151 B.2 Objective Metrics Requirement. . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 x LIST OF TABLES LIST OF ABBREVIATIONS List of Abbreviations AbS: Analysis-by-Synthesis ACR: Absolute Category Rating ACELP: Algebraic Code-Excited Linear Prediction ADC: Analog-to-Digital Converter ADPCM: Adaptive Differential Pulse Code Modulation AE :Acooustic Echo AEC: Acooustic Echo Cancellation AMR-NB: Adaptive Multi-Rate Narrow-Band AMR-WB: Adaptive Multi-Rate Wide-Band ATH: Absolute Threshold of Hearing BTS: Base Transceiver Station BSC: Base Station Controller BSS: Base Station Subsystem CELP: Code-Excited Linear Prediction CODEC: COoder/DECoder CNG: Comfor Noise Generator CPU: Central Processing Unit DAC: Digital-to-Analog Converter dB: Decibel dBov: Decibel-Overload DCT: Discrete Cosine Transform DFT: Discrete Fourier Transform DPCM: Differential Pulse Code Modulation DSN: Difference SNRI TNLR DSP: Digital Signal Processor DTD: Double Talk Detection DTX: Discontinuous Transmission FFT: Fast Fourier Transform FFG: Filtering of the Fixed Gain FIR: Finite Impulse Response GLC: Gain Loss Control GMSC: Gateway Mobile Switching Center GSM: Global System for Mobile communications HLR: Home Location Register xi ISDN: Integrated Services Digital Network ITU-T: International Telecommunication Union, Telecommunication standardization sector IIR: Infinite Impulse Response IP: Internet Protocol KBPS: Kilo Bits Per Second LMS: Least Mean Square LP: Linear Prediction LPC: Linear Prediction Coefficients LSF: Line Spectral Frequencies LSP: Line Spectral Pair LTP: Long Term Prediction MA: Moving Average MOS: Mean Opinion Score MSC: Mobile Switching Center MD: Mobile Deice Equipment NLMS: Normalized Least Mean Square NPLR: Noise Power Level Reduction NSS: Network Switching Subsystem NR: Noise Reduction PCM: Pulse Code Modulation PESQ: Perceptual Evaluation of Speech Quality PSTN: Public Switch Telephone Network PLMN: Public Lan Mobile Network QoS: Quality of Service SER: Signal-to-Echo Ratio SNR: Signal-to-Noise Ratio SNRI: Signal-to-Noise Ratio Improvement SSNR: Segmental Signal-to-Noise Ratio STP: Short Term Prediction TNLR: Total Noise Level Reduction TRAU: Transcoding and Adaptative Unit UMTS: Universal Mobile Telecommunications System VAD: Voice activity Detection VoIP: Voice over Internet Protocol VQE: Voice Quality Enhancement X-LMS: X-Filter Band Least Mean Square 2 LIST OF ABBREVIATIONS 3 Chapter 1 Introduction and Context of the Thesis Introduction In mobile communication, voice quality and intelligibility are two of the most important customer’s satisfaction factors. Therefore, during wireless telecommunication scenario (Mobile to Mobile or Mobile to other networks), as described in Sec. 1.1, the communication system should be designed so as to produce a perceived sound impression on the listener sides as close as possible to a face to face conversation. In a telecommunication scenario, speech signal for the end user is generally affected by external impairments due to three main problems. The first problem is that Mobile Device (MD) systems or terminals are usually used on-themove, especially in noisy surrounding (busy offices, high street traffic, airport halls, restaurants, moving vehicles, etc.). As a consequence, both receiver and transmitter ends are surrounded by background noise. Voice quality and intelligibility can be significantly affected by noise effects, resulting in tiredness for the listeners and difficulty to understand each other. Speech quality and intelligibility can be improved by a noise reduction algorithm. Hence, the noise reduction algorithm should efficiently reduce the background noise by significantly increasing the speech signal to noise ratio. The algorithm should also have minimal effect on the speech signal, such as distortion, clicks and buzzing. In addition, the algorithm should be able to smartly manage various conditions: reduce the noise level to a comfortable level but retain the basic characteristic of the noise, maximum noise attenuation, measure of noise reduction aggressiveness: [Loizou 2007]. A second problem is the Acoustic Echo (AE) where the low quality system of amplification of mobile devices, the hand-free telephony and the environment room where the speaker is located play significant roles in its presence. AE is materialized as the acoustic coupling between the loudspeaker and the microphone. As depicted in Fig. 1.1, the microphone at the 4 INTRODUCTION AND CONTEXT OF THE THESIS sending side does not only capture the near-end speech signal s(t). It also captures the delayed version of the far-end speech signal z(t), leading to the superposition of sound waves captured by the microphone: y(t) = s(t)+z(t). This phenomenon affects greatly the conversation quality and the remote speaker (Receiver Speaker in Fig. 1.1) experiences the annoying effect of hearing his own voice with a delay. The delay is usually of about 200 − 500 ms, due to transmission time over mobile network which can be particularly high. Acoustic echo in addition increases the voice activity factor. For instance, this is particularly true with the Adaptive Multi Rate (AMR) coder where a Discontinuous Transmission module (DTX) is integrated. In the uplink, the radio efficiency is reduced [3GPP 1999b]. Acoustic Echo Cancellation (AEC) is strongly recommended [ITU-T 2004] to reduce echo effects and it should not introduce excessive delay. x(k) x(n) Decoder D/A Echo Path x(t) Loudspeaker path Receiver Speaker Side Sending Speaker Side Microphone path y(k) Encoder y(n) echo z(t) A/D y(t) = z(t)+s(t) speech s(t) Figure 1.1: Acoustic Echo Scenario. The Third problem is related to the fact that the world of telecommunication is becoming more and more heterogeneous. The proliferation of mobile devices and development of several speech coders for different networks have leaded to the deployment of coders that are not interoperable with each other. To interoperate, bit-streams need to be converted in the gateways separating different networks: decoding one codec bit-stream and re-encoding it into the target codec bit-stream format. This process can also be performed inside the Base Station Subsystem (BSS). Such a solution also called Transcoding requires computational load, but decreases speech quality and increases algorithmic delay. Solutions to overcome transcoding problems are in development, but are still limited to configurations where coders (sending and receiving) at every switching stage are similar. Due to these external impairments, voice quality enhancement algorithms are more critical than ever. Noise Reduction and Acoustic Echo Canceller algorithms can be used to overcome the problems due to speech degradation. Current solutions are achieved based on highly elaborated and complex algorithms. The common principle to reduce noise effects and/or to reduce or cancel acoustic echo, is to deal directly with the speech samples. To describe where NR and AEC are actually implemented, a short overview on GSM architecture is necessary Annex A, see the next section and [Halonen et al. 2002] . INTRODUCTION AND CONTEXT OF THE THESIS 5 Speech Signal Noise Echo Microphone Loudspeaker A/D D/A Terminal: Mobile Device Terminal Based Speech Enhancement NR / AEC SPD SPE CHD CHE Speech Coder Radio Channel / Network Transmission CHE CHD SPE SPD Radio Access Network: BST BSC TRAU NR / AEC SPD SPE CHD CHE Network Based Voice Quality Enhancement Network Transmission CHE CHD SPE SPD Core Network: MSC NR / AEC PSTN SPD SPE CHD CHE Network Transmission PLMN Figure 1.2: VQE in a Digital Wireless Network. 6 1.1 INTRODUCTION AND CONTEXT OF THE THESIS Motivations This thesis took place in audio coding and speech processing departments of Siemens Mobile, BenQ Mobile, and Nokia Siemens Networks, successively. Those departments have been involved in the International Telecommunication Union-Telephony Sector (ITU-T) standardization activities related to embedded speech, audio coding and speech enhancement. Its purpose is to answer the three problems mentioned (background noise and/or acoustic echo presence, transcoding problems). For a better understanding of how these three problems will be addressed, a brief overview of a wireless network is useful. In Fig. 1.2, three modules of a typical wireless network (Universal Mobile Telecommunication Systems (UMTS) or GSM) are depicted: the mobile device (receiver side or transmitter side), the radio access network and the core network. Due to the symmetry between the transmitter end and the receiver end, only the transmitter end is shown in details to simplify the communication problem. The transmitter end device captures noise d(n) and/or echo signals z(n) simultaneously with the clean speech signal s(n). The second module represents the radio access network that controls the radio link with the mobile device. The third module shows the core network where the Mobile Switching Center (MSC) and the Gateway are located. Fig. 1.2 shows where Voice Quality Enhancement (VQE) algorithms, like NR and AEC are generally implemented or deployed. The first location is directly inside the terminal. The second possible location of the VQE is within the network. The enhancement can be performed either at the MSC or near the radio access network ([Eriksson 2006] - [Cotanis 2003]). 1.1.1 Voice Quality Enhancement in Terminal According to Fig. 1.2, enhancement is achieved as a pre-processing before encoding, or after decoding near the loudspeaker. Existing techniques use the corrupted speech signal in PCM format to perform noise reduction and/or acoustic echo cancellation. Algorithms deployed are generally based on some transform (FFT, DCT, DFT, Block processing, Sub-band implementation, etc.) techniques. The complexity of such approaches is high and is constrained by the CPU of the DSP design. Real time processing constraints also make difficult and expensive to implement more advanced algorithms. Additionally, terminal solutions depend today on terminal constructors. As a consequence, the network providers cannot control the delivered quality for all terminals. Some solutions were initiated by incorporating speech enhancement inside speech coders. Such solutions also lead to problems related to terminal solutions. 1.1. MOTIVATIONS 1.1.2 7 Network Voice Quality Enhancement Positioning speech enhancement in the network leads to a technique called network voice quality enhancement. Algorithms (NR, AEC) implemented are similar to those used in terminal devices. As PCM samples are not available, processes are achieved by decoding the bit-stream, performing noise reduction and/or acoustic echo cancellation (to the uplink) in the time or frequency domain and re-encoding. Such solutions are implemented near the MSC or between Transcoder and the MSC. Another interesting configuration is where user A from (wireless or wireline) networkA (coder A) is in conversation with user B from another (wireless or wireline) network-B (coder B). Transcoding needs to be performed inside the network for bit-stream conversion and interoperability issue. In practice the transcoding is performed in the gateway. The bitstream from coder A is decoded by decoder A and the decoded speech signal is re-encoded with encoder B. Such a solution always degrades speech quality and introduces distortion due to the superposition of multiple quantization noises (encoding-decoding-encoding-decoding). In presence of noise or/and echo, ’classical’ speech enhancement (NR and AEC) must be performed between the decoding stage and before the re-encoding process: see Fig. 1.2. 1.1.3 Transcoding If coder A and coder B are using similar technology, Code-Excited Linear Prediction (CELP) for example, the parameters that are transmitted are of the same kind. An interesting technique involves directly mapping some parameters of encoder A inside encoder B leading to partial encoding. The encoding process is thus reduced. This approach is suitable when speech signal is not corrupted by noise and/or acoustic echo. This technique is known as Smart Transcoding. It has already been experimented with good results as in [Kang et al. 2003], [Beaugeant, Taddei 2007] and [Ghenania 2005]. If the speech signal is corrupted by noise or/and echo, the bit-stream from encoder A is decoded with decoder A and speech enhancement performed on the PCM samples. The enhanced speech signal is then re- encoded using encoder B. Current voice quality enhancement methods in the network are computational expensive and they introduce algorithmic delay. 1.1.4 Alternative Approach Developments of acoustic echo cancellation and noise reduction units are now moving from terminals (MD) into networks. There are several reasons and advantages of such a new placement of VQE algorithms. First of all, a central control of the network quality is desirable. 8 INTRODUCTION AND CONTEXT OF THE THESIS Indeed, network providers have a high diversity of devices in their network, with various levels of speech quality. At the same time, quality of mobile devices has not been particularly enhanced over the last decade. New challenges have appeared, like miniaturization of devices, without the speech quality being a high focus. Therefore concrete industrial development leads to placing VQE in network as an efficient or even better solution than those built in terminals. Enhancement of speech quality and intelligibility is related to the perturbation source characteristic, and even the design of the platform of the dedicated algorithm,[Loizou 2007]. The drawbacks of ’classical’ solutions, as mentioned in Sec. 1.1.1 and 1.1.2, lead to the idea that VQE could be made directly by modifying the available bit-stream. Modifying the parameters composing the bit-stream avoids the entire decoding/encoding process necessary in ’classical’ solution: it reduces computation load and avoid tandeming effects. Such solutions were initiated in [Chandran, Marchok 2000] and were recently extended to automatic level control [Pasanen 2006] and frame energy estimation [Doh-Suk et al. 2008]. The principles of such concepts are depicted in Fig. 1.3. Corrupted Parameters Extraction Input corrupted speech Encoder Processing of the Corrupted Parameters Corrupted Bit-stream Mapping inside the Bit-stream Other Parameters Near-End Speaker Side Enhanced speech Network Area or Parameters Domain Decoder Enhanced Bit-stream Far-End Speaker Side Figure 1.3: Codec Domain Speech Enhancement. Additionally, this new speech enhancement solution (Modification of coded parameters) can be easily integrated into Smart Transcoding schemes. In combination with Smart Transcoding solutions, our idea involves modifying some CELP parameters before mapping them inside the target bit-stream. Dealing with such solutions, delay is minimized, computational demand is reduced and problems due to classical transcoding solution are controlled. 1.2 Objective of the PhD This PhD investigation addresses both the interoperability problem in speech coding and the voice quality enhancement problem. The purpose of this PhD is the conception and implementation of flexible single microphone speech enhancement algorithms providing good speech quality and intelligibility. The proposed algorithms perform on CELP parameters. Above all, these algorithms can be located anywhere inside the network, since they are applied on the bit-stream. The main feature of these algorithms is that they do not require the decoded 1.3. ORGANIZATION OF THE DOCUMENT 9 speech signal. Based on the discussion of Sec. 1 and 1.1, this PhD will first propose algorithms to enhance CELP parameters (AMR-NB) degraded by background noise or/and acoustic echo without requiring the PCM speech samples. The developed algorithms will be compared to two existing ’classical’ location levels of voice quality enhancement as described in Sec. 1.1.1 and 1.1.2. The proposed speech enhancement algorithms should also satisfy constraints of complexity load, implementation flexibility and real time processing (algorithmic delay). Additionally, this study relies on the knowledge of CELP coder techniques and on the parameters transmitted in the bit-stream. This PhD will also explore Smart Transcoding principle and interoperability problems. This part includes investigations on different architectures and configurations of transrating between AMR-NB modes. Results regarding mapping of parameters and their impacts on speech quality and intelligibility will be presented. The final step of the PhD is the investigation on embedded solutions, meaning integration of our new algorithms (noise reduction and/or acoustic echo cancellation) inside Smart Transcoding schemes. The chosen embedded architecture should be implemented within transrating scheme between different AMR-NB modes. Our main contributions in this domain of research are: – The proposition and implementation of Noise Reduction algorithms by modifying the coded parameters of the AMR-NB. Such algorithms can be located inside the network or in any area where the coded parameters can be recovered. – The proposition and implementation of Acoustic Echo Canceller algorithms located inside the network through the modification of coded parameters of the AMR-NB. As above, these algorithms can also be implemented in any area where the coded parameters can be recovered. – NR and/or AEC solutions embedded inside smart transcoding schemes in different AMRNB modes. Smart transcoding is applied during the mapping of parameters. 1.3 Organization of the Document The present document is organized in six chapters or sections, including the present one. – Chap. 2 introduces elementary principles of CELP coding techniques. It also highlights the physical description and signification of the parameters transmitted inside the bitstream by the CELP coders. An overview of the AMR-NB architecture is also discussed. – Chap. 3 and 4 are dedicated to noise-reduction and acoustic echo cancellation, respec- 10 INTRODUCTION AND CONTEXT OF THE THESIS tively. Before developing new algorithms in the parameter domain we start with a state of the art of existing techniques. We also present recent works based on the modification of coded parameters. New algorithms dealing with CELP parameters are then presented for noise reduction in chapter 3 and for acoustic echo cancellation in chapter 4. Both chapters are concluded by experimental results of objective and subjective tests for comparison with classical speech enhancement approach. – Chap. 5 can be viewed as an application of the new algorithms introduced in chapters 3 and 4. The problem of centralized or network VQE is introduced. We start this chapter with the description of the Smart Transcoding and interoperability problems. The purpose of this section is to integrate the proposed algorithms for noise reduction and acoustic echo cancellation inside a smart transcoding scheme. The performance of such an architecture is studied via some objective tests. – Chap. 6 concludes this work. The critical points of the PhD are highlighted. It also indicates some perspectives to improve this new voice quality enhancement approach. 11 Chapter 2 Speech Coding and CELP Techniques Introduction Digital technology is based on sampling theory, which states the possibility to reconstruct a continuous signal without distortion with only a finite number of samples (called discrete signal). The rate at which we pick up samples from the continuous signal has an impact on the bandwidth of the discrete signal, i.e. the span of the frequencies that contains the discrete signal. This theorem, also known as Nyquist-Shannon sampling theorem stipulates that the density of time sample must be twice greater than the characteristic frequency length, [Goldberg, Riek 2000], [Shannon 1949]. Traditional telephony provides a bandwidth limited to approximately 3.4 kHz, much less than what our ear can perceive. To improve the user experience we can eventually increase the bandwidth or the sampling frequency. But telecom operators rather focus on reducing the quantity of information to be transmitted than on improving the frequency bandwidth. It is at this point that coding theory is exploited for speech compression. The purpose of speech coding is to compress a digitalized speech signal using as few bits as possible, while keeping a reasonable listening quality level. Properties such as low bit-rate, high speech quality, robustness across different speakers and languages, robustness against transmission errors, good performance during noisy periods, low memory size, low computational complexity, and low algorithmic delay are the most desirable features for a speech coder [Goldberg, Riek 2000] - [Oppenheim, Schafer 1999]. The most spread out technique in speech coding nowadays is Code-Excited Linear Prediction (CELP). This technique tries to mimic the human speech production apparatus. The vocal tract is modeled by a set of LPC coefficients. The excitation signal is a linear combination of an adaptive excitation (vocal cords) and of a fixed excitation (noise like signal). This approach produces a good approximation of human 12 INTRODUCTION TO SPEECH CODING AND CELP speech production model. In this chapter, computational details of the CELP coder are not presented. The objective here is to have good knowledge of the parameters transmitted by the CELP encoder. We will especially focus on the physical description and signification of the transmitted parameters received at the CELP decoder. This chapter starts with a general overview of speech coding in Sec. 2.1. In Sec. 2.2, the principle of analysis-by-synthesis is presented. The CELP coder technique is widely discussed in Sec. 2.3. In this section, physical interpretation of the CELP parameters is also presented. As application of CELP technique, Sec. 2.4 is dedicated to the description of the AMR-NB standard. 2.1 Speech Coding: General Overview A basic structure of a speech encoder and decoder is depicted in Fig. 2.1 (see [CHU 2000]). The processing is frame-wise at the encoder by taking into account the quasi stationarity of the speech signal. The input speech signal in PCM format (16 bits PCM at sampling rate of 8 kHz that would require a bit-rate of 128 Kbps without compression) is analyzed to extract a number of pertinent parameters. These parameters characterize the speech frame under analysis. The computed parameters are quantized and sent together as a compressed bitstream. As speech coding aims at reducing the bit-rate, the compressed bit-stream should have a bit-rate lower than 128 Kbps. The channel coding processes the encoded digital speech data for error protection. Finally, the channel protected bit-stream is transmitted. This PhD does not address specific problems related to source coding, and channel coding. Information on channel theory and channel coding can be found in [Bossert 1999]. At the decoder side Fig. 2.1 (b), the bit-stream is unpacked and the quantized parameters are obtained. The synthetic speech is generated by synthesizing and processing the decoded parameters. 2.1.1 Speech Coder Classification In speech coding, many encoding techniques have been developed. Current coders, candidate for standardization should satisfy some particular attributes which are used during the classification [Kleijn, Paliwal 1995]. These attributes are the following ones. 2.1. SPEECH CODING: GENERAL OVERVIEW Input PCM Speech 13 Analysis and Processing Extraction/ Encoding: Parameter 1 Index 1 Extraction/ Encoding: Parameter 2 ….. Index 2 Extraction/ Encoding: Parameter N Index N Bit-stream Pack (Multiplexing) (a) Encoder Unpack Index 1 Bit-stream Index 2 Decoding: Parameter 1 Index N Decoding: Parameter 2 ….. Decoding: Parameter N Synthesis and Processing Synthetic Speech (a) Decoder Figure 2.1: Generic Design of a Speech Coder. Bit-Rate The bit-rate specifies the number of bits required to encode a speech signal. The minimum bit-rate achievable by a speech coder is limited by the amount of information and redundancy contained in the speech signal. A sampling frequency of 8 kHz is commonly used for speech encoding (e.g. in telephony) and the input samples are usually 2 bytes (16 bits) samples. Therefore the input bit-rate that the coder attempts to reduce is: 8 kHz × 16 = 128 kbps. Tab. 2.1 indicates a speech codec classification speech according to the bit-rate. Category High bit-rate Meduium bit-rate Low bit-rate Very low bit-rate Bit-rate > 15 kbps 5 to 15 kbps 2 to 5 kbps < 2 kbps Table 2.1: Bit-rate Classification of Speech Coder. Subjective Quality It means perceived quality of the reconstructed speech signal at the receiver end, meaning intelligibility and naturalness of the spoken words and ability to be understood. 14 INTRODUCTION TO SPEECH CODING AND CELP Complexity Computational demand is one of the key issues, usually low bit-rate implies high complexity. There is an important requirement of memory storage directly related to the algorithmic complexity. Sophisticated coder needs large amount of fast memory to store intermediate coefficients and codebooks. Delay The algorithmic complexity generally implies algorithmic delay. As depicted in Fig. 2.2, the overall coder delay is the sum of several delay components: encoder buffering delay, encoder processing delay, transmission delay, decoder buffering delay and decoder processing delay. Buffering for real time implementation entails some delay that should also be minimized. Buffer Input Frame Encoding Bits Transmission Decoding Encoder Buffering Delay Encoder Processing Delay Transmission Delay / Decoder Buffering Delay Decoder Processing Delay Output Frame Figure 2.2: Illustration of Coding Delay. Bandwidth This characteristic refers to the frequency range which the coder is able to reproduce. 2.1.2 Speech Coding Techniques Another attribute that differentiates coders is their coding technique. Several speech coders have been developed and can be classified in three groups: – Waveform approximating coders: Here, the speech signal is digitized and each sample is coded by a constant number of bits (G.711 or PCM, [ITU-T 1988]). They provide high quality at bit rate greater than 16 kbps. Below this limit, the quality degrades rapidly. The number of bits for the quantization can be reduced when the difference between one sample and its predicted version is coded. The G.726 or ADPCM, Adaptive Differential Pulse Code Modulation is an example of waveform coder [Daumer et al. 1984]. 2.2. ANALYSIS-BY-SYNTHESIS PRINCIPLE 15 – Parametric coders: Based on the frame based process of the input digital speech signal, this kind of coder uses a model to generate and estimate a set of parameters that are quantized and transmitted. The frame size is about 10 − 30 ms and the decoded speech signal is intelligible but the perceptual quality of such coders relates to the model used. The most successful coder in this group is the LP vocoder, where a filter which models the vocal tract is derived from linear prediction. Parameters sent to the decoder are the filter coefficients, unvoiced/voiced state of the frame, the variance of the excitation signal and the pitch period of the voiced decision. The bit-rate of such coders is within the range of 2 to 4 kbps and these coders are especially used in military applications, permitting high data protection and encryption [Federal Standard 1015, Telecommunications: Analog to Digital Conversion of Radio Voice By 2400 Bit/Second Linear Predictive Coding 1984]. – Hybrid coders: These coders can be regarded as combination between parametric and waveform coders. Additional parameters of the model are fitted such that the decoded speech approximates the original waveform, and the original signal in time domain as close as possible. In this class, the commonly used technique is the analysis-by-synthesis. This technique makes use of the same linear prediction as vocoders. The excitations are computed in different ways, independently of the type of speech segment (voiced or unvoiced). Finally, the excitation signal is computed as a linear combination of both periodic part (adaptive excitation) and the noise like part (fixed excitation). The bit-rate lies between 4 and 16 kbps [Schroeder, Atal 1985a]. 2.2 Analysis-by-Synthesis Principle Analysis-by-Synthesis principle is used in CELP coding. This section gives a brief description of the technique (see [Kondoz 1994], [Atal, Remde 1982] and [Schroeder, Atal 1985b]). In CELP coders, the speech is represented by a set of parameters. One way to select the parameters, also called open-loop, is to analyze the speech and extract a group of parameters. With the analysis-by-synthesis scheme, also called closed-loop, the speech signal s(n) is reconstructed by the encoder, giving a synthetic speech signal s̃(n). The reconstruction is performed by a model of speech production that depends on certain parameters. During the close loop procedure, the reconstructed signal is compared to the original input speech signal according to a defined criterion (typically, the error metric is a perceptually weighted mean squared error between the original and the synthesized speech). Based on this criterion, the best configuration of the quantized parameters is selected and its index or indices is transmitted to the receiver. At the receiver side, the decoder uses techniques similar to those implemented at the encoder side to re-synthesize the original speech signal. As shown in Fig. 2.3, parameters are chosen conditionally to an error criterion. This principle is generalized over many coders, and CELP coders use this technique to achieve 16 INTRODUCTION TO SPEECH CODING AND CELP Input Speech Parameter selection / encoding Local Decoder Synthetic speech Bit-stream Error Minimization Error Figure 2.3: Encoder based on Analysis-by-Synthesis. optimum excitations sequences. Additionally, quantization exploits spectral masking. The quantization noise is shaped such that its energy is located in spectral regions where the original signal has most of its energy. This effect will be discussed in Sec. 2.3, concerning CELP coders. 2.3 CELP Coding Overview CELP coders can be seen as improved versions of LP vocoders, where an interesting mathematic model permits to have an equivalent of the human speech production system, Tab. 2.2. The vocoder model assumes that the digital speech is the output of a digital filter whose input excitation is either white noise or a train of impulses. Human Speech Production Vocal Tract Air From Lungs Vocal Tract Vibration Vocal Cords Vibration Period Fricatives and Plosives Air Volume from the Lungs Mathematic Model LPC Synthesis Filter: H(z) Input Excitation: u(n) Voiced Speech: v(n) Pitch Periods: T Unvoiced Speech: uv(n) The gain Applied to the Excitation: G Table 2.2: Vocoder Relationship. To model the filter and its excitation, the LP vocoder analyzes the speech signal to extract its parameters. The speech synthesis is only performed at the decoder. With CELP coders, a different way to encode the speech signal was developed by introducing the principle of analysis-by-synthesis at the encoder. This principle was highlighted by Schröder and Atal in, [Atal, Remde 1982] and [Schroeder, Atal 1985a]. The main innovation at this point was that the excitation was not based on the strict voiced/unvoiced classification of the speech frame. A synthesis phase was added to the encoding process such that the excitation was modeled utilizing Short-Term and Long-Term linear prediction of speech, combined with excitation codebook. An analysis step was based on linear prediction analysis and the synthesis filter 2.3. CELP CODING OVERVIEW 17 was estimated similarly as in LP vocoder. Therefore, the synthesis filter is excited by the excitation vectors selected inside codebooks, and this explains the terminology code-excited in CELP. Excitation signal is selected by matching the reconstructed speech waveform to the original signal as closely as possible. An easy way to understand CELP coder is to start with the decoder where quantized parameters are decoded from the bit-stream to synthesize the speech signal. In this work, we will not go too deep in CELP encoding process. We will limit our study to the physical description of CELP parameters and to a description of the decoder. A large overview of the CELP coders and related standards can be found in ([Vary, Martin 2005], [Kleijn, Paliwal 1995] and [Kondoz 1994]). 2.3.1 CELP Decoder The speech signal is processed frame by frame by the CELP encoder. The parameters are then extracted and transmitted frame by frame. At the CELP decoder side, parameters from each sub-frame are used to synthesize the speech signal. A sub-frame is entirely characterized by two groups of parameters, namely the excitation parameters and the vocal tract parameters. The CELP decoder as presented in Fig. 2.4 involves five different steps, represented with the yellow boxes. (ii) Lag T v(n) (i) a Gain Table of Quantization (iv) Quantized Indices Quantized LPC Coefficients u(n) 1/Â(z) 0 j Gain Table of Quantization … … . . f Index of Codebook (iii) (v) Post filtering cj(n) Synthesized Speech š(n) U Fixed codebook Figure 2.4: Typical CELP Decoder. The vocal tract parameters are represented by the LPC coefficients. During steps (i), the b = (â1 , .., âM ) is used to build the synthesized filter Ĥ(z), quantized LPC coefficient vector A 18 INTRODUCTION TO SPEECH CODING AND CELP given by: b = H 1 = b A(z) 1 1+ M X (2.1) âi z −i i=1 where M is the order of the linear prediction analysis. The quantized LPC coefficients âi are also used to construct the post filter. Generally, the LPC coefficients are computed on a frame basis and are then interpolated to obtain the LPC coefficients for each sub-frame. Excitation parameters are divided into fixed and adaptive excitations. In step (ii), the received pitch delay T is used to select a section of the past excitation. The selected section v(n) = u(n − T ) is called the adaptive codebook vector. In step (iii), the received index j of the fixed codebook is used to select the optimum fixed codebook vector cj (n). In step (iv), the quantized indices of the fixed and adaptive codebook gains are decoded and the quantized fixed codebook gain ĝf , and the adaptive codebook gain ĝa are obtained. The quantized codebook gains are used to scale cj (n) and v(n) respectively. The final excitation u(n) is constructed during step (v) as follows: u(n) = ĝa · u(n − T ) + ĝf · cj (n) (2.2) The excitation parameters are generally computed in each sub-frame at the decoder. For each sub-frame, the final excitation is filtered through the synthesis filter. The output of this operation is enhanced by a post processing filtering where the decoded or synthesized speech signal s̃(n) is obtained. The post processing is achieved in most CELP coders based on combination of a long term post filter, a short term post filter and a tilt compensation. Basically, the post filtering enhances the perceptual quality by lowering perceived noise in the synthesized speech signal. This process is performed by attenuating the signal in the valleys of the spectrum. In addition, the output signal is filtered through a high pass filter to prevent low frequency components. The speech samples are also up-scaled to recover an appreciable speech level. An adaptive gain control is used to insure that the signal energy of the post filtered speech signal is the same as that of the input speech signal. Step (v) is completed by storing the final excitation inside the adaptive codebook. This final excitation will be used during computation in the next sub-frame. Because this work aims at implementing a VQE system based on the CELP codec parameters, this chapter highlights the CELP parameters and their physical description and representation. We will not detail computational aspects that are useless with respect to the general purpose of this thesis. The following sections will provide a general overview on CELP encoder. 2.3. CELP CODING OVERVIEW 2.3.2 19 Physical Description of CELP Parameters As described in Sec. 2.3.1, the coders based on the CELP technique transmit approximately the same kind of parameters. The difference generally relates to the number of parameters, the way there are computed with their quantized version and the number of allocated bits for the transmission. Current speech coders as CELP ones exploit the human speech production apparatus as described in Fig. 2.5 below. The lungs generate air pressure that flows through the trachea, the vocal cords, the pharynx, and the oral and nasal cavities. Nose Output Nasal Cavity Velum Pharyngeal Cavity Vocal cords Trachea Tongue Hump Oral Cavity Mouth Output Lungs Muscle Force Figure 2.5: Human Speech Production Mechanism. The vocal tract represents all the cavities above vocal folds. The shape of the vocal tract determines the sound somebody makes. Changes of the vocal tract are relatively slow (10 ms to 100 ms). The amount of air coming from the lungs characterizes the loudness of the sound and acts as energy. When somebody speaks, the speech sounds can be created according to the following scenarios: – First the flow of air sets the vocal folds in an oscillating motion. Typical sounds can be vowels (/a/, /o/ and /i/) and nasals (/m/, /n/). The vocal cords vibrate and the rate at which they vibrate determines the pitch. These types of sound are called voiced sounds. Women and young children tend to have high pitch (fast vibration), whereas adult males tend to have low pitch (slow vibration). – Second, the flow is constricted, fricatives (/f /, /s/ and /h/), or completely stopped for a short interval. These sounds are named unvoiced sounds with noise like characteristics. The vocal cords do not vibrate, but remain constantly open. 20 INTRODUCTION TO SPEECH CODING AND CELP – Third, sounds as /z/ are produced by both exciting the vocal tract with a periodic excitation and by forcing air through circonstriction of vocal tract. These sounds are called mixed sounds. A typical waveform generated in the vocal folds (voice phoneme) is represented in Figure 6. The time representation in Fig. 2.6 (a) is characterized by the periodicity of the signal. In the frequency representation of Fig. 2.6 (b), we can observe the harmonic structures of the spectrum. The spectrum also indicates dominant low frequency content. In [0 1500] Hz one can observe four significant peaks. These peaks correspond to resonances in the vocal tract and are also called formants. The unvoiced phoneme time representation in Fig. 2.7 (a) is (a) Amplitude 5000 0 −5000 0 0.005 0.01 Time/s (b) 0.015 0.02 PSD−in−dB 100 80 60 40 0 500 1000 1500 2000 2500 Frequency 3000 3500 4000 Figure 2.6: Voiced Sound. noise like and there is no significant periodic component. We can also see on the spectrum in Fig. 2.7 (b) that there is a significant amount of high frequency components. This effect basically corresponds to rapid signal changes or the unvoiced sound random nature. The structure observed in speech signal representations reflects the human speech production system. The source-filter model is the basis of the technique used in CELP coders to characterize the transmitted parameters. Power spectrum of voiced and unvoiced speech segments are characterized by two attributes: the envelope of the power spectrum (LPC coefficients) and the fine structure of the power spectrum (Pitch delay) (cf. [Kleijn, Paliwal 1995]). 2.3. CELP CODING OVERVIEW 21 (a) Amplitude 2000 1000 0 −1000 −2000 0 0.005 0.01 Time/s (b) 0.015 0.02 PSD−in−dB 80 60 40 0 500 1000 1500 2000 2500 Frequency 3000 3500 4000 Figure 2.7: Unvoiced Sound. The Vocal Tract Parameters According to the foregoing, the vocal tract can be considered as a time varying filter. In each frame the vocal tract is modeled by a linear filter. The impulse response of the linear filter is represented by the LPC coefficients vector A = (a1 , . . . , aM ). The LPC coefficients are computed assuming that the speech signal in a given frame follows an Auto-Regressive model. Computation of the LPC coefficients is called linear prediction analysis. The LPC coefficients can be viewed as a spectral envelope estimation of the signal over a frame. Using the LPC coefficients of an original signal, it is possible to generate another signal whose spectrum characteristics are approximately those of the original signal. As indicated in Fig. 2.8 (b), the LPC coefficients are used to build the synthesis filter (cf. EQ. 2.1). The spectrum of the synthesis filter corresponds to the speech signal envelope. As a consequence, in a perturbed environment, if the LPC coefficients of the useful speech signal are well estimated before the decoding steps, it will be possible to recover a reasonably good speech spectrum envelope. This idea is used further in this thesis to enhance a signal in noisy environment. In speech coding, the linear prediction analysis can be defined as a procedure to remove redundancy, where short term redundancy information is eliminated. The synthesis filter associated with the computed LPC coefficients should be stable. An IIR 22 INTRODUCTION TO SPEECH CODING AND CELP (a) Amplitude 2000 1000 0 −1000 Magniture−in−dB −2000 0 0.005 0.01 Time/s (b) 0.015 0.02 80 60 Speech Spectrum Synthesis Filter Spectrum 40 0 500 1000 1500 2000 2500 Frequency 3000 3500 4000 Figure 2.8: LPC Coefficients as Spectral Estimation. filter is said to be stable if all the poles of its transfer function are inside the unit circle. If there is a pole outside the unit circle, then there will be an exponentially increasing component of the impulse response [Haykin 2002a]. In other words, a filter H is stable if its impulse response h(n) decreases to zero as n goes to infinity. In most CELP coders, especially in AMR-NB standard, the LPC coefficients are transmitted using their Line Spectral Frequencies (LSF ), introduced in [Itakura 1975]. An efficient computation using Chebyshev polynomials was proposed [Kabal, Ramachandran 1986]. The LSF representation is motivated by several advantages: – The LSF are bounded: 0 ≤ LSF ≤ π and 0 ≤ LSF1 ≤ · · · ≤ LSFM ≤ π. With this property, the stability check of the LPC coefficients can be easily performed. Stability behavior of the synthesis filter can also be taken into account during encoding by controlling the LPC coefficients range (cf. [Kabal, Ramachandran 1986] and [Kabal 2003]). – The number M of the LSF and the range of the values to quantize, allow better behavior in low bit rate for quantization. The LSF parameters between adjacent frames are highly correlated. This property leads to adoption of interesting quantization techniques such as prediction and interpolation. – The LSF parameters are directly linked to the spectral envelope of the speech signal. As a consequence, the formants can be described by the consecutive LSF parameters repar- 2.3. CELP CODING OVERVIEW 23 tition. For example, two or three consecutive LSF coefficients can describe a formant. Computation of LPC Coefficients: Linear Prediction Analysis The Linear Prediction Analysis or LP analysis is performed block-by-block (frame by frame) and it starts by windowing the frame to be analyzed. The windowing process as depicted in Fig. 2.9, serves to select the appropriate section (frame or sub-frame) of the speech signal. Simultaneous to the windowing, the individual blocks need to be overlapped to prevent loss of information at edges of frames. The overlapping process allows to include portion of adjacent frames in the current frame. The windowed speech also reduces the audible distortion of the reconstructed speech. Impact of the windowing can be viewed by analyzing the spectral representation of the window. The simplest analysis window is the rectangular window given by: 1, n = 0, . . . , N − 1, (2.3) wrect (n) = 0, Otherwise. Analysis Window Frame Sub-Frame 1 2 3 4 Figure 2.9: Structure of the Analysis Window. A rectangular window has in frequency domain a disadvantage. The Fourier transform of the signal does have a narrow main lobe. But, it also has appreciable side lobes or secondary lobes, as seen in Fig. 2.10. The rectangular window gives the best frequency resolution but due to the side-lobes, this window is not used much in applications. In order to avoid side lobes after Fourier transform, the speech signal is windowed using some particular windows. The Hamming or hybrids (combination of halves hamming) are most used in speech coding [Kleijn, Paliwal 1995]. The expression of a typical Hamming Window is: ( 0.54 − 0.46 cos N2πn −1 , n = 0, . . . , N − 1, (2.4) whamming (n) = 0, Otherwise. As shown in Fig. 2.11, the magnitude of the Hamming window is much greater around low frequencies than in high frequencies. Multiplication point to point with a frame will allow the edges of the frame to become insignificant. 24 INTRODUCTION TO SPEECH CODING AND CELP Rectangular Window 1 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 80 14000 16000 70 80 14000 16000 Impulse Response of the Rectangular Window PSD−in−dB 40 20 0 −20 −40 0 2000 4000 6000 8000 10000 Frequency 12000 Figure 2.10: Rectangular Window. Hamming Window 1 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 Impulse Response of the Hamming Window PSD−in−dB 20 0 −20 −40 −60 0 2000 4000 6000 8000 10000 Frequency 12000 Figure 2.11: Hamming Window. 2.3. CELP CODING OVERVIEW 25 Another issue is the window size. Its choice affects the time and the frequency resolution. This follows Heisenberg’s principle. In fact, n = 0 corresponds to the first sample of the current frame. The samples, for n < 0, are those of the previous frame. A linear prediction analysis of order M writes the current windowed speech signal sample sw (n) as a linear combination of M past samples plus a residual signal e(n). The residual signal is also called LP or LPC residual signal. sw (n) = − M X k=1 (2.5) as (k) · sw (n − k) + e(n) In the following, the linear filter coefficients or the vector AS = (as (1), . . . , as (M )) will represent the LPC coefficients vector of the windowed frame speech signal sw (n), n = 1, . . . , p. To estimate the optimal LPC coefficients, the autocorrelation and covariance methods are generally used. These methods select the linear filterPcoefficients by minimizing the short term −1 2 energy (square error) of the residual signal EST = N n=0 (e(n)) by least square computation. EST (n) = N −1 X sw (n) − n=0 M X k=1 !2 as (k) · sw (n − k) (2.6) Setting the partial derivatives of the square error with respect to each coefficient as (k) to zero, we obtain a set of LPC coefficients that minimize that square error. Autocorrelation method achieves such computation as follows: N −1 N −1 M X X X ∂EST (n) sw (n)sw (n − j) sw (n − k)sw (n − j) = as (k) · =0⇔ ∂as (j) k=1 (2.7) n=0 n=0 P −1 In EQ. 2.7, the term rS (j) = N n=j sw (n)sw (n − j), j = 0, . . . , M , represents the autocorrelation function as introduced previously. In matrix representation, the system in EQ. 2.7 can be written as: −ΓS · AS = RS (2.8) where the M × M autocorrelation matrix is defined as: ΓS = rS (M − 1) .. rS (1) rS (0) . .. .. . rS (1) . rS (M − 1) · · · rS (1) rS (0) rS (0) rS (1) ... .. . .. . (2.9) 26 INTRODUCTION TO SPEECH CODING AND CELP and the autocorrelation vector is given by: RS = (rS (1), . . . , rs (M )) (2.10) The autocorrelation matrix ΓS is a Toplitz matrix. The LPC coefficients can be easily computed. Iterative techniques are usually used since matrix inversion is computational expensive. Especially, solution based on the Recursive Levinson-Durbin algorithm is implemented in most speech coders. References [Kondoz 1994] and [Haykin 2002a] give an extensive explanation of the Levinson-Durbin algorithm. Fortunately, the Toplitz structure of the autocorrelation matrix involves estimation of AS such that the associated synthesis filter H(z) = AS1(z) is stable. The linear prediction analysis is an important module in recent coders. The linear prediction analysis can be defined in three different manners: – It is an identification technique where parameters of the system are found from the observations: Speech in this situation is modeled as an Autoregressive (AR) signal, which is appropriate in practice, [Kleijn, Paliwal 1995]. – The linear prediction analysis can be viewed as a spectral estimation method: Linear prediction allows computation of LPC coefficients which characterize the PSD of the signal itself. Based on the LPC coefficients of an original signal, it is possible to generate another signal such that the spectra characteristics are close to the original ones, (see Fig. 2.8). – Above all, for speech coding application, the linear prediction analysis can be defined as a procedure to remove redundancy, where short time repeated information is eliminated. Perceptual Weighting Filter The Perceptual Weighting Filter is introduced in CELP coder during the analysis-bysynthesis procedure in the encoder. The optimum parameters are computed in the CELP encoder by minimizing the error in perceptual domain (the error is filtered through the perceptual filter before the minimization). Frequency masking experiments have shown that the effects of quantization noise are not audible in the frequency bands where the speech has more energy (formant regions). The quantization of the parameters exploits this effect by allocating the quantization noise to the bands with high energy. This technique is also called noise shaping, as introduced in [Schroeder et al. 1979]. The perceptual weighting filter which derives from the LPC filter exploits the human auditory system characteristics. The perceptual Weighting Filter was formally designed to shape noise in order to make its energy lower between formants and bigger in formant zones (cf. [Kondoz 1994], [Spanias 1994]). Former perceptual weighting filters were computed based on LPC coefficients as follows: W(z) = A(z) A( γz ) (2.11) 2.3. CELP CODING OVERVIEW 27 The poles of the weighting filter are those of the synthesis filter but drawn towards the center of the unit circle. This filter is an all-pole filter and is in fact a bandwidth expansion filter [CHU 2000], since the constant introduces a dilatation effect [Kabal 2003]. By applying this filter to the speech signal, the result is that a listener will notice little noise (quantization noise) in formant regions. Current CELP coders use a perceptual weighting filter with the form: W(z) = A( γz1 ) A( γz2 ) (2.12) where the perceptual factor values are in the range 0 < γ1 , γ2 < 1. Pitch Delay: Adaptive Excitation Speech signal is also characterized by its long term dependency. The fine structure of the speech spectrum corresponds to the long term autocorrelation. The dependency is caused by the fundamental frequency of the speaker, which is in the range [40 − 500] Hz, according to the speaker gender and age. This fundamental frequency F0 corresponds to the vibration of the vocal chords. A voiced speech segment as shown in Fig. 2.6 is quasi periodic in time domain. It can be identified by the positions of the largest signal peaks and analysis of the fine structure. The distance between the largest peaks signal is referred to as the pitch period or pitch lag. The spectrum of such a signal exhibits spectral peak, (in Fig. 2.6 for example, four spectral envelope peaks). These peaks are the manifestation of the resonance of the vocal tract at these frequencies. The adaptive codebook search exploits the quasi periodic structure of the speech that occurs during voiced speech segments. The adaptive codebook vector v(n) is computed by simply weighting the past excitation at lag T = FF0s , where Fs is the sampling rate, with the fixed codebook gain ga . The LPC coefficients and the adaptive excitation represent the predictable contribution of the speech signal. After these contributions have been removed, the remaining non predictable excitation (noise like) or final residual signal needs to be modeled by the so called fixed excitation. Fixed Excitation Signals with slowly varying power spectrum can be used as an approximation of that final residual signal or non predictable signal. A table also called codebook containing pulses also called fixed codebook is used in CELP coding to approximate this signal. The fixed codebooks are generally characterized by their size, their dimension, and their 28 INTRODUCTION TO SPEECH CODING AND CELP search complexity. Among all these properties, complexity is a very important criterion. Complexity is significantly reduced by simplifying and imposing a structure to the codebook dictionary. Depending on the type of CELP coder, fixed codebooks can be classified as deterministic or nondeterministic. As deterministic class of codebook, the Gaussian codebook, were the first to be used in CELP codecs. This kind of codebook was proposed in [Schroeder, Atal 1985b] and was made from Gaussian noise with 1024 entries. The complexity of this codebook is high. The Gaussian codebook also needs memory space for storing the codevectors at the encoder and at the decoder. Trained Gaussian codebook [Moriya et al. 1993] can also be used where the fixed codebook vectors are trained by a representative set of input signal. Deterministic codebook, in general are those where excitations are constituted by P nonzero pulses designed as follows (cf. [Atal, Remde 1982]): cj (n) = P −1 X i=0 βi · δ(n − mi ) (2.13) where βi denotes the pulse amplitude and the position. Dealing with deterministic codebooks, regular pulse codebook and Algebraic codebook have been developed. Regular pulse codebook uses pulse with constant distance between two consecutive pulses [Kroon et al. 1986]. Algebraic codebook was introduced in [Adoul et al. 1987], and the pulse amplitude is given by: βi = ±1, ∀i ∈ (0, . . . , P − 1]. Most recent coders are based on this type of codebook. Only P non-zero values are needed and their complexity and memory demand is low. Algebraic codebooks allow the implementation of special codebooks for CELP (ACELP) to alleviate storing and reducing the complexity. In this category, Ternary Codebook is the most implemented. As in [Goldberg, Riek 2000] and in [Kleijn et al. 1990], the codebook components are either 1, 0 or −1. This kind of codebook offers interesting flexibility. Only additions and subtractions are performed. The complexity of such a codebook during the filtering process can be significantly reduced by varying the number of zeros inside the codebook. Finally the entire CELP coding processed, as illustrated by Fig. 2.12, can be summarized as the following: – The vocal tract parameters, represented by the LPC coefficients: as (i), i = 1, . . . , M are computed, quantized generally using LSF quantization and transmitted. – Excitations are searched based on the analysis-by-synthesis principle in the perceptual domain, using the perceptual weighting filter W(z). An adaptive contribution and a fixed contribution are combined to compute the excitation signal. – The adaptive excitation v(n) is composed of an adaptive gain and a pitch lag: ga and T respectively. 2.3. CELP CODING OVERVIEW 29 Fixed codebook 0 j Input PCM speech s(n) … … . . s(n)=ssubframe(n) LPC Analysis âk cj(n) s’subframe(n) u(n)=usubframe(n) 1/Â (z) e(n) gf Z-T W(z) U ga ak Minimization D(T,j,ga,gf) = sumsubframe(ew2T,j,ga,gf(n)) min ew(n) Figure 2.12: Typical CELP Encoder. – The fixed codebook search provides index (signs and positions) of the fixed codebook vector cj (n) and the associated gain gf . The quantized parameters are transmitted to the decoder and: – – – – 2.3.3 The decoder reconstructs the final excitation by using the decoded parameters. The LPC coefficients are decoded to build the synthesized filter and the post filter. The final excitation is filtered through the synthesized filter. The output of the synthesized filter is enhanced through the post filtering (optional) to improve the quality of the decoded speech waveform. Speech Perception The human ear system is made of various components and its external part, called the basilar membrane, has been well investigated, see [Goldberg, Riek 2000]. The basilar membrane can be considered as a non uniform filter bank or spectral analyzer whose bands (also called critical bands) vary with the frequency. Different points of the basilar membrane react differently depending on the frequency of the incoming sound wave. The human ear can perceive a sound whose frequency ranges over the interval 15 − 20000 Hz with a significant level. The level of this sound must be above the absolute audition threshold, which depends on the frequency. The absolute audition threshold, also called Absolute Threshold of Hearing (AT H), is generally characterized by the amount of energy needed for a sound to be detected by a listener in a silent environment. The AT H varies from one person to another. The human ear system tends to be more sensitive for frequencies in the range [1, 4] kHz, (cf. [Painter, Spanias 2000]). In contrast, the Maximum Threshold of Hearing (MTH) characterizes the threshold over which the hearing sensation becomes painful. 30 INTRODUCTION TO SPEECH CODING AND CELP The masking property The masking effect is defined as the situation where the perceptibility of a sound (the maskee) is disturbed by the presence of another sound (the masker), (see [Vary, Martin 2005] and [Schroeder et al. 1979]). Two existing masking effects, known as Temporal Masking and Frequency Masking are mostly encountered. Temporal Masking is difficult to realize. In this type of masking, a sound will not be noticed if it follows directly a louder sound. Both Temporal and Frequency Masking are used in audio coding to design the psychoacoustic model. The Frequency masking is observed when two or more sounds are simultaneously presented to the auditory system. Depending on the shape of the magnitude spectrum, the presence of certain spectral energy will mask the presence of other spectral energy. The Frequency Masking is generally used in speech coding through the weighting synthesis filter. Masked frequencies do not need to be quantized and transmitted, hence the codec can focus on quantizing the most important information. 2.4 Standard In this section we present a short overview of the standardized 3GPP (3rd Generation Partnership Project) AMR ACELP [3GPP 1999b]. This CELP based coder is used in this thesis as a reference for our simulations and applications. AMR-NB is the mandatory speech coder for narrow-band 8 kHz communication over UMTS. This coder follows the legacy of original CELP and uses algebraic techniques. Parameters of the AMR coder are exactly the same as those computed in Sec. 2.3, but some additive enhanced features will be summarized in the following. 2.4.1 The Standard ETSI-GSM AMR The AMR narrowband codec is the 3GPP mandatory standard codec for narrowband speech and multimedia messaging services over 2.5G/3G wireless systems based on evolved GSM core networks (WCDMA, EDGE, and GPRS). Initially developed for the GSM system, AMR-NB was standardized by the European Telecommunications Standards Institute (ETSI) in 1999. The AMR-NB uses 8 kHz speech sampling rate to operate at various bit rates, also called modes: 4.75, 5.15, 5.90, 6.70 (same as PDC-Japan), 7.4 (same as DAMPS-IS136), 7.95, 10.2 kbps and 12.2 kbps (same as GSM-EFR). The input speech signal is divided into frames of 20 ms (160 samples) each with four sub-frames of 5 ms (40 samples). The exact processing of the speech depends on the AMR-NB encoder mode and is largely described in [3GPP 1999b]. As depicted in Fig. 2.13, the decoder block diagram of the AMR-NB shows various steps 2.4. STANDARD 31 (ii) (i) Pitch index T and Tfrac LSF Indices v(n) a Gain Indices (iv) Gain Table of Quantization 0 Decoding and Interpolation: LSF LPC Gain Table of Quantization 1/Â(z) u(n) (v) gf j … … . . Post filter f j: Index of Codebook Synthesized Speech š(n) cj(n) (iii) U Fixed codebook Figure 2.13: Decoding Block of the AMR-NB. similar to initial CELP decoder in Fig. 2.4, but includes new features. LPC Analysis A 10th order LPC analysis is performed once or twice per frame, from a windowed (hybrid window: halves of Hamming with different sizes) signal. LPC Analysis is performed twice in 12.2 kbps mode, with no look-ahead and only once per frame in other mode with 5 ms look-ahead. The Levinson-Durbin algorithm is used to compute the LPC coefficients. The LSF representations are used to transmit the LPC coefficients. The perceptual weighting filter is given by: W(z) = A( γz ) A( γz ) , 1 where γ2 = 0.6,whereas γ1 = 0.9 2 (for 10.2 kbps and 12.2 kbps mode) or γ1 = 0.94 (for all other modes). Excitation Parameters (Computed for each sub-frame) The adaptive excitation is computed in two steps: An open-loop search coupled to a closeloop. The open-loop search is performed twice per frame using a 10 ms signal samples but only once per frame in 4.75 kbps, 5.15 kbps modes. After the open-loop search, an intermediate open-loop pitch TOP is obtained. The close-loop search is then performed in each sub-frame. 32 INTRODUCTION TO SPEECH CODING AND CELP The close-loop search is confined to a small number of lag values around the pitch computed during open-loop search TOP . Finally, the pitch delay obtained at the end of the close-loop in each sub-frame has two components, an integer part T and a fractional part Tf rac . The pitch is completed by the computation of the adaptive gain ga and limited to a certain interval: 0 ≤ ga ≤ 1.2. The fixed codebook is an algebraic with ternary ACELP vectors (see Table 3). The number of pulses inside the codebook depends on the mode leading to several bit allocations. In 12.2 kbps mode, the fixed codebook is encoded using 35 bits at each sub-frame. The fixed codebook has five tracks and in each track we can have two pulses, which can be assigned to 8 different positions inside sub-frame length. A little trick is used for the transmission of the two pulses in a track, if the position number of the second pulse is lower than the first pulse, then the second pulse has opposite sign of the first pulse. If the position number is higher, then both pulses have same sign. Hence, only the sign of the first pulse in the tract needs to be transmitted. As all the pulses have same amplitudes (equal to one), no bit is transmitted for pulse amplitude. The transmitted indices are used at the decoder to choose the appropriate codebook vector. Track 1 2 3 4 5 Pulses i0 , i5 i1 , i6 i2 , i7 i3 , i8 i4 , i9 Positions 0, 5, 10, 15, 20, 25, 30, 35 1, 6, 11, 16, 21, 26, 31, 36 2, 7, 12, 17, 22, 27, 32, 37 3, 8, 13, 18, 23, 28, 33, 38 4, 9, 14, 19, 24, 29, 34, 39 Table 2.3: 12.2 kbps Mode Algebraic Codebook Position. In 12.2 kbps and 7.95 kbps modes, the fixed codebook and the adaptive codebook gains are quantized separately. In other modes, they are jointly quantized. Dealing with 12.2 kbps mode, the adaptive gain ga is quantized using four bits through a non uniform scalar quantization. The fixed codebook is not directly quantized or transmitted, but a correlation factor γ̂gf is ′ computed based on the predicted fixed codebook gain gf . The predicted gain is computed using a fourth order MA, (see [3GPP 1999b]). The correlation factor is quantized with five bits. The fixed codebook quantization is performed by minimizing: EQ12.2 kbps ′ = gf − γ̂gf · gf (2.14) ′ In EQ. 2.14, gf is the predicted gain and hatγgf is the correlation factor. With the 7.4 kbps mode, the correlation factor and the adaptive gain are quantized by minimizing: EQ7.4 kbps = kX − ga · Y − gf · Zk2 (2.15) 2.5. CONCLUSION 33 where X is the target vector, Y the filtered adaptive codebook vector and Z is the filtered fixed codebook vector (see [3GPP 1999b]). To complete the processing, at the end of each sub-frame during encoding, the final excitation is computed and stored inside the adaptive codebook dictionary for the next sub-frame. The filter states are also saved so that they can also be used in the next sub-frame. Additional Properties A lot of special features are integrated in AMR-NB. Functionalities such as Frame Erasure Concealment, Voice Activity Detection (VAD)[26.094 2002], Comfort Noise Generation (CNG) [26.092 2002] and Discontinuous Transmission (DTX) are integrated, [3GPP 1999b]. Dealing with Voice over Internet Protocol (VoIP), the eight available bit rates allow for Quality of Service (QoS) improvement. AMR-NB flexibility enables the development of applications that control the bit rate according to the network characteristics. 2.5 Conclusion This chapter has presented the general principle of the CELP coding techniques, especially the decoder. AMR-NB standard has been briefly introduced. This standard process speech with 20 ms frame. Several groups of parameters are computed, quantized and transmitted to the decoder. At the decoder side, the transmitted parameters are reconstructed or approximated to synthesize the speech waveform. As vocal tract parameters, one or two set of LPC coefficients, depending on the coding modes are decoded: ai (m), i = 1, . . . , M . These LPC coefficients, model the vocal tract, and are transmitted via their LSF representation. As far as the excitations in concerned, the following parameters are computed: – Four adaptive codebook gains and four fixed codebook gains are computed each frame. The resulting gains are one set of gains on every sub-frame m (ga (m), gf (m)). As some special processing techniques are integrated in AMR-NB, the gains are individually or jointly quantized, according to the mode. – Four pitch delays, one for each sub-frame m. The pitch delay has an integer and fractional parts: T and Tf rac respectivelly. The pitch delay is used to compute the adaptive excitation v(n), which is in fact the previous excitation at lag (T and Tf rac ). – Four sets of fixed codebook vectors, also called stochastic codebook vectors, one for each sub-frame m: cj (m). Algebraic structures have permitted the implementation inside the encoder and the decoder of flexible codebook dictionaries, depending on the coding mode. Only signs and positions of the codebook vectors are transmitted. These parameters are 34 INTRODUCTION TO SPEECH CODING AND CELP also called index. This information is used at the decoder side to select inside the stored codebook the best matching fixed codevector. The quantization of the fixed codevectors is where most of the bits are allocated. We should mention that the CELP encoding process is much more complex. Therefore, in real time process, the CELP encoding process is segmented in smaller, more manageable sequential searches of parameters. If the speech signal to be encoded is corrupted by background noise or/and acoustic echo, the encoder performs poorly. The transmitted CELP parameters need to be restored. According to the context of this PhD as introduced in Chap. 1, Voice Quality Enhancement (VQE) will be performed by processing CELP parameters. In the next chapters, we will present in detail the algorithms we have developed to modify the AMR-NB ACELP parameters. 35 Chapter 3 Noise Reduction Introduction In mobile communication scenarios, when the speech signal is corrupted by high background noise signal, speech coders used for low bit-rate coding are drastically affected and do not perform well. This induces a fatigue for the far-end listener and difficulties to understand each other. Speech quality and intelligibility of the synthesized speech in such conditions are degraded. To overcome this drawback, a noise reduction processing is necessary. Noise reduction algorithms generally use Pulse Code Modulation (PCM) samples of the available signal (see Fig. 3.1). As depicted in Fig. 3.1 (a), if the noise reduction system is located inside the mobile device, as the existing system, the algorithm is implemented as a pre-processing, or a postprocessing, before encoding and after decoding respectively. As shown in Fig. 3.1 (b), if the noise reduction system is located inside the network, the bit-stream needs to be decoded. The decoded signal is enhanced and then re-encoded. This way to overcome the noise problem has many disadvantages: increasing of computational load, increasing of delay, increasing of complexity and signal distortion. The idea to de-noising the coded parameters so as to yield noise reduction and speech enhancement was originally proposed in [Chandran, Marchok 2000]. The authors in this contribution have successfully tested a feasibility study of speech enhancement based on coded parameters, especially CELP parameters. This chapter starts with a brief definition of noise in Sec. 3.1. The Sec. 3.2 presents an overview of the techniques used to reduce the noise in the frequency domain also known as spectral attenuation. Sec. 3.3 is an overview of previous approaches performing in the coded 36 NOISE REDUCTION Speech Signal Noise Microphone Loudspeaker D/A A/D PCM Samples Noise Reduction Unit SPD SPE Speech Coder CHD CHE Channel Coder (a) Noise Reduction in Mobile Device Corrupted Bit-stream Encoder A Enhanced Bit-stream Decoder A PCM Samples Speech + Noise Near-end Speaker Area Encoder B Noise Reduction Unit Decoder B PCM Samples Network Area or Codec Domain Enhanced Speech Far-end Speaker Area (b) Noise Reduction in Network Figure 3.1: Existing Noise Reduction Unit Location. 3.1. NOISE 37 domain. Two algorithms developed or enhanced during this work are described in Sec. 3.4 and Sec. 3.5 respectively. Experimental results are presented in Sec. 3.5.3 and a partial conclusion concludes the chapter. 3.1 Noise In the real world, noise can appear in different shapes and forms. Two important characteristics are generally taken into account [Loizou 2007]: 1. Noise can be stationary. This means that noise remains unchanged over time. Car noises, as the one used in this work, is stationary over short periods, in general of about 0 ms to 30 ms. The energy of such car noise is concentrated in low frequencies. 2. Noise can be non-stationary. This is the case with cafeteria noise and street noise which change especially every 20 ms. Implementation of algorithms dedicated to the suppression of non-stationary noise is more complicated than that of stationary noise. In practice, the assumption is that noise is typically stationary over short time frames of 0 − 30 ms. 3.2 Noise Reduction in the Frequency Domain Among all the techniques used in noise reduction, spectral attenuation is historically the most used to reduce perturbation. Spectral attenuation is especially performed in frequency domain through Short Time Fourier Transform (STFT), or Discrete Cosine Transform (DCT). In frequency domain, the filtering is performed by weighting the spectral components. The weighting factor is generally computed as a function of the Signal to Noise Ratio (SNR). After the corrupted speech signal has been enhanced in the frequency domain, the enhanced speech signal is transformed back to the time domain using the inverse transform of the STFT or DCT. Spectral attenuation is widely discussed in [Loizou 2007] and [Vary, Martin 2005]. 3.2.1 Overview: General Synoptic A noise reduction algorithm in the frequency domain is generally performed by following four main steps as described in Fig. 3.2. The first step is the transformation of the signal from the time domain into the frequency domain. The second step is the estimation of the noise signal features. An attenuation rule is applied on the corrupted signal in the third step and finally, the enhanced speech signal is converted back to PCM samples. The corrupted speech signal y(n) is given by the relation: y(n) = s(n) + d(n) (3.1) 38 NOISE REDUCTION where s(n) is the clean speech signal, d(n) is additive noise and n the sample index. Speech Detection Noisy Speech y(n) FFT or DCT Y(p,fk) Noise Tools Estimation Phase (p,f ) y k Y(p,fk) Enhanced Speech (n) IFFT or IDCT (p,fk) (p,fk) Weighting function or Gain G(p,fk) Figure 3.2: Spectral Attenuation Principle. The temporal and spectral characteristics of the speech signal change over time [Allen 1977]. A representation of the signal over short periods (10-30) ms or frames is suitable to cope with these speech properties. A window is used to select the corresponding segment of speech and to weight the speech samples for processing. Simultaneous to the windowing, the individual blocks needed to be overlapped to prevent loss of information at edges of frames. With the overlapping some portions of the adjacent frames are included into the current windowed samples. STFT is achieved by computing the DFT of each overlapping windowed frame as follows [Martin et al. 2004]: Y(p, fk ) = N −1 X n=0 2π y(p · σ + n) · w(n) · e− N k·n (3.2) where σ represents the frame shift, p the frame index, fk , k = 0, . . . , N − 1 is the frequency bin index, related to the normalized center frequency Ωk = 2πk/N , and w(n) denotes the window function. During this process, the effect of phase is not taken into account: φŜ (p, fk ) = φY (p, fk ). This is motivated by the fact that the human ear is particularly insensible to phase effects when the Signal-to-Noise Ratio is high enough: [Wang, Lim 1982] and [Pobloth, Kleijn 1999]: As depicted in Fig. 3.1, The STFT Y(p, fk ) of the noisy signal is used to estimate the noise spectrum D̂(p, fk ). Both Y(p, fk ) and D̂(p, fk ) are used to compute the filter G(p, fk ), also called attenuation factor. Filter G(p, fk ) is then applied to Y(p, fk ). Then, the result of this filtering process is the estimated clean speech STFT Ŝ(p, fk ), given by: Ŝ(p, fk ) = G(p, fk ) · Y(p, fk ) (3.3) Finally, to terminate the process, the enhanced speech signal is obtained by transforming Ŝ(p, fk ) back to the time domain. 3.2. NOISE REDUCTION IN THE FREQUENCY DOMAIN 3.2.2 39 Spectral Attenuation Filters In the following, we will introduce some particular filters used in spectral attenuation. Spectral Subtraction The spectral subtraction attenuation (cf. [Berouti et al. 1979] and [Lim, Oppenheim 1979]), can be summarized by the formula bellow: 1 a D̂(p, f ) k (3.4) GSS (p, fk ) = max max 1 − ζ · , 0 , ǫ Y(p, fk ) The noise floor factor ǫ is introduced so that some noise is remained to reproduce the naturalness of the environment. The parameter a has an interesting effect on spectral subtraction filter level of reduction, [Lim, Oppenheim 1979]. The factor ζ controls the amount of subtraction, typical values proposed in [Berouti et al. 1979] are between 3 and 5. Varying the noise over-estimation parameter ζ has effect on musical noise level. If a = 1, it refers to magnitude spectral subtraction [Boll 1979] and a = 2 leads to power spectral subtraction [McAulay, Malpass 1980]. Spectral subtraction has some drawbacks such as phase distortion, when the Signal-toNoise Ratio is close to zero. Multiplications in frequency domain result in circular convolution in time domain and lead to artifacts. The most annoying and disturbing for listeners is musical noise due to isolated tone bursts, distributed randomly over frequencies. These drawbacks have motivated researches resulting in many derived methods for spectral subtraction, see [Loizou 2007] and [Vary, Martin 2005]. The Wiener Filter In this work, the Wiener filter is considered as the reference classical solution for noise reduction. The Wiener Method is also based on short term frequency analysis of the noisy signal. Considering a windowed frame of analysis p of about 20 ms, the noisy signal is transformed in the frequency domain through STFT. The resulting signal Y(p, fk ) in the frequency domain is filtered so that an estimation of the clean signal S(p, fk ) noted Ŝ(p, fk ), is given by: Ŝ(p, fk ) = GW (p, fk ) · Y(p, fk ) (3.5) GW is the Wiener filter and is computed as [Loizou 2007]: GW (p, fk ) = SN R(p, fk ) 1 + SN R(p, fk ) (3.6) 40 NOISE REDUCTION where SN R(p, fk ) stands for the ratio between the Power Spectral Density (PSD) of the clean speech signal s(n) and the PSD of the noise d(n). The Ephraim and Malah Rule In [Ephraim, Malah 1984], the authors have proposed a modified Maximum Likelihood Envelope Estimation and have added an estimator for the a priori signal to noise ratio. They also have introduced an exponential smoothing in the time domain and this leads to a performance improvement in comparison to spectral subtraction. The Ephraim and Malah’s rule can be summarized by the formula: √ s SN Rprio SN Rprio π 1 ·M (1 + SN Rpost ) GEM SR (p, fk ) = 2 1 + SN Rpost 1 + SN Rprio 1 + SN Rprio (3.7) where: M (θ) = exp(− 2θ ) (1 + θ) · I0 ( 2θ ) · I1 ( 2θ ) , the terms I0 and I1 are the modified Bessel functions of zero and first order [Bowman 1958]. Estimation of the Signal to Noise Ratio Techniques An efficient estimation of the signal to noise ratio SN R(p, fk ) is based on the "decision directed" approach as proposed in [Scalart, Filho 1996]: ˆ R(p, fk ) = β · SN 2 ˆ SN R(p − 1, f ) k D̂(p, fk ) + (1 − β) · SN Rpost (p, fk ) (3.8) with SN Rpost (p, fk ) = |Y(p, fk )| D̂(p, fk ) −1 (3.9) In these relations, D̂(p, fk ) represents an estimation of the noise PSD and β corresponds to the mixing factor between present and past estimations of the SNR. Usually, β = 0.98 provides good performance of the estimator D̂(p, fk ). The estimator D̂(p, fk ) is computed by smoothing the noisy signal PSD Y(p, fk ) during non voice activity period. D̂(p, fk ) = δ · |Y(p, fk )|2 + (1 − δ) · D̂(p − 1, fk ) The adaptation is frozen during speech activity periods. (3.10) 3.2. NOISE REDUCTION IN THE FREQUENCY DOMAIN 41 In the Ephraim and Malah approach, the a priori signal to noise ratio SN Rprio is the most significant parameter. This parameter is computed based on smoothed value of the a posteriori signal to noise ratio SN Rpost as follows: SN Rprio (p, fk ) = (1 − β) · SN Rpost (p, fk ) + β · |GEM SR (p − 1, fk ) · Y(p, fk )|2 2 D̂(p, fk ) (3.11) and SN Rpost is computed according to EQ. 3.9. This approach leads to small variations of the signal attenuation over successive frames, which significantly reduces musical noise artifacts, [Cappe 1994]. 3.2.3 Techniques to Estimate the Noise Power Spectral Density One of the most important steps in noise reduction is estimation of the noise power spectrum D̂(p, fk ). The noise PSD is unknown during the noise reduction process and must be estimated. The noise signal is not stationary in principle. The ability to track changes of noise PSD is a difficult task, especially during both noise and speech presence. Fortunately, it is possible to observe noise during speech pauses. The speech pause detection can be used to tract the noise PSD. This solution is known as Voice Detection Activity method (see [Vahatalo, Johansson 1999] and [3GPP 1999b]). Another estimation technique introduced in [Martin 1994] and enhanced in [Martin 2001], called Minimum Statistics technique, makes it possible to estimate noise parameters without discrimination of speech presence and absence. 3.2.3.1 Estimation of the Noise PSD based on the Voice Activity Detection Principle of the VAD The voice activity detection module in each frame or sub-frame computes a logical decision (VAD). Therefore, the decision V AD = 1 corresponds to speech presence whereas the decision V AD = 0 corresponds to speech pause, thus the presence of noise only. On the basis of this decision, the noise PSD D(p, fk ) can be estimated as follows: if V AD = 0, D̂2 (p, fk ) = cst · D̂2 (p − 1, fk ) + (1 − cst) · |Y(p, fk )|2 if V AD = 1, D̂2 (p, fk ) = D̂2 (p − 1, fk ) (3.12) 42 NOISE REDUCTION Example of Voice Activity Detection: The AMR VAD An interesting tool integrated in AMR [3GPP 1999b] is the VAD. The VAD is used to control the discontinuous transmission module or DTX. This VAD analyzes several features of the input speech frame to discriminate speech presence periods. There are two options for the VAD that can be used. Thereafter we present only the VAD option one, initially proposed in [Vahatalo, Johansson 1999], because both rely to the same result. As depicted in Fig.3.3, the block diagram of the VAD involves four main functions, materialized with yellow boxes. The first block function analyzes the input speech signal over 9 sub-bands. The output of this module is the signal energy over each sub-band: level. The second block is dedicated to pitch and tone detection. This block detects vowel sounds and other periodic signals. The function makes use of the pitches computed during open-loop pitch search in the encoder:pitch − f lag. In addition this function detects the tone information and signals with very strong periodic components: tone−f lag. The third block is used to detect high correlated signal such as music, based on the open-loop pitch search performed at encoder. A Complexwarning − f lag and a Complex − timer are returned. The fourth block performs an estimation of the background noise. The output of this module is combined with an intermediate voice decision to achieve through hangover generator to output a final voice decision flag. If the speech is not detected, then V AD − f lag = 0, otherwise V AD − f lag = 1. VAD Hangover Addition 1 Input signal s(i) Filter Bank and Computation of sub-band levels Level[sbi] VAD-flag VAD Decision Intermediated VAD decision 4 2 Top(n), t1 and t2 Pitch and Tone Detection: Background Noise Estimation Pitch-flag and Tone-flag 3 Open-Loop Correlation Vector Complex warning Complex Signal Analysis Complex-timer Figure 3.3: Simplified Block Diagram of the AMR VAD Algorithm, Option 1. Performances of the AMR VAD We clearly see in Fig. 3.4 (a) that the VAD (VAD option 1 of the AMR-NB) on clean speech differs from that obtained on the noisy speech, Fig. 3.4 (b). Estimation based on VAD can be disturbed by the presence of noise, leading to non optimal estimation of noise power. 3.2. NOISE REDUCTION IN THE FREQUENCY DOMAIN 4 Amplitude 2 (a) x 10 1 0 −1 −2 Clean Speech VAD 0 2 4 6 Time/s (b) 8 10 12 4 6 Time/s 8 10 12 4 2 Amplitude 43 x 10 1 0 −1 −2 Noisy Speech VAD 0 2 Figure 3.4: Example of VAD Decision, Option 1. 3.2.3.2 The Minimum Statistic Technique In opposite to the VAD method, the minimum statistic does not make any assumption on the presence of speech. It permits estimation and updating of the noise power during speech plus noise periods as well as during noise only periods. The minimum statistic assumes that speech and noise signals are statistically independent and that the power of the corrupted speech signal converges most of the time to the power of the noise. This is due to the fact that speech communication has several pauses. The noise power is, in this condition, viewed as a spectral floor. It is then possible by tracking the minimum of the noisy speech signal power, to estimate the noise spectral power. The algorithm is based on the minimization of the short time power estimate in each frequency bin for each frame as detailed in [Martin 1994]: P (p, fk ) = α · P (p − 1, fk ) + (1 − α) |Y(p, fk )|2 (3.13) Typically,α ∈ [0.9 0.98] and the minimum of P (p, fk ) is searched within a window of U frames. Pmin (p, fk ) = min (P (p, fk ), P (p − 1, fk ), . . . , P (p − U + 1, fk )) (3.14) At this point, estimation of the noise PSD D̂(p, fk ) is this minimum of P multiplied by a factor 44 NOISE REDUCTION ς to compensate the bias of the minimum estimate and to reduce musical noise: (3.15) D̂(p, fk ) = ς · Pmin (p, fk ) The technique introduced by the author in [Martin 2001] to enhance the initial Minimum Statistic is related to the computation of an optimal time varying smoothing parameter. The author also proposed the computation of a more accurate bias compensation. Finally the increase of the noise tracking speed is achieved. 3.3 Introduction to Noise Reduction in the Coded Domain Noise reduction in classical approaches involves high complexity and additive computational load when noise reduction is applied within network. An alternative approach proposed in this work involves modifying the parameters provided by the CELP encoder. The CELP coders are based on Analysis-by-Synthesis principle as depicted in Chap. 2. If we consider the CELP decoder synthesis filter Hm (z) on a sub-frame basis m in Z-domain [Chandran, Marchok 2000], we can approximately write that: Hm (z) = gf (m) P −i 1 − ga (m) · z −T (m) · 1 + M a (m) · z i i=1 (3.16) In EQ. 3.16, M represents the order of the linear prediction filter, ai is the ith LPC coefficient, ga is the adaptive gain, gf is the fixed codebook gain, T is the pitch delay and m is the sub-frame index. The synthesis filter Hm (z) can be viewed as the cascade of two filters. The Long-term Prediction (LTP) filter HLT P (z) , which simulates the vocal source, and the LPC filter HLP C (z), which models the vocal tract are given by: HLT P (z) = 1 (1−ga (m)·z −T (m) ) HLP C (z) = (1+ PM 1 −i ) i=1 ai (m)·z (3.17) The fixed codebook gain gf appears as a multiplication factor in the expression of Hm (z) in EQ. 3.16. As a consequence, gf has a great influence on the decoded speech signal amplitude. A weighting of gf modifies the signal amplitude. Noise reduction applied to the fixed codebook gain is motivated by this remark and has already been experimented in [Chandran, Marchok 2000] and [Taddei et al. 2004]. This approach will be discussed in Sec. 3.4. The LTP and LPC filters can be considered as filters representing the spectral shape of the speech signal. As a result, modification of these filters has an impact on the spectral charac- 3.3. INTRODUCTION TO NOISE REDUCTION IN THE CODED DOMAIN 45 teristic of the synthesized speech signal. Considering noise reduction, analysis in [Chandran, Marchok 2000] and [Duetsch 2003] show that estimating the LPC and LTP filters of the clean speech signal provides the spectral shape of the clean speech signal. Compared to the modification of the fixed codebook gain, enhancement of the LPC filter as developed in [Thepie et al. 2008] not only influences the amplitude of the signal but also its spectral characteristics. Modifying the LPC filter has thus a potential positive effect on reducing the distortion of the decoded speech signal. This second approach will be exposed in Sec. 3.5. 3.3.1 Some Previous Works in the Codec Domain The Authors in [Chandran, Marchok 2000] and [Duetsch 2003] have demonstrated effectiveness of speech enhancement through modification of the CELP codec parameters. Experimental investigation with replacement of parameters was performed as depicted in Fig. 3.5. Bit-streams or CELP parameters of two different noisy speech signals were exchanged. The AMR-NB codec has been used in 12.2 kbps mode [3GPP 1999b]. The decoder was modified to use some parameters from another bit-stream. A speech signal was corrupted with two different background noise levels. The resulting noisy speech signals were a 20 dB segmented signal to noise ratio (SN Rseg ) and a 10 dB SN Rseg . The segmented signal to noise ratio was computed in this experiment only where V AD = 1. These experiments consist on replacing the parameters of the 20 dB noisy speech by those of 10 dB, as shown in figure bellow. Near-end-Side Network Area Far-end-Side Bit-stream b10 Speech + noise 10 dB Encoder Decoder Bit-stream b20 Speech + noise 20 dB Encoder Decoded Speech LPC Coefficients Adaptive-codebook Vector / gain Fixed-codebook Vector / gain Figure 3.5: Experimental Setup for the Exchange of Parameters. Listening tests show that exchanging the CELP parameters into bit-stream has a great influence on noise perception. The background noise level is more or less represented by the fixed codebook gain gf . Another remark was that exchanging the LPC coefficients introduces distortions on decoded signal. The noise perception is more noticeable if the 20 dB bit-stream is replaced by the 10 dB bit-stream. In opposite, if the 10 dB bit-stream is replaced by the 20 dB, the perception of the noise is reduced. The noise perception is also carried by the LPC coefficients. This experiment indicates that in a noisy environment, a good estimation of clean speech signal CELP parameters will achieve noise reduction. 46 NOISE REDUCTION In [Chandran, Marchok 2000] modification of both fixed and adaptive codebook gains was motivated by the trade-off between efficiency and low computational cost. This approach leads to a new transfer function defined as follows: Hm (z) = γf (m) · gf (m) P −i 1 − γa (m) · ga (m) · z −T (m) · 1 + M a (m) · z i i=1 (3.18) Since the adaptive codebook vector generally has a higher signal to noise ratio than the fixed codebook vector, especially during voiced speech segment, γf (m) should be computed such that the decoded speech signal power is optimal. Special care in this approach should be taken to insure a trade-off between a high noise attenuation and gain to preserve the original speech signal power. The technique based on EQ. 3.18 clearly reduces noise effect. Steps and computational details of this approach are explained in [Chandran, Marchok 2000]. Noise Reduction, Acoustic Echo Cancellation and Gain Control can be considered as automatic dynamic amplitude scaling. The authors in [Sukkar et al. 2006] have proposed techniques to directly dynamic scale the coded parameters of the speech signal. They simultaneous scale the fixed codebook and the adaptive codebook gains according to a predefined speech target contour and the CELP encoding process. As performed in [Sukkar et al. 2006], the entire coded domain scaling process is described in Fig. 3.6 below. The bit-stream is partially decoded to extract the required parameters. Once the gains (fixed and adaptive codebook gains) are modified, the scaled gains are quantized and mapped back inside the bit-stream. Mobile Device Network Area Target Scale Contour v(n), c(n) x(k) x(n) Encoder Partial Decoder Partial Decoder ga, gf v’(k) g’a, g’f Codebook Gain Scaling a, Q f Bit-stream Modification x’(k) Figure 3.6: Coded Domain Scaling. The principle of this experimentation is to modify the coded parameters to obtain a decoded speech signal according to a defined characteristic. Let x(n) a signal, the signal characteristic xd defined according to a contour Gx (n) is given by: xd (n) = x(n) · Gx (n) (3.19) The contour characteristic Gx (n) is used to scale x(k), representing the coded parameters of the signal x(n). The signal xsd (n) stands from the encoded and decoded version of the signal 3.4. NOISE REDUCTION BASED BY WEIGHTING THE FIXED GAIN 47 ′ xd (n). The result of this experiment is that the decoded signal x (n) from the modified bit′ stream x (k) approximates the signal xsd (n). Experimental results indicate that the scaling of the coded parameters according to the target contour restores the speech signal contour Gx (n) without affecting other speech quality aspects. Another consideration is that when speech calls are connected through two different network end-to-end point devices, the volume levels are often unbalanced, being one side higher or lower than the other. A level control processing is needed to achieve sufficient signal level. The classical speech level control is performed by multiplying the speech signal in PCM format by a suitable gain. The idea to control the level in the parameter domain was proposed in [Pasanen 2006], where the quantized speech parameters are directly modified. This approach reduces the system complexity. With such an approach, the end-to-end delay is reduced. Experimental results have shown that the performance of the level control in the coded domain is similar to that obtained by level control in the time domain. The great advantages are that this new approach preserves quality and complexity is reduced as no transcoding is involved. Based on the studies previously mentioned, we follow the same principle by proposing an estimation of the speech coded parameters in noisy environment. Especially, we propose two noise reduction algorithms. First, a noise reduction based on the weighting of the fixed codebook gain is explained. Then a noise reduction algorithm based on modification of the LPC coefficients is described. 3.4 Noise Reduction Based by Weighting the Fixed Gain Noise reduction in the speech codec parameter domain is performed following the same steps as noise reduction in the frequency domain. In comparison with noise reduction in the frequency domain, there is no need to transform codec parameters. As highlight in Fig. 3.7, the involved steps are as follows: Estimation of the fixed codebook gain of the noise signal Parameters needed for estimation Enhanced Speech Signal (n) f, D Noisy Speech Signal y(n) gf, Y Encoder Modification of the noisy fixed codebook gain f, S Post Filter Decoder LPC Coefficients, Adaptive codebook Vector / gain, Fixed codebook Vector keep unchanged Figure 3.7: Fixed Codebook Gain Modification in Parameter Domain. 48 NOISE REDUCTION 1. In contrast to the DFT transform, the parameter domain transform refers to the quantized representation of the speech signal in a set of parameters. 2. The second step is where the noise fixed codebook gain gf,D is estimated. This is because it is assumed that the noisy speech signal fixed-codebook gain gf,Y is obtained by the contribution of the clean speech signal gain gf,S and the noise gain gf,D : gf,Y is thus a function of gf,S and gf,D . 3. The third step is the effective noise reduction. The noise reduction is here performed by applying the weighting or modification rule to the corrupted speech signal gain such that: ĝf,S = γc · gf,Y . 4. One important and last step is the control of the amount of noise reduction through a post filtering as depicted in figure bellow. 3.4.1 Estimation of the Noise Fixed codebook Gain A transposition of the minimum statistic approach in frequency domain as described in Sec. 3.2.3.2 is used. This technique is implemented by mimicking the former minimum statistic method from the frequency domain to the parameter domain. We start this estimation by assuming that the fixed codebook gain of the noise signal gf,D characterized the noise amplitude. Similarly to the classical minimum statistic, we consider that gf,D is the floor of the fixed codebook gain gf,Y of the noisy speech signal y(n). Based on this assumption, finding the minimum of gf,Y over sub-frames leads to estimate gf,D over these sub-frames. Applying the minimum statistic on the fixed gain parameter involves the following steps. A smoothing factor is applied to the noisy speech fixed codebook gain gf,Y so as to compute a smoother version gY of gf,Y , according to: 2 gY2 (m) = α(m) · gY2 (m − 1) + (1 − α(m)) · gf,Y (m) (3.20) where α(m) is defined below. The minimum of gY2 is then searched within a window of τ = 100 sub-frames (cf. Chap. 2) and we have: gY2 min (m) = min gY2 (m), gY2 (m − 1)), . . . , gY2 (m − τ ) (3.21) To compensate the bias introduced by this estimation problem (estimation of the noise fixed gain by minimizing gY2 over several sub-frames ), the noise fixed gain estimated is obtained by weighting the minimum noisy speech signal fixed-gain with an overestimation factor βoverest , and thus: ĝf,D (m) = βoverest · gY min (m) (3.22) 3.4. NOISE REDUCTION BASED BY WEIGHTING THE FIXED GAIN 49 To relate the estimate noise fixed codebook gain ĝf,D (m) with the SNR, the smoothing factor α(m) is taken equal to: α(m) = max αmin , 1+ αmax gY (m−1) ĝf,D (m−1) 2 −1 (3.23) where αmin = 0.3 and αmax = 0.98. When the noise fixed gain is estimated, the foregoing consists on designing the filter to apply to the noisy speech fixed codebook gain. We start by simulating a communication scenario between two talkers through a network using the same codec, the AMR-NR 12.2 kbps mode. The noisy speech signal is first encoded with encoder A, resulting in a noisy bit-stream. The noisy bit-stream is then decoded through decoder A. This decoder A is modified such that it is possible to extract all the parameters needed in our noise reduction system. After the enhancement of the fixed codebook gain, the enhanced fixed codebook is introduced inside the noisy bit-stream. This enhanced bit-stream is finally decoded at decoder A. The noisy speech signal is the sum between car noise, and a speech signal. For each noisy signal, the signal to noise ratio is computed only during speech activity: (V AD = 1) as follows: SN Rseg = L−1 1 X SN R(l) · L (3.24) l=0 where L is the total number of sub-frames and SN R(l) is given by the relation: SN R(l) = 10 · log10 PN −1 n=0 PN −1 n=0 s(l · N + n)2 d(l · N + n)2 ! (3.25) where N is the sub-frame index. In this example, the resulting signal to noise ratio is: SN Rseg = 6 dB. The first step on the noise reduction consists in the noise signal fixed codebook gain estimation: ĝf,D . In Fig. 3.8, an example of the performance of the estimation of the noise fixed codebook gain is given. It can be observed that during speech periods (2 − 4 s and 6 − 8 s), the estimated noise fixed gain ĝf,D is still updated. The method does not stop estimating of the noise fixed codebook gain. The highest updated values are observed during speech activity. These values correspond to the variations or fluctuation appearing during speech sections. The estimation is fairly constant during non speech periods. It can be seen that the estimated noise fixed gain ĝf,D in noise only periods approximates the mean value of the noisy speech signal fixed codebook gains gf,Y . 50 NOISE REDUCTION 4 2 x 10 Amplitude Noisy Speech Signal 1 0 −1 −2 0 1 2 3 4 5 6 7 8 9 Gain Amplitude in dB Time/s Noisy Speech Fixed Gain Estimated Noise Fixed Gain 30 25 20 15 0 200 400 600 800 1000 Sub−frames 1200 1400 1600 1800 Figure 3.8: Example of Noise Fixed Codebook Gain Estimation. 3.4.2 Relation between Fixed Codebook Gains From the observation of the curves of the fixed codebook gain in Fig. 3.9, it can be stated that: – In speech periods, the fixed codebook gains (those of the noisy speech signal and the clean speech signal) have approximately the same amplitude. – In noise only periods, the clean speech codebook gains have very low amplitude while the noisy fixed codebook gains correspond exactly to the noise signal fixed codebook gains. – These observations tell us that gf,Y is a function of gf,S and gf,D , leading to: gf,Y (m) = f (gf,S (m), gf,D (m)) (3.26) In addition, the CELP coder used to compute those parameters is not a linear process. function f is constructed intuitively, based on several observations. Let us take into account the two hypotheses (H1 and H2 ), that denote the presence and the absence of clean speech signal respectively. They correspond to high and low SNR respectively. The noisy speech signal can be considered as clean speech signal when the SNR is high and gf,Y is then set to gf,S . When the SNR is low, the noisy speech signal can be approximated by the noise signal and gf,Y is set to gf,D . With these assumptions, setting f as a linear function is the simplest choice. This choice is not limitative. If the clean speech and noise have same energy, this estimation will provide some bias. To reduce this bias, we use an exponent factor δ(m), 3.4. NOISE REDUCTION BASED BY WEIGHTING THE FIXED GAIN 4 Amplitude 2 Gain Amplitude (a) x 10 0 −2 Gain Amplitude 51 0 1 2 3 4 5 6 7 8 9 Time/s (b) 1000 500 0 0 200 400 600 800 1000 Sub−frames (c) 1200 1400 1600 1800 0 200 400 600 800 1000 Sub−frames 1200 1400 1600 1800 1200 1000 800 600 400 200 Figure 3.9: (a)-clean Speech signal, (b)-noisy speech fixed gain (red), clean fixed gain (blue), (c)-noisy speech fixed gain (red), noise fixed gain (blue). depending on the sub-frame index m: δ(m) δ(m) δ(m) (3.27) gf,Y (m) = gf,S (m) + gf,D (m) δ(m) δ(m) This choice has good behavior since the approximation gf,Y (m) ≈ gf,S (m) for low SNR if δ(m) > 1 can be made, according to observation of several fixed gain curves. In fact, if the δ(m) δ(m) noise fixed codebook gain is constant, it can be considered that gf,Y (m) >> gf,S (m) for smaller values of gf,S , in comparison with values needed to access gf,Y (m) >> gf,S (m). 3.4.3 Attenuation Function Similar to filter smoothing process in the frequency domain, the modification filter is inspired by Wiener filter. The noise reduction process achieved in the parameter domain is the estimation of the fixed codebook gain of the clean speech by: ĝf,S (m) = γc (m) · gf,Y (m) (3.28) 52 NOISE REDUCTION The weighting function γc (m), as the standard Wiener filter, is computed on the basis of the signal to noise ratio SNR. The SNR in this case is the a priori SN R. γc (m) = where SN Rδ 1 + SN Rδ δ(m) SN Rδ = ĝf,S (m) δ(m) ĝf,D (m) δ(m) = (3.29) δ(m) gf,Y (m) − ĝf,D (m) δ(m) ĝf,D (m) (3.30) For efficient results, we follow the same suggestion made in [Ephraim, Malah 1984] by introducing the a-priori Signal-to-Noise Ratio. The factor δ(m) introduced in Sec. 3.4.2, (cf. [Cappe 1994]) must be linked to the SNR and thus to the weighting function γc (m). We propose the following choice: δ(m) = δ1 , if γc (m) > 0.5 δ2 , if γc (m) ≤ 0.5 (3.31) with δ1 = 2, and δ2 = 0.75. The final estimation of SN Rδ is drawn from Ephraim and Malah decision directed approach [Ephraim, Malah 1984]. δ(m) SN Rδ (m) = (1 − βf g ) · gf,Y (m) δ(m) ĝf,D (m) δ(m) + βf g · γc (m − 1) · gf,Y (m − 1) δ(m) ĝf,D (m) (3.32) where βf g is taken in the interval [0 1] and affects the SNR updating rate. In the frequency domain, the a priori SNR, SN Rprio is calculated according to: δ(m) SN Rprio = gf,Y (m) δ(m) ĝf,D (m) (3.33) The term: 2 2 δ(m) δ(m) ĝf,D (m) = γc (m) · gf,Y (m) (3.34) in EQ. 3.32 corresponds to: |GM EST (p − 1, fk ) · Y(p, fk )|2 (3.35) 3.4. NOISE REDUCTION BASED BY WEIGHTING THE FIXED GAIN 53 2 2 δ(m) in EQ. 3.11. The last term ĝf,D (m) is equivalent to D̂(p, fk ) . A typical example of estimation of the clean speech fixed codebook gain is presented in Fig.3.10. The system maintains the estimated clean speech fixed codebook gain as close as possible to the noisy one during speech periods, with high SNR. This result can be explained by the assumption introduced in Sec. 3.4.2. The reduction of the corrupted fixed codebook gain amplitude is effective during noise only periods. The reduction is about 5 dB in this example where the signal to noise ratio is: SN Rseg = 6 dB. 4 Amplitude 2 (a) x 10 0 −2 0 1 2 3 4 Gain Amplitude in dB Amplitude 4 2 5 6 7 8 9 5 6 7 8 9 1200 1400 1600 1800 Time/s (b) x 10 0 −2 0 1 2 3 4 Time/s (c) 30 20 10 0 200 400 600 800 1000 Sub−frames Figure 3.10: (a)-clean speech, (b)- noisy speech, (c)-noisy fixed gain(red), estimated clean fixed gain(blue). 3.4.4 Noise Reduction Control: Post Filtering To keep the fixed codebook gain of the clean speech signal as close as possible to the noisy speech fixed gain, especially in periods of high SNR, the attenuation function γc (m) has to be controlled. The control also helps to avoid abrupt variation of γc (m), and thus to reduce the artifacts. We compute the energies in the parameter domain by evaluating the overall energy of both the noisy speech signal and the estimated clean speech signal. These energies represent the energy before and the energy after the noise reduction process. 54 NOISE REDUCTION ′ Let us consider Eu (m), Eu (m) the energies before and after the process respectively: Eu (m) = Q X (ga,Y (m) · vi (m) + gf,Y (m) · ci (m))2 (3.36) (ga,Y (m) · vi (m) + γc (m) · gf,Y (m) · ci (m))2 (3.37) i=1 ′ Eu (m) = Q X i=1 where vi (m) and ci (m) stand for the adaptive-codebook excitation and the fixed codebook excitation respectively. The adaptive codebook gain is ga,Y , Q is the number of sample per sub-frame. ′ The control rule is as follows: depending on the value of Eu (m) and Eu (m), the attenuation function is compensated so that the noisy speech signal fixed gain stays unchanged during high SNR, as illustrated by Fig. 3.10. According to the foregoing and EQ. 3.27, we propose to compute the smoothing function as follows: γc (m), if 10 · log10 Eu′ (m) ≥ T hdB Eu (m) γc (m) = Eu (m) 1, < T hdB if 10 · log10 E ′ (m) (3.38) u A CCR (Comparison Category Rate) test as specified [ITU-T 1996] was performed by comparing the proposed algorithm against the standard Wiener filter approach. The simulations in [Taddei et al. 2004] and [DeMeuleneire 2003] show that modification of the fixed codebook gain based on the principle as proposed in this work, for medium SNR provides good results. In Fig. 3.11 the curves of the noisy, clean and estimated fixed codebook gains are jointly compared. In comparison with the original clean fixed codebook gain during noise only periods (see Fig.3.10 (b)), the noisy speech fixed codebook gain is not completely eliminated during noise reduction. It has the consequence that there will be a slight amount of remaining background noise in the enhanced speech signal. In noise-only periods: (sub-frames 1 to 400 for example), the estimated fixed codebook gain can be regarded as an attenuated version of the noisy speech fixed codebook gain. In speech periods: (sub-frames 400 to 800 for example), the attenuation is performed, by taking into account the clean speech fixed codebook gain shape. This result is characterized by the assumption and formulation introduced in EQ. 3.27 and the assumption in which this equation is based. 3.5. NOISE REDUCTION THROUGH MODIFICATION OF THE LPC COEFFICIENTS55 Noisy Speech Fixed Gain Clean Speech Fixed Gain Estimated Clean Speech Fixed Gain 1200 1000 Amplitude 800 600 400 200 0 0 200 400 600 800 1000 Sub−frames 1200 1400 1600 1800 Figure 3.11: Estimated Fixed codebook Gain. 3.5 Noise Reduction through Modification of the LPC coefficients In CELP coders, the technique used to estimate the LPC coefficients is based on the assumption that the speech signal follows an autoregressive model. This assumption leads to the Yule-Walker equation that can be solved by using the Levinson-Durbin algorithm, [Haykin 2002a]. If the clean speech signal is encoded, the Yule-Walker equations are given by the relation: RS = −ΓS · AS (3.39) where RSP is the the clean autocorrelation coefficients given by RS = [rS (1), . . . , rS (M )], −1 with rS (j) = N n=1 sw (n)·sw (n−j), and j = 0, . . . , M . The number M represents the order of the linear prediction analysis. ΓS is the Toeplitz autocorrelation matrix and AS is the vector of LPC coefficients. In the same way, if noise is encoded,then RD = −ΓD · AD (3.40) 56 NOISE REDUCTION where RD is the vector of the noise autocorrelation coefficients. ΓD is the noise autocorrelation matrix and AS is the vector of noise LPC coefficients. If both the speech signal and noise are present, the Yule-Walker equation is: (3.41) RY = −ΓY · AY In EQ. 3.41, RY , ΓY and AY stand for the vector of the noisy autocorrelation coefficients, the noisy autocorrelation matrix and the noisy LPC vector respectively. The aim of this section is to exhibit and analyze relations between the noisy speech LPC vector, the clean speech LPC vector and the noise LPC vector. In [Thepie et al. 2008], we have proposed a noise reduction system by the modification of the LPC coefficients. This method is based on the VAD decision option 1 implemented in AMR-NB. The VAD decision indicates how to process the extracted noisy LPC coefficients vector AY . A general overview of the approach is depicted in Fig. 3.12. The noisy speech signal is first coded and transmitted over the network. Inside the network, the LPC coefficients vector AY of the noisy speech is extracted whereas the remaining parameters are kept unchanged. If speech is present, that is if V AD = 1, a modification function F is applied to AY to obtain an estimation of the clean speech LPC coefficients ÂS . If V AD = 0 , meaning that noise only is present, AY is damped through G. The estimated clean speech LPC coefficients vector ÂS is then mapped into the bit-stream. The decoding of the modified bit-stream will provide an enhanced speech signal. The design of the functions F and G will be described in the next sections. F ÂS VAD 1 Noisy Speech Signal y(n) E N C O D E R Decoding of the noisy LPC coefficients Noisy Bitstream Other Parameters keep unchanged Mobile Device AY 0 G Mapping ÂS In the bitstream Modified Bit-stream Pitch delay index, Adaptive Gain index, Fixed codebook index Network Area Figure 3.12: Principle of NR based on LPC Coefficients. 3.5. NOISE REDUCTION THROUGH MODIFICATION OF THE LPC COEFFICIENTS57 3.5.1 Estimation during Voice Activity Periods In this section, we exploit how the additive perturbation influences the autoregressive model to design the function F. We assume that the noisy signal is the sum of the speech signal and noise. Moreover we assume that the speech signal and the noise are not correlated, so that the autocorrelation matrix and the vector of the noisy autocorrelation coefficients can be decomposed according to: RY = RS + RD ⇒ ΓY = ΓS + ΓD (3.42) RS + RD = − (ΓS + ΓD ) · AY (3.43) Introducing EQ.3.42 into EQ. 3.41, we obtain the relation: By taking into account the equations obtained during linear prediction analysis of the clean speech signal and noise (EQ. 3.39 and EQ. 3.40), EQ. 3.43 can be formulated as follows: −ΓS · AS − ΓD · AD = − (ΓS + ΓD ) · AY (3.44) AS = AY + (ΓS )−1 · ΓD · (AY + AD ) (3.45) Finally, reorganizing EQ. 3.44, the LPC coefficient vector of the clean speech signal can be computed via the formula: EQ. 3.45 can be interpreted as a filtering process of AY to obtain AS through the function F as shown in Fig. 3.12. In this formula, the term (ΓS )−1 which represents the inverse of the clean speech autocorrelation matrix appears as a particular critical term. First, the clean speech autocorrelation matrix ΓS is unknown. Second, the estimated of ΓS must be inverted during realization of EQ. 3.45. In fact, the filtering process, given by EQ. 3.45, requires the estimation of the autocorrelation matrices ΓS and ΓD , as well as the estimation of the LPC coefficients of the noise AD . The flowchart needed to estimate the clean speech LPC coefficients is described in Fig. 3.13. To realize such estimation, we simulate the network area where the PCM samples of the speech signal are not available. The noisy speech is first encoded using an AMR-NR encoder, in 12.2 kbps mode. We should note that in 12.2 kbps mode two sets of LPC coefficients are normally computed by the encoder and only one set in other modes. For a given frame p of four sub-frames m = 1, . . . , 4, the LPC coefficients are computed in 12.2 kbps mode for sub-frames m = 2 and m = 4. For other modes, the LPC coefficients are computed only for sub-frame m = 4. During encoding process, we extract from the encoder the VAD decision, the noisy autocorrelation coefficients RY to build the autocorrelation matrix ΓY , and the noisy LPC coefficients 58 NOISE REDUCTION ΓY ,VAD Noisy Speech y(n) Noise Autocorrelation Matrix Estimation Γ̂D Γ̂D AMR Encoder ΓY Clean Speech Autocorrelation Matrix Estimation Γ̂S ÂS Clean Speech LPC Estimation Post Filtering AY AY, VAD ÂD Noise LPC Estimation Figure 3.13: Estimation Flowchart of the Clean Speech LPC Coefficients. AY . These extracted parameters are then used to estimate step by step the parameters needed to compute the estimate clean LPC coefficients vector as via EQ. 3.45. First the VAD decision and ΓY are used to compute an estimation of the noise signal autocorrelation matrix Γ̂D . Then the VAD decision and AY are used to estimate the noise signal LPC coefficients ÂD . The estimated noise autocorrelation matrix Γ̂D and the noisy signal autocorrelation matrix ΓY are used to compute the clean speech autocorrelation matrix: Γ̂S . Afterward, the process estimates of the clean speech LPC coefficients ÂS according to EQ. 3.45, by using: AY and ÂD , Γ̂S , Γ̂D , instead of the actual but unknown values. 3.5.1.1 Estimation of the noise LPC vector: ÂD Estimation of ÂD is based on averaging the noisy speech LPC vector during periods of noise alone. During speech periods, the smoothing is frozen: if V AD = 0, ÂD (m) = αlpc · ÂD (m − 1) + (1 − αlpc ) · AY (m) (3.46) if V AD = 1, ÂD (m) = ÂD (m − 1) where m is the sub-frame number and αlpc ≈ 0.8, is chosen experimentally. Estimation during speech activity is set to the latest estimation where the VAD was zero. This estimation has minimal effect during the estimation specified by EQ. 3.45 if all the remaining parameters are known. 3.5.1.2 Estimation of the Noise Autocorrelation Matrix: Γ̂D The noise signal characteristics can be used to estimate the noise autocorrelation matrix. The noise autocorrelation matrix plays a central role in EQ. 3.45. If noise is white, the non- 3.5. NOISE REDUCTION THROUGH MODIFICATION OF THE LPC COEFFICIENTS59 diagonal components of the autocorrelation matrix are zero: ΓD = rS (0) 0 .. . 0 0 ... 0 .. .. . rS (0) . .. .. . . 0 ··· 0 rS (0) (3.47) Accordingly, we just need to estimate the signal energy. This approach alleviates the estimation problem. Experimental results based on this technique achieve poor estimations with any noise different to white noise. In general, white noise is rarely met in real applications. A different method exposed below proposes an estimation of the noise autocorrelation matrix when the noise is not necessary white. Estimation of the noise autocorrelation matrix starts with the estimation of the noise autocorrelation coefficients. We propose here to estimate the noise autocorrelation coefficients by the Inverse Recursive Levinson-Durbin algorithm, [Haykin 2002a]. The implementation of the procedures of this algorithm is similar to the Recursive Levinson-Durbin algorithm. It needs knowledge of the final prediction error and of the LPC coefficients to compute the associated autocorrelation coefficients. The Inverse Recursive Levinson-Durbin algorithm performs the inverse operation of the Recursive Levinson-Durbin algorithm carried out by the encoder to compute the LPC coefficients. As we are working without using the PCM speech samples, estimation of the final prediction error is performed based on the coded parameters. During the computation of the noisy LPC coefficients, the Recursive Levinson-Durbin algorithm uses an autocorrelation RY sequence to compute the LPC coefficients AY , the reflection coefficients KY and the final prediction error power erry , (cf. Appendixe B). The final prediction error if only noise was encoded can be estimated from the noisy final prediction error errd . Therefore, errY is computed at the decoder side as the energy of the total noisy excitation signal. With respect to the voice activity decision, the final prediction error errd is computed as follows: if V AD = 0, err ˆ d (m) = µlpc · err ˆ d (m − 1) + (1 − µlpc ) · err ˆ y (m) if V AD = 1, err ˆ d (m) = err ˆ d (m − 1) (3.48) The smoothing parameter µlpc is set to 0.8. The principle of the Inverse Recursive Levinson-Durbin algorithm is to compute an autocorrelation sequence knowing the associated LPC coefficients and the final prediction error power. The noise autocorrelation sequence by applying the Inverse can be thus estimated ˆ d (m) . The estimated noise autocorRecursive Levinson-Durbin algorithm to ÂD (m), err relation matrix Γ̂D (m) is directly built since it is a Toeplitz matrix whose first column is 60 NOISE REDUCTION R̂D (m) = [r̂D (0), . . . , r̂D (M − 1)]: Γ̂D (m) = 3.5.1.3 r̂D (M − 1) .. r̂D (1) r̂D (0) . .. .. . r̂D (1) . r̂D (M − 1) · · · r̂D (1) r̂D (0) r̂D (0) r̂D (1) ... .. . .. . (3.49) Estimation of the Speech Autocorrelation Matrix: Γ̂S This estimation is the central point of the process. A matrix inversion is required by EQ. 3.45. If the estimation is not good enough, it will be impossible to perform EQ. 3.45. The authors in [Un, Choi 1981] have proposed estimating the autocorrelation matrix Γ̂S in the frequency domain. A threshold was introduced as voice detection. After Γ̂S is computed, they use the Recursive Levinson-Durbin algorithm to compute the clean LPC coefficients. Estimation of the clean speech autocorrelation matrix is achieved following two steps. The first step involves computation of Γ̂S (m) using R̂S (m), whereas the second step leads to the enhancement of the estimation. Step 1: The noisy speech signal autocorrelation coefficients RY (m) is obtained by applying the Inverse Recursive Levinson-Durbin algorithm to (AY (m), erry (m)). This is due to the fact that the noisy speech parameters are all available inside the network. An estimation of the clean speech autocorrelation coefficients is performed by exploiting the non correlation between noise and speech, see EQ. 3.42, leading to: R̂S (m) = RY (m) − R̂D (m) (3.50) The estimated clean speech autocorrelation matrix Γ̂S is obtained according to its toeplitz structure using R̂S (m): Γ̂S (m) = r̂S (M − 1) .. r̂S (1) r̂S (0) . .. .. . r̂S (1) . r̂S (M − 1) · · · r̂S (1) r̂S (0) r̂S (0) r̂S (1) ... .. . .. . (3.51) 3.5. NOISE REDUCTION THROUGH MODIFICATION OF THE LPC COEFFICIENTS61 The estimated autocorrelation matrix is a Toeplitz matrix and is positive definite, thus its inverse exists. As the matrix is symmetric and the first component is positive, this matrix can be inversed using iterative technique such as Cholesky’s, [Haykin 2002a]. Step 2: In general, amplitude of the first autocorrelation coefficient is large. This value represents the energy of the signal over the window used for the signal analysis. The lags are defined as index location of the autocorrelation vector and: lag(R̂S (m)) = [1, 2, . . . , M ]. When dealing with a noisy signal (speech + noise), the lags coefficients close to zero are generally more corrupted by noise than more distant lags coefficients: [Haykin 2002a]. For example, in a sequence at sub-frame m, of autocorrelation coefficients RY (m) = [rY (1) . . . , rY (M )], the autocorrelation coefficient rY (1) is much more corrupted by noise than rY (M ). In presence of noise, the voice detection is sometimes wrong. This decision impacts the estimation of the noise parameters as parameters are not well updated. A sequence of estimated autocorrelation coefficients which differs much more from the original one can cause ill conditioning of the resulting autocorrelation matrix. The ill conditioning of the estimated matrix is particularly annoying for the computation of LPC coefficients [Kabal 2003]. If the estimated matrix is ill conditioned, its inverse will diverge. To overcome problems due to transition periods and wrong VAD decision, we introduce a threshold T hV AD such that: If the noisy speech signal energy is higher than the defined threshold, then the estimation is based on EQ. 3.45. But if the noisy speech signal energy is lower than this threshold, then a damping procedure is used to estimate the clean speech LPC coefficients. Therefore, the formulation in EQ. 3.45 is used to estimate the clean speech LPC coefficients only if: if V AD = 1 and rY (1) > T hV AD , AS is estimated via EQ.3.45 else, Applying the damping procedure (3.52) Filter Stability Problems: The estimated ÂS obtained in EQ. 3.42 must lead to a stable filter. The condition is satisfied if and only if the poles of the associated synthesis polynomial are located inside the unit circle [El-Jaroudi, Makhoul 1991]. If the autocorrelation matrix is ill-conditioned, some poles can be located out of the unit circle [Kondoz 1994]. To solve this problem, a bandwidth expansion procedure is used. 62 NOISE REDUCTION A bandwidth expansion is used in AMR-NB [3GPP 1999b] to reduce ill-conditioning of the autocorrelation matrix. Ill-conditioning is generally source of stability problem [Kabal 2003]. The bandwidth expansion involves multiplying the autocorrelation coefficients by a lag-window vector given by: 1 2πf0 i 2 · , i = 0, . . . , M, f0 = 60 Hz, fs = 8000 Hz Wlag (i) = exp 2 fs (3.53) 1 Multiplication Factor 0.95 0.9 0.85 0.8 lag window: 60Hz lag window: 70Hz lag window: 80Hz lag window: 90Hz lag window: 100Hz 0.75 0.7 1 2 3 4 5 6 Lag Number 7 8 9 10 Figure 3.14: Lad Windowing Values. The effect of the lag-windowing vector is to assign importance to the autocorrelation coefficients proportionally to their lag. The higher the component lags is, the larger the attenuation is achieved. We have tested several values of f0 : 60, . . . , 100 Hz, see Fig. 3.14. By using f0 = 100 Hz we experimentally notice that stability problems do not occur. As seen in Fig. 3.14, the lag window with f0 = 100 Hz significantly reduces the level of autocorrelation coefficients at high lags. The result is that coefficients with high noise correlation contribute less to the computation of the autocorrelation matrix. The level of the coefficients at higher lags decreases when f0 increases. 3.5. NOISE REDUCTION THROUGH MODIFICATION OF THE LPC COEFFICIENTS63 3.5.2 Estimation during Noise Only Periods Estimation of clean speech LPC coefficients during noise only periods uses different technique because EQ.3.45 cannot be computed when there is no speech signal as during such period ΓS (m) = 0 and accordingly the inverse of ΓS (m) = 0 can not be computed. The idea behind the modification of the LPC coefficients is to attenuate the noisy spectrum amplitude. This attenuation should be efficient if it follows the noise amplitude variation. To do so, we compute in each sub-frame a damping factor [Kabal 2003], λ(m) > 1 as a linear function of the estimated noise energy ÊD (m). The coefficients of the linear function ν and η are described in the following way: λ(m) = ν · ÊD (m) + η (3.54) In the Z-domain, when the damping factor is applied to the noisy LPC coefficients associated polynomial AY (z), the estimated clean speech LPC polynomial in sub-frame m is given according to: ÂS (z) = AY (z/λ(m)) (3.55) In fact, EQ. 3.55 leads to modifying the k th noisy LPC coefficient as follows: âs (k) = 1 (λ(m))k · ay (k) (3.56) To avoid too poor or too weak attenuation, we introduce two thresholds Tmin = 27 dB and Tmax = 60 dB to control the attenuation level. If the noise energy ÊD (m) is lower than threshold Tmin , then the attenuation factor is set to a lower value: λ(m) = λmin = 2. Otherwise, when the noise energy ÊD (m) is above Tmax , then λ(m) is limited to λmax = 10. It follows that the coefficients of the linear function introduced in EQ. 3.54 are computed as follows: ν= λmin − λmax λmin − λmax and η = + λmin Tmin − Tmax Tmin − Tmax (3.57) Fig.3.15 explains the evolution of the damping factor in noise only periods. We clearly see that the attenuation factor is constant to λmax if the noise energy is larger than Tmax . If he noise energy is below Tmin , then λ(m) = λmin . Such an approach has an interesting behavior as the attenuation is applied by following the noise energy characteristic. The estimation of the noise energy in the coded parameter domain is performed based on the idea developed in [Doh-Suk et al. 2008]. As shown in Fig. 3.16, a typical example of spectrum where the damping procedure has been applied on noisy LPC coefficients is presented. The original clean speech LPC spectra in 64 NOISE REDUCTION (m) max min ÊD(m) Tmin Tmax Figure 3.15: Damping Factor Characteristics. 20 Clean Spectrum Estimated Spectrum Noisy Spectrum 15 10 Amplitude 5 0 -5 -10 -15 0 500 1000 1500 2000 2500 Frequency Bins 3000 3500 Figure 3.16: Typical Example of Spectrum Damping. 4000 3.5. NOISE REDUCTION THROUGH MODIFICATION OF THE LPC COEFFICIENTS65 silence periods is particularly flat, the noisy speech LPC spectrum has a significant high peak. The noisy speech and the clean speech LPC spectrums have no similarity. The high peak is erased in the estimated LPC spectrum. We can also see that the estimated spectrum better approximates the original spectrum than the noisy one. 3.5.3 Some experimental Results We studied several speech files in presence of car noise. A classical noise reduction in a simulated network environment was implemented. The noisy speech signal is first encoded into an AMR-NB 12.2 kbps mode bit-stream, To simulate the VQE, the bit-stream is decoded into PCM format, processed by the standard Wiener filter in the frequency domain and coded back into an AMR-NB 12.2 kbps mode bit-stream. Finally, the signal is decoded into PCM as it would be at the far-end terminal. The Wiener filter implemented in this simulation is similar to that of Sec. 3.2.2 (also described in [Beaugeant 1999]). The properties of our proposed method are analyzed by comparing the spectrum of the synthesis filter associated to the LPC coefficients. The synthesis filter or transfer in the frequency domain obtained from the LPC vectors ÂS , ÂW iener , AY and AS are built respectively: HŜ (f ), HW iener (f ), HY (f ) and HS (f ). We compare during speech and non speech presence the spectrum of the transfer functions. The mean spectral errors between the spectra of the transfer functions are also computed for each sub-frame: Error(f ) = 1 NF F T · X f |HS (f ) − HU (f )| (3.58) where NF F T is the length of the FFT analysis and HU (f ) stands for any of the signal Ŝ, Y or ŜW iener . Different spectra of a sub-frame in a speech period are presented in Fig. 3.17. In Fig. 3.17 (a), one can notice that the proposed method preserves and enhances the speech formants. The LPC modification implemented in this work allows reconstruction of the formants. The Wiener filter method, as presented in 3.17 (b), tends to amplify the formants at low frequencies, especially in the area [0 − 1000] Hz. It also increases the energy at these frequencies. In the frequency range [1000 − 3000] Hz, the Wiener filter method also tends to move the formants to lower frequencies. The spectral effect of our algorithm can be evaluated by the spectral error in this example. The spectral error obtained with the proposed method is about 0.7 dB. The spectral error without processing is 1.7 dB. The spectral error achieved by the Wiener method is 2.5 dB. Therefore, the noise reduction method we propose achieves a gain of about 1 dB, compared 66 NOISE REDUCTION (a) Amplitude/dB 30 Estimated spectrum Noisy spectrum 20 10 0 −10 0 500 1000 1500 2000 2500 frequency bin (b) 3500 4000 Estimated spectrum Noisy spectrum Clean speech spectrum Wiener spectrum 30 Amplitude/dB 3000 20 10 0 −10 0 500 1000 1500 2000 2500 frequency bin 3000 3500 4000 Figure 3.17: Typical Estimation Spectrun (SN Rseg = 12 dB): (a)-our proposed method is displayed with the noisy spectrum, (b)-our proposed method is compared to the noisy, the clean and the wiener method spectrum. 3.6. CONCLUSION 67 to the noisy spectrum. In most of the worst cases, the estimated spectrum remains close to the noisy one. The concrete achievement and improvement of this noise reduction algorithm will be discussed further in 5. The noise reduction system combining fixed gain weighting and LPC coefficients modification will be integrated inside Smart Transcoding process. 3.6 Conclusion This chapter has shown that techniques used to perform noise reduction in the frequency domain can be successfully transposed into the coded parameters domain. We have presented two algorithms dedicated to noise reduction in the parameter domain: 1. A noise reduction via modification of the fixed codebook gain. This noise reduction system is a direct transposition of spectral attenuation or amplitude attenuation generally performed in the frequency domain. 2. A noise reduction based on modification of the LPC coefficients. This second approach enhances the spectral characteristics of the corrupted speech signal. The noise reduction based on LPC coefficients is influenced by the associated filter properties. We have proposed some techniques to obtain a stable estimated LPC. We should mention than CELP encoding process is computationally expensive. The two techniques proposed in this chapter avoid the complete decoding of the noisy speech signal in PCM formats. These methods reduce the noise contribution and increase the SNR, while maintaining the signal original dynamics. One advantage is that these algorithms can be applied to any kind of codec using CELP technique. The noise reduction system in the following will be based on the combination of the fixed codebook filtering and the LPC coefficients modification. 68 NOISE REDUCTION 69 Chapter 4 Acoustic Echo Cancellation 4.1 Introduction Current wireless communication networks enable high mobility but are still in general affected by external impairments. Impairment such as acoustic echo, due to coupling between the Loudspeaker and the Microphone of the mobile device, can drastically impact the communication quality. In modern digital communication network, a new challenge appears today with integration of Voice Quality Enhancement unit such as AEC inside the network. For providers and industry, centralized AEC located at a specific area inside the network is very attractive. Several reasons may motivate this new approach. Integration of the AEC in the network avoids implementation of AEC inside Mobile Devices. It is very important since the power supply and computational load are critical issues for the Mobile Devices. This solution also leads to the reduction of the cost function of the communication system. Speech transmission over current wireless network is in general achieved using speech codecs based in CELP techniques. The CELP codecs exhibit non-linear characteristics. In cascade with the acoustic echo path, the entire echo path in such application presents strong and large non-linearities. The performance of standard AEC algorithms based on adaptive filters to estimate the acoustic echo path is drastically degraded by these non-linearities [Huang, Goubran 2000]. Based on these observations, we deal in this chapter with new AEC algorithms where the process is performed through modification of CELP coded parameters. The developed algorithms can be easily implemented as centralized AEC for GSM network. These algorithms may appear as solution and contribution to problem due to standard AEC approach. 70 ACOUSTIC ECHO CANCELLATION The communication system studied in this PhD models an end-to-end conversation between two users (far-end-speaker and near-end speaker) over a GSM network. This model exhibits a situation where acoustic echo problems are encountered. Especially, acoustic echo appears during communication in hand-free mode. 4.2 Acoustic Echo Scenario In a mobile communication scenario, each speaker voice at the mobile device is recorded by the microphone and digitized for source and channel coding. The coded signal at the near-end speaker Mobile Device is transmitted over a channel to the counterpart of the conversation, where it is decoded and played out through the loudspeaker. x(k) x(n) Decoder x(t) D/A Echo Path: H Loudspeaker path Far-end Speaker Side y(k) z(t)=x(t)*H(t) Encoder Microphone path y(n) D/A y(t) Near-end Speaker Side speech s(t) y(t) = z(t)+s(t) Figure 4.1: Acoustic Echo Scenario. As described in Fig.4.1, the symmetry of the system allows us to investigate at the near-end speaker side. The communication path between the far-end and the near-end speakers is the loudspeaker path while the reverse link is the microphone path. The wave sound (x(t)) is the input of the loudspeaker at the near-end speaker side and is spread in the echo path (Loudspeaker-Room-Microphone). The propagation of the sounds within the near-end environment/room leads to a coupling between the loudspeaker and the microphone. Through the different propagation paths, the acoustic signal is delayed and attenuated. Such a phenomenon is modeled as the convolution of the input signal s(t) with the time varying impulse response of the filter H(t), which characterized the echo path. Therefore, the acoustic echo signal z(t) is given by: z(t) = (x ∗ H)(t) (4.1) This type of echo is called acoustic echo. There is a superposition of sound waves at the microphone. The microphone at the near-end side records both the modified version of the 4.3. ACOUSTIC ECHO CANCELLATION: STATE OF THE ART 71 loudspeaker sound namely: z(t), and the near-end speech signal s(t), so that: y(t) = s(t) + z(t) (4.2) If the acoustic echo is not cancelled or attenuated, the far-end speaker experiences his own voice with a delay of about 200 − 500 ms. The delay is created by the accumulation of several delays caused by: the speech coding and decoding, the channel coding and decoding, the network transport and possible transcoding. This phenomenon impacts badly the quality of the conversation. Acoustic echo most of the time appears in hand-free telephony as sound spreads not only on the direct path (Loudspeaker - Microphone) but also spreads and reflects over the surrounding Loudspeaker-Room-Microphone. Standard solutions to conceal acoustic echo cancellation imply estimation of the filter that model the acoustic echo path. This estimation becomes critical when there is simultaneous presence of the near end speaker signal and the echo signal, also called double talk. Many techniques were developed to handle the effect of acoustic echo, (see [Haykin 2002b] - [Messerschmidt 1984]). Amongst all these techniques, the family of Stochastic Gradient algorithms and their derived versions are the most implemented solution [Breining et al. 1999]. These families of algorithms use PCM samples to perform AEC. This chapter presents in Sec. 4.3 a state of the art in acoustic echo cancellation. The Least Mean Square (LMS) algorithm and the Normalized Least Mean Square (NLMS) algorithm are first introduced. Then the Gain Loss Control in the time domain is presented, followed by the Wiener filter Method applied to AEC. Sec. 4.3 ends by a short overview about the double talk problem. Sec. 4.4 analyzes former AEC approaches in the coded parameter domain. The Gain Loss Control in the coded parameters domain is presented in Sec. 4.5. In Sec. 4.6, an AEC algorithm by filtering the fixed codebook gain is depicted. Sec. 4.7 concludes the chapter. 4.3 Acoustic Echo Cancellation: State of the Art Stochastic gradient algorithms are the most implemented techniques to adaptively solve estimation problems. The most popular of these algorithms is the LMS algorithm. The LMS algorithm has been widely applied to all kind of adaptive signal processing problems, and in particular for adaptive estimation of the acoustic echo path. 4.3.1 The Least Mean Square Algorithm The principle of the LMS algorithm is presented in Fig. 4.2. The residual signal error e(n) is the difference between the corrupted signal y(n) and the output ŷ(n) of the estimated impulse response: e(n) = y(n) − ŷ(n) = y(n) − Ĥ T (n) · X(n) (4.3) 72 ACOUSTIC ECHO CANCELLATION x(n) x(n) x(n) x(n) Far End Speaker Side Near-End Speaker Side Adaptation Algorithm Estimated Acoustic Echo Path Filter: Echo Path: H echo z(n) - speech s(n) (n) y(n) = s(n) + z(n) e(n) Figure 4.2: System Identification in AEC. The estimated impulse response Ĥ of the filter that models the acoustic echo path of order L is given by: Ĥ(n) = (h0 , . . . , hL−1 ) (4.4) The input vector X(n) represents the L past samples: X(n) = (x(n), . . . , x(n − L − 1)). Since the filter H(n) is unknown, the LMS algorithm improves Ĥ(n) by adaptively estimating its value until the error between H(n) and Ĥ(n) is minimal. A measure of the error during this estimation is the residual signal error e(n). The LMS algorithm thus searches Ĥ(n) that minimizes the squared residual signal. One approach to do this is to move Ĥ(n) in the direction of the gradient of the expected error. The filter Ĥ(n) can be approximated as follows: ∇Ĥ(n) = − µ µ · ∇E |y(n) − ŷ(n)|2 = − · ∇E e(n)2 2 2 (4.5) Parameter µ is the step size and is used to control the rate of change, E is the mathematical expectation and ∇ is the gradient with respect to Ĥ(n). The Stochastic gradient approach replaces the expected value error by the instantaneous value. In practice, we compute ∇Ĥ(n) as follows: ∇Ĥ(n) ≈ − µ · ∇ e(n)2 = −µ · e(n) · ∇e(n) = −µ · e(n) · X(n) 2 (4.6) At time index n + 1, the adaptive filter is updated as follows: Ĥ(n + 1) = Ĥ(n) + µ · e(n) · X(n) (4.7) Generally, a sufficient condition for stability of the LMS algorithm is that step size µ lies within the range: 0 < µ < 1/λmax , where λmax is the largest eigenvalue of the input signal autocorrelation matrix ΓX . The autocorrelation matrix is computed as: 4.3. ACOUSTIC ECHO CANCELLATION: STATE OF THE ART where rX (j) = PL−1 n=j ΓX = rX (L − 1) .. rX (1) rX (0) . .. .. . rX (1) . rX (L − 1) · · · rX (1) rX (0) rX (0) rX (1) ... .. . .. . 73 (4.8) x(n) · x(n − j), j = 0, . . . , L − 1. The Normalized Least Mean Square Algorithm A fast adaptation of the acoustic echo path can be achieved if the input signal is not strongly correlated. The input speech signal x(n) is highly correlated and a whitening process can be performed by dividing µ · e(n) · X(n) by a locally estimated power. This technique is called the Normalized Least Mean Square (NLMS) algorithm where the adaptation procedure is given by: Ĥ(n + 1) = Ĥ(n) + µ · 4.3.2 e(n) · X(n) X(n)T · X(n) (4.9) The Gain Loss Control in the Time Domain One of the oldest and simplest mechanisms to cancel acoustic echo is Gain Loss Control. The principle of this method is to attenuate the microphone signal if the far-end speaker is talking, and to reduce the far-end signal, if the near-end speaker is talking. In this section, the principle of Gain Loss Control in the time domain, as developed in [Heitkamper, Walker 1993], is first explained. The main idea in this technique is to decrease the corrupted (microphone) signal y(n) by an attenuation gain. This attenuation gain depends on the short-term level of the input signal (far-end signal) x(n) and the short term level of y(n). To compute the attenuation gain, the short term averaging magnitude is used instead of the original magnitude. The short term averaging magnitude ȳs (n) of the corrupted signal can be computed as the first order non-linear recursive relation below: ȳs (n) = (1 − αf ) · |y(n)| + αf · ȳs (n − 1), if |y(n)| ≤ ȳs (n − 1) (1 − αr ) · |y(n)| + αr · ȳs (n − 1), Otherwise (4.10) The parameters αf and αr determine the time constants of the averaging process and are chosen in a way that a rising edge of the corrupted signal level can be followed faster than a falling edge. The result is that, on the one hand, the control unit reacts faster in case of a 74 ACOUSTIC ECHO CANCELLATION sudden increase of the energy level and, on the other hand, neglects short speech pause. EQ. 4.10 can also be interpreted as low-pass filtering. Then, the attenuation factor gain ay (n) applied to the microphone signal is computed as follows: if ȳs (n) ≤ y0max c1 , ω̃ q−1 ay (n) = c2 · (ȳs (n)) , if y0max < ȳs (n) < y0max ω̃ c3 /ȳs (n), if y0max < ȳs (n) (4.11) The constants have been chosen such that ay (n) is a continuous function. The threshold y0max separates the several possible modes, as indicated in Fig. 4.3: 1. Above this threshold, the level of the modified microphone signal (in log) ay (n) · ȳs (n) is almost constant: c3 , characterizing speech periods. 2. Below the threshold, there is a small expansion region whose width defined by: log(y0max )− log(ω̃), is determined by factor ω̃. The degree of expansion is controlled by parameter q, see Fig. 4.3. 3. Below the expansion regions, the modified microphone signal is distinctly attenuated according to c1 . The Characteristics of the microphone attenuation principle is shown in figure below: log(ay(n)· s) Variations of y0max log(c3) q log(c1) log(y0max)- log() log() log(y0max) log( s) Figure 4.3: Control Characteristics of the Microphone in Gain Loss Control. Threshold y0max is adjustable and has to be chosen such that the background noise does not reach the expansion area and that the average magnitude of the speech signal lies above this threshold. Furthermore, the authors in [Heitkamper, Walker 1993] suggest computing y0max by using the coupling factor between the input speech signal and the corrupted signal. They also remark that the attenuation factor of the loudspeaker path should depend on the speech intensity of the input signal. 4.3. ACOUSTIC ECHO CANCELLATION: STATE OF THE ART 4.3.3 75 The Wiener Filter Applied to Acoustic Echo Cancellation The Wiener filter applied to echo cancellation, also called the Minimum Mean Squared Error (MMSE) filter, is performed in the frequency domain, (see [Vaseghi 1996]). It aims at minimizing the mean squared error J between the clean speech spectrum S(p, fk ) and its estimated spectrum Ŝ(p, fk ) (cf. Chap. 3): 2 J = E S(p, fk ) − Ŝ(p, fk ) (4.12) The estimated spectrum Ŝ(p, fk ) is computed by filtering the microphone signal spectrum with the Wiener filter GW iener (p, fk ), so that: Ŝ(p, fk ) = GW iener (p, fk )·Y( p, fk ). The Wiener filter GW iener (p, fk ) is obtained by minimizing the cost function J. The expression of Ŝ(p, fk ) is used during the minimization process. The Wiener filtering technique assumes that the useful speech s(n) and the echo signal z(n) are statistically independent, leading to: GW iener (p, fk ) = SER(p, fk ) 1 + SER(p, fk ) (4.13) where the term SER represents the Signal-to-Echo Ratio (SER). The SER is computed as the ratio between the speech power density spectrum of the useful speech ΦSS (p, fk ) and the echo signal power density spectrum ΦZZ (p, fk ): SER(p, fk ) = ΦSS (p, fk )/ΦZZ (p, fk ). 4.3.4 The Double Talk Problem A practical issue of importance in AEC is the situation where both the near-end speaker and the far-end speaker talk simultaneously. This situation is also known as the double talk period (cf. [Benesty et al. 2000] and [Ye, Wu 1991]), and is characterized by: s(n) 6= 0, and x(n) 6= 0 (4.14) where as above, s(n) is the near-end speaker speech signal and x(n) is the far-end speaker signal. Double talk periods disturb in practice the identification of the echo path when using adaptive filter leading to a possible divergence of the adaptive filter. An interesting way to overcome the double talk problem is to redefine or stop the adaptation when double talk is detected. Double Talk Detection (DTD) generally computes a detection statistic metric ξ. This metric is then compared with a defined constant threshold Tsx , independent with the input data. If double talk is detected, the filter adaptation process is disabled during this period of 76 ACOUSTIC ECHO CANCELLATION time. The technique usually used to compute the metric is the normalized cross-correlation method (see [Ye, Wu 1991]). 4.4 Overview on Acoustic Echo Cancellation Approaches in the Coded Parameter Domain Acoustic echo cancellation algorithms are generally deployed as close as possible to the echo source to mitigate their effects on speech quality. But existing acoustic echo cancellers do not provide a sufficient level of cancellation. After enhancement in the terminal, the speech signal is coded and transmitted over the network. At the receiver side, the decoded speech quality depends on the network capability. Therefore, it can be interesting to directly implement the Voice Quality Enhancement unit (cf. [Beaugeant 1999]) inside the network. Such an approach enables harmonized VQE solutions. The centralized management of the network quality will be possible by placing the VQE solution in the network, independently of the type of impairment. Acoustic Echo Cancellation implemented in the network has been simulated in [Lu, Champagne 2003]. Based on this idea, several works are conducted to adapt algorithms to this new concept. In [Gordy, Goubran 2006], the authors have proposed estimating the residual echo power spectrum based on a frequency dependent scalar factor. The algorithm is incorporated into a psychoacoustic post-filter for residual echo suppression as developed in [Lu, Champagne 2003]. The advantage is that the algorithm is placed inside the network. One relationship to codec domain processing is that the LPC coefficients are integrated in their model. Nevertheless, the system still uses the PCM speech samples. In [Gordy, Goubran 2004], the authors have implemented an adaptive X-LMS (X-filter band Least Means Square) echo canceller located inside the network. Instead of a single decorrelated filter as in [Mboup et al. 1994], they use a filter bank of short de-correlation filters whose coefficients are obtained from the LPC-based encoded representation of the loudspeaker speech signal x(n). The processing block diagram is depicted in Fig. 4.4. The LPC-based decoder provides compressed representation of the Loudspeaker signal x(k) inside the network. The Loudspeaker signal can be reconstructed as follows: x(n) = M X i=1 a(n, i) · x(n − i) + r(n) (4.15) where at time n, a(n, i) is the ith LPC coefficient and r(n) stands for the total excitation signal given by: r(n) = ga,X · r(n − TX ) + gf,X · c(n) (4.16) 4.4. OVERVIEW ON ACOUSTIC ECHO CANCELLATION APPROACHES IN THE CODED PARAMETER DOMAIN Speech frames Compressed LPC-Based Decoder 77 x(n) x(n) r(n) LPC Coefficients N E T W O R K Echo Path H(n) (n) f(k) echo z(n) Filter Bank e(n) speech s(n) (n) - e(n) y(n) Figure 4.4: Combined AEC/CELP Predictor. In EQ. 4.16, ga,X , TX , gf,X , and c(n) are respectively the adaptive gain, the pitch delay, the fixed codebook gain and the fixed codebook vector (see Chap.2). The total excitation signal r(n) can be easily computed inside the network. The LPC coefficients at the decoder are used to compute the de-correlation filter coefficients f (k). The estimated adaptive filter is denoted by h̄n , and the original acoustic path is given by hn . The difference between the target and the estimated impulse response at time n is given by: ∇hn (j) = hn (j) − h̄n (j) (4.17) The de-correlation filter bank is constructed using the inverse of the all pole LPC synthesis filter at the decoder. At time n − l, the filter is computed as follows: (f (0) = 1, f (k) = −a(n − l, k), 1 ≤ k ≤ M ). The new X-LMS updated equations are computed as follows: ef (n, l) ≈ N −1 X j=0 ∇hn (j) · r(n − j) + L−1 X k=0 f (k) · ν(n − j) (4.18) where ν(n) represents the uncorrelated adaptive noise. The estimated adaptive filter is updated based on the new formulation below: h̄n+1 (j) = h̄n (j) + µ · ef (n, l) · r(n − l) PN −1 2 k=0 r (n − k) (4.19) The parameter µ is the step size of the standard LMS algorithm. The simulations using the G.729 standard show a more constant and faster convergence rate than classic NLMS. This algorithm can be easily applicable to other LPC-based coders. This algorithm requires the decoding of the excitation signal. 78 ACOUSTIC ECHO CANCELLATION The same idea was introduced in [Gnaba et al. 2003] by combining the acoustic echo canceller and the CELP structure. This approach was based on the fact that the residual echo in the case of NLMS is equal to the quantization noise of the speech codec. In [Gnaba et al. 2003], the NLMS was modified by introducing the quantization noise of CELP based speech codec during estimations. The specificity of these algorithms is that they always need PCM speech samples. The codec parameters are incorporated inside the algorithm structure to increase the performances. The interesting behaviors of these approaches rely on the fact that these algorithms are implemented inside the networks. Such an approach is said to be centralized. In the following, we propose algorithms dedicated to acoustic echo cancellation. The main difference between our approach and those previously mentioned is that our algorithms do not require decoding of the coded speech signal into PCM samples. Our algorithms deal with the CELP codec parameters exclusively. 4.5 The Gain Loss Control in the Coded Parameter Domain As introduced in Chap. 3, the modification of the fixed codebook gain gf results in the modification of the decoded signal amplitude. This principle is used in this section to modify the corrupted (microphone) signal fixed codebook gain gf,Y and the loudspeaker fixed codebook gain gf,X in each sub-frame. Instead of attenuating the signals with a gain, the attenuation factors ay and ax are applied to gf,Y and gf,X respectively. This proposed algorithm does not dissociate Double Talk periods. Decoder x(n) x(k) Loudspeaker path Near-End Speaker AE path echo z(n) gf,x(m), ga,x(m) ax(m) Control Module To Far-End Speaker s(n) Microphone path y(n) = s(n) + z(n) gf,y(m), ga,y(m) ay(m) Encoder y(k) Figure 4.5: Gain Loss Control in the Codec Parameter Domain. In Fig. 4.5, the structure of the Gain Loss Control in the codec parameter domain is 4.5. THE GAIN LOSS CONTROL IN THE CODED PARAMETER DOMAIN 79 depicted. The process takes place on each sub-frame. The bit-stream x(k) of the far-end speaker is decoded at the near-end side, resulting in the decoded loudspeaker speech x(t). The microphone signal y(t) = s(t) + (x ∗ H)(t) is the sum of the near-end speaker’s speech signal s(t) and the acoustic echo z(t) = (x ∗ H)(t). The microphone signal is encoded, providing the bit-stream y(k). The control module has two functions. The control module estimates the energy of the loudspeaker signal and the energy of the microphone signal. Then the control module computes the attenuation gain factors ax and ay as specified in Sec. 4.3.2. As we use the AMR-NB speech coder [3GPP 1999b], the encoder and the decoder are modified so that it is possible to extract the adaptive codebook gains (ga,Y , ga,X ) and the fixed codebook gains (gf,Y , gf,X ). The encoder and the decoder are also modified so that they can also receive the attenuation factors to be applied to the fixed codebook gain. We describe now the estimation of the signals energy and the computation of the attenuation gain factors. 4.5.1 Estimation of the Energy As the fixed codebook gain and the adaptive gain are computed for each sub-frame, the signal energy estimation is performed on a sub-frame basis. On a sub-frame m, the total speech signal energy E(m) can be estimated by summing the adaptive codebook vector energy Ea (m) and the fixed codebook vector energy Ef (m), weighted by their corresponding gain, ga and gf respectively. The adaptive codebook vector is computed at the encoder side as a scaled version of the total excitation of the previous sub-frame at lag the pitch period. A recursive formula can then be used to simplify the computation of the signal energy. The energy of the adaptive excitation at sub-frame m makes use of the total energy computed at sub-frame m − 1, leading to: ÊX (m) = Ef,X (m) · gf,X (m) + ÊX (m − 1) · ga,X (m) ÊY (m) = Ef,Y (m) · gf,Y (m) + ÊY (m − 1) · ga,Y (m). (4.20) The encoder and the decoder have been transformed so that the needed parameters can be extracted or introduced inside the bit-stream, in each sub-frame. As described in Fig. 4.5, the fixed codebook gain and the adaptive codebook gain are extracted from the encoder (microphone side: gf,Y , ga,Y ) and the decoder (loudspeaker side: gf,Y , ga,X ). The parameters are then used inside the control module to compute an estimation of the microphone and loudspeaker energy: ÊY (m) and ÊX (m), respectively. 80 ACOUSTIC ECHO CANCELLATION 4.5.2 Computation of the Attenuation Gain Factors In Sec. 4.3.2, it was indicated that the low pass filtered magnitude of a speech signal is used to compute the attenuation gains. This approach is used to eliminate fast changes, (see [Heitkamper 1995] and [Heitkamper 1997]) associated to the speech signal energy. Small speech pauses are neglected during determination of the attenuation factors. Using the same properties on the considered energy EQ. 4.10, the short term energies Êi (m) are filtered as follows: ẼX (m) = (1 − αf ) · ÊX (m) + αf · ẼX (m − 1) if ÊX (m) ≤ ẼX (m − 1) (1 − αr ) · ÊX (m) + αr · ẼX (m − 1), else (4.21) ẼY (m) = (1 − αf ) · ÊY (m) + αf · ẼY (m − 1) if ÊY (m) ≤ ẼY (m − 1) (1 − αr ) · ÊY (m) + αr · ẼY (m − 1), else (4.22) The smoothed factors were chosen experimentally as: αf = 0.95 and αr = 0.7. An abrupt increase in the speech level is controlled. If the amplitude of the speech signal decreases, the energies ẼX (m) and ẼY (m) decay slower. The short speech pauses are neglected with EQ. 4.21 and EQ. 4.22. Applying EQ. 4.21 to the loudspeaker signal and EQ. 4.22 to the microphone signal corresponds to the estimation of the long-term energies ẼX (m) and ẼY (m) respectively. The long-term estimation matches the shape of the short-term energy. High variations of the estimated energy are followed by this long-term method. In particular, the curve of the long-term estimated energy matches the maximum contour of the computed short-term energy. During speech pause, the long-term energy floor is not equal to the short-term energy floor. This is due to the fact that during periods detected as silence, the speech samples are not really equal to zero. The adaptive principle of the estimation can be viewed during transition phases. The curve of the estimated long-term energy (in red) moves slowly to the minimum in silence periods. In the AMR coder the codebook gains, especially the adaptive ones, are generally different from zero. In [Duetsch 2003], several experiments were performed to argue on the interest of using expression 4.21 and 4.22 to characterize the signal energy in coded parameters. As shown 4.6, this estimation provides good approximation of the short-term energy of the signal. These estimated long-term energies are used to compute the attenuation gains ax and ay respectively. The attenuation gains are limited to the interval [0, 1]. Therefore, the fixed codebook gains are either attenuated completely or not at all depending of the cases. A metric Ẽdif f (m) is then computed to characterize the energy level. This metric is computed as a scaled logarithmic of the ratio between the loudspeaker ẼX (m) and the microphone ẼY (m) long-term energies: 4.5. THE GAIN LOSS CONTROL IN THE CODED PARAMETER DOMAIN 4 Signal Amplitude 1 81 Speech Signal x 10 0 −1 −2 0 2 4 6 Time/s 8 10 12 Energy / dB 0 −20 −40 −60 Short−Term Energy Estimation Long−Term Energy Estimation −80 0 500 1000 1500 Subframes 2000 Figure 4.6: Example of Energy Estimation in Codec Parameter Domain. Ẽdif f (m) = 10 · log10 ẼX (m) ẼY (m) ! (4.23) Specifically, the attenuation gains are computed as follows: ax (m) = 0, µGLC 2 1, if Ẽdif f (m) < − µGLC 2 · Ẽdif f (m) + 0.5, if − µGLC < Ẽdif f (m) < 2 < Ẽdif f (m) if µGLC 2 µGLC 2 ay (m) = 1 − ax (m) (4.24) EQ. 4.24 guarantees that at least one of the loudspeaker and the microphone fixed codebook gain is attenuated. To set µGLC , distinct processing has to be considered in silence periods and during speech activity periods. If the estimated energy of the loudspeaker and the microphone are below a certain threshold (−50 dB), parameter µGLC is modified to µGLC = cst · µGLC , where cst = 5. This situation corresponds to silence. The result is that the linear part µGLC 2 · Ẽdif f (m) + 0.5 in EQ. 4.24 is increased by cst. The attenuation gains ax (m) and ay (m) are about 0.5 as shown in Fig. 4.7. We can observe in Fig. 4.7 that if in loudspeaker characteristics the attenuation factor is zero, then the attenuation factor in microphone characteristics is equal to one and conversely. 82 ACOUSTIC ECHO CANCELLATION These periods correspond to single talk periods, where either the far-end speaker or the nearend speaker is talking. The attenuations gains are characterized by the relation between the microphone signal and the loudspeaker signal energy. ax(m) ay(m) 1 1 0.5 0.5 0 - GLC - GLC /2 0 GLC /2 0 diff(m) GLC Loudspeaker Characteristics - GLC - GLC /2 0 GLC /2 diff(m) GLC Microphone Characteristics Figure 4.7: Characteristics of the Attenuation Gains. In Fig. 4.7, above µGLC and below − µGLC the attenuation factors are constant. This 2 2 algorithm has no double talk detection mechanism. Our algorithm provides a dual processing between the loudspeaker and the microphone fixed codebook gain. The critical point is 0.5, where the fixed codebook gains are identically attenuated. If the estimated energies of the µGLC µGLC loudspeaker and microphone are such that Ẽdif f (m) is between − 2 , the attenuation 2 is distinctly applied as desired to the two fixed codebook gains. The Gain Loss Control in the codec parameter domain may fail when the near end and the far end speakers are speaking at the same time, or during double talk mode. There is no additive system in this method to determine double talk periods. The output of the loudspeaker is cut off completely only if the near end speaker is talking and the microphone input is interrupted only if the far end speaker is talking. 4.5.3 Experimental Results Analysis The results presented in Fig. 4.8 include both single talk and double talk periods. Periods from 0s to −3 s represent the single talk of the far-end speaker. Periods from 4 s to −6 s correspond to the near-end speaker single talk. Finally, periods from 6.5s to −8 s are the double talk periods where the near-end speaker and the far-end speaker are talking simultaneously. The echo signal was simulated by filtering the far-end signal through a car impulse response. The system implemented in this work simulates the network processing where the needed parameters (fixed codebook gain and adaptive codebook gain) are extracted from the bitstream. In the area of echo only signal in Fig. 4.8 (from t = 0 s to t = 3 s), the Gain Loss 4.5. THE GAIN LOSS CONTROL IN THE CODED PARAMETER DOMAIN 4 Signal Amplitude 2 (a) x 10 Unprocessed y(n) Echo z(n) 1 0 −1 −2 0 2 4 6 8 10 6 8 10 Time/s (b) 4 2 Signal Amplitude 83 x 10 Unprocessed y(n) Enhanced Signal 1 0 −1 −2 0 2 4 Time/s Figure 4.8: Example of AEC based on Gain Loss Control. Control introduces interesting improvement of the acoustic echo cancellation. The acoustic echo is completely cancelled. During the near-end speaker Single Talk, the microphone signal is not modified. Difficulties appear in Gain Loss Control during double talk periods. Analysis of the estimated energy in Fig. 4.9 (from t = 6.5 s to t = 10 s) indicates that the energies of the microphone and the loudspeaker signals have the same level. During such periods, it is rather difficult to discriminate the amplitude of the loudspeaker signal from that of the microphone signal. The attenuation gains are not well estimated since there is no significant difference between the energy amplitudes. As shown in Fig. 4.9, double talk periods are characterized by rapid changes of the attenuation factors. Both the microphone input signal and the loudspeaker output signal are alternatively attenuated. Informal listening tests indicate that during double talk periods, the processed microphone input signal is slightly distorted. The characteristics of the control module are shown in Fig. 4.9. During single talk of the far-end signal that is during echo-only periods the estimated energy of the microphone is clearly above the estimated energy of the loudspeaker. In these periods, the attenuation factors are constant, either 1 for the loudspeaker and 0 for the microphone. The result in such periods is that the loudspeaker output signal is cut off completely when the near-end speaker is talking. The microphone input signal is cut off completely when the far-end speaker is talking. During double talk periods, both the microphone and the loudspeaker signals are attenuated. The Gain Loss Control algorithm makes use of microphone and loudspeaker energies to compute the attenuation gains. It is particularly difficult in double talk to compare the signal energies. 84 ACOUSTIC ECHO CANCELLATION Estimated Energy 0 Energy / dB −20 −40 −60 EX(m) −80 EY(m) Attenuation amplitude −100 200 400 600 800 1000 1200 Subframes Attenuation Gains 1400 1600 1400 1600 1 0.8 0.6 0.4 ax(m) 0.2 ay(m) 0 200 400 600 800 1000 Subframes 1200 Figure 4.9: Typical Example of the Evolution of the Attenuation Factor. 4.6 Acoustic Echo Cancellation by Filtering the Fixed Gain This section proposes a more sophisticated Acoustic Echo Cancellation algorithm. The contents of this section were published in [Thepie et al. 2006]. The general principle consists of the use of the codec parameters only to design a complete AEC algorithm. The Gain Loss Control proposed in previous section has shown its limit during double talk periods. This new proposed algorithm includes a double talk periods detector, leading to better performance of the echo reduction. Compared to classical solution on PCM signal, this new approach reduces complexity and allows integration of echo cancellation in the network without adding any delay or creating the so-called tandeming effect. We also introduce an innovative estimation of the Signal-to-Echo Ratio (SER) based on a linear model of the fixed codebook gain parameters. Listening test results presented at the end of the section validates the good quality achieved, even though the complexity of the proposed algorithm is low. 4.6. ACOUSTIC ECHO CANCELLATION BY FILTERING THE FIXED GAIN 4.6.1 85 System Overview Taking into account the CELP synthesis filter as introduced in Chap. 2, the idea is to design a filter G(m) such that in presence of an echo signal, the microphone fixed codebook gain gf,Y (m) is replaced by it weighted version given by: (4.25) ĝf,S (m) = G(m) · gf,Y (m) Decoder block Extraction of the fixed codebook gain x(n) Decoder x(k) gf,x(m) AE path Estimation of the Echo Signal Fixed Gain echo z(n) f,z (m) s(n) gf,y(m) Filter f,s y(n) Encoder (m) Mapping of the bit-stream y(k) Encoder block Figure 4.10: Filtering of the Microphone Fixed Codebook Gain Principle. In our experiment, see Fig. 4.10, the encoder and the decoder blocks are modified to get access to parameters needed for the echo reduction. Such scheme mimic what would have happened in network where only the bit-streams, and in particular the gains, are available. At the decoder side, the input loudspeaker speech bit-stream x(k) is decoded and the fixed codebook gain gf,X (m) is extracted. The remaining parameters are kept unchanged. This loudspeaker fixed gain is used to estimate the echo signal fixed codebook gain ĝf,Z (m). At the encoder block side, the microphone fixed codebook gain gf,Y (m) is extracted during the encoding process. The fixed gains gf,Y (m) and gf,Z (m) are used during the filtering process to get the estimated clean speech signal fixed codebook gain ĝf,S (m). The estimated speech 86 ACOUSTIC ECHO CANCELLATION gain is then mapped inside the bit-stream y(k) of the microphone speech signal. The obtained bit-stream corresponds in fact to an estimation of the clean speech bit-stream ŝ(k). The CELP coding process is not a linear application. Therefore, we consider that the microphone fixed codebook gain is a function of the near-end speech signal fixed codebook gain and of the echo signal fixed codebook gain: gf,Y (m) = f (gf,S (m), gf,Z (m)) (4.26) The joint function f (., .) is not clearly identified, because of the CELP encoding process. In the following, we will propose an approximation of the joint function f (., .) and of the weighting filter G(m). 4.6.2 Approximation of the joint function f (., .) and the Filter G(m) In this work, we approximate the joint function with a linear combination based on three parameters such as: f (gf,S (m), gf,Z (m)) = a · gf,S (m) + b · gf,Z (m) + c (4.27) The parameter c is set to 0, as the fixed gain is null if no signal is encoded. EQ. 4.27 is now reduced to: f (gf,S (m), gf,Z (m)) = a · gf,S (m) + b · gf,Z (m) (4.28) Introducing the new measure described as a Signal-to-Echo Ratio: SER(m) = gf,S (m)/gf,Z (m) , it follows from EQ. 4.25 and EQ. 4.28 that: G(m) = SER(m) 1 + SER(m) (4.29) This latter expression can be interpreted as a weighted Wiener filtering of the gain, showing the similarity of our method to the filter developed in the ’classical’ frequency domain noise reduction (cf. [Ephraim, Malah 1984]). The main problem lies in the estimation of the two parameters (a and b). It is necessary to estimate a and b to obtain the joint function f (., .) and the filter G(m) via EQ. 4.28 and EQ. 4.29 respectively. To compute a and b, discrimination between single talk and double talk periods should be taken into account. 4.6. ACOUSTIC ECHO CANCELLATION BY FILTERING THE FIXED GAIN 87 The first observation is that if only the near-end speaker is talking, (Single Talk) then the echo signal fixed codebook gain is: gf,Z = 0 and the near-end speech fixed codebook gain is gf,S 6= 0. Similarly, if the far end speaker is not talking (Single Talk), we must have: gf,S = 0 and gf,Z 6= 0. Therefore, during Single Talk of the near end speaker, (a, b) = (1, 0). In the same way, one can show that during single talk of the far-end speaker: (a, b) = (0, 1). To compute the parameters a, b during Double Talk periods, we encode and extract from a large data basis of acoustic echo scenarios the fixed codebook gains of the echo signal, the Microphone speech and the Loudspeaker speech signals. We construct three vectors of size I, the number of total sub-frame per speech file, as follows: ℑy = [gf,Y (1), . . . , gf,Y (I)] , ℑs = [gf,S (1), . . . , gf,S (I)] , ℑz = [gf,Z (1), . . . , gf,Z (I)] (4.30) In this work the parameters a and b are assumed to stay constant during the estimation and during double talk periods. The purpose is now to search the pseudo-inverse of the following system, leading to optimum a and b: [ℑs ℑz ] · [a b]T = ℑy (4.31) The system of equations in EQ. 4.31 is an over-determined system. The matrix ℑ = [ℑs ℑz ] is not a squared matrix. We solve the system in the least mean square sense using the pseudoinverse ℑ+ of the matrix ℑ, also called the Moore-Penrose generalized inverse [Golub, Loan −1 1996]. Using this inversion with ℑ+ = ℑT · ℑ · ℑ, the optimal parameters a and b are given by: [a b] = ℑ+ ℑy (4.32) In order to evaluate the results obtained via EQ. 4.32, we simulate the acoustic echo scenarios in six different car kits, see Tab. 4.1. The parameters obtained by EQ. 4.32 are compared to the exact values by analyzing the normalized square error: error = PI m=1 [gf,Y (m) − (a · gf,S (m) + b · gf,Z (n))]2 PI 2 m=1 (gf,Y ) (4.33) As the data basis scenarios were performed in similar environment, the results of Tab. 4.1, suggest that the optimal values of a and b are approximately equal to: 88 ACOUSTIC ECHO CANCELLATION (4.34) a ≈ 1, and b ≈ 4/3 This latter equation is used to compute a general version of the weighting filter of EQ. 4.29 as follows: SER(m) (4.35) G(m) = ς + SER(m) where ς = 1 during single talk periods and ς = 4/3 during Double Talk periods. Car Kit ’Optimal a’ ’Optimal b’ ’error (%)’ c1 1.3 1.33 9.30 c2 0.99 1.39 6.85 c3 0.97 1.36 3.35 c4 0.98 1.34 8.27 c5 0.96 1.38 5.75 c6 0.97 1.27 3.25 Table 4.1: Mean Linear Coefficients in Double Talk Mode. We introduce a recursive estimation of the Signal-to-Echo Ratio (SER), in order to simplify the algorithm as in [Taddei et al. 2004]: SER(m) = βe · G(m − 1) · gf,Y (m − 1) gf,Y (m) + (1 − βe ) · ĝf,Z (m) ĝf,Z (m) (4.36) where the smoothing factor is set to βe = 0.9, according to preliminary experiments. The interest of EQ: 4.36 is that it needs no estimate of the near-end signal fixed codebook gain gf,S , similar to Signal-to-Noise Ratio computed in [Ephraim, Malah 1984]. Only the estimation of the echo signal fixed codebook gain gf,Z is required. The acoustic echo cancellation problem is now reduced to the discrimination double talk/single talk and the estimation of the echo signal fixed codebook gain ĝf,Z . 4.6.3 Estimation of the Echo Signal Fixed Codebook Gain: ĝf,Z The fixed codebook gain of the echo signal is estimated through mimicking the approach followed in [Heitkamper 1997]. The estimated echo fixed codebook gain is defined as a shifted and attenuated version of the far-end speech fixed codebook gain gf,X : ĝf,Z = κopt (m) · gf,X (m − τopt (m)) (4.37) The parameter τopt (m) represents the number of shifted sub-frames and κopt (m) is the attenuation parameters. The parameter τopt (m) and κopt (m) are computed in two steps: The 4.6. ACOUSTIC ECHO CANCELLATION BY FILTERING THE FIXED GAIN 89 first step is the echo mode detection (echo presence and double talk periods) and the second step is the effective estimation of the parameters. Echo Mode Detection: The echo mode detection is performed in two stages: First we detect whether an echo is present. Second we detect whether there is a double talk period. The echo mode detection starts with the estimation of the signal energy in the codec parameter domain. A smoothed version of the fixed codebook gain is considered as the signal energy. The smoothed version of the microphone and loudspeaker fixed codebook gains ĝsmooth,Y (m) and ĝsmooth,X (m) respectively, are computed as follows: ĝsmooth,Y (m) = γe · ĝsmooth,Y (m − 1) + (1 − γe ) · gf,Y ĝsmooth,X (m) = γe · ĝsmooth,X (m − 1) + (1 − γe ) · gf,X . (4.38) where typically γe = 0.9. We decide that a possible echo mode is detected if: ĝsmooth,X (m) > max (Tsilence , ĝsmooth,Y (m)) (4.39) where the threshold parameter is experimentally set to Tsilence = 10. The relation in EQ. 4.39 is aimed at verifying whether the far-end speaker is talking or not. Moreover, EQ. 4.39 is always verified in the presence of echo periods and/or double talk periods. This relation will be further used to verify if the coupling between the loudspeaker and the microphone is effective. The echo period is finally detected by analyzing the normalized cross-correlation function between the far-end speaker (loudspeaker) fixed codebook gain gf,Y and the microphone fixed codebook gain gf,X . The normalized cross correlation in the coded parameters domain is given by: ϕgf,X gf,Y (i) = qP P Ncc −i−1 gf,X (j+i)·gf,Y (j) , 2 P cc −1 2 |gf,X (j)| · N |gf,Y (j)| j=0 j=0 Ncc −1 j=0 ϕgf,X gf,Y (−i), if i ≥ 0 (4.40) else The term Ncc represents the length of the normalized cross-correlation analysis. The maximum of the normalized cross-correlation function cmax (m) as well as its corresponding lag lmax (m) are searched as follows: 90 ACOUSTIC ECHO CANCELLATION cmax (m) = maxi ϕgf,X gf,Y (i) and lmax (m) = arg maxi ϕgf,X gf,Y (i) (4.41) We decide that if cmax (m) is above ta = 0.75 (enough correlation), and EQ. 4.39 is verified, echo is detected and the current parameters τopt (m) and κopt (m) are updated as described in the next sections. Determination of τopt (m) and κopt (m): The determination of τopt (m) and κopt (m) is performed similarly to [Heitkamper 1997]. In contrast to [Heitkamper 1997], the fixed codebook gains are hereafter used as a sufficiently good representation of the energy. In general the maximum cross-correlation lag characterizes the amount of correlation between the analyzed signals. In the same sense, the maximum normalized cross-correlation lag lmax (m) will be used to determine the optimum sub-frame shift κopt (m). To prevent rapid variations of the fixed gain during echo periods, a short term lag lst (m) is first computed on the basis of lmax (m) as follows: lst (m) = α̂(m) · lst (m − 1) + (1 − α̂(m)) · lmax (m) (4.42) The smoothed factor α̂(m) is computed adaptively as a function of the correlation coefficient cmax (m) and the echo detection metric ta : α̂(m) = e −δe · cmax (m) + − α1−t a αe , αe −δe ·ta 1−ta , if cmax (m) > ta else (4.43) The smoothing parameters are taken equal to: αe = 0.96 an δe = 0.25 The interest of such an approach is that the short term lag lmax (m) is adapted faster or slower, depending on the correlation between the fixed codebook gain of the microphone signal and the loudspeaker signal. Therefore, the update of the short term lag needs to be controlled differently during echo and non echo periods. The control of the short term lag is made to avoid critical variations between two consecutive updating. To achieve the control procedure, an intermediated short ⊕ term lag (also called long term lag) lst (m) is computed only during echo periods as follows: ⊕ ⊕ lst (m) = µe · lst (m − 1) + (1 − µe ) · lst (m) (4.44) 4.6. ACOUSTIC ECHO CANCELLATION BY FILTERING THE FIXED GAIN 91 where µe = 0.95 is a smoothing factor. This value is kept in memory and used to update the short-term lag during non echo periods. During non echo period, the intermediate short ⊕ term lag lst (m) is used as a convergence point of the short term lag according to: ⊕ lst (m) = αe · lst (m − 1) + (1 − αe ) · lst (m) (4.45) Interpretation of such an approach is that if a non echo period is really short, lst (m) must keep a value close to the last computed value. In this case, for the next echo period, we consider that the echo path has not changed drastically. We thus use approximately the same value as the previous short term lag. If the non echo period is long, lst (m) converges to the average ⊕ short term lag lst (m). We assume during our experiments that the echo path does not change considerably. Consecutive values are used to estimate the next short term lag when the next echo period starts. Finally, the optimum sub-frame shift during echo and non echo periods is obtained by: (4.46) τopt (m) = round(lst (m)) The attenuated κopt (m) factor during echo only periods is in fact the ratio between the microphone fixed codebook gain and the shift loudspeaker fixed codebook gain: κopt (m) = gf,Y (m) gf,X (m − τopt (m)) (4.47) The fixed codebook gain of the echo only signal should be smaller than the fixed codebook gain of the loudspeaker. If the ratio in EQ. 4.47 is bigger than one, it means that double talk period is detected. The current computed parameters are then set to their previous values. The short term attenuation factor is updated with its long term value in a similar way as the computation of τopt (m) during non-echo periods. The short term attenuation factor during echo periods is given by: κst (m) = α̂(m) · κst (m − 1) + (1 − α̂(m)) · gf,Y (m) gf,X (m − τopt (m)) (4.48) and the long term attenuation is computed based on the relation below. ⊕ κ⊕ st (m) = µe · κst (m − 1) + (1 − µe ) · κst (m) (4.49) 92 ACOUSTIC ECHO CANCELLATION During non echo periods, the attenuation factor is updated using the updated version of the short term attenuation factor and kept in memory as follows: κst (m) = αe · κst (m − 1) + (1 − αe ) · κ⊕ st (m) (4.50) The attenuation factor in echo period is finally computed based on EQ. 4.48 through the relation below: κopt (m) = κst (m) (4.51) To complete the process, the values of the parameters τopt (m) and κopt (m) are used to compute an estimation of the fixed codebook gain of the echo signal according to EQ. 4.37. 4.6.4 Experimental Results To assess the performance of this Acoustic Echo Cancellation algorithm, (filtering of the microphone fixed codebook gain), an Absolute Category Rate (ACR) (see [ITU-T 2006c]) listening test (see [Thepie et al. 2006]) was performed. Ten naïve and expert listeners participated. They scored using a Mean Opinion Score (MOS) [ITU-T 2006e], ranging from 1 (unacceptable) to 5 (excellent). Each scenario was defined as a conversation between a nearend speaker and a far-end speaker. The files listened during the test contain single talk and double talk periods and were classified in 4 groups of 15 files each. Group A is composed of clean speech files with no echo, group B of files with unprocessed echo, group C of file processed with our echo reduction, group D of files enhanced using a standard NLMS method [Beaugeant et al. 2006]. The acoustic echo was simulated based on three car impulse responses: h1 , h2 and h3 . Mean scores and standard deviations of the MOS obtained with the listening test are displayed in Tab. 4.2 below. Considering the mean scores of all scenarios, the results show that our relatively simple solution yields results very cloe to the intensively studied and more costly NLMS method. The obtained average MOS (3.5) indicates an absolute quality of fair/better. This result is far better than the assessment of the unprocessed scenarios for which the MOS is (1.9). The result of this test depends highly on the kind of impulse response that is used and on the Signal-to-Echo Ratio (SER). The measurements of the SER during double talk periods were as follows: 11 dB for h1 , 15 dB for h2 , and 10 dB for h3 . We can observe that the Codec Parameter Domain method was not as good as standard NLMS for low SER, whereas it was rated as good or even better for SER > 15 (cf. Tab. 4.2). 4.6. ACOUSTIC ECHO CANCELLATION BY FILTERING THE FIXED GAIN Groups h1 h2 h3 ’Total’ GrpA 4.66/0.36 4.58/0.62 4.56/0.57 4.60/0.69 GrpB 1.6/0.93 1.64/0.72 1.84/0.88 1.69/0.85 GrpC 2.68/0.87 3.1/0.70 3.68/0.76 3.15/0.88 93 GrpD 4.10/0.78 2.82/0.80 3.72/0.82 3.56/0.99 Table 4.2: Mean and Standard Deviation Opinion Score. An example of processing based on the filtering of the microphone input fixed codebook gain is presented in Fig. 4.11. We can observe that during single talk periods of the far-end speaker, the echo signal is not completely cancelled. There is a slight amount of residual echo after the processing, as well as with standard NLMS. This residual echo is not particularly annoying in our method. During single talk periods of the near-end speaker, our proposed algorithm completely restores the clean speech envelope. In this approach based on the filtering of the fixed codebook gain, the double talk periods are significantly improved. The enhanced microphone signal shows in Fig. 4.11 that the signal envelope is recovered as well as with the NLMS method. During isolated echo-only periods (echo presence during two consecutive near-end speech activity), the proposed algorithm significantly reduces the echo effects. This capability can be observed in periods between t = 7 s and t = 8 s. 4 Amplitude 1 0 −1 0 Amplitude 4 1 x 10 2 4 6 8 Time/s Unprocessed y(n) (black) / processed signal (red) 10 2 4 10 2 4 0 −1 0 4 1 Amplitude Unprocessed y(n) (black) / Echo z(n) (red) x 10 x 10 6 8 Time/s Unprocessed y(n) (black) / Processed with NLMS (red) 0 −1 0 6 8 10 Time/s Figure 4.11: Example of AEC by Filtering the Fixed Gain. 94 ACOUSTIC ECHO CANCELLATION 4.7 Conclusion This chapter has provided an overview of techniques used to model the acoustic echo path. This acoustic echo path is generally modeled as a multi tap time varying filer. In Sec. 4.3 of this chapter, we have described current algorithms that model the acoustic echo path. These techniques are implemented in the time domain or in the frequency domain. The issue with these classical solutions is to adaptively compute an approximation of the acoustic echo path. In the time domain or in the frequency domain, these algorithms use the PCM speech samples of the far-end speaker and the near-end speaker to design their procedures. Moreover, during double talk periods, the classical AEC algorithms do not achieve a good estimation of the acoustic echo path. Finally, these algorithms need important computational load. We have also seen in this chapter that during double talk, existing acoustic echo cancellers do not provide adaptation of the acoustic echo path. There are several existing techniques to detect Double Talk periods. Double talk detection algorithms based on cross-correlation are the most used in practice. In order to overcome these shortcomings, this chapter has proposed two new algorithms to reduce acoustic echoes. The common feature of these algorithms is that they do not need to decode the speech signal in PCM format. Only a partial decoding is done to extract the relevant parameters: the fixed codebook gain and the adaptive codebook gain. More specifically, these two algorithms are: 1. The Gain Loss Control in the Codec Parameter Domain. This algorithm provides two attenuation factors used to control the amount of the fixed codebook gains of the microphone signal and the loudspeaker signal. In single talk periods, the algorithm performs well. This algorithm can be integrated in a system as a post processing of another acoustic echo cancellation algorithm. The Gain Loss Control will thus eliminate the residual echo, always present after the process. In double talk periods, the Gain Loss Control has a drawback, in the sense that both the loudspeaker and the microphone are attenuated. 2. The filtering of the fixed codebook gain of the microphone signal is presented and appears as a more efficient algorithm. The codec parameters are used to build a double talk detector, important in the performance of the AEC. The performance of this algorithm depends on the kind of impulse response and on the Signal-to-Echo Ratio (SER). An accurate estimation of the SER improves the result. This second algorithm is precisely aimed to good improvement during double talk periods. Experimental results show that our algorithm achieves results similar to those attained by the standard NLMS. However, the complexity of our algorithm is lower than that of the NLMS. After proposing speech enhancement algorithms (Noise Reduction in Chap. 3 and Acoustic Echo Cancellation in this chapter), the purpose is now to deploy these algorithms together with smart transcoding strategies. The Transcoding and Smart Transcoding were already introduced in Chap. 1. These last years, investigations on Smart Transcoding have provided 4.7. CONCLUSION 95 interesting results. In Smart Transcoding procedure, the coded parameters are used to reduce complexity of the target encoder. Therefore, the objective of the next chapter is to propose an integration of our coded domain Voice Quality Enhancement algorithms inside Smart Transcoding schemes. 96 ACOUSTIC ECHO CANCELLATION 97 Chapter 5 Voice Quality Enhancement and Smart Transcoding 5.1 Introduction As introduced in Chap. 1, telecommunication’s world is characterized by the successive development of new networks. In general, to each network, a particular speech/audio international or regional standard is dedicated: UMTS working with Adaptive Multi Rate (AMR) [3GPP 1999b], PSTN with G.711, G.726 or G.728, Services and Applications over Internet with G.723.1 or G.729. The networks with their associated speech codecs are not usually interoperable with each other [ITU-T 2004]. Due to operational characteristic, the coders transmit parameters that have different formats. As these networks can happen to interoperate, a bit-stream conversion is absolutely necessary to achieve interconnections. In such a case, the network systems must have to establish the speech communication between the sending network and the receiving network. The usual way to handle this issue is to perform transcoding [3GPP 1999b]. The classical transcoding involves a full decoding [Kang et al. 2003]. It decodes one codec bit-stream and reencodes it into the target codec bit-stream format. This is achieved by placing decoder/encoder of one point and encoder/decoder of the other end point in a so called gateway between networks. Such a classical solution is far from being optimum. The Transcoding approach generally implies three major problems: computational load increasing, speech quality and intelligibility decrease and algorithmic delay increase. An alternative approach to classical transcoding is to exploit the similarity of the speech coders [Kang et al. 2003]. This method, also called smart transcoding, achieves interesting improvement over the classical transcoding approach by reducing the computational load and the 98 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING algorithmic delay. If there is no external impairment (background noise or/and acoustic echo), the smart transcoding in addition provides better speech quality than standard transcoding (see [Beaugeant, Taddei 2007] - [Ghenania 2005]). External impairments such as background noise and/or acoustic echo generally impact the speech quality and intelligibility in a mobile communication. The common way to reduce their impact is to implement Acoustic Echo Cancellation (AEC) and/or Noise Reduction (NR) algorithms or units in the mobile device [Eriksson 2006], [Beaugeant 1999]. This solution has not yet achieved optimal results for many years [Cotanis 2003]. Therefore, AEC and NR units are now progressively being implemented in the network [Eriksson 2006]. There are several reasons and advantages for such an integration of Voice Quality Enhancement (VQE) algorithms in the network [Enzner et al. 2005]. First of all, implementation of VQE in the network is related to a desirable central network quality control. Indeed, network providers have a high diversity of devices in their network, leading to various levels of speech quality. At the same time, the quality of the mobile devices has not been particularly enhanced over the last decade. Second, new challenges have appeared, as device miniaturization where speech quality has not been the main focus. For example, mobile terminals focus more on multimedia application, whereas DECT phones concentrate more on price reduction. As a consequence, even if from technical view point, coping with voice enhancement brings a priori good results when solutions are implemented within the terminal, concrete industrial developments entail that VQE placed in the network can be in practice as efficient in terms of speech quality as solution built in terminal. Furthermore, the low complexity constraint restricts the choice of the algorithms and thus limits the system performance. Transposition of VQE from terminal to network is a general trend and many network providers have already deployed AEC and NR units, such as Tellabs 5500 (ref). The existing systems are based on the analysis and processing of the decoded signal. This solution generally introduces delay on the processed signal, as well as complexity due to decoding and re-encoding into the same format. This solution also decreases the speech quality because of the codec tandeming (decoding-encoding) effect. Especially for AEC, an important fact is that the algorithm performance implemented inside the network can be severely disturbed by the nonlinearity and the unpredictability of the effective acoustic echo path see: [Enzner et al. 2005], [Rages, Ho 2002], [Huang, Goubran 2000] and [Fermo et al. 2000]. All these drawbacks lead to the idea that VQE could be made directly on the available bit-stream [Chandran, Marchok 2000], also by integrating algorithms described in Chap. 3 (see [Taddei et al. 2004], [Thepie et al. 2008]) and Chap. 4 ([Thepie et al. 2006]). Modifying the parameters composing the bit-stream avoids the total decoding/encoding process necessary in classical solutions. This new approach reduces computational load and the quality loss due to the tandeming effect is circumscribed. The purpose of this chapter is to embed (integrate) inside a smart transcoding structure, the VQE algorithms described in Chap. 3 and 4. It also discusses the performance obtained by integrating the AEC and NR unit into a smart 5.2. NETWORK INTEROPERABILITY PROBLEMS AND VOICE QUALITY ENHANCEMENT 99 transcoding algorithm. The organization of this chapter is as follows. Sec. 5.2 explains problems due to network interoperability and VQE. This part of the chapter also presents former solutions. In Sec. 5.3, the smart transcoding principles and solutions are described. Integration of the VQE based on codec parameters in the smart transcoding module is discussed in Sec. 5.4 and 5.5. an optimal architecture dedicated to different AMR-NB coder modes is proposed. Experimental results are discussed in Sec. 5.6 and the chapter ends by a conclusion. 5.2 Network Interoperability Problems and Voice Quality Enhancement To ensure mobility, continuity or interoperability, networks need to interoperate [Yoon et al. 2003]. Interoperation is usually achieved via transcoding. A typical diagram of wireless interoperability network can be seen in Fig. 5.1. Each GSM network involves three main components: the Mobile Device (MD), the Base Station Sub-system (BSS) and the Network Sub-system (NSS), generally called core network. The GSM network is also connected to an Operation and Support System (OSS) for maintenance and control, (see annex A). The Gateway is the potential area where a given network will be interconnected with another network: connection of network A to network B, network A to PLMN, network A to PSTN, etc. The algorithms discussed in this chapter address solutions to high background noise and/or acoustic echo problems. 5.2.1 Classical Speech Transcoding Scenarios Transcoding is a key element for network interconnection. Transcoding is the process of transforming the format representation of A to a target format B. As presented in Fig. 5.2, if codecs A and B are different, the bit-stream of encoder A of the near-end speaker is first decoded through decoder A. The obtained decoded speech signal sA (n) is then encoded in target format B by encoder B, giving bit-stream B. Bit-stream B is transmitted to decoder B at the far-end speaker side, where s′ (n) is decoded via decoder B. This transcoding solution implies three particular problems. First the quality of the decoded speech signal at decoder B is degraded. Encoder B uses the decoded signal from decoder A and re-encodes it with encoder B, rather than using the original speech signal. The speech quality degradation is due to the succession of encoding/decoding operations which creates accumulation of quantization errors, twice in this example. Secondly, computational load is increased. Computational load is increased since two or more coders are simultaneously implemented. This number becomes all the more critical when a large number of users can be interconnected. The third impairment is the increasing of the algorithmic delay. To obtain the target bit-stream B, an additional look-ahead delay for LPC analysis is necessary during the 100 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING GSM Network (B) GSM Network (A) MD A (A) (B) MD B Base Station Sub-System: BTS + BSC + TRAU (A) BTS BTS Base Station Sub-System: BTS + BSC + TRAU (B) BSC OSS BTS BSC TRAU (A) MSC OSS TRAU GMSC GMSC HLR, EIR, AUC, VLR Network Sub-System: Core Network, MSC+ GMSC (A) BTS (B) MSC HLR, EIR, AUC, VLR Network Sub-System: Core Network, MSC+ GMSC (B) PSTN PLMN or Other Networks Figure 5.1: Generic GSM Interconnection Architecture. Bit-stream A Encoder A Bit-stream B Decoder A Encoder B Decoder B Speech Samples: sA(n) Speech Signal s(n) Near-end Speaker (Mobile Device) Network Area or Parameters Domain Decoded Speech s’(n) Far-end Speaker (Mobile Device) Figure 5.2: Transcoding, Classical Approach. 5.2. NETWORK INTEROPERABILITY PROBLEMS AND VOICE QUALITY ENHANCEMENT 101 transcoding process. Such process can introduce delay to the communication. It is well known that a long transmission delay in the communication link can be particularly annoying for the end-users. 5.2.2 Classical Speech Transcoding and Voice Quality Enhancement Classical VQE solutions such as NR and/or AEC, implemented inside the network, can lead to transcoding effects. In fact, with classical approach, and in presence of perturbation (Noise and/or Acoustic Echo), the enhancement is totally performed on speech samples in PCM format. Current algorithms implemented in the network follow the principle described in Fig. 5.3 below. Let us consider a communication scenario established between two wireless networks A and B. The corrupted speech signal is encoded by encoder A at the Mobile Device of the near-end speaker. The corrupted bit-stream A is then transmitted. Inside the network, the corrupted bit-stream A is decoded using decoder A and the decoded signal is sent to the VQE unit for speech enhancement. The output of the VQE unit is re-encoded by encoder B. At the far-end side, the VQE processed bit-stream B is decoded and the enhanced speech signal version is obtained. Corrupted Bit-stream Encoder A Enhanced Bit-stream Decoder A Corrupted Speech Samples yA(n) Corrupted speech y(n) Near-end Speaker Encoder B VQE Unit Network Area or Codec Domain Decoder B Enhanced Samples A(n) Enhanced Speech B(n) Far-end Speaker Figure 5.3: Network VQE, Classical Solution. The VQE unit (yellow box) as shown in Fig. 5.3 only deals with PCM speech samples. The algorithms deployed in such systems are those described in the state of the art of NR and AEC in Chap. 3 and 4 respectively. The network solution as presented in Fig. 5.3 has one disadvantage, this architecture cumulates the problems due to classical transcoding solution and the problems due to VQE algorithms in PCM format such as delay, computational load and quality degradation. 102 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING 5.3 Alternative Approach: the Speech Smart Transcoding The goal during interoperability is to translate a bit-stream of coder A into the format of coder B. Many research activities are being conducted for a so called smart transcoding technique. If both coders A and coder B are similar, (CELP coders for example as in this thesis), the coders make use of similar set of parameters. The CELP transmitted parameters are: the LPC coefficients, the pitch delay, the fixed codebook vector, the fixed and the adaptive codebook gains. In this thesis we restrict our attention to GSM networks and AMR-NB speech coders. Parameters Bit-stream B Bit-stream A Encoder A Decoder A Speech Signal s(n) Mapping / Partial Encoding Decoder B Decoded Speech s’(n) Speech Samples sA(n) Near-end Speaker (MD) Network Area or Codec Domain Far-end Speaker (MD) Figure 5.4: Smart Transcoding Principle. 5.3.1 The Speech Smart Transcoding Principle and Strategies The key idea in smart transcoding consists in avoiding the re-computation at the target encoder B of some parameters or group of parameters. The difference between the parameters at encoder A and the parameters at encoder B is mostly related to the way these parameters are computed [Ghenania, Lamblin 2004]. Typically, the parameter estimation method, the resolutions of the estimation and the quantization technique used, differentiate the parameters from codec A to codec B when they use CELP technique. Based on these observations, up to three smart transcoding strategies can be applied to the relevant parameters. – First, the Smart Transcoding performed at the binary stage. This means that the coded parameter is directly transferred from bit-stream A to bit-stream B, of course a mapping operation is needed. This approach is suitable if the parameter at encoders A and B differs only by the quantization technique. This approach was already experimented on the pitch delay in ([Tsai, Yang 2001], [Yasuji et al. 2002] and [Kang et al. 2000]). In ([Yasuji et al. 2002] and [Kang et al. 2000]) the smart transcoding was applied to the fixed codebook index by restricting the search of the fixed codebook vector at encoder B. 5.3. ALTERNATIVE APPROACH: THE SPEECH SMART TRANSCODING 103 – Second, in case the parameter computed from encoder A does not correspond to that of encoder B, a partial decoding of the relevant parameter is required. The partial decoding here at decoder A corresponds to a decoding process which does not require the PCM signal reconstruction. As an example, considering LPC coefficients, the decoding process can be performed up to the LSF coefficients representation. The LSF parameters from decoder A are then mapped into encoder B [Ghenania, Lamblin 2004]. – The last smart transcoding scheme is related to the problematic where the formats of the parameters at encoders A and B are different. In this type of smart transcoding, also called parameter approach, a tandem is necessary and the speech signal is totally decoded at decoder A. What is specific with this approach is that during the encoding at encoder B, the functions assigned to compute the relevant parameters are either not performed or not totally performed. The parameters decoded at decoder A are directly used as input of the quantization units of encoder B. This smart transcoding strategy was experimented in [Kim et al. 2001] and [Yoon et al. 2001] by mapping the fixed codebook index and gain. The mapping of the relevant parameters can also be followed by the suppression of redundant pre-processing and post processing functions. As an example, the high pass filtering performed for the AMR encoder can be suppressed. It results in a computational load that can be significantly reduced, cf. [Beaugeant, Taddei 2007]. The parameter based smart transcoding approach is retained in this thesis. In a noisy environment or/and in presence of acoustic echo, the parameters decoded at decoder A are corrupted and need to be enhanced. As described in Fig 5.4, the third smart transcoding approach is performed. At decoder A, the relevant parameters are extracted. At encoder B, the encoding process is not totally performed. Some of the functions assigned to the computation of the relevant parameters are skipped. In general, a parameter extracted from decoder A is first modified such that the parameter matches with the characteristics of the similar parameter at encoder B. The smart transcoding is thus achieved by intelligently mapping the parameters available from bit-stream A inside those of bitstream of encoder B. During our simulations, the C floating point platform of the NB-AMR is used, [3GPP 1999b]. This work addresses solution to transcode between different AMR-NB speech codec modes, especially between the AMR-NB at 12.2 kbps mode and the AMR-NB at 7.4 kbps mode and vice versa. Before to start investigating transcoding in presence of background noise and acoustic echo, it is useful to study the impact of mapping parameter or group of parameters to the speech quality and intelligibility. This study helps us to design the mapping functions used during our smart transcoding. In this part of the work, experiments were conducted using clean speech signal. We used AMR-NB at 7.4 kbps mode as coder A and 12.2 kbps mode as coder B and vice-versa. Similarly to the smart transcoding scheme performed in [Beaugeant, Taddei 2007], and based on several experiments we will focus on three codec parameters in this work: the LPC coefficients, the fixed codebook gain and the adaptive codebook gain. We should note that a parameter extracted at decoder A side is the quantized version of the parameter computed at encoder A. Therefore, the parameters used in the smart transcoding algorithm 104 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING are the quantized version denoted by: ÃS , g̃f,S and g̃a,f of the LPC coefficients, the fixed codebook gain and the adaptive codebook gain respectively. 5.3.2 Mapping Strategy of the LPC Coefficients As depicted in Chap. 2 cf. [3GPP 1999b], on the encoder side, the LPC coefficients are computed twice per frame (every 10 ms) in 12.2 kbps mode (two sets) and only once per frame (every 20 ms) in other modes (one set). Then, they are converted in the Line Spectral Frequency (LSF) representation. The LSF are then interpolated to provide four sets of LSF, one set per sub-frame. These interpolated LSF’s are then converted back to LPC, giving four sets of quantized LPC coefficients, one set for each sub-frame. The LPC analysis window of the 12.2 kbps mode allows computation of LPC coefficients localized at the sub-frames 2 and 4. No look-ahead is required. In the other modes, the analysis window is concentrated at sub-frame 4, leading to the computation of only one set of LPC coefficients. In these later modes, a 5 ms look-ahead is used. The LPC coefficients are extracted from decoder A and a mapping procedure is applied before their introduction inside the encoder B. These LPC coefficients are then directly used, no further computation is needed. At decoder A or decoder B, four sets of quantized LPC coefficients are decoded in each k . The index k representing the number of sub-frame number is taken frame: ÃS decoder as: k = 1, 2, 3, 4. According to the transcoding in place, the mapping inside encoder B of the LPC coefficients extracted from decoder A follows the mapping equations below: From transcoding from 12.2 kbps mode to 7.4 kbps mode, only one set of LPC coefficients is computed at 7.4 kbps mode encoder each frame, leading to the mapping at frame p as follows: 4 (AS )Encoder (p) = ÃS Decoder (p) (5.1) During transcoding from 7.4 kbps mode to 12.2 kbps mode, two sets of LPC coefficients are computed in the 12.2 kbps mode encoder each frame. The mapping is given by: 2·k (AS )kEncoder (p) = ÃS Decoder where k = 1, 2 (p) (5.2) In Fig. 5.5 and 5.6 below, the spectra of the synthesis filters associated to the LPC coefficients are represented. The curve in blue is the original spectrum of the unquantized synthesis filter. The curve in black is obtained after smart transcoding. Finally, the curve in red corresponds to that obtained from the standard transcoding. On a frame basis, the spectrum 5.3. ALTERNATIVE APPROACH: THE SPEECH SMART TRANSCODING 105 obtained by smart transcoding presents the best approximation of the original spectrum. The spectrum obtained from the classical transcoding has been slightly amplified at low frequencies. We clearly see in Fig. 5.6 how the spectrum obtained in classical transcoding has been significantly distorted. There is a formant loss at high frequency with the standard transcoding approach. Formant degradation can induce speech distortion and quality degradation. The spectrum of the synthesis filter associated to the LPC coefficients obtained from the smart transcoding is significantly close to that of the original speech signal. The formants returned of the smart transcoding are preserved. From AMR−NB 7.4 kbps mode to AMR−NB 12.2 kbps mode Smart Transcoding Spectrum Standard Transcoding Spectrum Reference Spectrum 30 PSD−in−dB 20 10 0 −10 −20 0 500 1000 1500 2000 2500 Frequency 3000 3500 4000 Figure 5.5: Transcoding Example from 7.4 kbps mode to 12.2 kbps mode: Spectrum of the associated synthesis filters. 5.3.3 Mapping Strategy of the Fixed and Adaptive Codebook Gains For each frame, four adaptive gains are decoded in 12.2 kbps mode or 7.4 kbps mode. The adaptive gain from decoder A is simply mapped inside encoder B as follows: (ga,S )Encoder (m) = (g̃a,S )Decoder (m) (5.3) where m is the sub-frame index. From the mapping of the adaptive codebook gain presented in Fig. 5.7, we can notice that 106 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING From AMR−NB 12.2 kbps mode to AMR−NB 7.4 kbps mode Smart Transcoding Spectrum Standard Transcoding Spectrum Reference Spectrum 30 PSD−in−dB 20 10 0 −10 −20 0 500 1000 1500 2000 2500 Frequency 3000 3500 4000 Figure 5.6: Transcoding Example from 12.2 kbps mode to 7.4 kbps mode: Spectrum of the associated synthesis filters. 5.3. ALTERNATIVE APPROACH: THE SPEECH SMART TRANSCODING 4 Signal Level 1 From AMR 7.4 kbps mode to AMR 12.2 kbps mode x 10 0.5 0 −0.5 −1 0 0.2 0.4 0.6 0.8 1 time/s 1.5 Amplitude 107 1.2 1.4 1.6 1.8 Smart Transcoding Standard Transcoding Original Adaptive gain 1 0.5 0 −0.5 0 50 100 150 200 Subframes 250 300 350 Figure 5.7: Adaptive Gains in Transcoding, typical example during transcoding from 7.4 kbps to 12.2 kbps mode. 108 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING the adaptive gains from the smart transcoding are less disturbed that those of the classical transcoding. It is obviously beneficial to perform a smart transcoding on this parameter, instead of transmitting the adaptive gains from the classical transcoding. We clearly see that the adaptive gains obtained from the classical transcoding (in red) change enough but always stay around the original adaptive gains (in black). Estimation of the pitch delay used together with the adaptive gain to compute the adaptive excitation in CELP coders is particularly robust. In practice there is no significant difference between the pitch from smart transcoding and that of the standard transcoding. The interest of the gain obtained from smart is that it improves the adaptive excitation. The fixed codebook gain plays a major role on the decoded speech amplitude, see Fig. 5.8. The shape of the curve of the fixed codebook gains matches the shape of the speech signal amplitude. We have noticed experimentally that, at decoder B, the fixed codebook gain is delayed compared to the original fixed codebook gain. The delay that appears during this process impacts the speech quality. Based on several experiments ([ITU-T 2001]) and informal listening tests, the optimal mapping of the fixed codebook gain is achieved by: (gf,S )Encoder (m) = (g̃f,S )Decoder (m − 1) (5.4) where m is the sub-frame index. This mapping takes into account the delay (one sub-frame) introduced by the ’encodingdecoding-encoding-decoding’ chain. In Fig. 5.8 and Fig. 5.9, one can see in case of a transcoding from 7.4 kbps mode to 12.2 kbps mode that the fixed codebook gains have been strongly attenuated. This observation results from the fact that the transcoding is performed from low quality (7.4 kbps mode) to very high quality (12.2 kbps mode). The reduction of the fixed gain level in classical transcoding is more noticeable during speech periods. As an example, in Fig. 5.8 from 1.4 s to 1.8s, the amount of the fixed gain attenuation in classical transcoding is particularly high. The transcoding from the 12.2 kbps mode to 7.4 kbps mode is rather different, see Fig. 5.10. The fixed codebook gain from the classical transcoding is not attenuated. In Fig. 5.10, we find no amplitude attenuation as in 7.4 kbps to 12.2 kbps transcoding mode. The classical transcoding gains tend to be higher than that of the smart transcoding in presence of high speech energy. The smart transcoding of the fixed and adaptive codebook gains does not particularly impact the computational load. Simulations with informal listening test show that the mapping of the gains (fixed and adaptive) significantly improve the decoded speech quality in transcoding from 7.4 kbps mode to 12.2 kbps mode. Another remark is that the smart transcoding of the gains has an impact only if encoder B is 12.2 kbps mode. This is due to the quantization technique used. The quantization in other modes does not offer the possibility to directly replace the gains. The fixed gain and the adaptive gain are distinctly quantized according to the codec mode. The gains are separately quantized in 12.2 5.3. ALTERNATIVE APPROACH: THE SPEECH SMART TRANSCODING Amplitude Signal Level 4 1 x 10 From AMR−NB 7.4 kbps mode to AMR−NB 12.2 kbps mode 0 −1 0 0.5 1 time/s 1000 1.5 2 Reference fixed gain Smart Transcoding fixed gain 500 0 50 100 150 1000 Amplitude 109 200 250 Subframes 300 350 400 Reference fixed gain Classical transcoding fixed gain 500 0 50 100 150 200 250 Subframes 300 350 400 Figure 5.8: Typical Example of Decoded Fixed Codebook Gains during transcoding from 7.4 kbps mode to 12.2 kbps mode. 110 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING From AMR−NB 7.4 kbps mode to AMR−NB 12.2 kbps mode Smart Transcoding. fixed gain Standard Transcoding fixed gain Reference fixed gain 1000 900 Gain Amplitude 800 700 600 500 400 300 200 100 0 50 100 150 200 250 Subframes 300 350 400 Figure 5.9: Fixed Codebook Gains mapping transcoding from 7.4 kbps mode to 12.2 kbps mode. 5.3. ALTERNATIVE APPROACH: THE SPEECH SMART TRANSCODING 111 From AMR−NB 12.2 kbps mode to AMR−NB 7.4 kbps mode Smart Transcoding. fixed gain Standard Transcoding fixed gain Reference fixed gain 1000 Gain Amplitude 800 600 400 200 0 50 100 150 200 250 Subframes 300 350 400 Figure 5.10: Fixed Codebook Gains mapping transcoding from 12.2 kbps mode to 7.4 kbps mode. 112 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING kbps mode. In other modes, the gains are jointly quantized. The quantization at those modes is based on Moving-Average prediction and it is rather difficult to replace the parameters and further studies are needed. The results obtained by these smart transcoding algorithms can be interpreted as follows: The classical transcoding approach needs to synthesize speech to re-compute the LPC coefficients. These recomputed coefficients will diverge from the original one. The synthesized speech at decoder B has been distorted twice, by successive quantization, at the first encodingdecoding chain and the second encoding-decoding chain. 5.4 Network Voice Quality Enhancement and Smart Transcoding In the presence of external impairments such as high background noise and/or acoustic echo, the CELP coder alone does not deliver a clean speech signal [Eriksson 2006]. This is due to the fact that the transmitted parameters are contaminated by noise and/or by acoustic echo. It has been demonstrated in Chap. 3 and 4 that noise reduction and/or acoustic echo cancellation can be performed by directly modifying the coded parameters. This idea allows us to directly integrate our proposed algorithms in smart transcoding. The result is that the smart transcoding will be simultaneously performed with the coded domain speech enhancement. The general overview of the VQE unit embedded in smart transcoding process is depicted in Fig. 5.11 below. Alternatively to classical transcoding, this approach may be used. This scheme can convert the bit-stream from compression format A to a compression format B without fully decoding to PCM and then re-encoding the signal. The complete VQE unit is now located in the network. At the far-end speaker side, the Mobile Device encodes the corrupted speech with encoder A. The corrupted bit-stream is transmitted over the network. At the transcoding stage, the corrupted bit-stream is decoded by decoder A. In parallel, the relevant corrupted parameters are extracted: the LPC coefficients, the fixed and the adaptive codebook gains. The parameters needed to perform estimation of noise and/or acoustic echo parameters are also buffered. The relevant parameters are sent to the VQE unit for enhancement and modification (see yellow box in Fig. 5.11). During the encoding at encoder B, the enhanced parameters are mapped and used as input of quantization units in encoder B. There is no need to re-compute the LPC coefficients, the fixed codebook gains and the adaptive codebook gains. The implementation of such an approach involves three main advantages in comparison with classical transcoding combined with frequency or time speech enhancement. – The reduction of the computational load: Considering the process, encoder A to decoder A, decoder A to encoder B and encoder B to decoder B, the overall computational load 5.4. NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING113 Codec Domain VQE Unit (AEC and/or NR) Corrupted Bitstream A Encoder A Corrupted Speech y(n) To Near-end Speaker (MD) Enhanced Parameters Corrupted Parameters Mapping / Decoder A Partial Encoding Corrupted Speech Samples yA(n) Network Area or Parameters Domain Enhanced Bit-stream B Decoder B Enhanced Speech (n) To Far-end Speaker (MD) Figure 5.11: Structure of the Codec Domain VQE Embedded in Smart Transcoding. is reduced. There is no need to compute inside the AMR encoder B the LPC coefficients, the fixed and the adaptive codebook gains. In addition, the enhancement algorithms use the codec parameters instead of the PCM speech samples, yielding to a reduction of the computational load. Finally and according to [Beaugeant, Taddei 2007], up to 27% of computational load reduction can be achieved. For example with noise reduction, the proposed algorithms process on each frame four fixed codebook gains and only one or two sets of LPC coefficients. In contrast to the traditional frequency domain approach, the frame PCM samples of the corrupted speech are first converted using the Fourier transform. Then noise reduction is performed on each frequency bin. The enhanced signal is after that transformed back into PCM format for encoding purpose. – The enhancement of the decoded speech quality: It has been demonstrated that avoiding transcoding reduces speech distortion. Furthermore, the corrupted parameters (the LPC coefficients and the fixed codebook gain) in the presence of background noise and/or acoustic echo are replaced by their enhanced version. It has also been verified in [Enzner et al. 2005] and [Rages, Ho 2002] that the performance of the standard AEC is strongly reduced inside the network. The performance degradation is due to the non-linearity introduced by the coder. Accordingly the non linearity is created by the speech excitation approximation and quantization. In fact, the mobile speech signal recovered inside the network has passed through two speech codecs. The parameter domain algorithms avoid non-linearity and unpredictability since they are directly applied to coded parameters. No estimation of the linear acoustic echo path is required with these approaches. – With such an implementation, the algorithmic delay is reduced. Many functions are skipped during the decoding and the encoding. The amount of delay reduction compared to classical transcoding also appears if the look-ahead due to LPC windowing is avoided. The gain of 5 ms can be achieved as the proposed system does not require the same amount of processing functions. Some classical VQE algorithms are based on time or 114 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING frequency domain transforms. Such transforms introduce additive delay (typically 5 to 10 ms). The coded domain VQE integrated into smart transcoding does not require any transform. Therefore, the delay can be substantially reduced. 5.5 The proposed Architecture In [Gnaba et al. 2003], the authors have studied the integration of CELP structures of the 12.2 kbps coder in the acoustic echo canceller for GSM network. The smart transcoding was not taken into account, but the authors intend to minimize the quantization noise introduced by the coders. We propose in this section a transcoding architecture between the 12.2 kbps and 7.4 kbps modes. Several experiments were conducted as well as informal listening tests to detect the impact of each parameter to the speech quality, see Sec. 5.2.1. We tested several architectures and we found during the simulations that the architecture described in Fig. 5.12 achieves a good compromise in terms of speech enhancement quality and objective tests. Due to the symmetry of the communication link, only one communication way (from the near-end speaker A to the far-end speaker B) is presented in Fig. 5.12. The communication way (from the far-end speaker B to the near-end speaker A) is similar to the communication way (from the near-end speaker A to the far-end speaker B). Codebook Gain and index gf,x Fixed Codebook and Gain Synthesis gf,y f,s AEC / NR FixedCodebook Vector Search f,s Adaptive Gain and Pitch index Mapping of Ad. Gain LTP Synthesis A kbps LPC Synthesis Pitch and Adaptive Gain index Pitch Search Ay NR B kbps Âs LPC Mapping Quantization and Coding Enhanced LPC Coefficients Skipping Preprocessing Post-processing AMR Decoder A Enhanced Fixed Excitation index ga,y Parameters Extraction LPC Coefficients Enhanced Fixed Gain Mapping of Fixed-Gain Speech samples: yA(n) AMR Encoder B Figure 5.12: Proposed Architecture. Fig. 5.12 presents a scheme of a smart transcoding strategy between two GSM networks. If bit-stream A is from 7.4 kbps mode, then bit-stream B is from 12.2 kbps mode and vice versa. In presence of perturbations, (background noise or/and acoustic echo) during the decoding process at decoder A, the corrupted LPC coefficients, the corrupted fixed and adaptive codebook gains are first extracted. The corrupted speech signal is then totally decoded and the needed 5.5. THE PROPOSED ARCHITECTURE 115 parameters are recorded. These needed parameters are the CELP parameters used during clean speech parameters estimation. The LPC coefficients and the fixed codebook gain are processed inside the VQE unit developed in Chap. 3 and 4. The adaptive codebook gain remained unchanged. During the re-encoding of the decoded noisy signal, the enhanced LPC coefficients, the enhanced fixed codebook gain and the adaptive codebook gain are directly mapped inside the encoder B. This procedure makes it possible to by-pass the functions used to compute these parameters, as shown with yellow boxes in Fig. 5.12. The fixed and adaptive gains are computed for each sub-frame. These gains (fixed and adaptive) are individually quantized in 12.2 kbps mode and jointly quantized in other modes [3GPP 1999b]. Due to this constraint, as shown in Fig. 5.12, the modified fixed codebook gain is mapped in both decoder A and encoder B. During our investigations and tests, we saw that the performance varies significantly if the enhanced fixed codebook gain is also re-introduced inside the decoder A, showing that the fixed codebook is a key parameter to achieve a good VQE system. Two results are achieved with this proposed architecture. First, if Encoder B is in 12.2 kbps mode, then the coded domain speech enhancement is performed both at decoder A and at encoder B. Second, if the encoder B is different from 12.2 kbps mode, then the speech enhancement takes place only at decoder B. The mapping of the fixed codebook gain inside the target encoder in a different mode to 12.2 kbps mode has minimal effect. We also found during simulations and informal listening tests that to achieve interesting results, it is suitable to map the fixed codebook in encoder B and decoder A simultaneously. The enhanced bitstream is finally sent to user B at Mobile Device B, where the enhanced speech signal is synthesized based on decoder B. The proposed architecture can be easily applied to other mode transcoding algorithms if their characteristics are taken into account. 5.5.1 Noise Reduction Integrated in the Smart Transcoding Algorithm In presence of high background noise, the speech coder, especially CELP, computes the LPC coefficients and the fixed codebook gain that are affected by noise. The speech quality provided by the decoder B after the classical transcoding is decreased. We propose in this section to integrate inside the smart transcoding algorithm noise reduction algorithms discussed in Chap. 3. Those algorithms are used to enhance the corrupted and quantized parameters extracted from the AMR decoder A. The VQE unit as shown in Fig. 5.13 is the combination of the LPC NR and the fixed gain filtering. The mapping is applied to three groups of parameters: The enhanced LPC coefficients ÂS , the enhanced fixed codebook gain ĝf,S and the noisy adaptive codebook gain ga,Y . The quantized adaptive codebook gain from AMR decoder A is used as input to the adaptive gain 116 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING yA(n), ga,Y(m) A kbps yA(k) AMR Decoder A AMR Encoder B / Mapping f,S(m) gf,Y(m) B kbps (k) f,S(m) VQE Unit: Noise Reduction AY(m) To the near-end speaker side ÂS(m) Network Area To the far-end speaker side Figure 5.13: Flowchart Noise Reduction in Smart Transcoding. quantization unit in AMR encoder B. As there is no need to perform the LPC analysis, it is skipped at encoder B. These enhanced LPC coefficients are then directly used to compute the LSF parameters at the AMR encoder B. The enhanced fixed codebook gain ĝf,S is also used as input to the gain quantization module in AMR encoder B. 5.5.2 Acoustic Echo Cancellation Integrated in Smart Transcoding Algorithm If the perturbation is an acoustic echo, only the fixed codebook gain is modified. The smart transcoding algorithm is identical to that of the NR. The corrupted LPC coefficients and the adaptive codebook gain from decoder A are mapped inside encoder B. The VQE module used to modify the fixed codebook gain in this example can either be the Gain Loss Control or the Filtering of the fixed codebook gain. The end-to-end communication is required since both parameters from the microphone signal y(n) and the loudspeaker signal x(n) are needed. The Gain Loss Control Integrated in the Smart Transcoding Algorithm In this section, the GLC is integrated in the smart transcoding. The fixed codebook gains of the microphone signal gf,Y and of the loudspeaker signal gf,X are needed by the GLC module. The adaptive codebook gains are also used by the GLC module during the estimation of the signals energies. The GLC unit computes the attenuation gains, ay (m) and ax (m). These attenuation gains are used to weight respectively the microphone fixed codebook gain and the loudspeaker fixed codebook gain. As shown in Fig. 5.14, the enhanced microphone fixed codebook gain ay (m) · gf,Y is mapped in encoder B and decoder A. The enhanced loudspeaker fixed codebook gain ax (m) · gf,X is mapped in encoder A and decoder B. 5.6. EXPERIMENTAL RESULTS 117 yA(n), AY(m), ga,Y(m) AMR Decoder A A- kbps yA(k) AMR Encoder B / Mapping ay(m)gf,Y(m) gf,Y(m), ga,Y(m) B-kbps (k) ay(m)gf,Y(m) VQE Unit: Gain Loss Control ax(m)gf,X (m) A-kbps xA(k) AMR Encoder A / Mapping gf,X(m), ga,X(m) AMR Decoder B ax(m)gf,X (m) B-kbps xB(k) xB(n), AX (m), ga,X(m) To the near-end speaker side Network Area To the far-end speaker side Figure 5.14: Overview of the Gain Loss Control Integrated in Smart Transcoding. The process is terminated by encoding the PCM samples (yA (n) and xB (n)) and mapping the corresponding LPC coefficients and adaptive gain: (AX (m), ga,X (m)) for the microphone path and (AY (m), ga,Y (m)) for the loudspeaker path. The resulting bit-stream at the farend speaker side leads to the decoded speech where the acoustic echo has been attenuated or cancelled. Fixed Gain Filtering Integrated in the Smart Transcoding Algorithm The architecture of the filtering of the fixed codebook gain for AEC is similar to that of the GLC. The acoustic echo cancellation is considered to be performed only on the microphone signal y(n). In this technique, the microphone fixed codebook gain is replaced by an estimation of the clean speech fixed codebook gain computed as: G(m)·gf,Y . The enhanced fixed codebook gain is mapped both in encoder B and in decoder A as described in Fig. 5.15. The LPC coefficients and adaptive codebook gains are not enhanced, but are directly mapped inside encoder B during the encoding of the decoded signal yA (n). 5.6 Experimental Results In this section, we present the overall and detailed performance of our proposed architecture through several objective tests. We first analyze the computational load and the delay improvement of the smart transcoding algorithm. Then, we compute various objective measurements to compare our approach with former or classical ones. The proposed AEC algorithms are compared to the standard NLMS algorithm [Haykin 2002b] and especially the one implemented 118 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING yA(n), AY(m), ga,Y(m) A kbps yA(k) AMR Decoder A gf,Y(m) AMR Encoder B / Mapping G(m)gf,Y(m) B kbps (k) G(m)gf,Y(m) VQE Unit: Computation of the Filter G(n) gf,X(m) A kbps xA(k) To the near-end speaker side AMR Encoder A AMR Decoder B xB(n) Network Area B kbps xB(k) To the far-end speaker side Figure 5.15: Filtering of the Fixed Codebook Gain Integrated in Smart Transcoding. in [Beaugeant et al. 2006]. The proposed NR system is compared to the NR based on the classical Wiener Filter approach, see [Beaugeant 1999]. 5.6.1 Overall Computational Load and Algorithmic Delay As demonstrated in [Beaugeant, Taddei 2007], the mapping of the LPC coefficients results in skipping the Levinson-Durbin function and a complexity reduction of about 20% of the entire encoding process at encoder B. Besides, if the down and up scaling at decoder A and encoder B are skipped, as well as low pass and high pass filtering at encoder B, the proposed smart transcoding algorithm yields another complexity reduction of about 7%. These two simplifications permit a total computational load reduction of approximately 27% in comparison with the classical transcoding approach. In smart transcoding from 12.2 kbps mode to 7.4 kbps mode, the LPC analysis is avoided. The look-ahead at the 7.4 kbps encoder due to the windowing is also skipped. Accordingly, this smart transcoding scheme provides a delay decrease of 5 ms. Conversely, during smart transcoding from 7.4 kbps mode to 12.2 kbps mode, there is no delay reduction as the LPC analysis window in 12.2 kbps mode encoder has no look-ahead. During noise reduction, we process four fixed codebook gains and one or two sets of LPC coefficients per frame. The LPC vector in practice has M + 1 components: A = [1, a1 , · · · , aM ]. The first coefficient is not modified. Finally only M or M + M LPC coefficients are used by the algorithms. For AEC algorithm, four fixed codebook gains are used per frame. The coded domain VQE does not require any transformation such as STFT or Filter band decomposition. In comparison to classical approach such as Wiener Filter or NLMS, coded domain VQE involves a low complexity. 5.6. EXPERIMENTAL RESULTS 5.6.2 119 Overall Voice Quality Improvement The objective performance of the network VQE embedded in smart transcoding can be summarized as follows: – The proposed noise reduction system tends to have objective performance similar to that of the classical Wiener approach. In transcoding from 7.4 kbps mode to 12.2 kbps mode, the performance of the proposed NR is above the performance of the standard Wiener filter. – The proposed AEC algorithms are clearly better than the classical NLMS algorithms. Even with GLC or FFG, we obtain a ERLE ([Hansler, Schmidt 2004]) of about 40 dB compared to the classical NLMS. The coded domain network AEC as discussed in this thesis is suitable for network VQE. The results performed to arrive to the statements above were obtained through several experiments as described Sec. 5.6.3 and 5.6.4. 5.6.3 Noise Reduction In order to check the behavior and interest of our codec parameter domain network NR system (LPC coefficients and fixed codebook gain modification), we analyze the objective measures provided by the classical NR algorithm (Wiener method in frequency domain) as described in Chap. 3, Sec. 3.2.2 and that of our system. The reason here is because the LPC coefficients enhancement and the fixed codebook modification are in fact complementary. The LPC coefficients enhancement impacts the spectral representation of the noisy signal. The fixed codebook modification impacts the noisy signal amplitude or signal to noise ratio. The objective measure tool used in this work ([ITU-T 2006a] and [Knappe, Goubran 1994]) is a further development of another tool included in the 3GPP GSM technical Specification 06.77 [3GPP-GSM 1999] and the TIA TR45 Specification TIA/EIA/IS/853 [3GPP 1999a]. 5.6.3.1 The ITU-T Objective Measurement Standard for GSM Noise Reduction One of the main targets of a NR system is to maintain the power level of the speech signal, so as not to affect the level of the speech signal together with the background noise signal. The ITU-T objective measurement in [ITU-T 2006a] presents in detail objective instruments characterizing the effect of a NR method. These objective instruments are the objective metrics defining the Signal to Noise Ratio Improvement (SNRI) and the Total Noise Level Reduction (TNLR). The SNRI is measured during speech activity and it determines the impact of noise on the speech signal at the end of the processing. The TNLR estimates the level of noise reduction both during speech and no speech periods. The analysis is completed by the computation of 120 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING a Delta Measurement (DSN), which determine the balance between SNRI and TNLR. The metric DSN reveals after the NR process a possible speech attenuation and possible undesired speech amplification caused by the NR algorithm. Objective performance of the proposed NR system is evaluated as an average over all test conditions in dB overload (dBov) based on [ITU-T 2006d]. According to the recommendation in [ITU-T 2006a], the metrics should satisfy: SN RI ≥ 4 dB, T N RL ≤ −5 dB, −4 dB ≤ DSN ≤ 3 dB (5.5) The Delta Measurement also indicates if the effects of a proposed noise reduction algorithm match with the recommendation. It indicates the interest of a noise reduction system. A more detailed description of these objective measures is presented in Appendix B. 5.6.3.2 Noise Reduction: Simulation Results Objective Measurements This section presents the objective performance of our proposed NR system when applied to two types of Smart Transcoding scheme: 7.4 kbps to 12.2 kbps and 12.2 kbps to 7.4 kbps. Clean speech signals were constructed based on 24 utterances: 6 utterances from 4 speakers (2 females and 2 males). As for the noise signals, three sequences were considered: two car noise and one street noise. The noisy signals were in total 216 files, and covering the background noise and Signal-to-Noise Ratio conditions of 6 dB, 12 dB and 18 dB. Metrics DSN(dBov) SNRI(dBov) TNLR(dBov) 7.4kbps → 12.2kbps Codec Domain W iener M ethod −0.4415 0.1323 11.9968 9.9732 −12.4124 −9.9205 12.2kbps → 7.4kbps Codec Domain W iener M ethod 0.0061 0.1082 8.8064 9.6539 −8.7060 −9.9772 Table 5.1: Total Average of the Objective Metrics. The overall observation according to the results of Table 1 is that there is an obvious benefit to use our proposed method, as well as the Wiener method. The objective measurements concretely match the recommendation requirements. Especially, the new approach proposed in this thesis reduces the noise level (T N LR = −12.4124 dBov) and amplifies the signal (SN RI = 12 dBov) better than the Wiener method (T N LR = −9.9205 dBov and SN RI = 10 dBov) from 7.4 kbps to AMR 12.2 kbps smart transcoding mode. This result is explained by the fact that, in this Smart Transcoding mode, the noise reduction is performed both in the decoder of the 7.4 kbps mode and in the encoder of the 12.2 kbps mode. The enhanced fixed 5.6. EXPERIMENTAL RESULTS 121 codebook gain is mapped both in the decoder in 7.4 kbps mode and in the encoder in 12.2 kbps mode. In 12.2 kbps mode to 7.4 kbps mode, the Wiener method slightly reduces the noise level and amplifies the signal more than our proposed method. This is explained by the fact that, in this transcoding mode, the mapping of the modified fixed codebook gain has minimal effect. As previously indicated, the fixed and the adaptive gains are jointly quantized in the 7.4 kbps mode. Our smart transcoding strategy does not modify the quantization. Our proposed method is influenced by the AMR-NB modes. The gain mapping has a very low effect in smart transcoding from 12.2 kbps mode to 7.4 kbps mode. Based on this observation, we found during several experiments that the noise reduction algorithm is more efficient if the modified fixed gain is also mapped inside decoder A. As a consequence, the noise reduction is performed only at decoder A in this smart transcoding mode. The principle of the dual mapping of the fixed gain (inside encoder B and decoder A) can be seen in Fig. 5.13 by the operation in red. During Smart Transcoding from 12.2 kbps mode to 7.4 kbps mode, our proposed method achieves an interesting balance between the possible total noise reduction level and the possible speech amplification. In this smart transcoding scheme, the dB overload Delta Measurement is particularly low: DSN = 0.0061 dBov. Another remark is that the Wiener method generally provides fairly consistent results, independent from the AMR modes. This is due to the fact that the Wiener method analyzes the completely decoded speech signal, whereas our proposed method only uses coded parameters to perform noise reduction. We can also establish that the objective measurements depend on the Signal-to-Noise Ratio levels. In Fig. 5.16 and 5.17, the objective measurements when transcoding from 12.2 kbps mode to 7.4 kbps mode and from 7.4 kbps mode to 12.2 kbps mode respectively are presented. On one side, the total SNRI decreases when the segmental SNR increases. On the other side, the total TNLR increases with the segmental SNR. The delta measurement (DSN) is more difficult to evaluate. This metric depends on the level of the SNRI and the TNLR. The DSN measure lies during these simulations between -1 and +1 dBov, revealing that there is a good balance between the SNR improvement and the total noise level reduction. This DSN metric also reveals that the possible speech level reduction is well compensated by the possible speech amplification. The outcomes of the proposed noise reduction algorithms are confirmed in Fig. 5.16 and 5.17. The proposed noise reduction algorithm performance, in comparison with the classical Wiener method, depends on the transcoding strategy. If the transcoding is performed from the 12.2 kbps mode to 7.4 kbps mode, the Wiener approach achieves a signal amplification of 2 dBov better than the coded domain approach. This gap of 2 dBov decreases when the segmented SNR increases to less than 1 dBov if the segmented SNR moves to 18 dB. The remarks are the same if the total noise reduction level (TNLR) is analyzed. 122 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING A B 13.5 0.8 13 −9.5 12.5 −10 12 −10.5 0.4 SNRI(dBov) 0.2 0 −0.2 −9 TNLR(dBov) 0.6 DSN(dBov) C 1 11.5 11 10.5 −13 9.5 −13.5 −14 9 5 10 15 Seg−SNR(dB) −12 10 −0.6 −1 −11.5 −12.5 −0.4 −0.8 −11 5 10 15 Seg−SNR(dB) 5 10 15 Seg−SNR(dB) Figure 5.16: Objective Metrics versus Segmented SNR. Transcoding from 12.2 kbps mode to 7.4 kbps mode. Proposed NR method (blue dashed circle), Wiener NR method (red dashed diamond). 5.6. EXPERIMENTAL RESULTS 123 If the transcoding is performed from 7.4 kbps mode to 12.2 kbps mode, then our method performs better than the classical Wiener approach, see Fig. 5.16. The coded domain approach increases the SNR (SNRI) by around 2 dBov more than the classical Wiener method. The total noise attenuation (TNLR) is almost 2 dBov greater than that of the classical Wiener approach. A B C 1 10.5 0.8 −8.5 0.6 −9 10 0 −0.2 TNLR(dBov) 0.2 SNRI(dBov) DSN(dBov) 0.4 9.5 9 −9.5 −10 −10.5 −0.4 8.5 −0.6 −11 −0.8 −1 −11.5 8 5 10 15 Seg−SNR(dB) 5 10 15 Seg−SNR(dB) 5 10 15 Seg−SNR(dB) Figure 5.17: Objective Metrics versus Segmented SNR. Transcoding from 7.4 kbps mode to 12.2 kbps mode. Proposed NR method (blue dashed circle). Wiener NR method (red dashed diamond). Spectrograms As the noise attenuation and the speech amplification cannot describe alone the speech enhancement, we analyze in the following the spectrograms of the processed and unprocessed noisy speech signal. Fig. 5.18 represents the spectrogram of a noisy car noise where the segmented SNR is 6 dB. The spectrogram allows temporal and frequency analysis of the speech signal. The vertical axis stands for frequencies while the horizontal axis is the time axis. The signal amplitude is proportional to the darkness of the picture. The spectrogram in Fig. 5.19 represents the noisy speech processed by the standard Wiener filter. In Fig. 5.20 and Fig. 5.21, the spectrograms of the speech processed by the coded domain noise reduction when transcoding from 12.2 kbps 124 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING mode to 7.4 kbps mode and from 7.4 kbps mode to 12.2 kbps mode respectively are presented. One can see the effectiveness of the noise reduction of the coded domain in the interval 3 s to 5 s. Noise is more attenuated by transcoding from 7.4 kbps mode to 12.2 kbps mode in Fig. 5.21 than in the transcoding from 12.2 kbps mode to 7.4 kbp smode in Fig. 5.20. This attenuation is due to the effect of the fixed codebook gain mapping inside the encoder in 12.2 kbps mode. The shape of the speech signal is relatively unmodified in transcoding from 12.2 kbps mode to 7.4 kbps mode. The shape of the speech signal is slightly distorted in transcoding from 7.4 kbps mode to 12.2 kbps mode. These distortions in speech periods (1 s to 3 s and 5 s to 7 s) appear mainly during speech transition and at the end of the speech burst. In these specific areas, SN R ≈ 0 dB. This leads to poor estimation of the LPC coefficients. Frequencies 4000 2000 0 0 1 2 20 3 40 4 5 60 6 80 7 100 8 120 4 x 10 Amplitude 1 0.5 0 −0.5 −1 0 1 2 3 4 5 Time(s) 6 7 8 Figure 5.18: Spectrogram of the Noisy Speech Signal: 6 dB Segmented SNR. 5.6.4 Acoustic Echo Cancellation The corrupted files used in this section are those already tested in Chap. 4. Three different filters h1 , h2 and h3 were considered to simulate acoustic echo. The files were built such that each contains an echo-only period, a Single Talk period of the near-end speaker and a Double Talk period. The overall Signal-to-Echo Ratio (SER) was computed during Double Talk periods and was respectively 11 dB, 15 dB and 17 dB. In periods of echo only, remote single talk, the Echo Return Loss Enhancement (ERLE) 5.6. EXPERIMENTAL RESULTS 125 Frequencies 4000 2000 0 0 1 2 20 3 40 4 5 60 6 80 7 100 8 120 4 x 10 Amplitude 1 0.5 0 −0.5 −1 0 1 2 3 4 5 Time(s) 6 7 8 Figure 5.19: Spectrogram of the Noisy Speech Enhanced With the Standard Wiener Filter. [Rages, Ho 2002] is a suitable performance criterion for a given AEC algorithm. The ERLE characterizes the ratio of the energy of the original echo to the energy in the residual echo. If the energy is completely attenuated, the acoustic echo effect will not be noticeable. We compare in the following the ERLE of the classical NLMS to that of the coded domain processing (cf. Chap. 4), is computed by averaging the ERLE values for every frame for each scenario: ERLE = C(Np ) X 1 ERLE(ℓ) C(Np ) (5.6) ℓ=1 where C(Np ) represents the total number of frames and Np the frame length. For each file, the ERLE(ℓ) is computed as follows: ! PNp 2 n=1 ŝ(ℓNp + n) (5.7) ERLE(ℓ) = −10 · log10 PNp 2 y(ℓN + n) p n=1 5.6.5 Simulation Results The overall ERLE measurements are presented in Tab. 5.2. The simulation environment corresponds to a typical simulated GSM network. The results presented in that table charac- 126 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING Frequencies 4000 2000 0 0 1 2 20 3 40 4 5 60 6 80 7 100 8 120 4 x 10 Amplitude 1 0.5 0 −0.5 −1 0 1 2 3 4 5 Time(s) 6 7 8 Figure 5.20: Spectrogram of Coded Domain Enhancement: Transcoding from 12.2 kbps mode to 7.4 kbps mode. 5.6. EXPERIMENTAL RESULTS 127 Frequencies 4000 2000 0 0 1 2 20 3 40 4 5 60 6 80 7 100 8 120 4 x 10 Amplitude 1 0.5 0 −0.5 −1 0 1 2 3 4 5 Time(s) 6 7 8 Figure 5.21: Spectrogram of Coded Domain Enhancement: Transcoding from 7.4 kbps mode to 12.2 kbps mode. 128 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING terize the mean ERLE over all test conditions as well as ERLE for each used filter. The results are the ERLE of the corrupted files enhanced by the Gain Loss Control (GLC), with the Filtering of the Fixed Gain (FFG) algorithm and based on the standard Normalized Least-Mean Square (NLMS). ERLEref represents the ERLE obtained with the microphone signal if there is no echo. Echo Only Periods During echo only periods, the ERLE is appropriate to characterize the AEC algorithm performances. It is clear that NLMS implemented inside the network is a linear AEC. This linear AEC is not efficient enough to decrease the acoustic echo level. The voice quality is not improved. The NLMS produces an ERLE below 10 dB. This performance increases fairly in transcoding from 12.2 kbps mode to 7.4 kbps mode. The overall ERLE results show that, independently from the SER, the coded domain AEC integrated in smart transcoding (FFG) achieves the required ERLE for 12.2 kbps networks, that is to say 45 dB, [ITU-T 2006b]. The ERLE is particularly high (up to 50 dB) during transcoding from 7.4 kbps mode to 12.2 kbps mode. This result is justified by the fact that, in this transcoding, the acoustic cancellation is performed at the decoder of the 7.4 kbps mode and in the encoder in 12.2 kbps mode, see Tab. 5.2 (a). The result obtained with the NLMS was already experimented in [Huang, Goubran 2000]. The effect of the vocoder was responsible for the low performance of the classical AEC approach, as well as the NLMS. The ERLE with the GLC algorithm increases when the SER increases in transcoding from 12.2 kbps mode to 7.4 kbps mode. In comparison to the classical NLMS, the ERLE of the NLMS decreases when the SER increases. The FFG algorithm tends to produce ERLE according to the kind of the acoustic echo path. Double Talk Periods The GLC method computes and applies the attenuation coefficients to both the microphone and the loudspeaker fixed codebook gain. The impression in high SER is that the echo signal level is reduced. This effect is verified by the values of the ERLE: 12 dB in transcoding from 12.2 kbps mode to 7.4 kbps mode and 9 dB in the transcoding from 7.4 kbps to 12.2 kbps mode. The FFG shows good improvements during Double Talk periods. The acoustic echo level is reduced in those periods. As shown in Tab. 5.2 (a) and (b), the overall ERLE is about 9 dB when transcoding from 7.4 kbps mode to 12.2 kbps mode. The ERLE is around 15 dB in the transcoding from 12.2 kbps mode to the 7.4 kbps mode. The NLMS in these periods of double talk did not perform any significant enhancement of the microphone signal. The ERLE 5.7. CONCLUSION 129 is low: 2.7 dB during transcoding from 12.2 kbps mode to 7.4 kbps mode and only 1.5 dB when transcoding from 7.4 kbps mode to 12.2 kbps mode. Near-End Speaker Single Talk The ERLE in Single Talk periods of the near-end speaker represents the distortion introduced by the AEC system. The overall results show that the FFG and the GLC introduce approximately the same distortion to the decoded speech. The NLMS algorithm has good behavior in these periods. The NLMS ERLE is lower by 1 dB than when trancoding from 7.4 kbps mode to 12.2 kbps mode. This overall NLMS ERLE is about 1 dB in the transcoding from 12.2 kbps mode to 7.4 kbps mode. Typical Evolution of the ERLE Fig. 5.22 and Fig. 5.23 show the typical comparative evolution of the ERLE of the GLC, the FFG and the NLMS. These examples present processing where the acoustic echo was simulated using the impulse response h1 , (11 dB SER). The time representation of the microphone y(n) signal from t = 0 s to t = 3 s corresponds to the echo only periods, or Single Talk of the far-end speaker. The single talk of the near speaker is from t = 3 s to t = 6 s. Finally the Double Talk period occurs from t = 6 s to t = 10 s. During echo-only periods, the FFG technique (curve in blue) provides the highest ERLE. In average, the ERLE with the FFG algorithm is improved of about 35 dB to 45 dB compared to the standard NLMS when transcoding from 12.2 kbps mode to 12.2 kbps mode. In this transcoding, the ERLE is particularly low, leading to less distortion of the enhanced speech. As shown in Fig. 5.23, during transcoding from 7.4 kbps mode to 12.2 kbps mode, the FFG algorithm enhances much more that the GLC and the NLMS. The FFG ERLE is in this case 30 dB to 40 dB higher than the classical NLMS but 20 dB to 30 dB higher than the GLC. 5.7 Conclusion In this chapter, we have demonstrated a smart transcoding algorithm applied on the LPC coefficients, the fixed codebook gain and the adaptive codebook gain. The coded domain VQE unit was embedded inside the smart transcoding strategies. The VQE algorithms were achieved by modifying the fixed codebook gain and the LPC coefficients. The effectiveness of the NR and/or the AEC improvement was verified through several experiments and objective evaluations. The possibility to jointly perform AEC and NR inside a smart transcoding strategy was also explored. 130 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING 4 Amplitude 1 x 10 Transcoding: AMR 12.2 kbps −> AMR 7.4 kbps 0.5 0 −0.5 −1 y(n) z(n) 0 2 4 6 8 10 Time/s ERLE(dB) 100 FFG ERLE Standard NLMS ERLE GLC ERLE 50 0 −50 0 100 200 300 400 500 Frames Figure 5.22: Time Evolution of the ERLE: from 12.2 kbps mode to 7.4 kbps mode, case filter h1 . 5.7. CONCLUSION 131 4 Amplitude 1 x 10 Transcoding: AMR 7.4 kbps −> AMR 12.2 kbps 0.5 0 −0.5 −1 y(n) z(n) 0 2 4 6 8 10 Time/s ERLE(dB) 100 FFG ERLE Standard NLMS ERLE GLC ERLE 50 0 −50 0 100 200 300 400 500 Frames Figure 5.23: Time Evolution of the ERLE: from 7.4 kbps mode to 12.2 kbps mode, case filter h1 . 132 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING Coded domain NR (modification of the fixed codebook gain and modification of the LPC coefficients) integrated inside smart transcoding provides approximately similar objective results as those of the standard Wiener filter. If smart transcoding is performed from 7.4 kbps mode to 12.2 kbps mode, our proposed method achieves an objective result better than the classical Wiener filter method. Our proposed NR system performances depend on the coders used during transcoding modes. With regard to the AEC, the Gain Loss Control achieves good improvement, especially during single talk of the near-end speaker. This coded domain GLC algorithm is simple and the requirement of the GSM network AEC is achieved (ERLE at least equal to 45 dB). The filtering of the fixed codebook gain as AEC algorithm, also called in this thesis FFG, provides the best objective results among all the simulations. The FFG is much more effective in echo-only periods, as shown in the ERLE tables (ERLE better than 45 dB). The FFG during Double Talk periods reduces the acoustic echo effect while keeping the near-end speech understandable. In comparison, the classical NLMS introduces high distortion on enhanced speech signal in double talk periods. The useful speech in high SER level becomes inaudible. Informal listening tests show that the Gain Loss Control during double talk slightly attenuates both the Microphone and the Loudspeaker. Another critical aspect of AEC implemented in the network is that the classical solution (NLMS) performance is drastically reduced. The ERLE results show that the classical approach (NLMS) ERLE values are in general below 10 dB in echo-only periods. The NLMS used during our simulations is influenced by the non-linearity and the unpredictability introduced by the speech codecs. This chapter also points out the interest of such an implementation in terms of computational load, delay. Several processing functions during the second encoding are skipped. The AEC is performed by enhancing a single parameter: the fixed codebook gain. The NR is achieved by modifying two parameters: the fixed codebook gain and the set LPC coefficients. The proposed system does not require any transformation, as FFT for example. The parameters are extracted and directly modified or filtered before the mapping. The algorithmic delay in such an approach is slightly reduced, especially if the encoder B is not that of the AMR 12.2 kbps mode. 5.7. CONCLUSION 133 AEC and Smart Transcoding: AMR-NB 7.4 kbps Filter h1 (11dB SER) h2 (15dB SER) h3 (17dB SER) Overall Processing AMR-NB 12.2 kbps AEC Algorithms ERLE ref NLMS GLC FFG ERLE ref NLMS GLC FFG ERLE ref NLMS GLC FFG s(n) single Talk Periods 2.4329 0.6011 1.9584 1.2982 2.6239 0.8968 1.0704 0.2114 2.4288 0.5456 4.1010 0.7930 Double Talk Periods 4.370 2.7468 9.5187 8.5758 3.6132 0.0032 6.0136 6.1459 3.7193 1.2582 13.6213 10.6313 Echo-only Periods 57.8149 8.3659 44.7805 55.7542 49.7853 4.5665 36.1725 39.2706 52.0316 5.3884 51.3593 49.7075 Total ERLE 20.4490 03.8144 18.1106 20.9807 17.7191 01.7150 13.8502 14.5912 18.3972 02.3137 22.3873 19.6809 ERLE ref NLMS GLC FFG 2.4791 0.6542 3.6475 0.8371 3.9368 1.5027 12.8212 8.7354 53.6388 6.2920 40.9326 49.3658 18.9952 2.7268 18.6600 18.8959 (a) - ERLE values during Transcoding from AMR 7.4 kbps mode to AMR 12.2 kbps mode AEC and Smart Transcoding: AMR-NB 12.2 kbps Filter h1 (11dB SER) h2 (15dB SER) h3 (17dB SER) Overall Processing AMR-NB 7.4 kbps AEC Algorithms ERLE ref NLMS GLC FFG ERLE ref NLMS GLC FFG ERLE ref NLMS GLC FFG s(n) single Talk Periods 3.1036 1.1332 1.6522 2.1057 3.0776 1.2798 3.0964 2.4219 2.9290 0.8912 2.7448 1.8950 Double Talk Periods 4.9666 3.0465 7.2345 13.1562 8.0276 2.9885 16.4988 17.0227 5.4576 1.9921 12.2837 15.4311 Echo-only Periods 58.9073 11.7010 21.5751 49.3460 53.9084 9.1269 44.4998 40.3211 53.2751 9.0305 49.0305 40.7390 Total ERLE 21.2192 5.1368 9.9251 20.9194 20.7731 4.3584 20.9502 19.6211 19.5841 3.8379 20.7054 18.9985 ERLE ref NLMS GLC FFG 3.0367 1.1014 2.4978 2.1409 6.1506 2.6757 12.0057 15.2033 55.3636 9.9529 38.3684 43.4687 20.5255 4.4443 17.1936 19.8464 (b) - ERLE values during Transcoding from AMR 12.2 kbps mode to AMR 7.4 kbps mode Table 5.2: Echo Return Loss Enhancement Values. 134 NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING 135 Chapter 6 General Conclusion 6.1 Context External impairments such as high background noise and acoustic echo during communication reduce the speech quality and intelligibility. The solutions (Noise Reduction (NR) and/or Acoustic Echo Cancellation (AEC) algorithms) for such problems are generally implemented in Mobile Device. However, network operators have started with Voice Quality Enhancement (VQE) solutions directly implemented inside the network. Transposition from Mobile Devices to network is motivated by several advantages. A centralized network quality control is achieved with such an implementation. The low complexity constraint due to the power supply of the Mobile Devices restricts the choice of the algorithms and thus limits the system performance. In contrast, there is no complexity cost restriction inside the network. With standard network based VQE, the corrupted speech signal must be decoded, enhanced and then reencoded. In practice, classical approaches for acoustic echo control, based on linear method can be drastically affected by the presence of the CELP based speech codecs in the communication chain. The drawbacks with standard approaches are the computational load and the delay increasing and quality degradation. The development of new networks generally comes with the deployment of new speech codecs which are not interoperable with existing ones. The bit-rate conversion via a standard transcoding is necessary for network interoperability reasons: decoding of the source bit-stream and re-encoding to transmit the target bit-stream. This standard transcoding always degrades the speech quality and can introduce additional delay. An alternative approach, called smart transcoding, exploits the similarity between speech codecs to perform the network interconnection. In a smart transcoding scheme, the parameters from the source coder are directly 136 GENERAL CONCLUSION mapped inside the target codec. The common thread to the classical NR AEC algorithms and transcoding is that the process is carried out based on the signal in PCM format. Many deployed codecs in modern digital networks follow the legacy of the CELP. 3GPP AMR-NB modes speech codec standard are used in this thesis as platform for the experiments and applications. The challenge in this thesis context is to maintain the Quality of Service (QoS) at low computational cost in modern digital networks. 6.2 Thesis Contribution According to the context of this thesis as detailed in Chap. 1, this thesis is based on two key concepts: – VQE algorithms that deal with the CELP codec parameters exclusively (Coded domain VQE algorithms) and that are implemented inside network. – Integration of the Coded domain VQE algorithms inside a smart transcoding strategy. Four algorithms have been developed during this thesis. Two algorithms dedicated to acoustic echo cancellation via the processing of the fixed codebook gain of the CELP codec. Two other algorithms have been developed for noise reduction. The first NR algorithm modifies the fixed codebook gain. The second NR algorithm enhances the spectral characteristics of the speech through modification of the LPC coefficients. These concepts have been implemented after physical description of the CELP parameters was examined in Chap. 2. Noise Reduction A NR algorithm based on the filtering of the fixed codebook gain of the CELP codec has been proposed in Sec. 3.4 of Chap. 3. In each sub-frame, a filter is computed and applied to the noisy speech fixed codebook gain. The result of this filtering process is an estimation of the useful speech fixed codebook gain. This estimated fixed codebook gain then replaces the current noisy speech fixed gain inside the bit-stream. This filter is a Wiener based filter. The a priori and the a posteriori Signal-to-Noise Ratios have been also transposed in codec parameters domain. The estimation of the Signal-to Noise Ratio has been based on an extrapolation and transposition of the Ephraim and Malah rule. The Minimum Statistic generally implemented in the frequency domain to estimate the noise PSD has been transposed and used to estimate the fixed codebook gain of the noise. The noise amplitude perception is improved thanks to this approach. A NR based on a spectral characteristics enhancement of the noisy speech signal has been 6.2. THESIS CONTRIBUTION 137 presented in Sec. 3.5 of Chap. 3. This new algorithm has been implemented by modifying the corrupted LPC coefficients. Contribution of this section has been published at ITG-SCC 2008, [Thepie et al. 2008]. Two methods have been adopted based on the voice activity detection integrated in AMR-NB. If there is no speech signal presence, the noisy LPC coefficients are damped. The damping procedure is performed proportionally to the noise signal amplitude. In presence of noise and useful speech, a modification function is necessary. This modification has been obtained by exploiting the LPC analysis procedures of the noisy speech signal. The modification function is finally designed as the relation between the useful speech, the noisy speech and the noise signal LPC coefficients. As only the noisy speech parameters are available, it is necessary to estimate the noise autocorrelation matrix, the noise LPC coefficients, the clean speech autocorrelation matrix and the clean speech signal LPC coefficients. The inverse Recursive Levinson-Durbin algorithm has been introduced to estimate the noise autocorrelations coefficients, and thus the autocorrelation matrix. The modification function introduced in this NR algorithm is considered as a useful estimator of the clean speech LPC coefficients, especially in periods of voiced speech segments. Acoustic Echo Cancellation The first AEC discussed in this thesis appears in Sec. 4.5 of Chap. 4. This algorithm is the transposition of the Gain Loss Control generally implemented in time domain. The idea consists in computing two attenuation gains to be applied to the microphone and loudspeaker fixed gains. In the CELP parameters domain, the fixed codebook gain has been considered as a good representation of the signal energy. Estimate of the energy of the microphone and the loudspeaker signal is then achieved based only on the fixed codebook gains and the adaptive gains. The metric used to compute the attenuation gains is taken as the ratio between the microphone energy and the loudspeaker gain. Finally, the microphone and the loudspeaker signals have been distinctly attenuated according to the ratio of their long-term estimated energies. This proposed algorithm has no double talk detection, but the performances are very promising as attenuation is performed event during double talk periods. The second AEC algorithm proposed in Sec. 4.6 of Chap. 4 is a more complex approach. This AEC algorithm is based on the filtering of the fixed codebook gain of the microphone signal. The filter derives from an extrapolation of the standard Wiener filter. This filter is built as a function of the Signal-to-Echo Ratio (SER) in coded domain. The SER is computed in parameter domain using a transposition of the Ephraim and Malah rule as in noise reduction. The discrimination of the echo presence and the double talk periods has been introduced to design the filter applied to the fixed codebook gain. The Echo presence and the double talk periods are detected by analyzing the loudspeaker energy and the normalized-cross correlation function between the microphone and the loudspeaker fixed codebook gains. The estimation of the echo signal fixed codebook gain in each sub-frame has been computed by assuming that the echo fixed codebook gain is an attenuated and shifted version of the current loudspeaker fixed codebook gain. The sub-frame shift is considered as deriving from the maximum of the 138 GENERAL CONCLUSION normalized cross-correlation. The attenuation coefficient has been computed by using the ratio between the shifted loudspeaker fixed gain and the current microphone fixed codebook gain. The sub-frame shift and the attenuation coefficients are updated only if the echo is detected. This coded domain AEC approach provides listening test results similar to that of the NLMS. This contribution has been presented in IWAENC 2006, [Thepie et al. 2006]. Network Voice Quality Enhancement and Interconnection Solution The smart Transcoding refers to the ability of providing an (quality-wise, transparent) effective way to map various parameters between two speech coders. In Chap. 5, a smart transcoding strategy has been implemented between the 12.2 kbps mode and 7.4 kbps mode of the 3GPP AMR-NB. After several experiments and informal listening tests, the smart transcoding algorithm proposed in this thesis is applied to the LPC coefficients, the fixed and the adaptive codebook gains. This thesis thus proposes in Chap. 5 integration of the VQE algorithms developed above inside network or smart transcoding algorithms. The overall improvement is as follows: Computational Load Reduction: The retained smart transcoding strategy as developed in this thesis involves a computational load reduction of about 27 %, compared to the standard transcoding between the AMR-NB modes. This complexity gain is due to the fact that the computations of the LPC coefficients, of the fixed and of the adaptive gains are not performed at the target encoder. Delay Reduction: The delay with this smart transcoding strategy is reduced only if the target encoder is not the 12.2 kbps mode. The look-ahead needed in LPC analysis is skipped, leading to a delay reduction of 5 ms. The processing delay with this approach may be particularly low. The proposed NR and AEC algorithms enhance only two sets of parameters (the LPC coefficients and the fixed codebook gain) at each sub-frame. In comparison with the standard approach, no transform such as STFT is required. The ITU-T TIA TR45 Specification TIA/EIA IS 853 objective measurement has been adopted to evaluate the performance of the proposed NR algorithm (combination of the fixed codebook gain and the LPC coefficients modification). The metrics used are the Total Noise Reduction Level (TNRL) and the Signal to Noise Ratio Improvement (SNRI). An additive metric (DSN) is computed to determine the balance between the possible useful signal overattenuation and amplification of the noise. The proposed noise reduction algorithm integrated inside the smart transcoding strategy has been compared to the classical Wiener filter approach. The overall conclusion is that our proposed algorithm performs better than the standard Wiener filter NR in transcoding from the AMR 7.4 kbps mode to the AMR 12.2 kbps mode. During transcoding from the 12.2 kbps mode to the AMR 7.4 kbps mode, the proposed NR algorithm performance is similar to that of the classical Wiener filter. The DSN measures reveal a good balance between the SNR improvement and the decrease of the noise: 6.2. THESIS CONTRIBUTION 139 −0.5 ≤ DSN ≤ 0.5 of our solution. The performance of an AEC algorithm located inside the network has been evaluated in this thesis using the Echo Return Loss Enhancement (ERLE). Our proposed algorithms (Gain Loss Control (GLC) and Filtering of the Fixed Gain (FFG)) have been tested against the standard NLMS algorithm located inside a GSM network. It has been demonstrated in [Huang, Goubran 2000] that the presence of CELP speech coders degrade the ERLE of the AEC algorithms. The required 45 dB ERLE in GSM has been achieved as well with GLC or with the FFG during echo-only periods. The standard NLMS ERLE performance is in general below 10 dB. Based on the ERLE measures, it can be noticed that the NLMS method has minimal effect during double talk periods. The ERLE is less than 3 dB. The proposed algorithms, with their strategies, try to reduce the acoustic echo effects during double talk periods. The ERLE in double talk periods is about 15 dB with the FFG and can achieve up to 12 dB with GLC. The ERLE performance of the proposed coded domain AEC when transcoding from the 12.2 kbps mode to AMR 7.4 kbps mode is quasi similar to those obtained in transcoding from the AMR 7.4 kbps mode to AMR 12.2 kbps mode. This thesis has proposed a low cost solution to centralized VQE, especially network NR, network AEC and network interoperability problems. The performance obtained during this thesis are efficient, not only because the computational demand is low, but also because the objective performances are interesting. Such approach has good behaviour since the resources required for filtering with classical algorithms are no longer a limiting factor. 1. The NR performance achieved in this thesis is close to that of the standard Wiener approach. The advantage with NR is that the computational load and the delay are significantly reduced compared to the standard techniques. 2. The performance obtained with our proposed AEC located inside the network is better than classical adaptive approaches. The performance degradation due to the nonlinearities introduced on the acoustic echo path by the CELP codecs is avoided. 3. The problem due to the interconnection between two compatible networks is also reduced through the computational load and the delay decreasing. There is no need to compute the LPC coefficients, the fixed and adaptive gains during the transcoding. 6.2.1 Perspectives The concept and the algorithms proposed in this thesis can be improved in several directions. The first perspective appearing as critical for this thesis is the algorithmic evaluation or complexity (exact total number of multiplications and additions) of the proposed NR and AEC algorithms. The algorithmic evaluation will enable a global complexity comparison with standard techniques. Appropriated tools for quality evaluation are also necessary for the new 140 GENERAL CONCLUSION parameters domain algorithms. Explanation and interpretation of the results obtained with these algorithms need to be specifically defined. The algorithmic evaluation can be structured by the enhancement of the techniques used to estimate the unknown parameters. For example, the VAD decision used in this thesis is based on analysis of the PCM samples. The development of a VAD system by processing the CELP coded parameters will be very beneficial to the NR algorithm based on modification of the LPC coefficients. The techniques implemented in this thesis combine extrapolation/transposition of standard filtering and many estimation tools (Minimum Statistic, Recursive Inverse Levinson-Durbin, VAD, Normalized Cross-Correlation, LPC analysis, Short-Term and Long-Term Energy Estimation). An interesting perspective can be the study in the CELP parameters domain of the interaction between the filtering process and the estimation techniques used. The algorithms proposed in this thesis have a broad range of applicability. The principle of modifying the LPC coefficients and the fixed codebook gain can be extended to any kind of codec based on the CELP technique. This principle could be extended principally to transcoding between AMR-WideBand modes, and from AMR-WideBand modes to the AMR-NarrowBand modes. The transcoding critical is the transcoding from AMR-NarrowBand to AMR-WideBand where a bandwidth extension is necessary [ITU-T 2003]. It would also be useful to investigate on other CELP parameters that could be modified or filtered. The fixed codebook and the LPC coefficients are modified in this thesis. The process can be applied to the LSF coefficients or to the fixed codebook vector. It will be also interesting to investigate on different smart transcoding strategies. The mapping of the entire set of CELP parameters instead of three parameters only as proposed in this thesis will reduce the complexity of the proposed system. Since there is no need that the target encoder computes these parameters anymore. A centralized dual processing: parameters domain NR and AEC inside network can be suggested as another way of research. Interaction between centralized CELP parameters domain NR and AEC can also be studied as the fixed codebook gain is modified by the NR and the AEC algorithms. Another perspective in the same direction is to study the possibility that the network operator controls both the acoustic echo and the line echo through CELP parameters. We can conclude that, with the multiplication of networks and codecs, the need of delay reduction and complexity minimization, the constraints of real time-processing, the problem due to non-linearities introduced by the CELP coders to the acoustic echo path, centralized VQE and network interconnection solution based on coded parameters could be inescapable in the future. 6.2. THESIS CONTRIBUTION 141 Appendix 143 Appendix A GSM Network and Interconnection A.1 GSM Networks Architecture In Fig. A.1 below, a generic interconnection architecture between two GSM networks A and B is described [Cotanis 2003]. This presentation will not detail the GSM architecture. The protocols and all other specific aspects of the GSM network are useless in this work. The GSM network can be summarized in three main components: the Mobile Device (MD), the Base Station Sub-System (BSS) and the Core Network (also called Network Sub-System (NSS)). The MD or Terminal connects via the air a User to the GSM network. The signal conversion from analogical to digital and the Speech Coding are performed in the MD. The different types of Terminal are distinguished by their output power and their application. A MD is associated with a Subscriber Identity Module (SIM) card, which is in conjunction with the network Authentification Centre (AUC). The BSS also called the Radio Access Network controls the radio link between the MD and the Network. The BSS comprises the Base Transceiver System (BTS), the Base Station Controller (BSC) and the Transcoder and Adaptation Unit (TRAU). The BTS is made of antennas or Transceivers (TRX) and electronic equipments. Operations such as coding, crypting, modulation, synchronization, and multiplexing are performed at the BTS side. The BSC is used to translate the 12.2kbps voice to 64kbps channel. Several BTS are generally connected to one BSC which joints the MSC via the TRAU. The TRAU is used to handle the speech codecs bit-stream to PCM conversion. The TRAU is an ideal area for Transcoding during the communication. The Operations performed near the BSC are frequency hopping, time and frequency synchronization, power management and time delay measurement. The main component of the Core Network or NSS is the Mobile Service Switching Cen- 144 GSM NETWORK AND INTERCONNECTION ter (MSC). Any user of a given GSM network is registered with an MSC stored inside one particularly database called Home Location Register (HLR). All calls from and to the user are controlled by the MSC. A GSM network can have one or more MSC, geographically distributed, which synchronizes the BSS. Another component is the Gateway Mobile Switching Center (GMSC). The GMSC controls the mobile terminating calls and the timing between the GSM network and other network. The Gateway can also be considered as potential area of Transcoding, since this part of the network is the gate to access other networks. The GMSC is the interconnection point from one GSM network to another GSM network or from one GSM network to other networks such as PLMN or PSTN. The Visitor Location Register (VLR) GSM Network (B) GSM Network (A) MD A (A) (B) MD B Base Station Sub-System: BTS + BSC + TRAU (A) BTS BTS Base Station Sub-System: BTS + BSC + TRAU (B) BSC OSS TRAU (A) MSC BTS BSC OSS TRAU GMSC GMSC HLR, EIR, AUC, VLR Network Sub-System: Core Network, MSC+ GMSC (A) BTS MSC (B) HLR, EIR, AUC, VLR PSTN PLMN or Other Networks Network Sub-System: Core Network, MSC+ GMSC (B) Figure A.1: Generic GSM Interconnection Architecture. contains information from a subscriber’s HLR necessary to provide the subscribed services to visiting users. When a subscriber enters the covering area of a new MSC, the VLR associated to this MSC will request information about the new subscriber to its corresponding HLR. The VLR will then have enough data to assure the subscribed services without needing to ask the HLR each time a communication is established. The VLR is always implemented together with a MSC; thus, the area under control of the MSC is also the area under control of the VLR. The Authentification Centre (AuC) serves security purposes. The AuC provides the parameters needed for authentification and encryption functions. These parameters allow verification of the subscriber’s identity. The Equipment Identity Register (EIR) stores security-sensitive information about the MD. It maintains a list of all valid terminals as identified by their International Mobile Equipment Identity (IMEI). The EIR allows then to forbid calls from stolen or A.1. GSM NETWORKS ARCHITECTURE 145 unauthorized terminals (e.g., a terminal which does not respect the specifications concerning the output Radio Frequency (RF) power). In principle, a communication between two MD A and B from the GSM A to the GSM B will go through the nearest BTS from MD A to the BSC and, the associated MSC A. The call will be then routed to the MSC B via the GMSC A and GMSC B. The MSC B addresses the call to the BSC which contacts the nearest BTS from the MD B. Communication between two MD events from the same GSM network never follows a direct link. The communications goes toward the BTS, the BSC and the MSC. A GSM network is in addition connected to a so called Operation and Support System (OSS). From the OSS, the network operator is able to monitor and control the system. 146 GSM NETWORK AND INTERCONNECTION 147 Appendix B CELP Speech Coding Tools B.1 The Recursive Levinson Durbin Algorithm The LPC analysis is an important step in CELP encoding process. The LPC block in CELP encoder aims to find the best prediction coefficients, also called LPC coefficients which minimize the squared error between the current windowed speech samples and the predicted ones. The prediction coefficients AS = (1, as (1), . . . , as (M )) minimizing that square error are in fact the solution of a linear system also known as Yule-Walker equations below: −ΓS · AS = RS (B.1) where M is the order of the LPC analysis and the M × M autocorrelation matrix ΓS is defined as: ΓS = rS (M − 1) .. . rS (1) rS (0) .. .. . rS (1) . rS (M − 1) · · · rS (1) rS (0) rS (0) rS (1) ... .. . .. . RS = (rS (1), . . . , rs (M )) PN −1 (B.2) (B.3) and rS (j) = n=j sw (n)sw (n − j), j = 0, . . . , M , and N represents the size of the window of analysis. The solution of the linear system in EQ.B.1 can be obtained by inversing the autocorrelation matrix ΓS . For real time implementation, such approach is too complex and iterative solution is necessary. 148 CELP SPEECH CODING TOOLS Current CELP coders use the Recursive Levinson-Durbin algorithm to find the optimum LPC coefficients instead of the matrix inversion. In a windowed frame basis, the algorithm makes use of the autocorrelation sequence rS (j), j = 0, . . . , M and the desired LPC analysis order M to recursively estimate the LPC coefficients and the reflection coefficients. B.1.1 Steps of the Recursive Levinson-Durbin Algorithm The algorithm can be run only if the condition rS (0) > 0 is satisfied. For convenience, (k) when we write am , m is the step in the iteration of the algorithm, ranging from 0 to M . The indice (k) defines the number of the computed LPC coefficient. The algorithm computes M +1 coefficients. The first coefficient is set to 1 and the M remaining are the LPC coefficients. The algorithm can be summarized as follows: 1. If m = 0 then errS (0) = rS (0), also called the final prediction power error. 2. If m = 1 rS (1) (0) (1) (1) a1 = − , a1 = 1, and errS (1) = errS (0) · 1 − (a1 )2 rS (0) (0) (B.4) ((m)) 3. For m = 2 to m = M , then am = 1 and the terms am = Km generally called reflection coefficients are first computed as: m−1 X (j) 1 rS (m) − a(m) am−1 · rS (m − j) · (B.5) m = errS (m − 1) j=1 The LPC coefficients are obtained at iteration m by: T T (1) (m−1) T (m−1) (1) (m−1) (m) = a · · · a a(1) · · · a − a · a · · · a m m m m−1 m−1 m−1 m−1 and 2 errS (m) = errS (m − 1) · 1 − (a(m) m ) (B.6) (B.7) at the end of the iterating process are given by: AS = (1, as (1), . . . , as (M )) = The LPC coefficients (1) (M ) 1, aM , . . . , aM B.2 The Inverse Recursive Levinson-Durbin Algorithm With the explicit knowledge of the autocorrelation coefficients values of the signal s(n) rS (j), j = 0, . . . , M , the Recursive Levinson-Durbin algorithm estimates the LPC coefficients B.3. THE ITU-T P. 160 149 AS and the final prediction error power errM using the Yule-Walker equations. In order hand, given the set of LPC coefficients AS is it possible to compute the appropriated autocorrelation sequence values rS (j), j = 0, . . . , M ? A solution of this problem is achieved by using the Inverse Recursive Levinson-Durbin algorithm. This algorithm has as input the LPC coefficients and the final prediction error power. It processes in two steps. First the algorithm computes the associated reflection coefficients Km , m = 1, . . . , M . Then the autocorrelation values derive from the reflection coefficients. The Inverse Recursive Levinson-Durbin algorithm decreases iteratively from t = M to (1) (M ) t = 1. Knowing the entire set of LPC coefficients 1, aM , . . . , aM , the algorithm processes as follows: From B.5, we can obtain : t−1 X (t−1) ak k=0 · rS (k − 1) = −Kt · errS (t − 1) (B.8) Using the fact that rS (k) = rS∗ (−k), we can compute recursively the autocorrelation coefficients as follows: (t) rS (0) = errS (0) and a0 = 1 for t = M, . . . , 1, the autocorrelation coefficients are derived from: rS (t) = Kt · errS (t − 1) − t−1 X k=0 (t−1) ak · rS (t − k) (B.9) The final prediction error power errS (t) is updated based on EQ. B.7. B.3 The ITU-T P. 160 One of the targets of the noise suppression is to maintain the power level of the speech signal so as not to attenuate the level of the speech signal together with the noise signal in the Noise Reduction (NR) processing. It is presented here as objective methodology for characterizing the basic effect of the NR methods. This tool presents NR solution in term of Signal-to-Noise Ratio Improvement (SNRI) and the Total Noise Level Reduction (TNLR). The 150 CELP SPEECH CODING TOOLS SNRI is measured during speech activity focusing on the effect of NR on speech signal. The TNLR estimates the overall level of noise reduction, both during speech and speech pause. In addition the delta measurement (DSN) is computed to reveal speech attenuation or undesired speech amplification caused by an NR solution. This objective Measurement tool can be only applied in specific conditions of the test material. The test signals should comprise at least: – I = 24 original clean speech signal utterances si . 6 utterances from 4 speakers, 2 males and two females. – J = 6 original noise signal utterances dj . The noise signals covering a total SNR of 6 dB, 12 dB and 18 dB. As noise sequences, an interior car noise at 100 km/h, fairly constant power level and a street noise. The noise signals should have slowly varying power. After that, the noisy speech signal yini,j , taken as the reference signal, is given by: yini,j = si + βi,j (SN R) · dj (B.10) The frame analysis basis is almost a 80 samples frame. The processed speech signal yout is referenced as: youti,j = N R(yini,j ) (B.11) Based on the ITU-T Recommendation [ITU-T 2006d], the test signals should be normalized to active speech level of −26 dBov and represented with 16 bit integers. The average power Px of a signal x in a 80 Samples frame length is defined by: 80 Px = 1 X 2 x (n) 80 (B.12) n=1 The power level in decibel overload (dBov) is defined relatively to a reference power level P0 = 327680 as follows: Px Lx = 10 · log10 (B.13) P0 The noisy speech signal d and the processed speech signal y are classified by their average power (Ld and Ly respectively) in comparison with different clean speech signal average power thresholds in dBov: thh , thm , thl , thnh , thnl . 1. 2. 3. 4. 5. a a a a a Signal Signal Signal Signal Signal x x x x x belongs belongs belongs belongs belongs to to to to to High speech class or Ksph class if Lx > thh Medium speech class or Kspm class if thm < Lx ≤ thh Low speech class or Kspl class if thl < Lx ≤ thm Knse class if thnl ≤ Lx ≤ thnh Kpse class if Lx ≤ thnh The NR solution is evaluated in terms of Signal-to-Noise Ratio Improvement (SNRI) and Total Noise Level Reduction (TNLR). The SNRI is measured during speech activity focusing on the effect of NR on useful speech signal, mostly the amplification of the useful speech signal. B.3. THE ITU-T P. 160 Threshold ’thh ’ ’thm ’ ’thl ’ ’thnh ’ ’thnl ’ 151 Explaination ’Lower bound for high speech power class’ ’Lower bound for medium speech power class’ ’Lower bound for low speech power class’ ’Higher bound for speech pause class’ ’Lower bound for speech pause class’ Value −1 dBov −10 dBov −16 dBov −25 dBov −40 dBov Table B.1: Threshold Level for Speech Classification. B.3.1 Assessment of SNR Improvement (SNRI) The SNRI metric measures the SNR improvement achieved by the NR algorithm, meaning the amplification of the speech. This SNR is computed in three speech power classes to obtain an evaluation of the effect separately for strong, medium and weak speech related to the input and the output speech signals. The computations are performed for high speech class. The other classes easily rise from this calculation. We start with the SNR for input yini,j and output youti,j speech signal as follows: P P Ksph 1 2 (l,n) ξ+ yout log ( ) 10 10 Ksph l=1 n i,j − 1 SN Rout_hi,j = 10 · log10 max ξ, (B.14) P P Knse 1 2 (m,p) ξ+ yout log ) p i,j 10 Knse m=1 10 ( P P Ksph 1 10 Ksph l=1 log10 (ξ+ n yin2i,j (l,n)) − 1 (B.15) SN Rin_hi,j = 10 · log10 max ξ, P P Knse 1 2 10 Knse m=1 log10 (ξ+ p yini,j (m,p)) where ξ = 10−5 . The indexes p and n are carried on a frame of 80 samples. The indice n is related to speech frames while p is related to noise frames with frame power between thh bound and thl bound. The SNRI for high speech class during a single scenario (i, j) is given by the relation: SN RI_hi,j = SN Rout_hi,j − SN Rin_hi,j (B.16) The total SN RIi,j for high, medium and low speech class for a single scenario is obtained by: 1 ·(Ksph · SN RI_hi,j + Kspm · SN RI_mi,j + Kspl · SN RI_li,j ) Ksph + Kspm + Kspl (B.17) Extending the SNRI procesing for all the speech signal utterances we obtain: SN RIi,j = I SN RIj = 1X SN RIi,j I (B.18) i=0 Finally, the SNRI among the entire experiments is given by: SN RI = J 1X SN RIj J j=0 (B.19) 152 B.3.2 CELP SPEECH CODING TOOLS Assessment of Total Noise Level Reduction (TNLR) The total noise level reduction measure, or TNLR, defines the capability of the noise reduction method to attenuate the background noise level measured during both speech activity and speech pauses. Due to the difference between the number of frames during speech activity and during long speech pauses, TNLR mainly measures the capability of an NR to reduce noise during long speech pauses. Then TNLR can be computed as follows: !# ! Kspe " X X X 1 T N LRi,j = yin2i,j (m, q) yout2i,j (m, q) − log10 ξ + log10 ξ + · 10 · Kspe q q m=1 (B.20) The index q on a frame of 80 samples relates to noise frames with frame power less than thnh bound. I 1X T N LRi,j (B.21) T N LRj = I i=0 Finally, the TNLR among the entire experiments is given by: T N LR = J 1X T N LRj J (B.22) j=0 An improvement in the SNR of a noisy speech signal can be achieved based on amplification of the high energy portion of the signal. But, both attenuation and amplification can cause damage to noise reduction improvement. The balance between the SNR improvement (amplification) and the noise level reduction (attenuation) obtained during speech activity should be detected. The Noise Power Level Reduction (NPLR) is introduced and is computed similar to TNLR, except that NPLR is computed between tnh and tnl , meaning short time speech pause periods. !# ! " K nse X X X 1 2 2 yini,j (m, q) youti,j (m, q) − log10 ξ + log10 ξ + N P LRi,j = · 10 · Knse q q m=1 (B.23) The index q carried on a frame of 80 samples is related here to noise frames with frame power between than thnh bound and thnl bound. Finally, the metric NPLR provides the counterpart for SNRI, and these two metrics are together the basis evaluation of the balance. The metric SNRI-to-NPLR Difference (DSN) is proposed as a measure to acquire an indication of possible speech attenuation or speech amplification produced by the NR method: DSN = SN RI − N P LR (B.24) The metric for SNRI, TNLR and DSN should satisfy the requirements described in table below. The objective performance is defined through the average values of the metrics over all test B.3. THE ITU-T P. 160 Objective Metric SNRI TNLR DSN 153 Required Performance SN RI ≥ 4 dBov ’as average over all test conditions’ T N LR ≤ −5 dBov ’as average over all test conditions’ −4 dBov ≤ DSN ≤ 3 dBov ’as average over all test conditions’ Table B.2: Objective Metrics Requirement. conditions. Hence, NPLR is typically negative and the DSN should be close to zero. If NPLR is higher in absolute values than SNRI making DSN clearly negative then the NR solution produces speech level attenuation. But if DSN is clearly positive, the SNRI indicated SNR improvement without a decrease in noise level: the speech level has been amplified. 154 CELP SPEECH CODING TOOLS BIBLIOGRAPHY 155 Bibliography 26.092, G. T. (2002). Mandatory Speech Codec speech processing functions, AMR Speech codec, Comfort noise aspects. 26.094, G. T. (2002). Mandatory Speech Codec speech processing functions, AMR Speech codec, Voice Activity Detector. 3GPP (1999a). TIA/EIA-IS-853: Noise Suppression Minimum Performance for AMR, Technical report, 3GPP. 3GPP (1999b). TS 26.090: Mandatory Speech Codec speech processing functions; AMR Speech codec; General Description, Technical report, 3GPP. 3GPP-GSM (1999). TS.06.77: Minimum Performance Requirement for Noise Suppresser Application to the AMR Speech Encoder, Release 7, Technical report, 3GPP. Adoul, J.; Mabilleau, P.; Delprat, M.; Morissette, S. (1987). Fast CELP Coding Based on Algebraic Codes, Proceedings of ICASSP’87, IEEE, pp. 1957–1960. Allen, J. (1977). Short Term Spectral Analysis, Synthesis, and Modification by Discrete Fourier Transform, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-25, no. 3, June, pp. 235–239. Atal, B.; Remde, J. (1982). A New Model of Excitation for Producing Natural-Sounding Speech at Low Bit Rates, Proceedings of ICASSP’82, vol. 1, pp. 614–617. Beaugeant, C. (1999). Réduction du Bruit et Contrôle d’Echo pour les Applications Radimobiles, PhD thesis, Université de Rennes I. Beaugeant, C.; Schoenle, M.; Loellman, H.; Sauert, B.; Steinert, K.; Vary, P. (2006). Hand-Free Audio and its Application to Telecommunication Termnals, In Proccedings of AES’06. Beaugeant, C.; Taddei, H. (2007). Quality Computation Load Reduction Achieved by Applying Smart Transcoding between CELP speech Coders, EUSIPCO’07, pp. 1372–1376. Benesty, J.; Morgan, D.; Cho, J. (2000). A New Double Talk Detection Based on CrossCorrelation, Proceedings of IEEE Transaction on Speech and Audio Precessing, IEEE, vol. 8, pp. 168–172. Berouti, M.; Schwartz, R.; Makhoul, J. (1979). Enhancement of Speech Corrupted by Acoustic Noise, Proceedings of ICASSP’79, vol. 4, pp. 208–211. 156 BIBLIOGRAPHY Boll, S. (1979). Suppression of Acoustic Noise in Speech Using Spectral Subtraction, IEEE Trans. Acoust. Speech, Signal Processing, vol. ASSP-27, no. 2, April, pp. 113–120. Bossert, M. (1999). Channel Coding for Telecommunication, Jonh Wiley. Bowman, F. (1958). Introduction fo Bessel Functions, New York Dover, 1958. Breining, C.; Dreiseitel, P.; Hansler, E.; Mader, A.; Nitsch, B.; Puder, H.; Schertler, T.; Schmidt, G.; Tilp, J. (1999). Acoustic Echo Control, an Application to Very High Order Adaptive Filters, Proceedings of IEEE Signal Processing Magazine, IEEE, vol. 16, pp. 42– 69. Cappe, O. (1994). Elimination of Musical Tone Phenomenon with the Ephraim and Malah Suppressor, Proc. IEEE Trans. Acoust. Speech, Signal Processing, vol. 4, April, pp. 435– 349. Chandran, R.; Marchok, D. (2000). Compressed Domain Noise Reduction and Echo Suppression for Network Speech Enhancement, Proceedings of the 43rd IEEE Midwest Symposium on Circuits and Systems, IEEE, Lansing, USA, vol. 1, pp. 10–13. CHU, W. (2000). Speech Coding Algorithms: Foundation and Evolution of Standardized Coders, Wiley Interscience. Cotanis, I. (2003). Speech in the VQE Device Environment, Wireless Communications and Networking Conference, 2003 WCNC IEEE, vol. 2, no. 20, March, pp. 1102–1106. Daumer, W.; Mermelstein, P.; Maitre, X.; Tokizawa, I. (1984). Overview of the ADPCM Coding Algorithm, Proc. of GLOBECOM’84, pp. 23.1.–23.1.4. DeMeuleneire, M. (2003). Noise Reduction on Codec Parameters, Master’s thesis, ENST Bretagne, Brest France. Doh-Suk, K.; Cao, B.; Tarraf, A. (2008). Frame Energy Estimation Based on Speech Coded Prameters, IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, no. 10, March, pp. 1641–1644. Duetsch, N. (2003). Integrated Echo Cancellation in Speech Coding, Master’s thesis, Munich University of Technology (TUM). El-Jaroudi, A.; Makhoul, J. (1991). Discrete All-Pole Modeling, Proc. IEEE Trans. Signal Processing, vol. 39, April, pp. 411–423. Enzner, G.; Kruger, H.; Vary, P. (2005). On the Problem of Acoustic Echo Control in Cellular Networks, Proceedings of the IWAENC’05, pp. 213–215. Ephraim, Y.; Malah, D. (1984). Speech Enhancement Using a Minimum Mean Square Error Short Time Amplitude Estimator, IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 32, no. 6, December, pp. 1109–1121. Eriksson, A. (2006). Speech Enhancement in Mobile Devices, Technical report, Eriksson. Federal Standard 1015, Telecommunications: Analog to Digital Conversion of Radio Voice By 2400 Bit/Second Linear Predictive Coding (1984). Technical report, National Communication System - Office Technology and Standards. BIBLIOGRAPHY 157 Fermo, A.; Carini, A.; Sicuranza, G. (2000). Analysis of Different Low Complexity non Linear Filters for Acoustic Echo Cancellation, Proc. of IEEE Workshop on Speech Coding, IEEE, pp. 261–266. Ghenania, M. (2005). Format Speech Conversion between Standarzed CELP Coders, PhD thesis, Université de Rennes I. Ghenania, M.; Lamblin, C. (2004). Low-Cost Smart Transcoding Algorithm between ITU-T G.729 (8 Kbp/s) and 3GPP NB-AMR (12.2 Kbp/s), EUSIPCO. Gnaba, H.; Turki, M.; Jaindane, M.; Scalart, P. (2003). Introduction of CELP Structure of the GSM Coder in the Acoustic Canceller for GSM Network, Proceedings of Eurospeech, Berlin, Germany, pp. 1389–1392. Goldberg, L.; Riek, L. (2000). A Practical Handbook of Speech Coders, CRC Press. Golub, G.; Loan, C. V. (1996). Matrix Computation, The John Kopkins, University Press, Baltimor, 1996. Gordy, J.; Goubran, R. (2004). A Combine LPC-Based Speech Coder and Filtered X-LMS Algorithm for Acoustic Echo Cancellation, Proceedings of ICASSP’04, IEEE, vol. 4, pp. 125– 128. Gordy, J.; Goubran, R. (2006). Post Filtering for Suppression of the Residual from Vocoder Distortion on Packet Based Telephony, Proceedings of ICME’06, IEEE, pp. 1957–1960. Halonen, T.; Romero, J.; Melero, J. (2002). GSM, GPRS and EDGE Performance: Evolution toward 3G/UMTS, John Wiley. Hansler, E.; Schmidt, G. (2004). Acoustic Echo and Noise Control. A Practical Approach, Wiley Interscience. Haykin, S. (2002a). Adaptive Filter Theory, Prentice Hall, Information and System Serie. Haykin, S. (2002b). Adaptive Filter Theory, Chapter 5 Least Mean Square Adaptive Filter, Prentice Hall, Information and System Serie. Heitkamper, P. (1995). Optimization of an Acoustic Echo Canceller Combined with Adaptive Gain Control, Proceedings of ICASSP’95, IEEE, vol. 5, pp. 3047–3050. Heitkamper, P. (1997). An Adpatation Control for Acoustic Echo Canceller, Proceedings of IEEE Signal Processing Letters, IEEE, vol. 4, pp. 170–172. Heitkamper, P.; Walker, M. (1993). Adaptive Gain Control for Speech Quality and Echo Suppression, Proceedings of Eurospeech’93, Berlin, Germany, pp. 1077–1080. Huang, Y.; Goubran, R. (2000). effects of the Vocoder distortion on Network Echo Cancellation, In Proceedings of ICME’00, IEEE, vol. 1, pp. 437–439. Itakura, F. (1975). Line Spectrum Representation of Linear predictive Coefficients of Speech Signals, J. Acoust. Soc. Am., vol. 57, pp. S35. ITU-T (1988). Recommendation G.711: Pulse Code Modulation (PCM) of voice frequencies. ITU-T (1996). Recommendation P.800: Methods for Subjective Determination of Transmission Quality. 158 BIBLIOGRAPHY ITU-T (2001). Recommendation P.862: Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs. ITU-T (2003). Recommendation G.722.2: Wideband coding of speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB). ITU-T (2004). Recommendation G.161: Interaction Aspects of Signal Processing Network Equipments. ITU-T (2006a). Recommendation G.160: Voice Enhancement Devices for Mobiles Network: Appendix II, Serie G: Transmission System and Media, Digital System and Network, International Telephone Connections and Circuit. Apparatus Asociated with Long Distance Telephone Circuit. ITU-T (2006b). Recommendation G.168: Digital Network Echo Cancellers. ITU-T (2006c). Recommendation P.10: Vocabulary for Performance and Quality of Service. ITU-T (2006d). Recommendation P.56: Objective Measurement of Active Speech Level. ITU-T (2006e). Recommendation P.800.1: Mean Opinion Score (MOS) Terminology. Kabal, P. (2003). Ill Conditioning and Bandwidth Expantion in Linear Prediction of Speech, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 1, April, pp. 824–827. Kabal, P.; Ramachandran, R. (1986). The Computation of Line Spectral Frequencies Using Chebyshev Polynomials, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 6, December, pp. 1419–1425. Kang, H. G.; Kim, H. K.; Cox, R. V. (2003). Improving the Transcoding Capability of Speech Coders, IEEE Trans. on Mulitimedia, vol. 5, no. 1, March, pp. 23–33. Kang, H.; Kim, H.; Cox, R. (2000). Improving Transcoding Capability of Speech Coders in Clean and Frame Erasured Channel Environments, Proc. of IEEE Workshop on Speech Coding, IEEE, pp. 78–80. Kim, K.; Jung, S.; Park, Y.; Choi, Y.; Youn, D. (2001). An effective Transcoding Algorithm for G.723.1 and EVRC Speech Coders, Proccedings of IEEE VTS’01, IEEE, pp. 1561–1564. Kleijn, B.; Paliwal, K. (1995). Speech Coding and Synthesis, Elversier. Kleijn, W.; Krasinski, D.; Ketchum, R. (1990). Fast Method for the CELP Speech Coding Algorithm, Proceedings of ICASSP’90, IEEE, pp. 1330–1342. Knappe, M.; Goubran, R. (1994). Steady State Performance Limitation of full band AEC, Proceedings of ICASSP’94, IEEE, vol. 2, pp. 73–76. Kondoz, A. (1994). Digital Speech Coding for Low Bit Rate Communication Systems, Jonh Wiley and Sons. Kroon, P.; Deprettere, E.; Sluyter, R. (1986). Regular-Pulse Excitation - A Novel Approach to Effective and Efficient Multipulse Coding of Speech, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 34, no. 5, pp. 1054–1063. Lim, J.; Oppenheim, A. (1979). Enhancement and Bandwidth Compression of Noisy Speech, Proceedings of the IEEE, vol. 67, pp. 1586–1604. BIBLIOGRAPHY 159 Loizou, P. (2007). Speech Enhancement, Theory and Practice, CRC Taylor and Francis Group. Lu, X.; Champagne, B. (2003). A Centralized Acoustic Echo Canceller Exploiting Masking Properties of the Human Ear, Proceedings of ICASSP’03, IEEE, vol. 5, pp. 377–380. Martin, R. (1994). Spectral Subtraction Based on Minimum Statistics, Proceedings of EUSIPCO’94, pp. 1182–1185. Martin, R. (2001). Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics, IEEE Trans. on Speech and Audio Processing, vol. 9, no. 5, July, pp. 504–512. Martin, R.; Mallat, D.; Cox, R.; Accardi, J. (2004). Noise Reduction Preprocessor for Mobile Voice Communication, EURASIP, Journal on Applied Signal Processing, vol. Issue-8, pp. 1046–1058. Mboup, M.; Bonnet, M.; Bershad, N. (1994). LMS Couples Adaptive Prediction and System Identification: Statistical Model Transient Mean Analysis, Trans. of IEEE Signal Processing, IEEE, vol. 42, pp. 2607–2614. McAulay, R.; Malpass, M. (1980). Speech Enhancement Using a Soft Decision Noise Suppression Filter, IEEE Trans. Acoust. Speech, Signal Processing, vol. ASSP-28, no. 2, April, pp. 137–145. Messerschmidt, D. (1984). Echo Cancellation in Speech and Data Transmission, Proceedings in the IEEE Journal on Selected Areas in Communication, IEEE, vol. sac-2, pp. 283–297. Moriya, T.; Miki, S.; Mano, K.; Ohmuro, H. (1993). Training Method of the Excitation Codebook for CELP, Proceedings of Eurospeech’93, pp. 1155–1158. Oppenheim, A.; Schafer, R. (1999). Discrete-Time Signal Processing, Second Edition, Prentice Hall, Englehood Cliffs, New Jersey. Painter, T.; Spanias, A. (2000). Perceptual Coding of Digital Audio, Proceedings of the IEEE, IEEE, vol. 88, pp. 451–513. Pasanen, A. (2006). Coded Domain Level Conrol for AMR Speech Codec, IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, no. 14, March, pp. I–I. Pobloth, H.; Kleijn, W. (1999). On Phase Perception in Speech, Proceedings of ICASSP’99, IEEE, vol. 1, pp. 29–30. Rages, M.; Ho, K. (2002). Limits on Echo Return Loss Enhancement on a Voice Coded Speech Signal, Proceedings of the 45rd IEEE MSCS, IEEE, vol. 2, pp. 152–155. Scalart, P.; Filho, J. V. (1996). Speech Enhancement Based on A-Priori Signal to Noise Ration Estimation, Proceedings of ICASSP’96, IEEE, vol. 2, pp. 626–632. Schroeder, M.; Atal, B. (1985a). Code-Excited Linear Prediction (CELP): High-Quality Speech at Very Low Bit Rates, Proceedings of ICASSP’85, IEEE, vol. 10, pp. 937–940. Schroeder, M.; Atal, B. (1985b). Stochastic Coding of Speech at Very Low Bit Rates: the importance of Speech Perception, Speech Communication 1985, vol. 4, pp. 155–162. Schroeder, M.; Atal, B.; Hall, J. (1979). Optimizing Digital Speech Coders by Exploting Masking Properties of the Human Ear, The Journal of the Acoustical Society of America, vol. 66, December, pp. 1647–1652. 160 BIBLIOGRAPHY Shannon, C. (1949). Communication in the Presence of Noise, Proc. Institute of Radio Engineers, vol. 37, pp. 10–21. Reprint as classic paper in: Proc. IEEE, Vol. 86, No. 2, (Feb 1998). Spanias, A. (1994). Speech Coding: A Tutorial Review, Proceedings of the IEEE, IEEE, vol. 82, pp. 1541–1582. Sukkar, R.; Younce, R.; PengZhang (2006). Dynamic Scaling of Encoded Speech Through the Direct Modification of Coded Parameters, Proceedings of ICASSP’06, IEEE, pp. 677–680. Taddei, H.; Beaugeant, C.; DeMeuleneire, M. (2004). Noise Reduction on Speech Codec Parameters, Proceedings of ICASSP’04, IEEE, vol. 1, pp. 497–500. Thepie, E.; Beaugeant, C.; Taddei, H.; Duetsch, N.; Pastor, D. (2006). Echo Reduction based on Speech Codec Parameters, Proceedings of IWAENC’2006. Thepie, E.; Beaugeant, C.; Taddei, H.; Pastor, D. (2008). Noise Reduction within Network through Modification of the LPC Parameters, Proc of ITG-SCC’06. Tsai, S.; Yang, J. (2001). GSM to G.729 Speech Transcoder, In Proceedings of ICECS’01, IEEE, pp. 485–488. Un, C.; Choi, K. (1981). Improving LPC analysis of Noisy Speech by Autocorrelation Subtraction Method, Proceedings of ICASSP’81, IEEE, vol. 6, pp. 1082–1085. Vahatalo, A.; Johansson, I. (1999). Voice Activity Detection for GSM Adaptive Multirate Codec, Proc. of IEEE Workshop Speech Coding, IEEE, Porvoo, Finland, pp. 55–57. Vary, P.; Martin, R. (2005). Digital Speech Transmission: Enhancement, Coding and Error Concealment, Jonh Wiley and Sons, LTD. Vaseghi, S. (1996). Advance Signal Processing and Noise Reduction, Wiley-Teubner. Wang, D.; Lim, J. (1982). The Unimportance of Phase in Speech Enhancement, Proc. IEEE Trans. Acoust. Speech, Signal Processing, vol. 4, no. 4, August, pp. 679–681. Yasuji, O.; Suzuki, M.; Tsuchinaga, Y.; Tanaka, M.; Sasaki, S. (2002). Speech Coding Translation for IP and 3G Mobile Integrated Network, In Proccedings of ICC’02, IEEE, pp. 114– 118. Ye, H.; Wu, X. (1991). A New Double Talk Detection Based on the Orthogonality Theorem, Proceedings of IEEE Transaction on Communication, IEEE, vol. 39, pp. 1542–1545. Yoon, S.; Kang, H.; Park, Y.; Youn, D. (2001). An effective Transcoding Algorithm for G.723.1 and G.729A speech Coders, Proceedings of Eurospeech’01, pp. 2499–2502. Yoon, S.; Kang, H.; Park, Y.; Youn, D. (2003). Transcoding Algorithm for G.723.1 and AMR Speech Coders: For Interoperability between VoIP and Mobile Networks, Proceedings of Eurospeech’03, pp. 1101–1104. Résumé L’amélioration de la qualité de la parole s’effectue progressivement dans les réseaux, plutôt que dans les terminaux mobiles. Les contraintes liées à la réduction du délai, la réduction de la complexité et le souhait d’un contrôle centralisé des réseaux motivent cette nouvelle approche. Le déploiement des codeurs de paroles standardisés posent des problèmes d’interopérabilité entre les réseaux. Pour assurer l’interconnexion entre ces réseaux, le transcodage du train binaire d’un codeur vers le codeur cible est indispensable. Les solutions classiques d’amélioration de la qualité et le transcodage classique nécessitent la présence du signal sous format PCM, c’est à dire des échantillons du signal. Un concept alternatif pour améliorer la qualité de la parole dans les réseaux est proposé dans cette thèse. Cette approche repose sur le traitement des paramètres des codeurs de type CELP. Un système de réduction du bruit est implémenté dans cette thèse en modifiant le gain fixe et les coefficients LPC. Deux algorithmes destinés à l’annulation de l’écho acoustique développés modifient le gain fixe. Ces différents algorithmes utilisent des extrapolations et des transpositions des techniques existantes, du domaine temporel ou fréquentiel dans le domaine des paramètres des codeurs de type CELP. Au cours de cette thèse, nous avons aussi intégré les algorithmes ci-dessus mentionnés dans des schémas de transcodage intelligent impliquant le gain fixe et adaptatif, ainsi que les coefficients LPC. Avec cette approche, la complexité du système est réduite d’environ 27%. Les problèmes liés à la non-linéarité introduits par les codeurs sont significativement réduits. Les tests objectifs indiquent en ce qui concerne la réduction du bruit, que les performances sont meilleures que celles du filtre classique de Wiener pendant le transcodage de l’AMR-NB 7.4 kbps vers 12.2 kbps. Elles sont sensiblement équivalentes dans le transcodage de l’AMR-NB 12.2 kbps mode vers 7.4 kbps mode. Les mesures objectives concernant l’annulation de l’écho acoustique (ERLE) montrent un gain de plus de 40 dB des algorithmes proposés par rapport au NLMS. Le seuil minimal de 45 dB fixé pour le GSM est atteint. Mots clés: codeur CELP, AMR-NB, réduction de bruit et annulation de l’écho acoustique dans le domaine des paramètres CELP, filtre de Wiener, réseau GSM, transcodage intelligent. Abstract Voice Quality Enhancement (VQE) solutions are now moving from Mobile Device to the network. This is due to constraints of low-complexity, low-delay and the need of centralized control of the network. The deployment of incompatible standardized speech codecs implies interoperability issue between telecommunication networks. To insure interconnection between networks, the transcoding from one codec format to another is necessary. The common point to the classical network VQE and standard transcoding is that they use the speech signal in PCM format during the process. An alternative way to perform network VQE is developed in this thesis. This new approach leads to modification of the CELP parameters to perform network VQE. A Noise Reduction algorithm is implemented in this thesis by modifying the fixed codebook gain and the LPC coefficients of the noisy speech signal. An Acoustic Echo Canceller is developed by filtering the fixed gain of the microphone signal. These algorithms are based on extrapolation of existing algorithms in the time or the frequency domain into the CELP parameter domain. During this thesis, the algorithms developed in coded domain have been integrated into smart transcoding algorithms. The smart transcoding strategy is applied to the fixed codebook gain, the LPC coefficients and the adaptive codebook gain. With this approach, the non-linearity introduced by the coders does not affect the performance of the network AEC. Many functions at the target encoder are skipped, leading to a significant computational load reduction of about 27%, compared to the classical approach. The network VQE embedded into smart transcoding has been implemented. Objective metrics (the Signal-to-Noise Ratio Improvement (SNRI) and the Total Noise Reduction Level (TNRL)) indicate that noise reduction integrated in smart transcoding performance is better than the classical Wiener method when transcoding from the AMR-NB in 7.4 kbps mode to 12.2 kbps mode. The performance is equivalent during transcoding from 12.2 kbps mode to 7.4 kbps mode. The Echo Return Loss Enhancement (ERLE) values of our proposed algorithms are highly improved compared to the standard NLMS (up to 40 dB). The required 45 dB ERLE in GSM is achieved. Key words: CELP, AMR-NB, VQE based on CELP parameters, Wiener filter, GSM network, smart transcoding.