Speech Quality Investigation using PESQ in a simulated Climax
Transcription
Speech Quality Investigation using PESQ in a simulated Climax
2007:260 CIV MASTER'S THESIS Speech Quality Investigation using PESQ in a simulated Climax system for ATM Alexander Storm Luleå University of Technology MSc Programmes in Engineering Space Engineering Department of Computer Science and Electrical Engineering Division of Signal Processing 2007:260 CIV - ISSN: 1402-1617 - ISRN: LTU-EX--07/260--SE Abstract The demand for obtaining and maintaining a certain speech quality in systems and networks has grown the last years. It is getting more and more common with specifications of the quality instead of “-It sounds ok.”. The most accurate way to perform a quality test is to let users of the services express their opinion about the perceived quality. This subjective test method is unfortunately expensive and very time consuming. Lately new objective methods have been developed to replace these subjective tests. These new objective methods have high correlation to subjective tests, are fast and relatively cheap. In this work the performance of one of these objective methods was investigated, where the main focus was on speech distorted by impairments commonly occurring in air traffic control radio transmissions. The software where the objective method is implemented was evaluated for recommendation for further usage. Some of the test cases were tested on people for a subjective judgment and then compared with the objective results. Keywords: Objective speech quality evaluation, PESQ, Mean Opinion Score (MOS), MOS-LQO, Climax. Preface This is the report of the final master thesis that concludes my journey towards my MSc Degree in Space Engineering at Luleå University of Technology. The thesis was carried out at Saab Communication in Arboga and it involved examination of objective methods for grading speech quality in communication links. I would like to thank Saab Communication for this opportunity; a special gratitude goes to my supervisors Alf Nilsson, Ronnie Olsson and Lars Eugensson at Saab for their inspiration and knowledge. At the department of Computer Science and Electrical Engineering I would like to thank my examiners Magnus Lundberg Nordenvaad and James LeBlanc. I would also like to express my gratitude to all the inspiring people that I have had the opportunity to meet and work with during my five years at LTU. You have made these years some of the best of my life and I wish you all a delightful future. Finally, a gratitude to my family and friends for your support. Arboga, October 2007. Alexander Storm Content 1 Introduction ...............................................................................................................- 5 2 What is Speech Quality and how is it measured? ..................................................- 6 2.1 Impairments ...................................................................................................................- 7 2.2 Quality measurements ..................................................................................................- 9 2.3 PESQ – Perceptual Evaluation of Speech Quality.................................................- 16 2.4 Intelligibility .................................................................................................................- 21 2.5 Future measurement methods...................................................................................- 24 - 3 Theory.......................................................................................................................- 25 3.1 CLIMAX.......................................................................................................................- 25 3.2 Impairments .................................................................................................................- 27 3.3 Objective measurements ............................................................................................- 30 3.4 PESQ-verification........................................................................................................- 30 3.5 Subjective measurements...........................................................................................- 30 - 4 Methods ....................................................................................................................- 32 4.1 PESQ-verification........................................................................................................- 32 4.2 Objective quality measurements...............................................................................- 34 4.3 Subjective measurements...........................................................................................- 40 - 5 Result ........................................................................................................................- 42 5.1 PESQ-verification........................................................................................................- 42 5.2 Objective measurements ............................................................................................- 45 5.3 Subjective measurements...........................................................................................- 58 - 6 Conclusion and Discussion ....................................................................................- 59 6.1 PESQ-verification........................................................................................................- 59 6.2 Objective measurements ............................................................................................- 59 6.3 Subjective measurements...........................................................................................- 62 6.4 Error sources................................................................................................................- 63 6.5 The GL-VQT Software® .............................................................................................- 63 6.6 Additional measurements ..........................................................................................- 64 - References ...................................................................................................................- 65 Appendix 1 Glossary..................................................................................................- 68 Appendix 2 ITU-T P.862, Amendment 2. Conformance test 2(b) ......................- 69 Appendix 3 Subjektivt test av talkvalitet ...............................................................- 70 Appendix 4 Results for case 9 of the objective measurements............................- 71 Appendix 5 The MOS-LQS result of the subjective measurement ....................- 72 - 1 Introduction The European organization for civil aviation equipment (Eurocae1) is developing a technical specification for a communication system using Voice over IP (VoIP) for Air Traffic Management (ATM). The specification is planned to be completed by the end of 2007 and it is expected to contain a speech quality recommendation for ATM according to the International Telecommunication Union (ITU) MOS-scale. To measure and verify the quality it is proposed that the objective PESQ-algorithm should be used. To get a feeling of what kind of quality demands that can be reasonable some cases of speech impairments typical for ATM were investigated and tested in this work using the PESQ-algorithm. For comparison and software evaluation matters some of the test cases were also tested on humans to get subjective opinions. The purpose of this work was to investigate what kind of objective methods for speech quality assessment that are available on the market and how they perform. Another purpose was to simulate and investigate how different impairments influence the speech quality using the objective PESQalgorithm. Evaluation and verification of the algorithm for these specific impairments were made using subjective tests and recent research. Also the software where the algorithm is implemented was investigated and evaluated. A small comparison between intelligibility and quality of speech was also performed. 1 Eurocae is an organization where European administrations, airlines and industry can discuss technical problems. The members of Eurocae are European administrations, aircraft manufacturers, equipment manufacturers and service providers and their objective is to work out technical specifications and recommendations for the electrical equipment in the air and on the ground [1]. -5- 2 What is Speech Quality and how is it measured? With the introduction of IP services like Voice over IP (VoIP) the need for methods to measure the performance of the services are required. VoIP is introduced to reduce expenses using one type of network for both voice and data. Over the years users have become accustomed to the quality that the “ordinary” Public Switched Telephone Network (PSTN) provides to the extent that nowadays PSTN is a standard in quality and predictability. VoIP needs to meet up with this standard to be widely accepted [2]. To cope with this challenge it is important to understand the many differences between VoIP and PSTN. Examples of the main differences are presented below: PSTN was designed for time-sensitive delivery of voice traffic. It was constructed with non-compression analog-to-digital encoding techniques, always with the voice channel in mind to give the right amount of bandwidth and frequency response [2]. The IP networks were, on the other hand, designed for non-real-time applications like file transfers and e-mails. In PSTN the call setup and management are provided by the core of the network while VoIP networks have put this management into the endpoints such as personal computers and IP telephones [2]. Because of this the network core is not equally controlled and regulated which can have negative impact on the quality. A telephone call in PSTN gets a dedicated channel with dedicated bandwidth. This is a guarantee for a certain quality which is about the same for every call. VoIP, on the other hand, can neither guarantee nor predict voice quality. In VoIP the calls are divided up into small frames or packets which can take different routes between the caller and the receiver. The available bandwidth can not be guaranteed, it depends on the performance and load of the network [3]. In PSTN the codec ITU-T G.711 [4] is used. It is a linear waveform codec that almost reproduces the waveform at decoding. G.711 works at a 8kHz sampling rate (8000samples/s) and each encoded segment is 8bit long (0,125ms) which gives a data rate of 64kbit/s (or bps). The codecs G.729 (10ms segments, ~80bits/segment, ~8kbps) and G.723.1 (30ms segments, ~180bits/segment, ~6kbps) are non-linear because they only try to process the parts of the waveform that are important -6- for perception; leading to a smaller bandwidth requirement. The drawbacks are longer segments and low bit-rate which can lead to higher end-to-end delay. VoIP introduces factors, like headers, that increase the bandwidth requirement. After encoding the code words are accumulated into frames, usually of 20ms. The frames are then placed in packets before transmission. For correct delivery, headers are added to the packets. First the IP-, UDP- and RTP-protocols add a header each of a total of 320bits. The transmission medium layer, typical Ethernet, adds an additional header of 304bits. This adds up to a total of 95,2kbps for the VoIP transmission, eq.(1), using G.711, 64kbps payload. (1280 frames bits 320bits 304bits ) 50 s frame 95200bps (1) 2.1 Impairments There are many factors influencing the quality of speech transmitted over a network. It is possible to measure most of these factors but it is not assured that these measures give a correct estimation of the quality. Quality is highly subjective, it is the user that decides whether the quality is acceptable or not. Voice quality can be described by three key parameters [3]: end-to-end delay – the time it takes for the signal to travel from speaker to listener. echo – the sound of the talker’s voice returning to the talker’s ear. clarity – a voice signal’s fidelity, clearness, lack of distortion and intelligibility. The two first are often considered to be the most important but the relationship between the factors are complex and if just any of the three is turned unacceptable the overall quality is unacceptable. 2.1.1 Delay Delay is only affecting the conversational quality; it doesn’t introduce any distortions to the signal. The delay in PSTN is dependent on the distance the signal will travel, longer distance - higher delay. In VoIP the delay is -7- dependent on the managing of the packets; switching, signal processing (encoding, compression), packet size, jitter buffers etc [2]. Delay becomes an issue when it reaches about 250ms, between 300ms and 500ms conversation is difficult and at delays over 550ms a normal conversation is impossible. In PSTN the end-to-end delay is usually under 10ms but in VoIP networks the delay can reach 50-100ms due to the operations (packetization and compression etc.) of the codec [2]. 2.1.2 Echo Like delay echo is a bigger issue when dealing with conversational quality. It doesn’t affect the sound quality even though a talker can perceive it as disturbing as other distortions. There are two different kinds of echo, acoustic and electrical echo. The acoustic echo can be heard when a portion of the speech is coming out of the speaker, at the far end, and heard by the microphone and sent back to the talker. Electrical echo is introduced where a 2-wire analog is connected to a 4-wire system. These connections are made by hybrids and if there’s some impedance mismatch between the 2-wire and the 4-wire the speech will leak back to the talker. If the echo returns less than 30ms after it was sent the talker will not usually perceive it as annoying, this is also depending on the level of the echo. If the echo returns a little bit more than 50ms after the transmission the conversation will be affected and the talker will apprehend the conversation as “hollow” or “cave-like”. Echo is bigger problem in VoIP than it is in PSTN. VoIP does not introduce more echo but it introduces more delay which makes the echo more noticeable and annoying [3]. 2.1.3 Clarity Clarity is the parameter that is the most subjective. Clarity is dependent on the amount of various distortions introduced to and by the network. There are several kinds of distortions that influence the clarity. Some examples [5]: Encoding and decoding of the signal. Which codec is used and what are its features. Time-clipping. For example front end clipping (FEC) introduced by a Voice Activity Detector (VAD). Temporary signal loss caused by packet loss. Jitter. Variance in delay of received packets. Noise. Background noise for example. Level Clipping. When an amplifier is driven beyond its voltage or current capacity. -8- Of these the following impairments are introduced in a PSTN network: Analog filtering and attenuation in a telephone handset and on line transmissions. Encoding via non-uniform PCM which introduces quantization distortion. This has minimal impact on the clarity and is accepted in ordinary PSTN telephony. Bit errors due to channel noise. Echo due to many hybrid wire junctions. Along with VoIP new impairments have been introduced because of the new technology of transmitting speech signals: Low bit-rate codecs are more often used to limit the need for bandwidth. These introduce nonlinear distortion, filtering and delay. Front-end-clipping (FEC). To lower the bandwidth requirement even more silence suppression is used together with VADs. Packet losses which introduces dropouts and time-clipping. Packet jitter, variance in packet arrival times to the receiver. This is limited by jitter buffers. Packet delay. Can cause packet loss and jitter. In this work, only impairments affecting the clarity and the listening quality are investigated. 2.2 Quality measurements 2.2.1 Subjective Assessment Using people, for grading the quality of speech, is the most accurate way to measure the quality since the user of the services is human. People are also used because it is hard for instruments and machines to mimic how humans perceive speech quality. The drawbacks are that subjective tests are very expensive and time consuming. They are usually used in the development phase of systems and services, not suitable for real-time monitoring. The International Telecommunication Union (ITU) has standardized a method for a subjective speech quality test. It is described in the standard ITU-T P.800 [6]. Subjective tests are performed in a carefully controlled environment with a large number of people. A large number is required since the subjective judgment is influenced by expectations, context/environment, physiology and -9- mood. The large number of subjects increases the accuracy and decreases the influence of deviating results. The participants listen to the transmitted/processed speech samples and grade the perceived quality according to the scale stated in P.800, see table 2.1. Table 2.1. Opinion scale according to ITU-T P.800. Score 5 4 3 2 1 Quality of the speech Excellent Good Fair Poor Bad After the test, the individual scores are collected and the results are treated statistically to produce the desired information. The most common result is the mean value. This mean is evaluated as the quantity MOS (Mean Opinion Score), a MOS-score of 3,6-4,2 is widely accepted as being a good score for a network. Letters are added to state the kind of test. For a listening-only test the notation is either MOS-LQS (listening-quality-subjective) or just MOS. For a conversational test the notation is MOS-CQS or MOSc (table 2.2) [7]. Table 2.2. MOS notation according to P.800.1. Subjective Objective Estimated Listening-Quality MOS-LQS MOS-LQO MOS-LQE Conversational-Quality MOS-CQS MOS-CQO MOS-CQE In every subjective test some references should be employed, usually Modulated Noise Reference Units (MNRU’s) are used. MNRU is standardized in ITU-T P.810 [8] and it describes how to distort speech samples in a controlled mathematical way. The amount of MNRU is measured in dBQ where the Q-value is the ratio in decibels between the signal and the added white noise, for subjective tests several Q-values are used. These extra reference speech samples are mixed with the original samples. After the test the reference samples have now both a MOS-value and - 10 - a Q-value. By plotting these it is possible to obtain a relationship between MOS and Q. Figure 2.1 [9] shows an example of a regression of this relationship curve, it usually has this S-shape. With this relationship it is possible to translate every MOS-score to a Q-score. The Q-scores tend to be more language and experiment independent which makes it possible to compare scores from different experiments at different laboratories which is not possible using the MOS-scores only. Figure 2.1. An example of a regression of the relationship between MOS- and Q-values. Together with listening and conversational test there are also talking quality tests [6]. Listening tests are by far the most widely used since they are the easiest to perform. The judgment is also stricter in listening tests since the participants will be more focused and sensitive for small impairments that won’t be caught during a conversational test. In ITU-T P.800 three different listening tests are described; the ACR-, the DCR- and the CCR-method. ACR – Absolute Category Rating Here the subjects are presented with sentences with a length of 6-10s. After each sentence the listeners should rate the perceived quality according to table 2.1. The mean value of all ratings is the MOS-LQS. ACR is the most frequent used listening test. It works well at low Q-values (Q<20dB) but shows a reduction in sensitivity at higher Q-values (good quality circuits). One reason for the low sensitivity can be that in ACR different sentences are often used for different systems. - 11 - DCR – Degradation Category Rating DCR shows higher sensitivity at high Q-values (Q>20dB) than ACR. In DCR the listeners are presented with pairs of the same sentence where the first sample is of high quality and the second one is processed by the system. After each pair the listeners rate the degradation of the last sample compared to the first unprocessed sample according to a degradation opinion scale (table 2.3). Afterwards the degradation MOS (DMOS) is calculated. Table 2.3. Degradation opinion scale (ITU-T P.800). Score 5 4 3 2 1 The degradation is: Inaudible Audible but not annoying Slightly annoying Annoying Very annoying CCR-Comparison Category Rating The CCR-method is similar to the DCR but the order, in which the samples are presented to the listener, is random. In half of the pairs the processed sample should be the first sample and for the other half of the test the second sample should be processed. After each pair the listeners rate how the quality of the second sample is compared to the quality of the first sample. The rating is according to the comparison opinion scale in table 2.4. Table 2.4. Comparison opinion scale. Score 3 2 1 0 -1 -2 -3 The quality of 2:nd compared to 1:st is: Much better Better Slightly better About the same Slightly worse Worse Much worse This leads to a Comparison MOS (CMOS). An advantage of CCR over DCR is the possibility to assess processes that either have degraded or improved the speech quality. - 12 - 2.2.2 Objective Assessment Even though the subjective tests give the most accurate measurement of speech quality the need for objective methods is desired. Subjective tests are, as stated earlier, both expensive and time consuming. There have been many different techniques of objective assessment during the years and they can be divided into different groups [10]. First of all the measurements can be either passive or active. Passive measurements The passive measurements are divided into planning and monitoring tools. Planning tools The E-model, ITU-T G.107 The E-model is a method for estimating the performance of networks. It is used as a transmission planning tool and it is described in ITU-T G.107 [15]. The foundation of the model is eq.(2). R Ro Is Id Ie A (2) Ro is the basic signal-to-noise ratio. Is represents all impairments which occur simultaneously with speech, for example loudness, quantization distortion and side tone level. Id is the “delay impairment factor” which includes all impairments due to delay and echo effects. Ie is the “equipment impairment factor” and represents all impairments caused by the equipment; low bit-rate codecs for example. Finally the “advantage factor” A represents the user’s expectation of quality. For example, using a mobile phone out in the woods, people can be more forgiving on quality issues because they are satisfied with just being able to establish a connection. All this sums up to the Rating Factor, R, which ranges from 0 to 100 where 100 is the highest rating, i.e. best quality. The R-value can then be converted into a MOS-CQE (conversational-quality-estimated) or a MOS-LQE score for comparison with other objective measurements. - 13 - Monitoring tools In inactive or non-intrusive monitoring measurements the actual traffic is examined, no need for speech samples being sent trough the system. It is possible to monitor the system 24 hours a day and the monitoring doesn’t affect or intrude the system. The drawback is that the accuracy and correlation to subjective tests are lower than for intrusive measurements. ITU-T P.563 The ITU-T P.563 [16] describes a new standard for non-intrusive measurements. P.563 is a single-ended method for objective speech quality assessment. It is based on models of voice production and perception. It measures the effects of one-way distortions and noise on speech and delivers a MOS-score that can be mapped to a MOS-LQO score for example. Active measurements Active measurements are divided into electroacoustic or psychoacoustic measurements and the basic idea is to transmit a waveform from one end of system and receive it at the other end. The received (degraded) waveform is then compared to the original waveform resulting in a quality score based on the difference between the two waveforms. The advantages of these methods are that they have the highest correlation with subjective measurements and that the original waveform can be constructed to match the objective of the measurements, different languages, specific distortions etc. The drawback is that the technique uses a specific speech sample which is transmitted, not live traffic. Among these tests signal-to-noise ratio (SNR) and total harmonic distortion (THD) can be mentioned [10]. Electroacoustic measurements Electroacoustic measurements were among the first objective techniques to measure the perceived quality of waveforms [10]. One example of the earlier methods is the 21- and 23-Tone Multifrequency Test where a complex waveform containing several equally spaced frequencies is transmitted through the system. At the end the signal-to-distortion ratio (SDR) is calculated as a power ratio in decibels (dB), the ratio is an indication of the quality. This method was soon questioned since it gave very low SDR-values for some codecs even though the users didn’t perceive any degradation. The reason was that the codecs in concern affected parts of the transmission that was not very important for the human perception. Later (1989) proposals were made to - 14 - change the multifrequency waveform to digital files containing recorded speech. The processing was basically the same but the method was no success due to poor correlation to subjective test results [10 p.120]. Psychoacoustic measurements The problem with electroacoustic measurements was that they only consider and measure different characteristics of the transmitted signal, the actual content of what was being transmitted was not considered. The increasing use of communication services raised the need for weighted measurements which considered how humans perceive different kinds of impairments. PSQM – Perceptual Speech Quality Measure (KPN Netherland) PSQM was one of the earliest standardized methods to measure speech quality from a human perception point of view. PSQM was standardized in 1996 through the ITU-T Recommendation P.861 [11]. The purpose was to objectively measure the quality of telephone-narrow-band (300-3400Hz) speech signals transmitted trough different codecs under different controlled conditions. PSQM measured the perceptual distance between the input and the output signal. The result was a score from 0 to infinity where 0 corresponded to a perfect match. The objective was to map this score into a MOS-score but due to problems with different results according to which language being used there was no good mapping function resulting in low correlation to subjective scores [11]. Another reason for the low correlation was the weak time alignment function. The PSQM-algorithm was developed further in 1997 to cope with these limitations. The new method which was included in P.861 was called PSQM+ and it had solved problems like how to judge and handle severe distortions and time clipping. PAMS – Perceptual Analysis Measurement System (British Telecom) The PAMS-algorithm is based on another signal processing technique than PSQM. They both compare a source signal with the same one transmitted but PAMS gives a score between 0 and 5 which correlate on the same scale as subjective MOS testing. PAMS calculates and analyses the Error Surface to get the score. The error surface is the difference between the Sensation Surfaces between the output and input speech samples. The score is then the average of the error surface at different frequencies. This process will be described in the next section. - 15 - Both PSQM+ and PAMS showed unsatisfactory correlation with subjective tests for a couple of test cases. The solution was the combination of the perceptual model of PSQM99 (extension of PSQM+) and the powerful time alignment function of PAMS. The new algorithm was called Perceptual Evaluation of Speech Quality (PESQ) and became the new standard ITU-T P.862 [12] in February 2001. With the introduction of P.862 the PSQM, P.861standard was withdrawn. As for the earlier methods, the PESQ is intended for narrow-band telephone signals. Since this work focuses on the performance on the PESQ-algorithm it will get a more elaborate explanation. 2.3 PESQ – Perceptual Evaluation of Speech Quality Figure 2.2. The basic functionality of the PESQ-method. Figure 2.2 shows the basic block representation of the PESQ test procedure. A speech sample is first inserted into the system under test and then collected at the output of the system. The collected sample is then compared to the original speech sample in the PESQ-algorithm resulting in a PESQ Raw-score which is mapped to get the highest correlation to subjective MOS-scores. The resolution of the inserted speech sample file should be 16-bit and the sample rate should be 8kHz (PESQ is also validated for a sample rate of 16kHz). Figure 2.3. Block representation of the PESQ algorithm. - 16 - The PESQ-algorithm is illustrated in figure 2.3 [13]. The first step of the processing is to compensate for any gain or attenuation of the system under test. The signals are aligned to the same constant power level in the level alignment block; this level is the same as the normal listening level used in subjective tests. In the input filter the algorithm models and compensates for the filtering that takes place in the handset of the telephone in a listening test, it is assumed that the handset’s frequency response follows the characteristics of an IRF (Intermediate Reference System) receiver. Since the exact filtering is hard to characterize the PESQ is rather insensitive to the filtering of the handset. To enable comparison between the two signals, time alignment is required. The degraded signal is often delayed, sometimes with variable delays and PESQ uses the technique from PAMS to cope with this problem. The time alignment process is divided into two main stages; an envelope- based crude delay estimation and a histogram-based fine time alignment. The envelope-based approach starts by calculating the envelopes of the whole length of the degraded and reference signal respectively. This is achieved with the help of a voice activity detector (VAD). These envelopes are then cross-correlated by frames in order to find the delay. This procedure yields a resolution of about ±4ms. Subsequently the signals are divided up in utterances; an utterance is a continuous speech burst with pauses shorter than a certain lengths (200ms). Theses utterances are examined using the same envelope-based delay estimation. The first step in the histogram-based estimation is to divide the signals into frames of 64ms with a 75% overlapping. These frames are Hannwindowed and cross-correlated and the index of the maximum from the cross-correlation gives the delay estimate for each frame. A weighted histogram of the delay estimates is then constructed, normalized and smoothed by convolution with a symmetric triangular kernel of a width of 1ms. The location of the maximum in the histogram is then combined with the previous delay estimation yielding the final delay estimation for the utterance. The maximum is also divided by the sum of the histogram before convolution to give a confidence measure between 0 (no confidence) and 100 (full confidence). In many cases there can be delay changes within the utterance. To test for this each utterance is split up into smaller parts on which the envelope- and histogram-based delay estimations are performed. The splitting process is repeated at several points and the confidence is measured and compared to the confidence before the split. As long as the confidence is higher than before the splitting the process continues to find the right delay estimation. - 17 - The auditory transform is a psychoacoustic model that mimics the properties of human hearing. In this the signals are mapped into an internal representation in the time-frequency domain by a short-term Fast Fourier Transform (FFT) with a Hann-window over 32ms frames. The result is components called cells, see figure 2.4. During the FFT the frequency scale is warped into a modified Bark scale, called the pitch power density, which reflects the human sensitivity at lower frequencies. This Bark spectrum is then mapped to a loudness scale (Sone) to obtain the perceived loudness in each time-frequency cell. During this mapping equalisation is made to compensate for filtering in the tested system and for time varying gain [12]. The achieved representation is called the Sensation Surface. In the Disturbance Processing block the sensation surface of the degraded signal is subtracted from the sensation surface of the reference signal resulting in the Error Surface, containing the difference in loudness for every cell. An example of an error surface is shown in fig.4.2. Two different disturbance parameters are calculated; the absolute (symmetric) disturbance and Figure 2.4. The time-frequency cells. the additive (asymmetric) disturbance. The absolute disturbance is a measure of the absolute audible error and it is achieved by examining the error surface. If the difference in the error surface is positive, components like noise have been added, if the difference is negative, parts of the signal have been lost due to coding distortion for example. For each cell the minimum of the original and degraded loudness is computed and divided by 4. This gives a threshold which is subtracted from the absolute loudness difference; values that are less than zero after this subtraction are set to zero. This is called masking, when the influence of small distortions, that are inaudible in the presence of loud signals, are neglected. The additive disturbance is a measure of the audible errors that are significantly louder than the reference. It is calculated for each - 18 - cell by multiplying the absolute disturbance with an asymmetry factor. This asymmetry factor is the ratio of the degraded and original pitch power densities raised to the power of 1,2. Those factors that are less than 3 are set to zero and those over 12 are clipped at that value leading to that only those cells where the degraded pitch power densities exceeds the reference pitch power densities remains, i.e. additive disturbance for positive disturbances only. The two disturbance parameters are aggregated along the frequency axis resulting in two frame disturbances. If these frame disturbances are above a threshold of 45 they are identified as bad intervals. The delay is then recalculated for these intervals and once again cross correlated in the time alignment block. If this correlation is below a threshold it is concluded that the interval is matching noise against noise and the interval is no longer bad. For a correlation above the threshold a new frame disturbance is calculated and replaces the original disturbance if it is smaller. In the Cognitive Model the frame disturbance values and the asymmetrical frame disturbance value are aggregated over intervals of 20 frames. These summed values are then aggregated over the entire active interval of the speech signal. Finally the PESQ score is a linear combination of the average disturbance value and the average asymmetrical disturbance value and ranges from -0,5 to 4,5, eq.(3). PESQMOS 4, 0,1d SYM (3) 0,0309d ASYM d SYM is the average disturbance value and dASYM is the average asymmetrical disturbance value. This PESQ Raw-score shows in some cases poor correlation with MOS-LQS. To obtain higher correlation the PESQ Raw-score is usually mapped into the MOS-LQO (MOS-Listening Quality Objective) score (ITU-T P.862.1 -11/2003 [14]). The mapping function, shown in eq.(4) and in figure 2.5 gives a score from 1,02 to 4,55 which corresponds to the P.800 MOS-LQS, see table 2.1. The maximum value 4,5 for the PESQ-score was chosen because it is the same as for a clear and undistorted condition in a typical ACR-LQ test. y 0,999 4,999 0,999 1 e 1, 4945 x 4 , 6607 (4) x represents the PESQ Raw-score and y the MOS-LQO score. - 19 - MOS-LQO Mapped P.862 5 4,5 4 3,5 3 2,5 2 1,5 1 0,5 0 -1 0 1 2 3 4 5 P.862 PESQ Raw-score Figure 2.5. The MOS-LQO mapping function. The produced MOS-LQO score estimates the listening quality only, it takes no concern to impairments that influence the conversational quality (MOS-CQO) like delay, jitter, echo, sidetone and the level of the incoming speech. Table 2.5. Comparison between some objective methods. The average and worst-case correlation coefficients for 38 subjective tests are shown. No. of tests 19 9 10 Type Mobile network Fixed network VoIP/ multitype Corr.coeff. average worst-case average worst-case average worst-case PESQ 0,962 0,905 0,942 0,902 0,918 0,810 PAMS 0,954 0,895 0,936 0,805 0,916 0,758 PSQM PSQM+ 0,924 0,935 0,843 0,859 0,881 0,897 0,657 0,652 0,674 0,726 0,260 0,469 Table 2.5 [13] shows a comparison between PESQ, PAMS, PSQM and PSQM+. The table shows the correlation coefficients for the different algorithms compared with 38 subjective tests. The conclusion is that PESQ has the highest correlation in both average and worst-case. PESQ shows high accuracy for a wide range of conditions. For some conditions PAMS is close but it is less accurate for some other conditions. PSQM and PSQM+ show lower correlation in conditions including VoIP, packet loss etc. - 20 - 2.4 Intelligibility In communications quality is closely related to intelligibility (the degree to which speech can be understood); high quality usually means high intelligibility. However, it is important to distinguish between the two; in many cases intelligibility is crucial while high quality is a desirable bonus [17]. Even though they correlate well in many cases the relationship becomes much more incomprehensible in other cases. For example, a small quality drop can have big influence on the intelligibility. On the other hand even at low quality scores it might be possible to apprehend and understand the transmitted information without great effort. It is also possible to improve the quality while decreasing the intelligibility and vice versa. An example is when using noise suppression schemes to lower the background noise to improve the perceived quality, these systems tend to decrease the intelligibility [18]. The PESQ-algorithm was not developed to assess speech intelligibility. However, since there are a relation between quality and intelligibility it might be possible to extend the PESQ-algorithm to correlate well with subjective intelligibility tests. Research is made to investigate the relation and how PESQ performs in intelligibility tests [18], [19] and [20]. 2.4.1 Subjective measurements Just like quality, intelligibility is a subjective judgment indicating how well a human listener can decode speech information [17]. It is measured using statistical methods where trained talkers speak using standardized word lists trough the system under evaluation. The words are received at the far end and trained listeners try to recognize what words that have been spoken. There are a number of different standardized word lists to use; one is the Modified Rhyme Test (MRT) [21], [22]. It consists of 50 six-word lists of rhyming words. The whole list is presented to the listener and the talker pronounces one of the six words in each list. The listener marks the word he thinks is spoken. After the test has been done by at least five listeners the results are collected and treated statistically to access the desired information. Table 2.6 shows the first five rows of six rhyming words in the MRT. Table 2.6. The first five rows of words in the MRT. went hold pat lane kit sent cold pad lay bit bent told pan late fit dent fold path lake hit tent sold pack lace wit rent gold pass lame sit - 21 - A similar method is the Diagnostic Rhyme Test (DRT). It consists of 96 rhyming pairs of words which are constructed from a consonant-vowel-consonant sound sequence. Examples of the word pairs are presented in table 2.7. The words only differ in the initial consonant and they are chosen in a way that the result can be interpreted in different ways to show what kind of consonants that are hard to recognize and then pin-point out what needs to be altered in the system to get correct intelligibility. Consonants are chosen because they are more important for the intelligibility than vowels [10]. They are also more sensitive to additive impairments like noise, tones etc. as they contain 20 times less average power than vowels. Since consonants are shorter in duration, 10-100ms, compared to vowles, 10-300ms, they are also more sensitive to losses and additive pulses. Table 2.7. Examples of word pairs in the DRT. Specific features of speech is also shown. Voicing veal bean gin dint zoo feel peen chin tint sue Nasality meat beat need deed bit mitt dip nip moot boot Sustenation vee bee sheet cheat bill vill thick tick pooh foo Sibilation zee thee cheep keep gilt jilt thing sing goose juice Graveness reed weed peak teak bid did fin thin moon noon Compactness yield wield key tea hit fit gill dill coop poop The subjective intelligibility tests result in a percentage, 0-100% representing the amount of words that were recognized correctly. These results are more straightforward to interpret than the corresponding subjective MOS (1-5). The MOS-score reflects more impressions than just intelligibility and the scores can vary quite a lot among different listeners [17]. These subjective tests do not always reflect the reality. In normal life speech is made up of sentences which increase the intelligibility because of the flow of words. The MRT and the DRT consists of random words and even though they are equally distorted the real life sentences are perceived as having higher overall intelligibility. 2.4.2 Objective measurements There are a couple of indices for objective speech intelligibility. The two most fundamental are the Speech Transmission Index (STI) [23] and the Speech Intelligibility Index (SII) [24]. The STI gives a number between 0 and 1 where 1 represents good intelligibility and low influence of acoustical system properties and/or background noise (compare with subjective tests, 0-100%). The STI is based on the assumption that speech can be described as a - 22 - fundamental waveform which is modulated by low-frequency signals [23]. The STI-score is calculated from the Modulation Transfer Function (MTF) of the system (figure 2.6). The MTF is the reduction in the modulation index of the signal at the transmitter, m(1), and at the receiver, m(2), eq.(5). Figure 2.6. The STI-method. MTF m2 m1 (5) The SII-method is described in [24]. It is a development from the STI-method and it works in a similar way and produces scores between 0 and 1. Correlation The SII-method correlates well with subjective tests. An arising problem is that this objective method along with other are limited to linear systems. Testing modern applications such as low bit-rate coding do not produce well correlated scores. Research is done to find new objective methods that can deal with these non-linear systems [25]. - 23 - 2.5 Future measurement methods The area of how to objectively measure speech quality is expanding fast. It is getting more and more common that there are demands regarding the speech quality in specification involving communication solutions. A couple of years ago this was not the case, measurements to obtain a quality score had to be made subjectively. This was far too expensive and time consuming to be used in an every day manner. Today the objective tools have become accurate, fast and cheap enough for extensive usage. Subjective tests are still more accurate but regarding the benefits of objective measurements they will continue to expand in areas of usage. The research of today is struggling with the following tasks: Higher correlation for non-intrusive measurements, like P.563, with subjective tests. Today intrusive measurements, like P.862, give more accurate results of the speech quality. A new ITU-T standard is under development with the working name P.VTQ. It is a tool for predicting the impact on the quality of IP-network impairments and for monitoring the transmission quality. It uses metrics from the RTCP-XR (RTP Control Protocol-Extended Report) to calculate the quality and it gives a MOS-score on the ACR Listening Quality Scale [26]. Combine quality and intelligibility measurements. Extend the PESQalgorithm to include intelligibility measurements and give a common score for both quality and intelligibility. Make intelligibility measurements work in VoIP applications. The objective STI-method is inaccurate in non-linear and time-variant packetized networks. Modify and extend in service standards like ITU-T P.862 and ITU-T G.107 to have the same accuracy for VoIP systems as in “old” PSTN networks. ITU-T P.OLQA is a new standard under development. It will be the “Universal” model for objective predicting of listening quality. It will include not only speech but other new 3G-applications [26]. Develop tools for predicting and monitoring conversational quality from mouth-to-ear. This includes both the electrical connection and the acoustical part. The ITU-T P.CQO is a new standard that are developed to deal with this task [26]. - 24 - 3 Theory The main objective for this work is to investigate how different impairments degrade the quality of speech in ATM (Air Traffic Management) radio. The ATM system under consideration is the CLIMAX system. 3.1 CLIMAX Climax, or the offset carrier system, is an Air Traffic Control (ATC) Communications system working in the VHF-band (30-300MHz). The construction of Climax started in the United Kingdom in the 1960’s and is now widely spread in Europe. Sweden does not use the Climax system but it becomes an issue for us when flying to countries where the system is used. This multi-carrier system is attended for ground to air communication and it is based on the idea on having 2-5 transmitters transmitting on the same frequency with a slight offset. Climax offers greater ground coverage, higher redundancy and better coverage on low altitude and at harsh environments [27]. Climax is limited to a 25kHz channel spaced environment. This can cause problems since the 8.33kHz environment is spreading in Europe because of the need of more available frequencies; an 8.33kHz receiver does not have enough bandwidth to operate correctly in a Climax environment. To prevent for audible beats, homodynes, caused by frequency difference the multiple carriers are separated in frequency according to table 3.1 [28]. Table 3.1. Frequency arrangement for Climax channels (fc is the assigned channel frequency). No. of Climax Legs 2 3 4 5 Leg 1 Tx frequency fc +5kHz fc +7.5kHz fc +7.5kHz fc -2.5kHz Leg 2 Tx frequency fc -5kHz fc fc -7.5kHz fc -7.5kHz Leg 3 Tx frequency fc -7.5kHz fc +2.5kHz fc Leg 4 Tx frequency fc -2.5kHz fc +2.5kHz Leg 5 Tx frequency fc +7.5kHz - 25 - 3.1.1 Operations The Air Traffic Control Centre (ATCC) transmits the audio signal to all the transmitters and the air plane receives the signal from all the transmitters within coverage (figure 3.1). The pilot will then hear all the incoming transmissions simultaneously. Figure 3.1. The basic structure of Climax. The air plane receives the signal from three antennas. When the pilot receives multiple transmissions there is a great risk that the transmissions are mutual delayed due to different transmission paths. It is crucial that this delay do not reduce the quality and intelligibility of the transmitted speech. The European Organisation for Civil Aviation Equipment (EUROCAE) has proposed that this delay difference may vary between 0 and 10ms for Climax. For values over 10ms difference, echo effects will start to become disturbing and annoying [29]. For air to ground it works differently. The transceiver on the air plane transmits at the centre frequency and the transmission is received at each aerial within range. The ATCC then selects the aerial with the best performance by Best Signal Selection (BSS). This leads to that only one transmission reaches the air traffic controller (figure 3.2) [27]. - 26 - Figure 3.2. The air plane transmits to ATCC, BSS is used for better performance. 3.2 Impairments For the investigation five main impairments are examined in this work: I. II. III. IV. V. The Climax case (delay). Speech with added noise. Speech with an added tone. Packet (frame) losses. Speech with added noise pulses. 3.2.1 Case I Here the unique feature of the Climax system was investigated. It was simulated that a pilot receives two transmissions with the same speech from two different transmitters. One of the speech samples was delayed to simulate the echo-effect that the pilot will perceive , longer delays give more disturbing echo. How disturbing the echo gets is also dependent on the propagation loss that may be different for the two paths, resulting in different levels of the two received signals. Only simulations were investigated, no real radio transmissions were made. Most of the delay originates in the equipment used for the transmission; the propagation time in air is negligible. The propagation loss is on the other hand mostly dependent on the transmission path in the air. Examples of measured delays are shown in table 3.2 [30]. - 27 - Table 3.2. Examples of measured delays from Sundsvall ATCC. Station Arvidsjaur Gällivare Måttsund Storuman Round-trip (ms) 13,4 16,1 18,1 18,7 One-way (ms) 6,70 8,05 9,05 9,35 Delay difference (ms) 0,0 1,4 2,4 2,7 The measurements, in table 3.2, were made at the Sundsvall ATCC in Sweden. The one-way latency is obtained by assuming the round-trip latency is twice the one-way latency. The delay difference measure is the delay relative the shortest one-way latency. It should be noted that the stations operated at 125,60MHz and that these measurements were performed on TDM-connections, not VoIP. 3.2.2 Case II and III For the cases with added noise or tones the Signal-to-Noise Ratio (SNR) was the measure which was being altered. The SNR is a measure of the level of desired signal compared to the level of the background noise. The SNR is measured in decibels (dB) and is calculated as eq.(6): SNR dB 10 log Psignal Pnoise Psignal dB Pnoise dB (6) For the cases investigated in this work the Psignal is the Average RMS Power of the entire clean signal and the Pnoise is the same for the nosie/tone. Where noise was added, both the influence of white and pink noise was investigated. White noise is characterized by that it contains all frequencies with the same probability, the same mean energy and that the power is evenly distributed among all frequencies. Pink noise emphasizes the lower parts of the frequency spectrum, it distributes its energy evenly in all octaves, that is the power density decreases by 3dB/octave towards higher frequencies. This feature makes it, for example, more pleasant to listen to. - 28 - 3.2.3 Case IV In a digital voice transmission the speech is divided up into packets and frames containing usually 20ms speech. Depending on the performance and the load of the network some of these packets can be lost during transmission. Losses can occur if the network is congested, i.e. components receive too many packets which make their buffers to overflow and cause packets to be discarded. Congestion can lead to packet rerouting which can result in that the packets arrive too late to the jitter buffers leading to packet discarding. The individual packets can also be discarded by different applications because they are damaged with bit errors due to circuit noise or equipment malfunction [2], [3]. The effect on the quality of a packet loss is depending on many factors. First, what is the content of the lost packet? Of course packets containing speech affects quality more than packets containing silence when lost. Also what kind of speech the packets contain is important, if it contains vowel sounds or consonant sounds, if they occur in the beginning of a syllable or in the end, or if the whole syllable is lost? The time when the packet is lost is also important especially when dealing with bursts of lost packets [31]. For example, bursts towards the end of a telephone call are subjectively perceived as more negative regarding quality than bursts occurring at the beginning of the call. Another factor that influences the packet loss is what codec being used. Waveform codecs like G.711 encodes the whole waveform, no compression and high bit-rate. Usage of this codec affects the perceived quality much less than other perceptual codecs like G.729, G.723 and G.721 which encode only the relevant part of the voice signal. The PESQ-algorithm has earlier been tested and verified for packet losses with a normal distribution. Studies have also been made to investigate how the PESQ measures the impact of specific packet losses [32]. 3.2.4 Case V The case with noise pulses occur when an analog radio is disturbed by transmitters using frequency hopping. When the signals of the transmitters are mixed together intermodulation products are created and if these products coincide with the frequency of the radio a noise pulse is perceived by the radio, see table 3.3. The length of the noise pulse is the time the frequency hoppers remains at that specific frequency. An example is the Bluetooth®-technology which uses frequency hopping. It changes channels 1600 times per second, i.e. it remains 0,625ms on a channel [33]. - 29 - Table 3.3. Examples of intermodulation products. Intermodulation products (fA) f1 + f2 = fA f1 - f2 = fA 2f1 - f2 = fA 2f2 – f1 = fA 3.3 Objective measurements As PESQ is the most accurate and most used objective speech quality tool it was used for the investigation of the five cases. Several files were made to include most of the realistic real-life cases. The files were examined using a software where the PESQ-algorithm had been implemented. Each tested file resulted in a MOS-LQO score. 3.4 PESQ-verification Before the testing of the five cases the software itself was tested to make sure that it worked as expected. Speech files with impairments that are neglected by the PESQ were examined, expected to result in a maximum quality score. To fully verify the algorithm a conformance test can be made according to ITU-T P.862, Amendment 2 [36]. This test contains three test cases for the narrow band operation where the test scores of the enclosed files should not diverge from the scores of a reference implementation with more than a certain value. Test case number 2 of these three specified cases was performed; case 2 validates P.862 with variable delay. 3.5 Subjective measurements To get a hint on whether the PESQ judge the impaired files correctly a subjective test was performed with some of the files from the objective measurements. The used Absolute Category Rating method delivers a MOS- 30 - LQS score which was compared with the objective MOS-LQO score. The results should be treated very carefully though, to be able to compare the numeric values the subjective test needs to be performed in a standardized way in a tightly controlled environment [6]. No calibration with MNRUs [8] was performed for example; the test could therefore not be repeated somewhere else with accurate results. The test was only made to investigate how PESQ ranks the files compared to the subjects, does PESQ rank added noise vs. an added tone differently than humans for example. - 31 - 4 Methods 4.1 PESQ-verification 4.1.1 Test with neglected impairments To verify the accuracy of the software four tests were made with impairments not considered by the PESQ. The degraded speech file was processed with (table 4.1): 1. 2. 3. 4. added silence in the beginning to simulate end-to-end delay. amplified and lowered level. combination of added silence and changed level. not processed, just resaved. All these cases should result in maximum quality if the software works correctly. The test was made with four Swedish voices, four English voices and four Russian voices, two males and two females of each language. Table 4.2 shows the delays and levels of case 3 and table 4.3 shows which files being used. Table 4.1. Numerical values for the test cases for the PESQ-verification. Case 1 Delay(ms) Case 2 Level (dB) Case 3 Comb.(nr) Case 4 File (nr) 10 50 75 140 213 538 973 1688 2222 -15 -10 -7,5 -4,3 -2,1 -0,2 0,3 2,2 4,4 7,6 10,1 15,1 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11 12 - 32 - Table 4.2. Tested values for case 3. Combination number 1 2 3 4 5 6 7 8 Delay (ms) Level (dB) 50 140 538 1688 50 140 538 1688 +7,6 -2,1 -10 +4,4 -15 +10,1 +0,3 -7,5 Table 4.3. Tested files/languages for case 4. File number 1 2 3 4 5 6 7 8 9 10 11 12 File swe_f2 swe_f7 swe_m5 swe_m8 B_eng_f1 B_eng_f3 B_eng_m3 B_eng_m6 Ru_f2 Ru_f5 Ru_m1 Ru_m7 4.1.2 Conformance test, P.862, Amendment 2 The enclosed 40 file pairs of the second test case were tested. To pass the test the absolute difference between the PESQ Raw-score of the tested implementation and the reference implementation should not be greater than 0,05 for 1 file pair and 0,5 for all pairs. - 33 - 4.2 Objective quality measurements The PESQ-algorithm requires two samples of the speech, one reference file and a copy of the reference processed by the system or by hand. These two samples are compared in the algorithm resulting in a PESQ Raw-score. The speech samples should be speech-like voice samples, natural recorded speech may be used but artificial voice should be avoided. The samples should have the right level, typically -26dBov (dB relative the overload point of the system). They should be around 8s in duration. The reference should be as distortion-free as possible. The sample rate should be 8kHz or 16kHz for both the reference and the degraded sample and the sample resolution should be 16-bit. The degraded sample should be recorded at a level where amplitude clipping is avoided. Speech samples from ITU-T P.50 [34] were used for these tests, P.50 consists of speech samples of several languages. There are 8 female voices and 8 male voices for each language. For the rest of this report the notation of these files will be made up of a short of the language followed by the gender and finally the file number. For example the fourth Swedish male is noted as swe_m4. For these tests Swedish was used with the 8 male and the 8 female voices. The samples were first saved as 16bit Windows PCM wave. The samples were then copied and distorted with different impairments according to table 4.4. In almost all real transmissions the speech is transferred through a codec. To simulate this all the distorted samples were saved in ITU-T G.711 A-law wave. G.711 [4] is an 8-bit waveform preserving, linear codec, used in PSTNnetworks. The samples need to be resaved in 16-bit PCM wave to be able to work with the PESQ-algorithm. To examine the files with the PESQ-algorithm the software “Voice Quality Testing (VQT) from GL Communications Inc®. was used. Figure 4.1 shows a screen dump of the VQT-main screen viewing 2-D representations of the two sensation surfaces and the error surface. Figure 4.2 shows the 3-D sensation surface for a degraded file, the x-axis shows the time, the y-axis shows the frequency and the z-axis shows the loudness. 15 cases plus a reference case were tested (table 4.4). - 34 - Figure 4.1. Screen dump of the GL VQT software. Figure 4.2. The Sensation Surface for a degraded file. - 35 - Table 4.4. Overview over the different test cases. Test case Reference G.711 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Impairment G.711 Delay Delay with changed level Pink noise White noise 400Hz tone 1200Hz tone Packet losses Packet losses with concealment Noise pulses Different speech Delay and pink noise Delay and 400Hz tone Delay and 1200HZ tone Delay and packet loss Delay and noise pulses Used file multiple swe_m3, swe_f4 swe_f4 swe_m3 swe_m3 swe_m3 swe_m3 swe_m3, swe_f1 swe_m3, swe_f1 swe_m3 swe_m1, swe_f4 multiple multiple multiple swe_f1 swe_m3 Reference case, G.711 The following files were used to investigate how much the G.711-codec itself lowered the quality: swe_m1, swe_m2, swe_m3, swe_m5, swe_m7, swe_m8, swe_f1, swe_f2, swe_f3, swe_f4, swe_f5, swe_f6, swe_f8, Ru_m1, Ru_f4, B_eng_f7, B_eng_m2, Fr_m6, Fr_f2. Case 1 Two identical files were mixed together to represent the degraded file. One of the files was delayed relative the other before mixing, simulating what a pilot can hear in his head-set. The tested delays were (ms): 0,1 0,25 0,5 0,75 1,0 1,5 2,0 2,5 3,0 4,0 5,0 6,0 7,5 10 12,5 15 20 50. Case 2 As case 1 but the level of the delayed file was first attenuated 3dB and then 6dB before mixing. - 36 - Case 3 Speech and pink noise were mixed together and investigated; the SNR was the parameter that was altered. SNR(dB) = speech average RMS power (dB) – noise average RMS power (dB). Tested SNRs (dB): -20 -15 -10 -6 -5 -3 0 3 5 6 10 15 20 25 30 35 40. Case 4 As case 3 but with white noise instead. Case 5 A speech file and a sinus tone were mixed together. The tone was at 400Hz and different SNRs were tested. The SNR was calculated as: SNR(dB) = signal average RMS power (dB) – tone average RMS power (dB). Tested SNRs (dB): -25 -20 -15 -10 -6 -5 -3 0 3 5 6 10 15 20 25 30 35 40. Case 6 As case 5 but the tone was at 1200Hz instead. Case 7 Speech with simulated packet losses. Each packet contained 20ms of speech and every loss was replaced with silence. The packets were evenly spread throughout the sentence. Two sentences were tested and the loss densities (s-1) were: 0,1 (0,2%), 0,2 (0,4%), 0,3 (0,6%), 0,5 (1%), 0,7 (1,4%), 1 (2%), 2 (4%), 4 (8%), 10 (20%), 20 (40%). The number in the brackets represents how much of the sentence that was lost. Case 8 As case 7 but the lost packets were replaced by the preceding packet. Each packet contained 20ms of speech and the tested loss densities (s -1) were: 0,1 (0,2%), 0,2 (0,4%), 0,3 (0,6%), 0,5 (1%), 0,7 (1,4%), 1 (2%). The number in the brackets represents how much of the whole sentence that was lost. - 37 - Case 9 Speech with noise pulses which replaced the speech. Different lengths, different levels of the noise pulses and different pulse densities were tested. The pulses were evenly spread over the sentence but their locations were chosen to where there were speech and not silence. The same locations in the sentence were used for all pulse lengths and levels. SNR(dB) = Peak value speech (dB) – Peak value noise (dB). Tested SNRs (dB): -6 -3 0 3 6 9 12. Tested pulse densities (s -1): 0,1 0,2 0,3 0,5 0,7 1,0 2,0 4,0. a, b, c, d, e, f, g, pulse length = 1ms pulse length = 3ms pulse length = 5ms pulse length = 7ms pulse length = 10ms pulse length = 15ms pulse length = 20ms Table 4.5. Table of how much (in %) of the sentence that was noise. Pulse densities (s-1) a, b, c, d, e, f, g, 0,1 0,2 0,3 0,5 0,7 1,0 2,0 4,0 0,01 0,03 0,05 0,07 0,1 0,15 0,2 0,02 0,06 0,1 0,14 0,2 0,3 0,4 0,03 0,09 0,15 0,21 0,3 0,45 0,6 0,05 0,15 0,25 0,35 0,5 0,75 1,0 0,07 0,21 0,35 0,49 0,7 1,05 1,4 0,1 0,3 0,5 0,7 1,0 1,5 2,0 0,2 0,6 1,0 1,4 2,0 3,0 4,0 0,4 1,2 2,0 2,8 4,0 6,0 8,0 Case 10 Two different sentences were mixed together and tested. The file swe_m1 kept its level throughout the whole test but the file swe_f4’s level was changed to test different SNRs. SNR(dB) = swe_m1 average RMS power (dB) – swe_f4 average RMS power (dB). Tested SNRs (dB): -15 -10 -6 -5 -3 -2,5 0 2,5 3 5 6 10 15 20 25 30 35 40. - 38 - Case 11 A combination of added delay and pink noise were investigated. Delay (ms): 0,1 0,25 0,5 1,0 4,0 10 20. SNRs (dB): -20 -15 -10 -6 -5 -3 0 3 5 6 10 15 20 25 30 35 40. a, delay: 0,1ms, swe_m3 b, delay: 0,25ms, swe_f5 c, delay: 0,5ms, swe_m7 d, delay: 1,0ms, swe_f8 e, delay: 4,0ms, swe_m5 f, delay: 10ms, swe_f3 g, delay: 20ms, swe_m2 Case 12 Combinations of added delay and a 400Hz sinus tone were tested. Delay (ms): 0,1 0,25 0,5 1,0 4,0 10 20. SNRs (dB): -25 -20 -15 -10 -6 -5 -3 0 3 5 6 10 15 20 25 30 35 40. a, delay: 0,1ms, swe_m3 b, delay: 0,25ms, swe_f5 c, delay: 0,5ms, swe_m7 d, delay: 1,0ms, swe_f8 e, delay: 4,0ms, swe_m5 f, delay: 10ms, swe_f3 g, delay: 20ms, swe_m2 Case 13 Combinations of added delay and a 1200Hz sinus tone were tested. Delay (ms): 0,1 0,25 0,5 1,0 4,0 10 20. SNR (dB): -25 -20 -15 -10 -6 -5 -3 0 3 5 6 10 15 20 25 30 35 40. a, delay: 0,1ms, swe_m3 b, delay: 0,25ms, swe_f5 c, delay: 0,5ms, swe_m7 d, delay: 1,0ms, swe_f8 e, delay: 4,0ms, swe_m5 f, delay: 10ms, swe_f3 g, delay: 20ms, swe_m2 - 39 - Case 14 Speech with packet losses and added delay. Packets of 20ms were lost. The delayed file didn’t contain any packet losses. Delays (ms): 0,10 0,25 0,50 1,0 4,0 10. The delayed file was tested at three levels: the same as the original file, attenuated 3dB and 6dB. Tested losses (s-1): 0,5 (1%), 1,0 (2%), 2,0 (4%). The number in the brackets represents how much of the sentence that was lost. Case 15 Combinations of added delay and noise pulses of 20ms. The noise pulses were evenly spread throughout the sentence and they were added upon the speech. Delays (ms): 0,1 0,25 0,5 1,0 4,0 10. Pulse densities (s -1): 0,1 (0,2%), 0,2 (0,4%), 0,3 (0,6%), 0,5 (1,0%), 0,7 (1,4%), 1,0 (2,0%), 2,0 (4,0%), 4,0 (8,0%), 10 (20%), 20 (40%). The number in the brackets represents how much of the whole sentence that was noise. 4.3 Subjective measurements To be able to compare the objective results a subjective test was performed based on [6] Absolute Category Rating (ACR) method. 19 of the voice files investigated by the VQT were selected and used for the test. A copy of the score sheet is found in appendix 3. The subjects should listen to the files and grade them according to table 2.1. The results were collected and an average MOSs was calculated for each file by taking the sum of all subjective scores and dividing it by the number of subjects. The used files are shown in table 4.6. - 40 - Table 4.6. Overview over the test files used in the subjective test. File number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Impairment Talker Delay 10ms Delay 10ms -6dB leveled delay 10ms Delay 10ms and pink noise SNR=25dB Delay 10ms and a 400Hz tone SNR=20dB Delay 10ms and a 1200Hz tone SNR=20dB Tone at 400Hz, SNR=20dB Tone at 1200Hz, SNR=20dB Packet loss 1%, lost packets = silence Packet loss 1%, lost packets = preceding packet White noise, SNR=25dB Pink noise, SNR=25dB Speech over speech, SNR=25dB Noise pulses 1ms, 0dB, 0,05% noise Noise pulses 3ms, 0dB, 0,15% noise Noise pulses 3ms, 0dB, 1,2% noise Delay 50ms Pink noise, SNR=10dB No impairments swe_f2 swe_m8 swe_f4 swe_f3 swe_f3 swe_f8 swe_f5 swe_m1 swe_m7 swe_f1 swe_f7 swe_m6 swe_m5(f3) swe_m4 swe_f6 swe_m3 swe_f6 swe_m1 swe_m5 To avoid any comparison between similar files the file numbers were mixed before testing. The score sheet with instructions and voice files were e-mailed out to colleagues and friends. For those subjects having a file size limitation in there e-mail inbox the files were compressed to 70kbits mp3-format. The compression led to a small and almost not noticeable quality degradation. This was not a standardized way to perform a subjective quality test therefore the results must be treated very carefully. The ITU recommendation P.800 [6] describes how to perform a subjective test; it should be very controlled and involve many people. - 41 - 5 Result 5.1 PESQ-verification 5.1.1 Test with neglected impairments Table 5.1. Overview over the test cases in the PESQ-verification. Case 1 2 3 4 Processing Added silence Changed level Combination of case 1 and 2 No processing Case 1 4,5 swe_f2 swe_f7 swe_m5 swe_m8 B_eng_f1 B_eng_f3 B_eng_m3 B_eng_m6 Ru_f2 Ru_f5 Ru_m1 Ru_m7 4 MOS-LQO 3,5 3 2,5 2 1,5 1 0,5 0 0 500 1000 1500 2000 2500 Silence added (ms) Figure 5.1. MOS-LQO for 12 different speech files with added silence in the beginning. - 42 - Case 2 4,5 swe_f2 swe_f7 swe_m5 swe_m8 B_eng_f1 B_eng_f3 B_eng_m3 B_eng_m6 Ru_f2 Ru_f5 Ru_m1 Ru_m7 4 MOS-LQO 3,5 3 2,5 2 1,5 1 0,5 0 -16 -12 -8 -4 0 4 8 12 16 Level change (dB) Figure 5.2. MOS-LQO for 12 different speech files with changed level of the degraded file. Case 3 4,5 swe_f2 swe_f7 swe_m5 swe_m8 B_eng_f1 B_eng_f3 B_eng_m3 B_eng_m6 Ru_f2 Ru_f5 Ru_m1 Ru_m7 4 MOS-LQO 3,5 3 2,5 2 1,5 1 0,5 0 1 2 3 4 5 6 7 8 Combination number Figure 5.3. MOS-LQO for 12 different speech files with both added silence and changed level. - 43 - Case 4 4,5 4 MOS-LQO 3,5 3 2,5 2 1,5 1 0,5 0 1 2 3 4 5 6 7 8 9 10 11 12 File number Figure 5.4. MOS-LQO for 12 different speech files unprocessed. 5.1.2 Conformance test P.862, Amendment 2. All the file pairs ended up within the test limits. In two cases (marked with * in appendix 2) the difference was 0,01, for the remaining 38 pairs no difference could be concluded. The complete test sheet is attached in appendix 2. - 44 - 5.2 Objective measurements Reference G.711 4,5 0,16 4 0,14 3,5 0,12 Diff. MOS-LQO MOS-LQO 3 2,5 2 1,5 0,10 0,08 0,06 1 0,04 0,5 0,02 0 5 10 15 File number 20 Figure 5.5. MOS-LQO for 19 speech files processed by the G.711-codec. 0 5 10 15 File number 20 Figure 5.6. Difference in MOS-LQO, for 19 files, from the maximum score 4,55. Case 1 4,5 4 MOS-LQO 3,5 3 2,5 swe_m3 swe_f4 2 1,5 1 0,5 0 0 5 10 15 20 25 30 35 40 45 50 55 Delay (ms) Figure 5.7. MOS-LQO for two speech samples with different delays added. - 45 - 4,5 4 MOS-LQO 3,5 3 2,5 swe_m3 swe_f4 2 1,5 1 0,5 0 0 1 2 3 4 5 6 Delay (ms) 7 8 9 10 11 Figure 5.8. As Fig.5.7 for delays 0-10ms. Case 2 4,5 4 MOS-LQO 3,5 3 swe_f4 swe_f4 -3dB swe_f4 -6dB 2,5 2 1,5 1 0,5 0 0 5 10 15 20 25 30 35 40 45 50 55 Delay (ms) Figure 5.9. MOS-LQO speech samples where the level of the delayed sample is attenuated 3dB and 6dB. - 46 - 4,5 4 MOS-LQO 3,5 3 swe_f4 swe_f4 -3dB swe_f4 -6dB 2,5 2 1,5 1 0,5 0 0 1 2 3 4 5 6 7 8 9 10 11 Delay (ms) Figure 5.10. As Fig.5.9 for delays 0-10ms. Case 3 and 4 4,5 4 MOS-LQO 3,5 3 2,5 pink noise white noise 2 1,5 1 0,5 0 -30 -20 -10 0 10 20 30 40 50 SNR (dB) Figure 5.11. MOS-LQO for two speech samples with noise added. - 47 - Case 5 and 6 4,5 4 MOS-LQO 3,5 3 2,5 400Hz tone 1200Hz tone 2 1,5 1 0,5 0 -30 -20 -10 0 10 20 30 40 50 SNR (dB) Figure 5.12. MOS-LQO for speech samples with a sinus tone added. Case 7 4,5 4 MOS-LQO 3,5 3 2,5 swe_m3 swe_f1 2 1,5 1 0,5 0 0 10 20 30 40 50 Lost packets (% of whole sentence) Figure 5.13. MOS-LQO for two speech samples with lost packets. - 48 - 4,5 4 MOS-LQO 3,5 3 2,5 swe_m3 swe_f1 2 1,5 1 0,5 0 0 0,5 1,0 1,5 2,0 2,5 Lost packets (% of whole sentence) Figure 5.14. As Fig.5.13 for losses 0-2%. Case 8 4,5 4 MOS-LQO 3,5 3 2,5 swe_m3 swe_f1 2 1,5 1 0,5 0 0 0,5 1,0 1,5 2,0 2,5 Lost packets (% of whole sentence) Figure 5.15. MOS-LQO for two speech samples with packet losses replaced with the preceding packet. - 49 - 4,5 4 MOS-LQO 3,5 swe_m3 - silence swe_m3 - packet swe_f1 - silence swe_f1 - packet swe_m3 - noise pulse (case 9g, -6dB) 3 2,5 2 1,5 1 0,5 0 0 0,5 1,0 1,5 2,0 2,5 Lost packets (% of whole sentence) Figure 5.16. Comparison of different replacements for packet losses. Case 9 4,5 4 MOS-LQO 3,5 1,0ms 3,0ms 5,0ms 7,0ms 10ms 15ms 20ms 3 2,5 2 1,5 1 0,5 0 0 0,5 1,0 1,5 2,0 2,5 3,0 3,5 4,0 4,5 Noise pulses(s ¹) Figure 5.17. MOS-LQO for speech samples with added noise pulses of SNR=-6dB and with different lengths. - 50 - 4,5 4 MOS-LQO 3,5 1,0ms 3,0ms 5,0ms 7,0ms 10ms 15ms 20ms 3 2,5 2 1,5 1 0,5 0 0 0,5 1,0 1,5 2,0 2,5 3,0 3,5 4,0 4,5 Noise pulses(s ¹) Figure 5.18. MOS-LQO for speech samples with added noise pulses of SNR=0dB and with different lengths.. 4,5 4 MOS-LQO 3,5 1,0ms 3,0ms 5,0ms 7,0ms 10ms 15ms 20ms 3 2,5 2 1,5 1 0,5 0 0 0,5 1,0 1,5 2,0 2,5 3,0 3,5 4,0 4,5 Noise pulses(s ¹) Figure 5.19. MOS-LQO for speech samples with added noise pulses of SNR=+6dB and with different lengths.. - 51 - 4,5 4 MOS-LQO 3,5 1,0ms 3,0ms 5,0ms 7,0ms 10ms 15ms 20ms 3 2,5 2 1,5 1 0,5 0 0 0,5 1,0 1,5 2,0 2,5 3,0 3,5 4,0 4,5 Noise pulses(s ¹) Figure 5.20. MOS-LQO for speech samples with added noise pulses of SNR=+12dB and with different lengths.. The results for all SNRs are shown in appendix 4. Case 10 4,5 4 MOS-LQO 3,5 3 2,5 2 1,5 1 0,5 0 -20 -10 0 10 20 30 40 50 SNR (dB) Figure 5.21. MOS-LQO for a speech sample with a different speech sample added. - 52 - 4,5 4 MOS-LQO 3,5 400Hz tone 1200Hz tone white noise pink noise speech 3 2,5 2 1,5 1 0,5 0 -30 -20 -10 0 10 20 30 40 50 SNR (dB) Figure 5.22. Comparison of five different additive impairments. Case 11 4,5 4 0,0ms 0,10ms 0,25ms 0,50ms 1,0ms 4,0ms 10ms 20ms MOS-LQO 3,5 3 2,5 2 1,5 1 0,5 0 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 SNR (dB) Figure 5.23. MOS-LQO for different delays at varying SNR (pink noise added). - 53 - Case 12 4,5 4 0,0ms 0,10ms 0,25ms 0,50ms 1,0ms 4,0ms 10ms 20ms MOS-LQO 3,5 3 2,5 2 1,5 1 0,5 0 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 SNR (dB) Figure 5.24. MOS-LQO for different delays at varying SNR (400Hz tone added). Case 13 4,5 4 0,0ms 0,10ms 0,25ms 0,50ms 1,0ms 4,0ms 10ms 20ms MOS-LQO 3,5 3 2,5 2 1,5 1 0,5 0 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 SNR (dB) Figure 5.25. MOS-LQO for different delays at varying SNR (1200Hz tone added). - 54 - Case 14 4,5 4 MOS-LQO 3,5 3 0,5s (1%) 1,0s (2%) 2,0s (4%) 2,5 2 1,5 1 0,5 0 0 1 2 3 4 5 6 7 8 9 10 11 Delay (ms) Figure 5.27. MOS-LQO for speech samples with added delay at 0dB and packet losses. 4,5 4 MOS-LQO 3,5 3 0,5s (1%) 1,0s (2%) 2,0s (4%) 2,5 2 1,5 1 0,5 0 0 1 2 3 4 5 6 7 8 9 10 11 Delay (ms) Figure 5.28. MOS-LQO for speech samples with added delay added delay at -3dB and packet losses. - 55 - 4,5 4 MOS-LQO 3,5 3 0,5s (1%) 1,0s (2%) 2,0s (4%) 2,5 2 1,5 1 0,5 0 0 1 2 3 4 5 6 7 8 9 10 11 Delay (ms) Figure 5.29. MOS-LQO for speech samples with added delay added delay at -6dB and packet losses. Case 15 4,5 0,2% 0,4% 0,6% 1,0% 1,4% 2,0% 4,0% 8,0% 20% 40% 4 MOS-LQO 3,5 3 2,5 2 1,5 1 0,5 0 0 1 2 3 4 5 6 7 8 9 10 11 Delay (ms) Figure 5.30. MOS-LQO for speech samples with added delay and noise pulses of 20ms. - 56 - 4,5 4 0,2% 0,4% 0,6% 1,0% 1,4% 2,0% 4,0% 8,0% 20% 40% MOS-LQO 3,5 3 2,5 2 1,5 1 0,5 0 0 0,2 0,4 0,6 0,8 1,0 1,2 Delay (ms) Figure 5.31. As fig.5.30 for delays 0-1ms. - 57 - 5.3 Subjective measurements The column with MOS-LQO shows the results produced by the PESQ software, the column with MOS-LQS shows the mean of the 31 subjects used in this experiment and the right column shows the standard deviation of the subjective results. The complete result can be viewed in appendix 5. Table 5.2. The result of the subjective measurement. File Impairment nr. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Delay 10ms Delay 10ms Attenuation -6dB and delay 10ms Delay 10ms and pink noise SNR=25dB Delay 10ms and a 400Hz tone SNR=20dB Delay 10ms and a 1200Hz tone SNR=20dB Tone at 400Hz, SNR=20dB Tone at 1200Hz, SNR=20dB Packet losses, 1% silence Packet losses, 1% preceding packet White noise, SNR=25dB Pink noise, SNR=25dB Speech over speech, SNR=25dB Noise pulses 1ms, 0dB, 0,05% noise Noise pulses 3ms, 0dB, 0,15% noise Noise pulses 3ms, 0dB, 1,20% noise Delay 50ms Pink noise, SNR=10dB No impairments MOS-LQO MOS-LQS Standard Deviation MOS-LQS 2,65 2,63 3,69 2,42 2,59 2,23 3,31 3,21 3,54 3,56 3,22 3,59 3,74 3,80 3,27 1,86 1,47 1,92 4,49 3,06 1,97 3,39 2,55 2,42 2,03 2,81 2,71 2,48 3,35 2,45 3,42 3,26 3,45 3,16 1,84 1,48 1,94 4,35 0,93 0,75 0,92 0,72 0,96 0,91 0,87 1,01 0,72 1,02 0,85 0,92 0,93 0,93 0,97 0,86 0,63 0,93 0,71 - 58 - 6 Conclusion and Discussion 6.1 PESQ-verification The PESQ-algorithm worked as expected in all the four verification cases. The quality drops, for some files, in case 2 and 3 were caused by the high level of the speech. Some of the speech had reached the limit where amplitude clipping occurred with lower quality as the result. There are still some features of the algorithm that need to be investigated. The time alignment function should be tested further to examine how the algorithm deals with delays in the middle of sentences for example. The algorithm passed the second case in the conformance test. The results of all, but two, files pairs agreed, after round off, to the reference score. The two were still within the error limit and the 0,01 difference was probably due to round off inaccuracy. 6.2 Objective measurements Reference G.711 The G.711-codec lowered the quality with up to 0,15 on the MOS-LQO scale. This is quite low, the PESQ-developers state a drop of about 0,30 for an analog G.711 network [13]. The difference can depend on the fact that they have used real measurements with higher controlled features of the speech files and that they measured over an entire network, not just the codec itself. Case 1 The, by Eurocae, proposed limit of delay is 10ms, this simulation showed a MOS-LQO of about 2,8 at that delay. 2,8 is considered to be acceptable but one should remember that in real life it is likely that more degrading factors will occur than just delay. Case 2 In the cases where the delayed file was attenuated the simulations showed an expected increase in quality. For the 10ms case the quality was about 0,8 higher than when both files were at equal level. This is usually the case in real - 59 - life; one file has lower level than the other. The 7,5ms and 12,5ms delay showed unexpected lower quality. This is an indication that multiple tests should be performed to avoid random errors. Case 3 and 4 Adding noise to the files produced nice looking plots. The white noise showed a bit flatter curve than pink noise but the difference was small all over the different SNRs. At very low SNRs the quality was almost constant indicating the need to keep the SNR over a certain limit. Case 5 and 6 When adding a tone instead of noise the results were a bit different. The curves were more flat and at 0dB the quality difference was almost 1,0 in favor for the cases with an added tone. Some remarkable results were obtained at SNRs below 0dB; the quality started to increase again and at -20dB it had reached the same level as at about +10dB. At very low SNRs (-25dB) the quality was lower as expected, about 0,7 higher than for added noise though. The reason for the quality increase at low SNRs is a case for further investigation. Case 7 Simulating lost packets or frames showed expected results. Ref.[35] shows a MOS of about 3,6 at 1% loss which correlate well with these results. One should remember that the content of the packet is very important; if it contains speech or silence. The PESQ-algorithm consider whether the content is speech or silence but it is unclear how well it can handle the importance of different parts of the speech. Case 8 Replacing the simulated packet losses with the preceding packet instead of silence showed a bit higher quality, an indication that error concealments are useful. Replacing the losses with noise should on the other hand be avoided. Case 9 When the level of the pulses decreased the length of the pulses got more important. At -6dB the quality difference was about 0,7 at the most while it was about 1,8 at low noise levels. For most of the cases where 1% of the - 60 - sentence was noise the quality was around 3,0 for all pulse lengths making it worse than silence and preceding packet as error concealment. Case 10 Letting another human voice impair the original speech gave quality results lower than in the cases with an added tone. On the other hand the quality was a bit higher than noisy speech at SNRs 0-30dB. These results should be treated carefully as the PESQ-algorithm has not been validated for cases with multiple talkers [12]. Case 11 At low SNRs the amount of delay was negligible, the noise made the quality too poor for the delay to make any difference. At higher SNRs the delay had more influence on the quality and at 40dB the result was comparable with case 1 where no noise was added. Case 12 Adding a 400Hz tone to the delayed files gave curves with the same shape as without delay. Noticeable was that under 0dB the results got more random and any regression was hard to notice. Case 13 Changing to a 1200Hz tone showed no surprising results. The curves had better regression with each other than for the 400Hz tone indicating that the delay had lower influence on the 1200Hz case than for the 400Hz case. Case 14 More packet losses led to lower quality, just as expected. Adding more delay gave less consistent results especially for the 1ms and 4ms delays, the quality tend to rise at these delays. The level change of the delayed file gave expected results even though the difference was smaller than in case 2. The conclusion is that packet losses have a more degrading feature than delays, even if the delayed file has the same level as the original sample. - 61 - Case 15 The results indicated once again that noise is the worst way to replace a lost packet, for the 1% case the score was around 2,5. As for many other cases there were some results that differ from the expected regression which indicates that further investigation is desired. Case 11-15 consisted of combinations of delays and other impairments. The objectives of these were to investigate whether any combinations showed remarkable results and if there was any impairment that was more important to avoid. None of these showed any surprising results, the resulting quality score was about the same as adding the scores from the measurements where only one impairment was considered. 6.3 Subjective measurements As mentioned earlier the results from the subjective measurement should be treated very carefully. Comparing the results, with this in mind, shows well correlated results. It is only in three cases where the difference is higher than 0,5. File 1 and 2 differ quite much, the reason is probably that the male voice is much more slurred and not that well articulated as the female. Looking at case 9 and 10 the PESQ has scored them almost the same while the subjective results show that the error concealment simulation is perceived as almost 1,0 higher than silence. Comparing this with case 16 shows that both PESQ and subjects agree on that noise pulses should be avoided when more than 1% of the speech is impaired. For the cases with added background noise white noise is the worst according to the subjects, about 1,0 worse than pink noise, this is a bit bigger difference than what the PESQ shows. Using different speech as the impairment is a bit worse than pink noise according to the subjects while PESQ grades it higher than when noise or tones are added. Finally there are some cases (16-19) with very well correlated results. The reason for this is probably that the results are near the ends of the quality scale. The standard deviation is almost 1 for most of the cases indicating that the individual results differ quite a lot. On the other hand that is what one can expect when dealing with people with different opinions about quality. - 62 - 6.4 Error sources One of the most obvious sources to why there are some errors in the results is that the original input speech files do not completely fulfill the requirements of the input file according to ITU-T P.862 [12]. The level and filtering should for example be controlled. To be able to fully compare the results the same speech file should be used in every test. As seen in many cases some of the scores differ quite a lot from the expected one, multiple measurements of the same case should be made to avoid this. For the subjective measurement a number of errors can occur. Performing the test on their own using different equipment contributes to the uncertainty and deviation, between individual scores, of the test. Some of this uncertainty could be avoided with more detailed instructions, in case 2 for example where the subjects might have taken the actual voice into consideration. The fact that the compression of the files to the mp3-format did lower the quality, even though the author had difficulties hearing the difference, might have lowered the subjects’ results. 6.5 The GL-VQT Software® The test of the software where the PESQ-algorithm is implemented resulted in satisfying results. There are still functions that need to be investigated, in this work only the manual function has been used with manually made files simulating some of the impairments that can occur in a network. For further investigation tests should be performed on a real network, using both real speech and similar speech files used in this work for comparison. To fully validate the software there is a conformance test in ITU-T P.862, Amendment 2 [36]. This test comes with about 1800 file pairs with known PESQ-scores for testing and confirmation of the algorithm’s functions. This tool is a good choice if manual offline measurements are desired. There’s no need to bring the software out in the field, just bring a high quality recorder, record the transmitted speech and investigate the quality later. To get a more complete understanding of the overall speech quality the PESQ should be used together with instrument measuring impairments like delay, echo and other factors which PESQ hasn’t been validated for. - 63 - 6.6 Additional measurements For the Climax case, with delayed speech samples, it should be investigated how additional speech influence the quality. In this work only one delayed file has been investigated but it is possible that two or even more voice samples can impair the radio transmission. An ordinary transmission is usually degraded by many impairments. Here only combinations with maximum two impairments have been tested, it necessary to continue the investigation combining more impairments. It is important to perform tests in real networks with real transmissions. Only files simulating the real world have been investigated and it is necessary to compare them with the real thing and investigate the difference. - 64 - References [1] http://www.eurocae.org/, “General Description”. [2] Hardman, Dennis, Noise and Voice Quality in VoIP Environments, Agilent Technologies, Inc – White Paper, 2003. [3] Pracht, S. Hardman, D. Voice Quality in Converging Telephony and IP Networks, Agilent Technologies, Inc – White Paper, 2000-2001. [4] ITU-T Rec. G.711, Pulse Code Modulation (PCM) of voice frequencies, Nov. 1988. [5] Anderson, John, Methods for Measuring Perceptual Speech Quality, Agilent Technologies, Inc – White Paper, 2000-2001. [6] ITU-T Rec. P.800, Methods for subjective determination of transmission quality, Aug. 1996. [7] ITU-T Rec. P.800.1, Mean opinion score (MOS) terminology, Jul. 2006. [8] ITU-T Rec. P.810, Modulated noise reference unit - MNRU, International Telecommunication Union, Geneva, Switzerland, Feb. 1996. [9] Ericsson AB, Speech Quality Index in CDMA2000, Figure 1 in EAB06:009546 Uen Rev B, Technical Paper, May 2006. [10] Hardy, William C, VoIP Service Quality, Blacklick, OH, USA: McGrawHill, 2003, ISBN: 0-07-141076-7. [11] ITU-T Rec. P.861, Objective quality measurement of telephone band (300-3400 Hz) speech codecs, Feb. 1998. [12] ITU-T Rec. P.862, Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, Feb. 2001. [13] Psytechnics Limited, PESQ – Product description, www.psytechnics.com, Nov. 2002. - 65 - [14] ITU-T Rec. P.862.1, Mapping function for transforming P.862 raw result scores to MOS-LQO, Nov. 2003. [15] ITU-T Rec. G.107, The E-model, a Computational Model for Use in Transmission Planning, Mar. 2005. [16] ITU-T Rec. P.563, Single ended method for objective speech quality assessment in narrow-band telephony applications, International Telecommunication Union, Geneva, Switzerland, May 2004. [17] Li, F.F. Speech Intelligibility of VoIP to PSTN Interworking – A key index for the QoS, Department of Computing and Mathematics, Manchester Metropolitan University, UK, 2004. [18] Beerends, J.G. van Wijngaarden, S. van Buuren, R. Extension of ITU-T Recommendation P.862 PESQ towards Measuring Speech Intelligibility with Vocoders, The NATO Research & Technology Organisation, New Directions for Improving Audio Effectiveness (pp.10-1 – 10-6), RTOMP-HFM-123, 2005. [19] Liu, W.M. Jellyman, K.A. Mason, J.S.D. Evans, N.W.D. Assessment of Objective Quality Measures for Speech Intelligibility Estimation, School of Engineering, University of Wales Swansea, 2006. [20] Beerends, J.G. Larsen, E. Nandini, I. van Vugt, J.M. Measurement of Speech Intelligibility based on the PESQ approach, [21] ISO standard, ISO/TR 4870:1991, Acoustics -- The construction and calibration of speech intelligibility tests. [22] ANSI S3.2 – 1989 – Method for measuring the Intelligibility of Speech over Communication systems. [23] IEC International standard, IEC 60268-16, 3:rd edition, 2003. [24] ANSI standard s3.5 – 1997, Methods for Calculation of the SII. [25] Chernick, C.M. Leigh, S. Mills, K.L. Toense, R. Testing the Ability of Speech Reconizers to Measure the Effectiveness of Encoding Algorithms for Digital Speech Transmission, IEEE Int. Military Comm. Conf. (MILCOM), 1999. - 66 - [26] International Telecommunication Union ITU-T SG 12 ITU-T Study Group 12, Voice Quality Voice Quality Assessment Assessment, Information slides, ETSI STQ Workshop, Taïwan, 13 February 2006. [27] Rihacek, Christoph. B-VHF Reference Environment, Project Report D08, Broadband VHF Aeronautical Communications System based on MCCDMA, Ref: 04A02 E515.10, 2005. [28] ICAO, Annex 10 –volume III, Attachment A to Part II, Guidance material for Communication systems, Nov 1997. [29] Eurocae Climax recommendations, Working Group 67, SG1, 2006. [30] Nilsson, Alf. Latency in the radio network of the Swedish CAA, Ver 01.01, Saab Communication, 30 Januari, 2006. [31] Tan, X-C. Wänstedt, S. Heikkilä, G. Experiments and Modeling of Perceived Speech Quality of Long Samples, AWARE @ Ericsson Research, Luleå, Sweden, 2001. [32] Hoene, C. Rathke, B. Wolisz, A. On the Importance of a VoIP Packet, Technical University of Berlin, 2003. [33] http://www.bluetooth.com/Bluetooth/Learn/Basics/ [34] ITU-T Rec. P.50, Artificial Voices, Sep, 1999. [35] Ding, Lijing. Goubran, Rafik A. Assesment of Effects of Packet Loss on Speech Quality in VoIP, Department of systems and Computer Engineering, Carleton University, Canada, 2003. [36] ITU-T Rec. P.862, Amendment 2, Revised annex A – Reference implementations and conformance testing for ITU-T Recs P.862, P.862.1 and P.862.2, International Telecommunication Union, Nov. 2005. - 67 - Appendix 1 Glossary ACR ATC ATM CCR Climax dBov Absolute Category Rating Air Traffic Controller Air Traffic Management Comparison Category Rating Offset Carrier System the decibel value relative to the overload point of a digital system DCR Degradation Category Rating DRT Diagnostic Rhyme Test Eurocae European Organization of Civil Aviation Equipment FEC Front End Clipping FFT Fast Fourier Transform ITU International Telecommunication Union MNRU Modulated Noise Reference Unit MOS Mean Opinion Score MOS-LQO Mean Opinion Score – Listening Quality Objective MOS-LQS Mean Opinion Score – Listening Quality Subjective MRT Modified Rhyme Test ms millisecond MTF Modulation Transfer Function PAMS Perceptual Analysis Measurement System PCM Pulse Code Modulation PESQ Perceptual Evaluation of Speech Quality PSQM Perceptual Speech Quality Measure PSTN Public Switched Telephony Network RTP Real-time Transport Protocol SII Speech Intelligibility Index SNR Signal-to-Noise Ratio STI Speech Transmission Index VAD Voice Activity Detector VoIP Voice over IP VQT Voice Quality Tester - 68 - Appendix 2 ITU-T P.862, Amendment 2. Conformance test 2(b) Reference voip/or105.wav voip/or109.wav voip/or114.wav voip/or129.wav voip/or134.wav voip/or137.wav voip/or145.wav voip/or149.wav voip/or152.wav voip/or154.wav voip/or155.wav voip/or161.wav voip/or164.wav voip/or166.wav voip/or170.wav voip/or179.wav voip/or221.wav voip/or229.wav voip/or246.wav voip/or272.wav voip/u_am1s01.wav voip/u_am1s01.wav voip/u_am1s02.wav voip/u_am1s01.wav voip/u_am1s03.wav voip/u_am1s03.wav voip/u_am1s01.wav voip/u_am1s02.wav voip/u_am1s02.wav voip/u_am1s03.wav voip/u_am1s03.wav voip/u_am1s03.wav voip/u_am1s01.wav voip/u_am1s03.wav voip/u_am1s02.wav voip/u_af1s01.wav voip/u_af1s03.wav voip/u_af1s02.wav voip/u_af1s03.wav voip/u_am1s03.wav Degraded voip/dg105.wav voip/dg109.wav voip/dg114.wav voip/dg129.wav voip/dg134.wav voip/dg137.wav voip/dg145.wav voip/dg149.wav voip/dg152.wav voip/dg154.wav voip/dg155.wav voip/dg161.wav voip/dg164.wav voip/dg166.wav voip/dg170.wav voip/dg179.wav voip/dg221.wav voip/dg229.wav voip/dg246.wav voip/dg272.wav voip/u_am1s01b1c1.wav voip/u_am1s01b1c7.wav voip/u_am1s02b1c9.wav voip/u_am1s01b1c15.wav voip/u_am1s03b1c16.wav voip/u_am1s03b1c18.wav voip/u_am1s01b2c1.wav voip/u_am1s02b2c4.wav voip/u_am1s02b2c5.wav voip/u_am1s03b2c5.wav voip/u_am1s03b2c6.wav voip/u_am1s03b2c7.wav voip/u_am1s01b2c8.wav voip/u_am1s03b2c11.wav voip/u_am1s02b2c14.wav voip/u_af1s01b2c16.wav voip/u_af1s03b2c16.wav voip/u_af1s02b2c17.wav voip/u_af1s03b2c17.wav voip/u_am1s03b2c18.wav Fsample PESQ_score VQT-score 8000 2.237 2,24 8000 3.180 3,18 8000 2.147 2,15 8000 2.680 2,68 8000 2.365 2,36* 8000 3.670 3,67 8000 3.016 3,02 8000 2.558 2,56 8000 2.768 2,77 8000 2.694 2,69 8000 2.606 2,61 8000 2.608 2,61 8000 2.850 2,85 8000 2.527 2,53 8000 2.452 2,45 8000 1.828 1,83 8000 2.774 2,77 8000 2.940 2,94 8000 2.205 2,20* 8000 3.288 3,29 8000 3.483 3,48 8000 2.420 2,42 8000 4.042 4,04 8000 3.179 3,18 8000 2.872 2,87 8000 2.806 2,81 8000 4.300 4,30 8000 3.634 3,63 8000 3.369 3,37 8000 3.911 3,91 8000 2.905 2,91 8000 3.579 3,58 8000 2.198 2,20 8000 3.276 3,28 8000 3.316 3,32 8000 3.307 3,31 8000 3.592 3,59 8000 2.614 2,61 8000 2.806 2,81 8000 2.540 2,54 - 69 - Appendix 3 Subjektivt test av talkvalitet En del av ett examensarbete vid Saab Communication i Arboga av Alexander Storm - mars 2007. Testet är ett talkvalitetstest där du ska lämna din åsikt om kvaliteten på inspelat tal. Testet består av 19st ljudfiler à 7-9 sekunder. Du ska lyssna på dessa och sätta ett kvalitetsbetyg på respektive fil. Hela testet skall göras vid ett och samma tillfälle i en så ostörd miljö som möjligt utan någon annan persons inverkan. Tidsåtgången är 5-10 min. Önskvärt är att hörlurar används men i brist på detta så går högtalare bra. Det första som ska göras är att en referensfil avlyssnas för att ställa volymen på dina hörlurar/högtalare. Nivån ska vara behaglig och det som sägs i filen skall höras klart och tydligt med så lite ansträngning som möjligt. Referensfilen kan avlyssnas hur många gånger som helst medan de övriga filerna endast skall avlyssnas 1 gång innan betyget sätts. Efter testet så mejlar du tillbaks detta dokument ifyllt till mig. Alla uppgifter kommer att behandlas konfidentiellt. Betygen ges mellan 1 och 5 enligt följande (endast heltal skall anges): Kvalitet på talet Utmärkt Bra Ganska bra Ganska dålig Dålig Betyg 5 4 3 2 1 Tack så mycket för hjälpen! /Alexander Testet Jag har använt (sätt kryss): hörlurar:___ högtalare:___ Kön: man:___ kvinna:___ Ålder: 0-30:___ 31-50:___ 51- :___ Betyg: Fil 1:____ Fil 2:____ Fil 3:____ Fil 4:____ Fil 5:____ Fil 6:____ Fil 7:____ Fil 8:____ Fil 9:____ Fil 10:____ Fil 11:____ Fil 12:____ Fil 13:____ Fil 14:____ Fil 15:____ Fil 16:____ Fil 17:____ Fil 18:____ Fil 19:____ - 70 - Appendix 4 Results for case 9 of the objective measurements 4,5 4 4 3,5 3,5 1,0ms 3,0ms 5,0ms 7,0ms 10ms 15ms 20ms 3 2,5 2 1,5 1 MOS-LQO MOS-LQO 4,5 2,5 2 1,5 1 0,5 0,5 0 1,0ms 3,0ms 5,0ms 7,0ms 10ms 15ms 20ms 3 0 0,5 1,0 1,5 2,0 2,5 3,0 3,5 4,0 0 4,5 0 0,5 1,0 1,5 SNR=-6dB 3,0 3,5 4,0 4,5 4,5 4 4 1,0ms 3,0ms 5,0ms 7,0ms 10ms 15ms 20ms 3 2,5 2 1,5 1 3,5 MOS-LQO 3,5 MOS-LQO 2,5 SNR=-3dB 4,5 1,0ms 3,0ms 5,0ms 7,0ms 10ms 15ms 20ms 3 2,5 2 1,5 1 0,5 0,5 0 0 0,5 1,0 1,5 2,0 2,5 3,0 3,5 4,0 0 0 4,5 0,5 1,0 1,5 Noise pulses(s ¹) 2,5 3,0 3,5 4,0 4,5 SNR=+3dB 4,5 4,5 4 4 1,0ms 3,0ms 5,0ms 7,0ms 10ms 15ms 20ms 3 2,5 2 1,5 1 3,5 MOS-LQO 3,5 1,0ms 3,0ms 5,0ms 7,0ms 10ms 15ms 20ms 3 2,5 2 1,5 1 0,5 0,5 0 2,0 Noise pulses(s ¹) SNR=+0dB MOS-LQO 2,0 Noise pulses(s ¹) Noise pulses(s ¹) 0 0,5 1,0 1,5 2,0 2,5 3,0 3,5 4,0 0 4,5 Noise pulses(s ¹) 0 0,5 1,0 1,5 2,0 2,5 3,0 3,5 4,0 4,5 Noise pulses(s ¹ ) SNR=+9dB SNR=+6dB 4,5 4 MOS-LQO 3,5 1,0ms 3,0ms 5,0ms 7,0ms 10ms 15ms 20ms 3 2,5 2 1,5 1 0,5 0 0 0,5 1,0 1,5 2,0 2,5 3,0 3,5 4,0 4,5 Noise pulses(s ¹ ) SNR=+12dB - 71 - MOS-LQS Standard deviation 3,16 3,26 2,42 3,35 1,94 3,06 3,45 2,03 3,42 0,97 0,93 0,96 1,02 0,93 0,93 0,93 0,91 0,92 3,39 0,92 2,81 0,87 1,84 0,86 1,97 0,75 2,45 0,85 2,48 0,72 1,48 0,63 2,71 1,01 2,55 0,72 4,35 0,71 Subject Gender Age H/S File 1 File 2 File 3 File 4 File 5 File 6 File 7 File 8 File 9 File 10 File 11 File 12 File 13 File 14 File 15 File 16 File 17 File 18 File 19 1 M 1 H 3 3 2 4 3 2 5 2 3 3 2 3 2 3 2 2 3 2 5 2 M 3 S 5 4 4 4 3 4 5 4 5 4 4 3 2 3 2 2 3 2 5 3 M 1 S 3 2 2 3 2 4 3 2 4 5 2 2 2 3 2 1 2 3 5 4 M 2 H 3 2 1 3 1 2 4 1 2 4 2 2 1 2 2 1 3 2 4 5 F 3 S 3 3 2 3 1 4 2 2 4 4 3 1 1 1 2 1 3 3 5 6 M 3 S 2 5 2 2 1 3 2 1 4 3 2 1 2 1 2 1 2 2 5 7 M 2 H 2 2 2 3 2 2 3 1 3 2 2 1 2 2 2 1 2 3 3 8 M 2 S 3 3 4 5 2 5 4 2 3 4 3 1 3 3 4 2 1 4 4 9 M 2 S 5 4 3 5 2 3 4 2 4 3 3 2 2 2 2 1 3 2 3 10 M H 4 4 3 5 2 3 3 2 4 5 3 2 2 3 4 2 2 3 5 11 M 2 S 4 3 3 4 2 3 4 3 3 4 3 3 2 2 3 2 3 2 4 12 M 2 S 4 5 4 4 4 4 4 4 5 5 4 3 3 4 3 2 4 4 5 13 M 1 S 2 3 1 2 1 3 2 1 3 2 2 1 1 3 1 1 4 3 4 14 F 1 S 2 3 3 2 1 2 2 2 2 3 4 1 1 2 2 1 3 2 5 15 M 1 H 3 3 2 4 2 2 3 2 3 3 4 2 2 2 3 2 2 3 4 16 M 1 S 4 3 3 4 3 4 4 3 4 4 4 3 4 4 4 3 3 3 5 17 M 3 S 4 5 4 5 4 5 4 3 5 4 4 3 2 3 3 3 4 3 5 18 F 1 H 3 2 1 3 2 3 4 2 3 3 2 2 2 3 3 1 3 2 3 19 M 3 S 2 3 2 3 1 3 4 1 3 2 3 2 1 2 3 2 2 2 5 20 M 2 S 3 4 2 3 1 2 3 2 4 3 2 1 2 2 2 1 3 2 5 21 M 1 S 2 3 2 2 1 3 3 2 3 3 2 2 1 3 2 1 4 2 4 22 F 1 H 4 3 3 4 2 4 4 2 4 3 2 2 1 4 2 2 2 4 5 23 M 1 H 3 4 1 2 1 2 2 1 3 2 2 1 1 2 2 1 2 2 4 24 M 1 H 5 5 4 4 3 4 4 3 4 4 4 2 3 3 2 2 4 3 5 25 M 1 H 3 2 2 3 2 3 3 1 2 3 2 1 3 2 3 1 2 2 4 26 M 3 H 3 3 3 3 2 2 4 2 3 3 3 2 2 3 2 1 4 2 4 27 M 1 H 2 2 1 3 1 2 2 1 2 2 2 1 2 1 2 1 1 1 3 28 F 1 H 4 3 2 4 1 3 4 2 3 4 2 1 2 2 3 1 1 3 4 29 M 1 S 2 4 2 1 3 2 5 4 5 3 4 1 3 2 2 1 4 3 4 30 M 1 S 2 3 2 3 1 4 3 1 2 5 2 1 2 1 3 1 1 2 4 31 M 1 H 4 3 3 4 3 3 4 2 4 3 4 4 2 3 3 2 4 3 5 Appendix 5 The MOS-LQS result of the subjective measurement Gender: M = male, F = female. Age: 1 = 0-30, 2 = 31-50, 3 = 51- . H/S: H = headphones, S = speakers (used for listening). The table shows the file numbers after the file mixing. - 72 -