here - The Good Recording Project

Transcription

here - The Good Recording Project
Sound Recording Techniques
MediaCity, Salford
Wednesday 26th March, 2014
www.goodrecording.net
Perception and automated assessment
of recorded audio quality, focussing on
user generated content.
How distortion affects the perceived
quality of music: Psychoacoustic
experiments
Iain Jackson, Bruno M. Fazenda, Trevor J. Cox, Paul
Kendrick, Francis F. Li, Stephen Groves-Kirkby, & Alex
Wilson
Acoustics Research Centre, University of Salford
• How does clipping affect the perception
of quality in music?
– Are hard clipping and soft clipping
perceived differently in terms of quality?
• How well does HASQI predict subjective
quality ratings of clipped music?
• How robust is HASQI across different
styles of music?
What is HASQI?
• Hearing Aid Speech Quality Index (Kates & Arehart, 2010)
• Models the effect of degradation on quality.
• Measures the combined effect of noise, nonlinear
distortion, and linear filters.
• For both normal-hearing and hearing-impaired
listeners.
• Good performance for speech signals (Kressner et al, 2013)
– What happens when applied to music?
Arehart, Kates & Anderson (2011)
• Wide variety of degradation/processing:
– Additive noise, peak clipping, amplitude quantisation, compression,
compression + babble, spectral sub, high-pass filter, low-pass filter, bandpass
filter, positive spectral tilt, negative spectral tilt, single resonance peak,
multiple peaks, stationary noise...
– ...a total of 112 conditions.
• But...
– Only 3 samples of music.
“Haydn”
“jazz”
“vocalise”
• Quality ratings “reasonably well predicted” by HASQI.
– Were also “significantly affected by genre of music”.
Experiment 1
The effect of hard clipping on perceptions
of quality
• In contrast to previous work, we assess the effect of
a single type of processing – hard clipping – against
a comprehensive range of musical styles.
Sample Selection
• Aim: Select a representative sample of as wide a
range of musical styles as possible.
• Guided by previous work (Rentfrew & Gosling, 2003)
– 25 prototype songs from each of 14 Genres:
• Classical, jazz, blues, folk, alternative, rock, heavy metal, country,
pop, religious, rap/hip-hop, soul, funk, and electronica/dance.
• Final sample library of 140 songs.
– We obtained CD copies of 117 songs on the list.
• How to scale down to a manageable number of
songs for test?
– Sort and cluster by timbre.
Sample Selection
Why select by timbre, not genre?
• Genre
– Intuitively useful but lacking in objectivity.
• Timbre
– Apply objective methods to compare songs.
• Samples clustered using modified version of
technique used by Aucouturier and Pachet
(2002).
– Gaussian Mixture Model (GMM) fitted to Mel
Frequency Cepstrum Coefficients (MFCC) for 3 sections
each song, which are then clustered by similarity.
• Total number of clusters is an emergent feature:
– In this case it was found to be 6.
The test set
• From each of our 6 timbre clusters we draw two
samples.
– One cluster, number 4, however contains only one sample.
• Additionally, we include the three samples used by
Arehart et al (2011) in their previous assessment of
HASQI and music (“jazz”, “Haydn”, “vocalise).
• Thus the final test set consists of 14 samples.
Table 1. The 14 songs the final test samples were taken from, by
cluster number.
1
2
3
4
5
6
Song Name
Artist/Composer
Riverboat Set: Denis Dillon’s Square
Dance Polka, Dancing on the Riverboat
Crazy Train
“Haydn” *
Ave Maria
Packin' Truck
“vocalise” *
Kalifornia
Brown Sugar
The Four Seasons: Spring
For What It's Worth
The Girl From Ipanema
Spoonful
Nobody Loves Me But My Mother
“jazz” *
John Whelan
Ozzy Osbourne
*
Franz Schubert
Leadbelly
Tierney Sutton
Fatboy Slim
The Rolling Stones
Antonio Vivaldi
Buffalo Springfield
Stan Getz
Howlin' Wolf
B.B. King
*
Method
Distortion of samples
• HASQI is continuous between values of 0 to 1.
– HASQI values used to estimate discrete levels.
• 10 Levels per song sample:
– 9 levels of distortion, spread at equal intervals over full
range of (available) HASQI values.
– Plus original, clean sample.
Relationship between HASQI values and
threshold
Crazy Train
Threshold (% of peak level)
100
80
60
For What It's Worth
40
20
0
0
0.2
0.4
0.6
Distortion level (1-HASQI)
0.8
1
Table 1. The 14 songs the final test samples were taken from, by cluster number.
Song Name
Artist/Composer
Example Samples
Clean
1
2
3
4
5
6
Riverboat Set: Denis Dillon’s Square
Dance Polka, Dancing on the Riverboat
Crazy Train
“Haydn” *
Ave Maria
Packin' Truck
“vocalise” *
Kalifornia
Brown Sugar
The Four Seasons: Spring
For What It's Worth
The Girl From Ipanema
Spoonful
Nobody Loves Me But My Mother
“jazz” *
John Whelan
Ozzy Osbourne
*
Franz Schubert
Leadbelly
Tierney Sutton
Fatboy Slim
The Rolling Stones
Antonio Vivaldi
Buffalo Springfield
Stan Getz
Howlin' Wolf
B.B. King
*
Medium
High
• Broadly reproduced method used by Arehart et al.
• 30 participants.
– Mean age 23.7 years (SD: 4.7 years)
– No reported hearing impairments
• Sounds presented over headphones.
– Sennheiser 650 HD
– Stereo, 72dB (linear)
• 140 trials.
– 14 songs x 10 processing conditions
– 7 second samples (randomised presentation order)
• Ratings of overall quality.
– Slider labelled Bad and Excellent at either end (output: 0 100)
Results
Figure 1. Mean quality ratings of each cluster,
as a function of distortion level. (Error bars
show 95% CIs.)
Figure 1. Mean quality ratings of each cluster,
as a function of distortion level. (Error bars
show 95% CIs.)
• Differences in quality between timbre clusters?
– Repeated-measures ANOVA
• Independent variables: Level of distortion, cluster
• Dependent variable: Mean quality ratings
• Significant main effect for distortion level (F(4.97, 144.26) =
458.38, p = <.01, η ² = .94).
• Significant main effect for cluster (F(2.33, 67.48) = 42.43, p = <.01,
p
ηp² = .59).
• Significant interaction of cluster x distortion level
(F(11.91, 345.41) = 6.98, p = <.01, ηp² = .19).
• Each successive level of distortion is associated with
a significant decrease in quality ratings, but the rate
of degradation is not perceived equally across all
timbres.
Table 2. Clusters grouped according to (between group)
significantly different quality ratings.
1
2
3
6
4
3
5
5
6
4
Song Name
Artist/Composer
Riverboat Set: Denis Dillon’s Square
Dance Polka, Dancing on the Riverboat
Crazy Train
“Haydn” *
Ave Maria
Packin' Truck
“vocalise” *
Spoonful
Kalifornia
NobodySugar
Brown
Loves Me But My Mother
“jazz”
The
Four
* Seasons: Spring
Kalifornia
For
What It's Worth
Brown
The
GirlSugar
From Ipanema
For What It's Worth
Spoonful
The Girl From
Nobody
Loves Ipanema
Me But My Mother
The Four
“jazz”
* Seasons: Spring
John Whelan
Ozzy Osbourne
*
Franz Schubert
Leadbelly
Tierney Sutton
Howlin'Slim
Fatboy
Wolf
B.B. Rolling
The
King Stones
*
Antonio
Vivaldi
Fatboy Slim
Buffalo
Springfield
The Rolling
Stan
Getz Stones
Buffalo Springfield
Howlin'
Wolf
Stan King
B.B.
Getz
*Antonio Vivaldi
Results
HASQI performance
Table 3. Correlation coefficients for
quality ratings and values predicted
by HASQI for each timbre cluster.
Cluster
Quality
1
.828
2
.689
3
.693
4
.671
5
.801
6
.755
Mean (SD)
.732 (.065)
HASQI performance: for speech = .942 (Kates & Arehart, 2010)
for music = .838, (range = .770 to .849; Arehart et al, 2011)
Rnonlin performance: for music = .95 (1 music sample, 10 participants; Moore et al, 2004)
Conclusions
• How robust is the HASQI model over a
comprehensive range of musical styles?
– The performance of HASQI was found to be (a little) less
accurate than previous work suggests.
– Overall correlation of predicted vs actual quality ratings =
.73 (compared to equivalent value of .84 in Arehart et al).
• Predictive accuracy of HASQI can be improved by
factoring in timbral features of samples.
Experiment 2
The effect of Hard Vs Soft clipping on
perceptions of quality
Hard versus soft clipping
• Partial replication of Experiment 1.
– Both hard and soft clipping processing conditions included
in test set.
– Equivalent to distortion levels 1 to 5 from Experiment 1 (as
opposed to levels 1 to 9 considered in Experiment 1).
• Samples (original, clean files), experimental set-up,
procedure, and number of participants all as per
Experiment 1.
Hard Clipping Thresholds
Threshold (% of peak level)
70
60
50
40
30
20
10
0
0
0.1
0.2
0.3
0.4
Distortion level (1-HASQI)
0.5
0.6
Soft Clipping Thresholds
Threshold (% of peak level)
70
60
50
40
30
20
10
0
0
0.1
0.2
0.3
0.4
Distortion level (1-HASQI)
0.5
0.6
Hard versus soft clipping
Table 4. Comparison examples of hard and soft clipping at equivalent HASQI levels.
Song Name
Artist/Composer
Hard/Soft Clip
Distortion Level
Clean
Ave Maria
Franz Schubert
Hard
Soft
Packin' Truck
Leadbelly
Hard
Soft
Low
Medium
Hard versus soft clipping
100
100
Cluster 1
Cluster 1
90
Cluster 2
90
Cluster 2
80
Cluster 3
Cluster 4
80
Cluster 3
Cluster 4
Cluster 5
Cluster 6
70
Mean quality rating
70
Mean quality rating
Cluster 5
Cluster 6
60
50
40
60
50
40
30
30
20
20
10
10
0
0
0
1
2
3
4
5
6
7
Distortion Level (0 is clean, 9 is most distorted)
8
9
0
1
2
3
4
5
6
7
8
Distortion Level (0 is clean, 9 is most distorted)
Figure 4. Mean quality ratings for hard (left) and soft (right) distortion conditions,
shown by cluster. Error bars show 95% CIs.
9
Hard versus soft clipping
• Across all samples, no significant difference
between ratings for hard and soft clipping.
• HASQI performance is unaffected by type of
distortion.
Experiment 3
Descriptions of quality attributes in
different distortion categories
Digital audio sample statistics
Since digital audio is encoded as discrete samples of the
audio waveform, much can be said about a recording by
the statistical properties of these samples.
The Probability Mass Function can show the presence of
distortion in mastered audio. Consider three categories:
1. The ‘clean’ distribution, where there is no clipping and
a wide dynamic range.
2. Audio with hard-clipping will feature a PMF with high
values at its extreme values, where the maximum
amplitude has been reached.
3. Where softer distortions are used, there is not one
single large value at extremes but more gentle bumps
in the nearby regions.
Subjective Test (Wilson & Fazenda, submitted)
• 63 samples of music, containing a mix
of ‘clean’, hard-clipping distortion and
soft distortions.
• 22 participants gave quality ratings
for each sample on a 5-point scale
and also provided 2 descriptors.
• Ratings for clean samples were
significantly higher than for the two
distorted categories.
• The two distortion categories did not
significantly differ between
themselves (F(1, 2) = 5.72, p < 0.001,
η2 = 0.008).
Verbal descriptions of distortion categories
• As well as a rating out of 5 participants were also asked to provide two words
which described the attributes on which quality was assessed.
• For example:
• “I gave this sample 5 stars because it was clear and full”
• “I gave this sample 1 star because it was distorted and dull”
Word-clouds of the most common attributes associated with (a) clean samples,
(b) hard clipped samples, (c) soft distortion samples.
Verbal descriptions of distortion categories
•
•
Table shows the five most commonly used descriptor words and their
absolute frequencies for each of the clean, hard-clipped and soft distortion
categories.
Chi-Square analysis shows that there is significant variation in the
distribution of words used to describe each of the three categories (χ2(8, N =
547) = 33.28; p < 0.001).
•
Bold frequencies in the table indicate values significantly greater (>) or less
than (<) the expected counts of the null hypothesis.
Verbal descriptions of distortion categories
• “Distorted” is used less than expected by chance to describe
samples in the ‘clean’ category. The opposite is true for both
other categories, the hard and soft clipped distortion samples.
• Samples in the soft category are more frequently described as
“Distorted” than those in the hard category. This suggests that
small amounts of hard-clipping can go unnoticed.
• “Punchy” used less often when describing the soft distortions,
compared to hard-clipping. This may be due to the lesser influence
of inter-sample peaks in soft distortions compared to hardclipping.
• “Harsh” was not associated with either of the distortion categories
but does appear more often than expected by chance for words
describing the clean samples.
Conclusions
• Overall, HASQI found to predict degradation
in music quality reasonably well.
– Performance across hard and soft clipping is very
good.
• Limitation of HASQI for music - not
developed for stereo.
– Model does not account for stereo width and
panning.
References
•
•
•
•
•
•
•
K.H. Arehart, J.M. Kates and M.C. Anderson. Effects of noise, nonlinear processing, and linear
filtering on perceived music quality. Int. J. Audiol. 50(3):177–190. (2011).
J.J. Aucouturier and F. Pachet. Music similarity measures: What’s the use?. Proc. ISMIR.
(2002).
J.M. Kates and K.H. Arehart. The Hearing-Aid Speech Quality Index (HASQI). J. Audio Eng.
Soc. 58(5): 363–381. (2010).
A. Kressner, D. Anderson, and C. Rozell. Evaluating the generalization of the Hearing Aid
Speech Quality Index (HASQI). IEEE Trans. Audio. Speech. Lang. Processing. 21(2): 407–415.
(2013).
B.C.J. Moore, C-T, Tan, N.Zacharov and V-V. Mattila. Measuring and predicting the perceived
quality of music and speech subjected to combined linear and nonlinear distortion. J. Audio
Eng. Soc. 52(12): 1228–1244. (2004).
J.P. Rentfrow and S.D. Gosling. The Do Re Mi’s of everyday life: The structure and personality
correlates of music preferences. J. Pers. Soc. Psychol. 84(6): 1236-56. (2003).
A. Wilson and B.M. Fazenda. Sonic character: Categorisation of distortion profiles in relation
to audio quality of music recordings. Submitted to 17th Int. Conference on Digital Audio
Effects (DAFx-14).