NORTEL NETWORKS CONFIDENTIAL 2 Evaluaton of Objective Quality Estimators: Methods used with Voice Models & Implications for Video Testing Leigh Thorpe.

NORTEL NETWORKS CONFIDENTIAL 2 Evaluaton of Objective Quality Estimators: Methods used with Voice Models & Implications for Video Testing Leigh Thorpe Nortel CTO Services Group VQEG Ottawa Meeting, Sept 10-14, 2007

Nortel Confidential 3 Overview Evaluation goals and analysis approach Database characteristics Subjective results: internal consistency Specific characteristics of measurement: resolution & performance on specific types of impairment

Nortel Confidential 4 Evaluation of Measurement Models Want to understand how well the model predicts quality as rated by users Need to assess performance against an evaluation database 1. How close are the predictions for a set of test cases to the subjective ratings for those same cases? 2. Does the model differentiate neighbouring points in the correct direction? Interested in three aspects of performance: Accuracy: is the model good at predicting the subjective rating Resolution/Monotonicity:

Nortel Confidential 5 Three methods of analysis (1) Graphical: scatterplot and regression line plot subjective scores on x-axis, objective measure on y-axis. the spread of dots shows visually how closely the variables track each other and how close their relationship is to the ideal (the main diagonal) by inspection can see how subgroups behave compared to overall performance. (2) The correlation coefficient r, the Pearson Product-Moment Correlation, measures strength of linear relationship, the tendency for two variables to increase or decrease together does not indicate how close the values of the two variables are perfect correlation gives r = 1 or −1; no relationship gives r = 0 the measurement units for the two variables may be same or different the number of points and the dynamic range of the variables (difference from highest to lowest) will each affect the value of the correlation coefficent (3) The Standard Error of Estimates (SEE) a measure of deviation of the dependent variable from its regression line can compute a score for subsets of the conditions tested SEE is a measure of deviation: smaller is better. The closer the points are to the line (the better the prediction), the smaller the SEE value. SEE is a measure of dispersion similar to standard deviation, and behaves like standard deviation

Nortel Confidential 6 Performance on subgroups of points What correlation tells us Computing the correlation coefficient for a subgroup can mislead us about how the subgroup relates to the overall group. The red points show a different relationship between the variables than is seen for the overall group. The correlation for those points tells us about their relationship to each other, but not to the rest of the data. r = 0.83 r = 0.94 * * * * * * * * * * * * * * * * * * * * * * * * *

Nortel Confidential 7 What SEE tells us * * * * * * * * * * * * Analogous to a standard deviation, SEE is the square root of the average of squared deviations. It is the RMS deviation from the regression line for a given set of points. It can be calculated for any set of points with sufficient n, say n ≥ 6. Compare two groups of points: SEE is smaller for the yellow deviations than for the red deviations. SEE is in the same units as the variable for which it captures the variation. For this example, SEE has the units of y.

Nortel Confidential 8 Evaluation Samples: The “Database” The evaluation database consists of: a number of samples of the signal of interest a mean subjective rating for each sample Ideally, the database should contain samples (test cases) covering the full range of types and levels of impairments that the model will encounter in usage conditions. single database: all subjects have rated all test cases where multiple databases are used, there should be sufficient common test cases across the databases to show whether the subjective ratings line up

Nortel Confidential 9 Criteria used for new Voice Qual Database Cover a broad range of impairment types and levels different types of codecs, range of packet loss, background noise (for these cases, noise is in the reference) combinations of these: coding, noise, packet loss, tandeming Two languages: English, French Multiple talkers eight---four per langage Include conditions that will challenge candidate methods time warping (temporal shift) and noise reduction A large number of judgments to obtain stable scores We used n = 60 for each sample

Nortel Confidential 10 Effect of Truncating Quality Range r = 0.53 r = 0.85 This small range database is simulated from the above by restricting the range of subjective values. Care was taken in the simulation to keep the number of points about the same.) The range restriction reduced the correlation coefficient from 0.85 to 0.53..

Nortel Confidential 11 Database details Languages tested separately; listeners were native speakers of language heard Samples 6 – 8 sec duration each made up of two unrelated sentences from same talker Four talkers per language; talkers crossed with conditions 1304 samples (326 x 4) Test room ambient noise low Presented at nominal telephone listening volume Too many samples to complete in one session: samples were divided across four test sessions each session included one instance of each condition the four talkers were represented equally in all sessions therefore, every listener heard every test case, but not always with the same talker

Nortel Confidential 12 Internal Consistency of Database: English r = 0.995 English samples. This is the upper limit of performance that can be detected with this database. One half Other half English Database: Internal Consistency (Per condition means, arbitrary split) R = 0.995 The variability of these samples indicates a resolution of about 0.25 MOS, as would be expected for n = 30 (ie, half).

Nortel Confidential 13 Internal Consistency of Database: French r = 0.995 French samples R = 0.995

Nortel Confidential 14 Correlation Coefficient (r) by Algorithm 0.910.840.920.93Averaged* 0.830.820.900.91Merged 0.870.780.90 French 0.900.850.920.93English Model DModel CModel BModel A Subj Data This is the correlation for French and English scores averaged together, not the average of the correlation coefficients!

Nortel Confidential 15 Results for Model A r = 0.93 The spread of these points shows that Model A can resolve subjective quality to no better than about 0.5 MOS.

Nortel Confidential 16 Results for Model C r = 0.84 This model shows a tendency to compress the range of its output score, relative to the subjective scores. There are a number of outliers in the lower left quadrant. The mid-range resolution is about 3/4 MOS.

Nortel Confidential 17 Example: data plotted by subgroup

Nortel Confidential 18 Example of results for subgroups SEE* values 0.260.30 0.23 Overall 0.210.280.320.22 Noise Reduction 0.260.400.320.25 Noise + Packet Loss 0.270.410.320.22 Noise 0.290.220.320.30 Temporal Clipping 0.170.210.260.20 Constrained Bursty PL 0.330.320.230.16 Bursty Packet Loss 0.290.220.340.24 Constrained Random PL 0.290.230.310.23 Random Packet Loss 0.180.29 0.27 Codecs 0.300.290.490.41 MNRU Model DModel CModel BModel A Combined * based on means across languages

Nortel Confidential 19 What can we learn from the voice metric testing that can assist in evaluation of video metrics? 1. Ensure the use of a range of quality in the subjective test samples (next slide). this can affect the correlation observed 2. Include all the impairments you are going to want to assess with the model, or that may be encountered in signals that pass through networks. 3. Within reason, any subjective metric can be used, as long as it is sufficiently sensitive to the variation in quality over the range used. It doesn’t need to be MOS. 4. Collect data from as many viewers as practicable n> 30 if possible 5. Examine internal consistency of subjective ratings 6. Examine performance of the models on subgroups within the data select a statistic that provides an unbiased result. (r is not unbiased in this application). SEE statistic provides credible alternative 7. Examine resolution and monotonicity quantitative metrics??

Nortel Confidential 21 Interpretating regression and correlation * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Weak relationship: the points fall far from the line, and the cloud of points is about as long as it is wide. It looks as though a line on any direction would be as good. Strong relationship and the line is very similar to the diagonal: on average, the objective measure is closely tracking subjective score. For MOS prediction, this is the most desireable result. Strong relationship, but the line is canted relative to the diagonal: the objective measure is using a smaller range than the subjective score. Note: the value of the correlation coefficient does not indicate whether the line tracks the diagonal. Deviation from linear: the objective measure follows the diagonal for the lower portion, but underestimates the quality of the conditions in the upper range. We can compute a regression line, but it will not account for the non-linearity. We could compute a best fit curve, but there is no “correlation” statistic to indicate the strength of a non-linear relationship. * * * * * * * * * * * * * *

Nortel Confidential 22 Working with correlation (1) Correlation coefficients cannot be averaged. Why not? * * * * * * * * * * * * r = 0.94 * * * * * * * * * * * * r = 0.92 * * * * * * * * * * * * * * * * * * * * * * * * r = 0.93 Correlation is not a linear process, and so the correlations cannot be treated with linear operations (like averaging). Database ADatabase B Databases A & B Merged r = 0.65

Nortel Confidential 23 Nortel Database Summary of Impairment Conditions 326 cases x 4 talkers x 2 languages = 2608 test samples in the database 326Total good and poor noise reduction algorithm48Noise Reduction 2%, 4%, random & bursty54Noise + Packet Loss 20, 10, 0 dB SNR, Hoth, car, babble, street 33Noise 15-60 ms clip, +/-80 ms shift, 120 ms mute 21Temporal Clipping same speech & mask for each codec22Constrained Bursty PL 1% - 10% PL, 10, 20, 30 ms packets54Bursty Packet Loss same speech & mask for each codec22Constrained Random PL 1% - 10% PL, 10, 20, 30 ms packets54Random Packet Loss G.711, G.729, AMR, tandem7Codecs High quality only2Clean Range of Quality No. of Cases Category 5 - 35 dBQ7MNRU

Nortel Confidential 24 Results for Model A by subgroup English

NORTEL NETWORKS CONFIDENTIAL 2 Evaluaton of Objective Quality Estimators: Methods used with Voice Models & Implications for Video Testing Leigh Thorpe.

Similar presentations

Presentation on theme: "NORTEL NETWORKS CONFIDENTIAL 2 Evaluaton of Objective Quality Estimators: Methods used with Voice Models & Implications for Video Testing Leigh Thorpe."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NORTEL NETWORKS CONFIDENTIAL 2 Evaluaton of Objective Quality Estimators: Methods used with Voice Models & Implications for Video Testing Leigh Thorpe.

Similar presentations

Presentation on theme: "NORTEL NETWORKS CONFIDENTIAL 2 Evaluaton of Objective Quality Estimators: Methods used with Voice Models & Implications for Video Testing Leigh Thorpe."— Presentation transcript:

Similar presentations

About project

Feedback