ASSESS: a descriptive scheme for speech in databases Roddy Cowie.

ASSESS: a descriptive scheme for speech in databases Roddy Cowie

to refresh people’s memory…  ASSESS embodies an approach to processing audio element of a database  It is about going beyond the raw audio signal;  Providing processing that a lot of people might want,  But not everyone can do.

ASSESS covers several levels:  Basic transformations of the signal;  Key boundaries and the units that go with them;  Properties of the units.  the system generates a lot of files but a lot of the things you might want are there if you know where to look

The processes ASSESS uses  A reasonable model: Developed for inconsiderate inputs Developed for inconsiderate inputs Robust Robust Maximise availability Maximise availability Systematic rather than selective Systematic rather than selective

ASSESS input characteristics  Input file: Reasonably long (up to 2.5 mins) Reasonably long (up to 2.5 mins) 20kHz sampling rate 20kHz sampling rate No header (.raw, not.wav) No header (.raw, not.wav) Messy, but conversion techniques are easily available

Using ASSESS  Woefully undramatic  Supply 3 command lines eg for a file called ‘test’ lasting x secs eg for a file called ‘test’ lasting x secs filterbank test.raw test.spc 20000filterbank test.raw test.spc 20000 howard test.raw test.txhoward test.raw test.tx stage2 teststage2 test  Wait about x/2 secs  Admire outputs

Basic transformations Basic transformations and 1 st order output  Intensity  1/3 octave spectrum  ‘pulses’ corresponding to vocal cord openings - basis for estimating pitch - basis for estimating pitch  1 st order output consists of 2 files intensity & 1/3 octave spectrum intensity & 1/3 octave spectrum estimated ‘pulses’ estimated ‘pulses’  Everything else ASSESS calculates is derived from those

Conditioning 1 st order outputs ASSESS Conditioning 1 st order outputs in ASSESS  Raw intensity Scaled by parameter derived from a ‘reference’ file Scaled by parameter derived from a ‘reference’ file - representing normal speaking level under same recording conditions - representing normal speaking level under same recording conditions  Clumsy, but checks show it allows reasonable comparison across files  Same scaling applied to spectrum

Conditioning 1 st order outputs ASSESS Conditioning 1 st order outputs in ASSESS  Raw pulse estimates cleaned  by selecting sequences where intervals are very close  Results (in pink) comparable to standard autocorrelation, but easier to clean further  High noise associated with frication filtered using spectrum

Conditioning 1 st order outputs ASSESS Conditioning 1 st order outputs in ASSESS  Fitting flexible ‘rope’ filters extremes, captures broad shape  (zeroes mark pause boundaries – taken into account)

Conditioning 1 st order outputs ASSESS Conditioning 1 st order outputs in ASSESS  In contrast, standard methods try to correct for octave jumps -  with the kind of result shown in the lower panel

Boundary finding ASSESS Boundary finding in ASSESS  Silences are found iteratively find an intensity level that separates a cluster of low- intensity samples (pauses) from a cluster of high-intensity samples (speech); find an intensity level that separates a cluster of low- intensity samples (pauses) from a cluster of high-intensity samples (speech); fine-tune using the spectrum of the definite pauses. fine-tune using the spectrum of the definite pauses.  Again, robust: in a comparison sample a phonetician identified 503 pauses a phonetician identified 503 pauses ASSESS identified 498 ASSESS identified 498  difference between times of corresponding bounds averaged 10.4 ms for pause starts 10.4 ms for pause starts -1.7ms for pause ends -1.7ms for pause ends  A similar approach is applied to frication

. exm files specify  pitch and intensity contours in terms of local maxima and minima in terms of local maxima and minima and speech/silence boundaries and speech/silence boundaries  episodes with frication (boundaries & average spectra) 2 nd order output of ASSESS

Describing units – 3rd order outputs ASSESS Describing units – 3rd order outputs of ASSESS  Basic units: Pauses Pauses Tunes (structures between pauses lasting over 150ms) Tunes (structures between pauses lasting over 150ms)  Pauses have only duration  Tunes have multiple attributes, and ASSESS covers them systematically

Describing units – 3rd order outputs ASSESS Describing units – 3rd order outputs of ASSESS  Basic module of description (in.psg file) - Pattern repeated for pitch, & for each tune

Describing units – structural properties  Tune properties include global slope & curvature of pitch contour, global slope & curvature of pitch contour, movement at start and end, movement at start and end, measures of spectral balance & change measures of spectral balance & change  Relations between tunes include abruptness of change from last tune abruptness of change from last tune ‘crescendo’ … ‘crescendo’ …  etc.

Summary  ASSESS is part system, part philosophy  The system delivers robust estimates of spectrum, F0 and intensity contours, key boundaries, and properties of the units they define  The philosophy is using signal processing expertise to make multiple alternatives at multiple levels available to others.

ASSESS: a descriptive scheme for speech in databases Roddy Cowie.

Similar presentations

Presentation on theme: "ASSESS: a descriptive scheme for speech in databases Roddy Cowie."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ASSESS: a descriptive scheme for speech in databases Roddy Cowie.

Similar presentations

Presentation on theme: "ASSESS: a descriptive scheme for speech in databases Roddy Cowie."— Presentation transcript:

Similar presentations

About project

Feedback