Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight
Agenda Introduction Previous Research Analysis Techniques Statistical Techniques Performance Applications Future Work
Previous Research Sounds traditionally described by pitch, loudness, duration, timbre Timbre can be identified by a tone because of their similar spectral energy distributions Too much variation across range of pitches and dynamic levels to “fingerprint” a single instrument tone Algorithms that extract audio structure (i.e. find first occurrence of G- sharp) – Algorithms were tuned to specific musical constructs and not appropriate for all sounds Neural nets to index audio databases – Some success but it was difficult for user to specify which features were important and which to ignore
Methods To Access Sounds Simile Acoustical/Perceptual Features Subjective Features Onomatopoeia
Accomplish Methods 1. Analysis Techniques Reduce sound to small set of parameters 2. Statistical Techniques To accomplish classification & retrieval
Analysis Techniques
Analysis & Retrieval Engine Exact Text Search Sound Level Fuzzy Text Search Speech or Musical content 1. Measure variety of acoustical features of each sound 1. Loudness 2. Pitch 3. Brightness 4. Bandwidth 5. Harmonicity 2. Set of N features is represented as an N -vector. 3. Different aural properties map to different regions of N- space.
Acoustical Features: Loudness Approximated by signal’s Root-Mean-Square (RMS) measured in decibels – RMS calculated by taking series of windowed frames of the sound and computing the square root of the sum of the squares of the sample values Human ear: 120 db range Software: 100 db range from 16 bit recordings
Acoustical Features: Pitch Estimated by taking series of short-time Fourier spectra Frequencies & amplitudes of peaks measured for each frame Approximate Greatest Common Divisor algorithm to calculate estimate of pitch Store as log frequency Human ear: 20Hz – 20kHz Software: 50Hz – 10kHz
Acoustical Features: Brightness Measure of higher frequency content of signal Computed as centroid of the short-time Fourier magnitude spectra Stored as log frequency Varies over same range as pitch Can’t be less than pitch estimate at any given instant
Acoustical Features: Bandwidth Difference of frequency components and centre frequency is taken Summation of differences Divide by number of components to get average Examples: – Single sine wave has bandwidth of 0 – Ideal white noise has infinite bandwidth
Acoustical Features: Harmonicity Harmonic vs. Inharmonic vs. Noise Computed by measuring deviation of sound’s line spectrum from a perfectly harmonic spectrum Normalized range 0-1 Optional feature
Storage – Feature Vector Trajectory in time computed but not stored For each trajectory, computes & stores: – Average – Variance – Autocorrelation – Duration of sound
Training The System For each sound entered into the db, the N-vector, a, is computed Mean vector and covariance matrix R for the a vectors in each class are calculated: µ = (1/M) ∑ j.a[j] R = (1/M) ∑ j.(a[j]-µ)(a[j]-µ)T Mean + Covariance = System’s model of perceptual property being trained by user
Statistical Techniques
Classifying Sounds When a new sound needs to be classified, a distance measure is calculated from new sound’s a vector and previous model Using weighted L 2 or Euclidean distance: D = ((a-µ) T R - 1(a-µ)) 1/2 Likelihood value L based on normal distribution and given by: L = exp(-D 2 /2)
Retrieving Sounds Sort sounds by all acoustic features Example: – Retrieve top M sounds in class – Get all sounds in hyper-rectangle centered around mean with volume V such that V/V 0 =M/M 0 – Compute distance measure for all sounds – Return closest M sounds – Increase ratio & Iterate of not enough sounds returned
2 Quality Measures 1. Magnitude of covariance matrix R Measure of the compactness of the class Quality measure of classification 2. Size of covariance matrix Measure of particular dimension’s importance to the class User can see if feature is too important or not important enough
Segmentation Apply acoustical analyses Look for transitions Transitions define segments of the signal to be treated like individual sounds
Performance & Results Laughter classification Touchtone classification Example: Laughter classification Returned: Laughing sounds Animal sounds Example: Touchtone classification Returned: 1 recording out of training set Low likelihood touchtone - 7 digit telephone # High likelihood – single digit tones
Applications Audio databases & file systems – Fields: file name, sample rate, sample size, file format, channels, dates, keywords, analysis feature vector, etc. Audio database browser – Front-end db application (e.g.. SoundFisher) lets user search for sounds using queries that can be content based – Permits general maintenance of entries – adding, deleting, describing sounds
Applications Audio editors – Include knowledge of audio content – Search commands like queries, build new classes on the fly Surveillance – Identical to editor but identification & classification done in real time – Detect sounds associated with criminal activity (eg. Glass breaking, screams) Automatic segmentation of audio & video – For large archives of raw audio & video – Audio-to-MIDI (Studio Vision Pro 3.0)
Future Work More analytic features General phrase-level content based retrieval Source separation Sound synthesis
Conclusions