Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight.

Slides:



Advertisements
Similar presentations
Change-Point Detection Techniques for Piecewise Locally Stationary Time Series Michael Last National Institute of Statistical Sciences Talk for Midyear.
Advertisements

Acoustic/Prosodic Features
Chapter 5: Introduction to Information Retrieval
Chapter 4: Representation of data in computer systems: Sound OCR Computing for GCSE © Hodder Education 2011.
Franz de Leon, Kirk Martinez Web and Internet Science Group  School of Electronics and Computer Science  University of Southampton {fadl1d09,
Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.
Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.
SIMS-201 Characteristics of Audio Signals Sampling of Audio Signals Introduction to Audio Information.
IT-101 Section 001 Lecture #8 Introduction to Information Technology.
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,
1 Speech Parametrisation Compact encoding of information in speech Accentuates important info –Attempts to eliminate irrelevant information Accentuates.
Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Dual Tone Multi-Frequency System Michael Odion Okosun Farhan Mahmood Benjamin Boateng Project Participants: Dial PulseDTMF.
Chapter 3 Data and Signals
Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Data Transmission Slide 1 Continuous & Discrete Signals.
4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,
William Stallings Data and Computer Communications 7th Edition (Selected slides used for lectures at Bina Nusantara University) Data, Signal.
Presented by Zeehasham Rasheed
Tutorial 7 Working with Multimedia. XP Objectives Explore various multimedia applications on the Web Learn about sound file formats and properties Embed.
Representing Acoustic Information
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
LE 460 L Acoustics and Experimental Phonetics L-13
Digital Audio What do we mean by “digital”? How do we produce, process, and playback? Why is physics important? What are the limitations and possibilities?
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
Introduction to Interactive Media 10: Audio in Interactive Digital Media.
Lecture 1 Signals in the Time and Frequency Domains
1-1 Basics of Data Transmission Our Objective is to understand …  Signals, bandwidth, data rate concepts  Transmission impairments  Channel capacity.
1 Business Telecommunications Data and Computer Communications Chapter 3 Data Transmission.
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
Tutorial 7 Working with Multimedia. XP Objectives Explore various multimedia applications on the Web Learn about sound file formats and properties Embed.
3 SIGNALLING Analogue vs. digital signalling oRecap advantages and disadvantages of analogue and digital signalling oCalculate signal transmission rates.
Wireless and Mobile Computing Transmission Fundamentals Lecture 2.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
Introduction to SOUND.
David Meredith Aalborg University
Extracting Melody Lines from Complex Audio Jana Eggink Supervisor: Guy J. Brown University of Sheffield {j.eggink
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
1 Introduction to Information Technology LECTURE 6 AUDIO AS INFORMATION IT 101 – Section 3 Spring, 2005.
Audio Tempo Extraction Presenter: Simon de Leon Date: February 9, 2006 Course: MUMT611.
Introduction to Digital Signals
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
1 Composite Signals and Fourier Series To approximate a square wave with frequency f and amplitude A, the terms of the series are as follows: Frequencies:
Loudness level (phon) An equal-loudness contour is a measure of sound pressure (dB SPL), over the frequency spectrum, for which a listener perceives a.
Duraid Y. Mohammed Philip J. Duncan Francis F. Li. School of Computing Science and Engineering, University of Salford UK Audio Content Analysis in The.
Vibrationdata 1 Unit 6a The Fourier Transform. Vibrationdata 2 Courtesy of Professor Alan M. Nathan, University of Illinois at Urbana-Champaign.
Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.
MSc Project Musical Instrument Identification System MIIS Xiang LI ee05m216 Supervisor: Mark Plumbley.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
1 What is Multimedia? Multimedia can have a many definitions Multimedia means that computer information can be represented through media types: – Text.
Basic Acoustics. Sound – your ears’ response to vibrations in the air. Sound waves are three dimensional traveling in all directions. Think of dropping.
And application to estimating the left-hand fingering (automatic tabulature generation) Caroline Traube Center for Computer Research in Music and Acoustics.
Why Is It There? Chapter 6. Review: Dueker’s (1979) Definition “a geographic information system is a special case of information systems where the database.
The Care and Feeding of Loudness Models J. D. (jj) Johnston Neural Audio Kirkland, Washington, USA.
CLASSIFICATION OF ECG SIGNAL USING WAVELET ANALYSIS
Lifecycle from Sound to Digital to Sound. Characteristics of Sound Amplitude Wavelength (w) Frequency ( ) Timbre Hearing: [20Hz – 20KHz] Speech: [200Hz.
National Mathematics Day
PATTERN COMPARISON TECHNIQUES
ARTIFICIAL NEURAL NETWORKS
Multimedia Information Retrieval
CS 188: Artificial Intelligence Fall 2008
Working with Multimedia
Chapter 4: Representing sound
EE513 Audio Signals and Systems
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
Presenter: Simon de Leon Date: March 2, 2006 Course: MUMT611
Realtime Recognition of Orchestral Instruments
Realtime Recognition of Orchestral Instruments
Measuring the Similarity of Rhythmic Patterns
Presentation transcript:

Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight

Agenda Introduction Previous Research Analysis Techniques Statistical Techniques Performance Applications Future Work

Previous Research Sounds traditionally described by pitch, loudness, duration, timbre Timbre can be identified by a tone because of their similar spectral energy distributions Too much variation across range of pitches and dynamic levels to “fingerprint” a single instrument tone Algorithms that extract audio structure (i.e. find first occurrence of G- sharp) – Algorithms were tuned to specific musical constructs and not appropriate for all sounds Neural nets to index audio databases – Some success but it was difficult for user to specify which features were important and which to ignore

Methods To Access Sounds Simile Acoustical/Perceptual Features Subjective Features Onomatopoeia

Accomplish Methods 1. Analysis Techniques Reduce sound to small set of parameters 2. Statistical Techniques To accomplish classification & retrieval

Analysis Techniques

Analysis & Retrieval Engine Exact Text Search Sound Level Fuzzy Text Search Speech or Musical content 1. Measure variety of acoustical features of each sound 1. Loudness 2. Pitch 3. Brightness 4. Bandwidth 5. Harmonicity 2. Set of N features is represented as an N -vector. 3. Different aural properties map to different regions of N- space.

Acoustical Features: Loudness Approximated by signal’s Root-Mean-Square (RMS) measured in decibels – RMS calculated by taking series of windowed frames of the sound and computing the square root of the sum of the squares of the sample values Human ear: 120 db range Software: 100 db range from 16 bit recordings

Acoustical Features: Pitch Estimated by taking series of short-time Fourier spectra Frequencies & amplitudes of peaks measured for each frame Approximate Greatest Common Divisor algorithm to calculate estimate of pitch Store as log frequency Human ear: 20Hz – 20kHz Software: 50Hz – 10kHz

Acoustical Features: Brightness Measure of higher frequency content of signal Computed as centroid of the short-time Fourier magnitude spectra Stored as log frequency Varies over same range as pitch Can’t be less than pitch estimate at any given instant

Acoustical Features: Bandwidth Difference of frequency components and centre frequency is taken Summation of differences Divide by number of components to get average Examples: – Single sine wave has bandwidth of 0 – Ideal white noise has infinite bandwidth

Acoustical Features: Harmonicity Harmonic vs. Inharmonic vs. Noise Computed by measuring deviation of sound’s line spectrum from a perfectly harmonic spectrum Normalized range 0-1 Optional feature

Storage – Feature Vector Trajectory in time computed but not stored For each trajectory, computes & stores: – Average – Variance – Autocorrelation – Duration of sound

Training The System For each sound entered into the db, the N-vector, a, is computed Mean vector and covariance matrix R for the a vectors in each class are calculated: µ = (1/M) ∑ j.a[j] R = (1/M) ∑ j.(a[j]-µ)(a[j]-µ)T Mean + Covariance = System’s model of perceptual property being trained by user

Statistical Techniques

Classifying Sounds When a new sound needs to be classified, a distance measure is calculated from new sound’s a vector and previous model Using weighted L 2 or Euclidean distance: D = ((a-µ) T R - 1(a-µ)) 1/2 Likelihood value L based on normal distribution and given by: L = exp(-D 2 /2)

Retrieving Sounds Sort sounds by all acoustic features Example: – Retrieve top M sounds in class – Get all sounds in hyper-rectangle centered around mean with volume V such that V/V 0 =M/M 0 – Compute distance measure for all sounds – Return closest M sounds – Increase ratio & Iterate of not enough sounds returned

2 Quality Measures 1. Magnitude of covariance matrix R Measure of the compactness of the class Quality measure of classification 2. Size of covariance matrix Measure of particular dimension’s importance to the class User can see if feature is too important or not important enough

Segmentation Apply acoustical analyses Look for transitions Transitions define segments of the signal to be treated like individual sounds

Performance & Results Laughter classification Touchtone classification Example: Laughter classification Returned: Laughing sounds Animal sounds Example: Touchtone classification Returned: 1 recording out of training set Low likelihood touchtone - 7 digit telephone # High likelihood – single digit tones

Applications Audio databases & file systems – Fields: file name, sample rate, sample size, file format, channels, dates, keywords, analysis feature vector, etc. Audio database browser – Front-end db application (e.g.. SoundFisher) lets user search for sounds using queries that can be content based – Permits general maintenance of entries – adding, deleting, describing sounds

Applications Audio editors – Include knowledge of audio content – Search commands like queries, build new classes on the fly Surveillance – Identical to editor but identification & classification done in real time – Detect sounds associated with criminal activity (eg. Glass breaking, screams) Automatic segmentation of audio & video – For large archives of raw audio & video – Audio-to-MIDI (Studio Vision Pro 3.0)

Future Work More analytic features General phrase-level content based retrieval Source separation Sound synthesis

Conclusions