Audio and Speech Computers & New Media
Topics for Today General Audio Speech Basics of audio signal Features Event detection Speech Detection Segmentation Speaker identification Recognition Audio generation in software applications
The Audio Signal Energy at each frequency step for every recorded point of time
Features for Audio Analysis Data over Time and Frequency
Energy Over Time What are these? Speech Music Gunshot
Summarizing the Audio Signal Sum energy for bands of frequencies over intervals of time
Audio Signal Analysis Fast Fourier Transform (FFT) Commonly used on audio signals Allows for analysis of frequency features across time Discrete Wavelet Transform (DWT) FFTs have equal sized windows where wavelets can vary based on frequency
Audio Signal Analysis Mel-frequency cepstral coeffients (MFCC) Based on FFTs Maps results into bands approximating human auditory system
Event Detection Mapping audio cues to events Recognizing sounds related to particular events (e.g. gunshot, falling, scream)
Classifying Audio Signals Features are extracted from audio signals Can be time or frequency or both Features creates a multidimensional space of data points Supervised learning Train classifier with set of labeled signals SVMs, neural nets, … Unsupervised learning Cluster unlabeled signals based on similarity HAC, K-means, … Same for most any type of signal, not just audio
Speech Detection Another audio signal classification task Complicated by background sounds
Distinguishing between Speakers Speaker segmentation/diarization Identify when a change in speaker occurs Self-similarity assessments Useful for basic indexing or summarization of speech content Speaker identification Requires label attached to training data or label attached to cluster from unsupervised learning Enables search (and other features) based on speaker
Speech Recognition Segment utterances & characterize phonemes Use gaps to segment Group phoneme segments into words Group words into requests or sentences
Speech Recognition Continuous speech What to do for noisy signal Language models for disambiguation Speaker dependent training improves recognition What to do for noisy signal Topic spotting Heuristic search
Playing Back or Generating Audio Where do you find audio cues in software outside of games? Mapping events in software to audio cues LogoMedia included audio cues to speed up stepping through code InfoSound used audio to aid in program comprehension Caitlin mapped code elements to different instruments
Spatialized Audio Additional geographic/navigational channel Examples Joyce’s interactive Central Park hyperaudio Audio maps of city for the visually impaired Conveys distances, directions, and object sizes Not for use while moving at time of writing
Spatialized Audio Generation Head-related transfer function (HRTF) Difference in timing and signal strength determine how we identify position of sound Easy to apply with headphones In open space Beamforming Timing for constructive interference to create stronger signal at desired location Crosstalk Cancellation Destructive interference to remove parts of signal at desired location
Echology: Interacting with Spatialized Audio An interactive 2D soundscape combining human collaboration with aquarium activity Goal: engage visitors to spend more time with (and learn more about) Beluga whales Spatialized sound based on whale activity and human interaction
Echology Interaction Whale activity is classified to create different sounds in soundstage Visitors determine how sounds move through space
Echology Architecture
Topics for Today General Audio Speech Basics of audio signal Features Event detection Speech Detection Segmentation Speaker identification Recognition Audio generation in software applications