Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort.

Slides:



Advertisements
Similar presentations
Advanced Mixing Advanced Audio and Mixing with Greg Hill AVGenius.com Copyright © 2011 AV Genius, LLC
Advertisements

Change-Point Detection Techniques for Piecewise Locally Stationary Time Series Michael Last National Institute of Statistical Sciences Talk for Midyear.
Resonance If you have ever blown across the top of a bottle or other similar object, you may have heard it emit a particular sound. The air in the bottle.
What makes a musical sound? Pitch n Hz * 2 = n + an octave n Hz * ( …) = n + a semitone The 12-note equal-tempered chromatic scale is customary,
DFT/FFT and Wavelets ● Additive Synthesis demonstration (wave addition) ● Standard Definitions ● Computing the DFT and FFT ● Sine and cosine wave multiplication.
Hearing and Deafness 2. Ear as a frequency analyzer Chris Darwin.
The evaluation and optimisation of multiresolution FFT Parameters For use in automatic music transcription algorithms.
Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.
Pitch Perception.
Tetsuro Kitahara* Masataka Goto** Hiroshi G. Okuno*
Sound Chapter 13.
Auditory Scene Analysis (ASA). Auditory Demonstrations Albert S. Bregman / Pierre A. Ahad “Demonstration of Auditory Scene Analysis, The perceptual Organisation.
Pitch Recognition with Wavelets Final Presentation by Stephen Geiger.
Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight.
1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.
Timbre (pronounced like: Tamber) pure tones are very rare a single note on a musical instrument is a superposition (i.e. several things one on top of.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
On Timbre Phy103 Physics of Music. Four complex tones in which all partials have been removed by filtering (Butler Example 2.5) One is a French horn,
AudioViz Angela Norton and Aaron Hilton. Summary of AudioViz  Allow user to visually match spectral auditory patterns within an audio stream  Decompose.
EE2F2 - Music Technology 11. Physical Modelling Introduction Some ‘expressive instruments don’t sound very convincing when sampled Examples: wind or.
Music Processing Roger B. Dannenberg. Overview  Music Representation  MIDI and Synthesizers  Synthesis Techniques  Music Understanding.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
What are harmonics? Superposition of two (or more) frequencies yields a complex wave with a fundamental frequency.
Human Psychoacoustics shows ‘tuning’ for frequencies of speech If a tree falls in the forest and no one is there to hear it, will it make a sound?
Fundamentals of Perceptual Audio Encoding Craig Lewiston HST.723 Lab II 3/23/06.
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
/14 Automated Transcription of Polyphonic Piano Music A Brief Literature Review Catherine Lai MUMT-611 MIR February 17,
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Audio and Music Representations (Part 2) 1.
SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIXFACTORIZATION AND SPECTRAL MASKS Jain-De,Lee Emad M. GraisHakan Erdogan 17 th International.
Standing Waves When an incident wave interferes with a reflected wave to form areas of constructive and destructive interference. When an incident wave.
DIGITAL WATERMARKING OF AUDIO SIGNALS USING A PSYCHOACOUSTIC AUDITORY MODEL AND SPREAD SPECTRUM THEORY By: Ricardo A. Garcia University of Miami School.
Stephen Mildenhall September 2001
R ESEARCH BY E LAINE C HEW AND C HING -H UA C HUAN U NIVERSITY OF S OUTHERN C ALIFORNIA P RESENTATION BY S EAN S WEENEY D IGI P EN I NSTITUTE OF T ECHNOLOGY.
8.1 Music and Musical Notes It’s important to realize the difference between what is music and noise. Music is sound that originates from a vibrating source.
Polyphonic Music Transcription Using A Dynamic Graphical Model Barry Rafkind E6820 Speech and Audio Signal Processing Wednesday, March 9th, 2005.
Chapter 21 R(x) Algorithm a) Anomaly Detection b) Matched Filter.
Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
Rhythmic Transcription of MIDI Signals Carmine Casciato MUMT 611 Thursday, February 10, 2005.
Extracting Melody Lines from Complex Audio Jana Eggink Supervisor: Guy J. Brown University of Sheffield {j.eggink
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
Polyphonic Transcription Bruno Angeles McGill University - Schulich School of Music MUMT-621 Fall /14.
EE 113D Fall 2008 Patrick Lundquist Ryan Wong
Pitch perception in auditory scenes 2 Papers on pitch perception… of a single sound source of more than one sound source LOTS - too many? Almost none.
Audioprocessor for Automobiles Using the TMS320C50 DSP Ted Subonj Presentation on SPRA302 CSE671 / Dr. S. Ganesan.
Audio Tempo Extraction Presenter: Simon de Leon Date: February 9, 2006 Course: MUMT611.
Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.
Performance Comparison of Speaker and Emotion Recognition
Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....
Sound Quality.
Piano Music Transcription Wes “Crusher” Hatch MUMT-614 Thurs., Feb.13.
12-3 Harmonics.
SPATIAL HEARING Ability to locate the direction of a sound. Ability to locate the direction of a sound. Localization: In free field Localization: In free.
1 Tempo Induction and Beat Tracking for Audio Signals MUMT 611, February 2005 Assignment 3 Paul Kolesnik.
PATTERN COMPARISON TECHNIQUES
ArmKeyBoard A Mobile Keyboard Instrument Based on Chord-scale System
Lock-in amplifiers
Information-Theoretic Listening
Term Project Presentation By: Keerthi C Nagaraj Dated: 30th April 2003
Presented by Steven Lewis
EE513 Audio Signals and Systems
Analysis of Audio Using PCA
Presenter: Simon de Leon Date: March 2, 2006 Course: MUMT611
Presentation on Timbre Similarity
Govt. Polytechnic Dhangar(Fatehabad)
Realtime Recognition of Orchestral Instruments
Realtime Recognition of Orchestral Instruments
Measuring the Similarity of Rhythmic Patterns
Presentation transcript:

Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort

Objective To give an overview of this particular technique for automatic transcription –Original implementation: ICMC 1993

Introduction Sound Source Separation System –Extracting sound source in the presence of multiple sources Physical vs. Perceptual sound source –Physical: actual source itself –Perceptual: Humans hear as single source Ex: Piano, Loudspeaker

Perceptual Sound Source Separation Creating system which simulates human perceptual system Extraction of parameters based on perceptual model, grouping of parameters based on certain criteria

This PSSS System Kashino et al. –U. of Tokyo OPTIMA: Organized Processing Towards Intelligent Music Scene Analysis First to use human auditory seperation rules

This PSSS System Suppose: Input = mono audio signal, output = multiple midi channels (and graphic display) Given signal S(t), comprised of mix of M sound sources –Assume S(t) = {F1(t),…,FL(t)} Where Fj(t) = {pj(t),fj(t),psij(t)} –Pj = power of spectral peak –Fj = freq of spectral peak –Psij = bandwidth of spectral peak Wish to: –Extract Fj(t) from S(t) –Cluster Fj(t) into groups which (ultimately) represent different sound sources

System Overview Extraction of Frequency Components –Analysis first taken All signals are 16 bit/ 48 khz Bank of 2nd order IIR bandpass filters (log freq scale) implemented –Peak Selection/Tracking: “pinching Plane” method –Regression planes, calculated via least squares »In other words, minimization of sum of squares in z direction (power), leaving x and y (time and freq) fixed »Normal vector for each plane calculated. Angle between gives psij(t), direction vector gives fj(t), pj(t) –First regression plane analysis sets threshold by which other potential peaks are measured

Pinching Planes

Bottom Up Clustering of Freq Components Grouping freq components based on perceptual criteria Goal is to group sounds humans hear as one calculations made for harmonic mistuning and onset asynchrony between pairwise freq components, then evaluated for probability of auditory separation – probability functions based on approximations of psychoacoustic experiments given prob functions p1 and p2, the integrated prob of auditory separationis given by m = 1-(1-p1)(1-p2) – this is from Dempster's law of prob. –m is used as distance measure in clustering

Clustering for Source Identification identify sound sources by global characteristics of clusters –goal is to group sounds based on same source (thus uses direct signal attributes apart from any psychoacoustic metric of determination) –if a cluster contains a single note we’re good

Clustering for Source Identification uses distance function to determine source –D = c1fp+c2fq+c3ta+c4ts Where: fp = peak power ratio of second harmonic to fundamental component Fq = peak power ratio of third harmonic to fundamental component Attack time Sustain time

tone model based processing unit of input is a "processing scope” –proc scope consists of one cluster, or several if they share a freq component –a tone model is a 2D matrix with each row being a freq component over time (column rep. time). each element is a 2D vector of normalized power and freq. –"mixture hypotheses" generated for each tone model, and matched with a processing scope to find the closest fit –distance function minimizes power difference at given time/freq location –effective in recognizing chords –-but, is model based

Automatic tone modeling -automatic acquisition of tone models from analysed signal –-based on "old-plus-new heuristic" [bregman 90] a complex sound is interpreted as everything old which remained is perceived as new sound

Hierarchy of Perceptual Sound Events

A Few Probs and Limitations Octave = no good Psychoacoustic Models –Not tested over large enough group Detuning –May not leave enough space for variance in real instruments (2.6% in prob function) Lots of free parameters –Seemingly a lot of tuning involved

Conclusion Works Well for 3 note polyphony –Anssi Klapuri claim: 18 note range, works for flute, piano, trumpet Groundbreaking in that it used Perceptual system model –Based on auditory scene analysis Lots of free parameters –Seemingly a lot of tuning involved