Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort.

Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort

Objective To give an overview of this particular technique for automatic transcription –Original implementation: ICMC 1993

Introduction Sound Source Separation System –Extracting sound source in the presence of multiple sources Physical vs. Perceptual sound source –Physical: actual source itself –Perceptual: Humans hear as single source Ex: Piano, Loudspeaker

Perceptual Sound Source Separation Creating system which simulates human perceptual system Extraction of parameters based on perceptual model, grouping of parameters based on certain criteria

This PSSS System Kashino et al. –U. of Tokyo OPTIMA: Organized Processing Towards Intelligent Music Scene Analysis First to use human auditory seperation rules

This PSSS System Suppose: Input = mono audio signal, output = multiple midi channels (and graphic display) Given signal S(t), comprised of mix of M sound sources –Assume S(t) = {F1(t),…,FL(t)} Where Fj(t) = {pj(t),fj(t),psij(t)} –Pj = power of spectral peak –Fj = freq of spectral peak –Psij = bandwidth of spectral peak Wish to: –Extract Fj(t) from S(t) –Cluster Fj(t) into groups which (ultimately) represent different sound sources

System Overview Extraction of Frequency Components –Analysis first taken All signals are 16 bit/ 48 khz Bank of 2nd order IIR bandpass filters (log freq scale) implemented –Peak Selection/Tracking: “pinching Plane” method –Regression planes, calculated via least squares »In other words, minimization of sum of squares in z direction (power), leaving x and y (time and freq) fixed »Normal vector for each plane calculated. Angle between gives psij(t), direction vector gives fj(t), pj(t) –First regression plane analysis sets threshold by which other potential peaks are measured

Pinching Planes

Bottom Up Clustering of Freq Components Grouping freq components based on perceptual criteria Goal is to group sounds humans hear as one calculations made for harmonic mistuning and onset asynchrony between pairwise freq components, then evaluated for probability of auditory separation – probability functions based on approximations of psychoacoustic experiments given prob functions p1 and p2, the integrated prob of auditory separationis given by m = 1-(1-p1)(1-p2) – this is from Dempster's law of prob. –m is used as distance measure in clustering

Clustering for Source Identification identify sound sources by global characteristics of clusters –goal is to group sounds based on same source (thus uses direct signal attributes apart from any psychoacoustic metric of determination) –if a cluster contains a single note we’re good

Clustering for Source Identification uses distance function to determine source –D = c1fp+c2fq+c3ta+c4ts Where: fp = peak power ratio of second harmonic to fundamental component Fq = peak power ratio of third harmonic to fundamental component Attack time Sustain time

tone model based processing unit of input is a "processing scope” –proc scope consists of one cluster, or several if they share a freq component –a tone model is a 2D matrix with each row being a freq component over time (column rep. time). each element is a 2D vector of normalized power and freq. –"mixture hypotheses" generated for each tone model, and matched with a processing scope to find the closest fit –distance function minimizes power difference at given time/freq location –effective in recognizing chords –-but, is model based

Automatic tone modeling -automatic acquisition of tone models from analysed signal –-based on "old-plus-new heuristic" [bregman 90] a complex sound is interpreted as everything old which remained is perceived as new sound

Hierarchy of Perceptual Sound Events

A Few Probs and Limitations Octave = no good Psychoacoustic Models –Not tested over large enough group Detuning –May not leave enough space for variance in real instruments (2.6% in prob function) Lots of free parameters –Seemingly a lot of tuning involved

Conclusion Works Well for 3 note polyphony –Anssi Klapuri claim: 18 note range, works for flute, piano, trumpet Groundbreaking in that it used Perceptual system model –Based on auditory scene analysis Lots of free parameters –Seemingly a lot of tuning involved

Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort.

Similar presentations

Presentation on theme: "Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort.

Similar presentations

Presentation on theme: "Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort."— Presentation transcript:

Similar presentations

About project

Feedback