Download presentation
Presentation is loading. Please wait.
Published byDenis Pierce Modified over 9 years ago
1
Piano Music Transcription Wes “Crusher” Hatch MUMT-614 Thurs., Feb.13
2
Introduction Polyphonic pitch extraction Want to realize “computational scene analysis” (Klaburi) Problem is comparable to speech recognition
3
Current State of affairs Many different approaches –Nothing is 100% reliable, or even 90%…or 80%… –Drawback: no one heuristic means that no one is building on, or learning from, previous work and experience
4
Parameters to extract Pitch Amplitude Onset and duration Do NOT require: –Spatial location –timbre
5
Benefits of knowing timbre Can assume a piano sound for input, and: –Simplifies things down the road –Don’t need to calculate a “sound source model of an instrument” (Marolt) –Can make assumptions about strengths of various partials (Martin) –Makes other techniques possible (eg. differential spectrum analysis, Hawley)
6
Recent developments A few techniques are gaining prominence: –Blackboard systems (Bello, Monti & Sandler, Martin) –Neural networks –Pitch perception models based on human audition (gammatone filterbank front-end) –To a lesser extent: Hidden Markov models
7
Benefits of Blackboards Can incorporate all previous approaches, and methodologies Top-down or bottom-up Easily expandable –Can be easily updated to accommodate new technology
8
A very general heuristic Front-end Analysis, representation, pitch hypothesis’ Top-down processes, (which in turn effects front-end analysis and pitch guesses) Transcribed notes out (Guido, MIDI, etc.)
9
Commonalities between systems transform data into freq. representation –STFT & tracking phase vocoder (Dixon) –Sinusoid tracks (Martin) –Gammatone filterbank (Marolt, Martin) Top-down organization System has the ability to learn –Neural nets (Marolt, Bello) –HMM (Raphael) –“timbre adaption” (Dixon--soon)
10
Top-down is super Bottom-up: analysis --> note hypothesis’ –Unidirectional –Doesn’t know about past analysis’, only concern is hierarchal flow of data –inflexible Top-down: high --> low level –Different levels of the system are determined by predictive models and previous knowledge –Implemented by neural nets, blackboard system
11
Happy schematic Low level --> mid-level --> high level
12
Front-end techniques Sinusoidal –STFT Constant frequency spacing means better resolution in high freq.’s, poorer resolution in low freq. range – tracking phase vocoder –Sinusoid track Track continuous regions of local energy maxima in time-frequency domain (eg. Dixon)
13
Front-end techniques, cont. Correllation –Try to model human audition Constant Q: mimics log. resolution of human ear –Gammatone filterbank output of each filter then processed by a model of “inner hair cell” dynamics Further analysis by short-time auto-correllation Variable filter widths; filters generally implemented across ~70 - 6000 Hz –Same problems as found in scene analysis
14
Onset detection Neural nets –Differences between 6 ms and 18 ms amplitude envelopes (Martolt) Change in high frequency content (Bello) Zero-lag correlation for each filterbank channel –Running estimate of energy (Martin)
15
Analysis & pitch hypothesis’ Blackboards –contain a variety of KS’ Neural nets “fuzzy logic” –May contain front-end processing, or may be fed results thereof –Can be used for entire process (front-end, data representation, pitch hypothesis’) or just to tabulate pitch guesses at the end
16
Analysis & pitch hypothesis’ Peak-picking together w/phase spectrum (helps to resolve low freq. uncertainties) –“atoms of energy localized in time and frequency” (Dixon) HMM Neural nets (note, chord recognizers) –trained to look for one given note (eg. C4) –Can also be a KS in blackboard system
17
Pitfalls Octave errors: most common error source Some solutions: –“feedback to provide inhibition from the output of the note recognition stage to its input” (Martolt) –Instrumental models (have knowledge about strengths of various partials--”spectral shape”) –Apply general musical knowledge (voice leading rules, harmony & counterpoint, etc.) (Kashino)
18
Different systems’ results Dixon:70-80% correct SONIC (Martolt):80-95% correct, (13-25% extra notes) Monti & Sandler:74% correct Raphael:39% wrong/missed Bello, Martin:no data available
19
Conclusions Exponentially more difficult than monophonic transcriptions Are slowly approaching very good, robust systems –Compare to Moore, 1975 –Very few restrictions in the input data Top-level organizations are key –Blackboards, neural networks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.