Using Blackboard Systems for Polyphonic Transcription A Literature Review by Cory McKay
Outline Intro to polyphonic transcription Intro to blackboard systems Keith Martin’s work Kunio Kashino’s work Recent contributions Conclusion
Polyphonic Transcription Represent an audio signal as a score Must segregate notes belonging to different voices Problems: variations of timbre within a voice, voice crossing, identification of correct octave No successful general purpose system to date
Polyphonic Transcription Can use simplified models: –Music for a single instrument (e.g. piano) –Extract only a given instrument from mix –Use music which obeys restrictive rules Simplified systems have had success rates of between 80% and 90% These rates may be exaggerated, since only very limited testing suites generally used
Polyphonic Transcription Systems to date generally identify only rhythm, pitch and voice Would like systems that also identify other notated aspects such as dynamics and vibrato Ideal is to have system that can identify and understand parameters of music that humans hear but do not notate
Blackboard Systems Used in AI for decades but only applied to music transcription in early 1990’s Term “blackboard” comes from notion of a group of experts standing around a blackboard working together to solve a problem Each expert writes contributions on blackboard Experts watch problem evolve on blackboard, making changes until a solution is reached
Blackboard Systems “Blackboard” is a central dataspace Usually arranged in hierarchy so that input is at lowest level and output is at highest “Experts” are called “knowledge sources” KSs generally consist of a set of heuristics and a precondition whose satisfaction results in a hypothesis that is written on blackboard Each KS forms hypotheses based on information from front end of system and hypotheses presented by other KSs
Blackboard Systems Problem is solved when all KSs are satisfied with all hypotheses on blackboard to within a given margin of error Eliminates need for global control module Each KS can be easily updated and new KSs can be added with little difficulty Combines top-down and bottom-up processing
Blackboard Systems Music has a naturally hierarchal structure that lends itself well to blackboard systems Allow integration of different types of expertise: –signal processing KSs at low level –human perception KSs at middle level –musical knowledge KSs at upper level
Blackboard Systems Limitation: giving upper level KSs too much specialized knowledge and influence limits generality of transcription systems Ideal system would not use knowledge above the level of human perception and the most rudimentary understanding of music Current trend is to increase significance of upper- level musical KSs in order to increase success rate
Keith Martin (1996 a) “A Blackboard System for Automatic Transcription of Simple Polyphonic Music” Used a blackboard system to transcribe a four- voice Bach chorale with appropriate segregation of voices Limited input signal to synthesized piano performances Gave system only rudimentary musical knowledge, although choice of Bach chorale allowed the use of generally unacceptable assumptions by lower level KSs
Keith Martin (1996 a) Front-end system used short-time Fourier transform on input signal Equivalent to a filter bank that is a gross approximation the way the human cochlea processes auditory signals Blackboard system fed sets of associated onset times, frequencies and amplitudes
Keith Martin (1996 a) Knowledge sources made five classes of hierarchally organized hypotheses: –“Tracks” –Partials –Notes –Intervals –Chords
Keith Martin (1996 a) Three types of knowledge sources: –Garbage collection –Physics –Musical practice Thirteen knowledge sources in all Each KS only authourized to make certain classes of hypotheses
Keith Martin (1996 a) KSs with access to upper-level hypotheses can put “pressure” on KSs with lower-level access to make certain hypotheses and vice versa Example: if the hypotheses have been made that the notes C and G are present in a beat, a KS with information about chords might put forward the hypothesis that there is a C chord, thus putting pressure on other KSs to find an E or Eb. Used a sequential scheduler to coordinate KSs
Keith Martin (1996 b) “Automatic Transcription of Simple Polyphonic Music: Robust Front End Processing” Previous system often misidentified octaves Attempted to improve performance by shifting octave identification task from a top-down process to a bottom-up process
Keith Martin (1996 b) Proposes the use of log-lag correlograms in front end Models the inner hair cells in the cochlea with a bank of filters Determines pitch by measuring the periodic energy in each filter channel as a function of lag Correlograms now basic unit fed to blackboard system No definitive results as to which approach is better
Kashino, Nadaki, Kinoshita and Tanaka (1995) “Application of Bayesian Probability Networks to Music Scene Analysis” Work slightly preceded that of Martin Used test patterns involving more than one instrument Uses principles of stream segregation from auditory scene analysis Implements more high-level musical knowledge Uses Bayesian network instead of Martin’s simple scheduler to coordinate KSs
Kashino, Nadaki, Kinoshita and Tanaka (1995) Knowledge sources used: –Chord transition dictionary –Chord-note relation –Chord naming rules –Tone memory –Timbre models –Human perception rules Used very specific instrument timbres and musical rules, so has limited general applicability
Kashino, Nadaki, Kinoshita and Tanaka (1995) Tone memory: frequency components of different instruments played with different parameters Found that the integration of tone memory with the other KSs greatly improved success rates
Kashino, Nadaki, Kinoshita and Tanaka (1995) Bayesian networks well known for finding good solutions despite noisy input or missing data Often used in implementing learning methods that trade off prior belief in a hypothesis against its agreement with current data Therefore seem to be a good choice for coordinating KSs
Kashino, Nadaki, Kinoshita and Tanaka (1995) No experimental comparisons of this approach and Martin’s simple scheduler Only used simple test patterns rather than real music
Kashino and Hagita (1996) “A Music Scene Analysis System with the MRF- Based Information Integration Scheme” Suggests replacing Bayesian networks with Markov Random Field hypothesis network Successful in correcting two most common problems in previous system: –Misidentification of instruments –Incorrect octave labelling
Kashino and Hagita (1996) MRF-based networks use simulated annealing to converge to a low-energy state MRF approach enables information to be integrated on a multiply connected hypothesis network Bayesian networks only allow singly connected networks Could now deal with two kinds of transition information within a single hypothesis network: –chord transitions –note transitions
Kashino and Hagita (1996) Instrument and octave identification errors corrected, but some new errors introduced Overall, performed roughly 10% better than Bayesian-based system at transcribing 3- part arrangement of Auld Lang Syne Still only had a recognition rate of 71.7%
Kashino and Murase (1998) Shifts some work away from blackboard system by feeding it higher-level information Simplifies and mathematically formalizes notion of knowledge sources Switches back to Bayesian network Perhaps not truly a blackboard system anymore Has very good recognition rate Scalability of system is seriously compromised by new approach
Kashino and Murase (1998) Uses adaptive template matching Implemented using a bank of filters arranged in parallel and a number of templates corresponding to particular notes played by particular instruments The correlation between the outputs of the filters is calculated and a match is then made to one of the templates
Kashino and Murase (1998) Achieved recognition rate of 88.5% on real recordings of piano, violin and flute Including templates for many more instruments could make adaptive template matching intractable Particularly a problem for instruments with –Similar frequency spectra –A great deal of spectral variation from note to note
Hainsworth and Macleod (2001) “Automatic Bass Line Transcription from Polyphonic Music” Wanted to be able to extract a single given instrument from an arbitrary musical signal Contrast to previous approaches of using recordings of only one instrument or a set of pre-defined instruments
Hainsworth and Macleod (2001) Chose to work with bass –Can filter out high frequencies –Notes usually fairly steady Used simple mathematical relations to trim hypotheses rather than a true blackboard system Had a 78.7% success rate on a Miles Davis recording
Bello and Sandler (2000) “Blackboard Systems and Top-Down Processing for the Transcription of Simple Polyphonic Music” Return to a true blackboard system Based on Martin’s implementation, using a conventional scheduler Refines knowledge sources and adds high-level musical knowledge Implements one of knowledge sources as a neural network
Bello and Sandler (2000) The chord recognizer KS is a feedworard network Trained using the spectrograph of different chords of a piano Trained network fed a spectrograph and outputs possible chords Can therefore output more than one hypothesis at each iteration Gives other KSs more information and allows parallel exploration of solution space
Bello and Sandler (2000) Could automatically retrain network to recognize spectrograph of other instruments with no manual modifications needed Preliminary testing showed tendency to misidentify octaves and make incorrect identification of note onsets These problems could potentially be corrected by signal processing system that feeds blackboard system
Conclusions Bass transcription system and more recent work of Kashino useful for specific applications, but limited potential for general transcription purposes True blackboard approach scales well and appears to hold the most potential for general-purpose polyphonic transcription
Conclusions Use of adaptive learning in knowledge sources seems promising Interchangeable modules could be automatically trained to specialize in different areas Could have semi-automatic transcription, where user chooses correct modules and system performs transcription using them