“Which Line Is it Anyway”: Clustering for Staff Line Removal in Optical Music Recognition Sam Vinitsky NAME
Optical Music Recognition: An Overview OMR is the task of taking an image of sheet music and converting it into a format readable by computers T: WTF is sheet music
Optical Music Recognition: An Overview OMR is the task of taking an image of sheet music and converting it into a format readable by computers T: WTF is sheet music
Sheet Music Primer Sheet music is the what musicians use to write down + read music Visual language T: Consists of staff lines and symbols
Sheet Music Primer: Staff Lines Traintracks that you follow along See a symbol, do a thing
Sheet Music Primer: Symbols Everything else is a “symbol” (black) Symbols tell you what to do What notes to play How loud to be Location with respect to staff lines matters T: Now that we’re all on the same page...
Back to Optical Music Recognition... Again, our goal... Computer readable format (many of these exist) T: You might be wondering “why do we want to do this”?
Applications of OMR: Handwritten Music Biggest application Taking handwritten music and converting it into a clean digital format (for mass producing or achive) people like to write by hand Easier Time consuming to digitize Other cool applications... T: So how are we gonna actually do this?
Optical Music Recognition: Pipeline
Optical Music Recognition: Pipeline Find and remove staff lines They’re just gonna get in the way We do care where they are though (symbols located wrt them) Write down for later
Optical Music Recognition: Pipeline Find and remove staff lines Find symbols Using CC’s
Optical Music Recognition: Pipeline Find and remove staff lines Find symbols Determine type of each symbol 1 = note 2 = treble clef 3 = sharp 4 = sharp 5 = three 6 = note ...
Optical Music Recognition: Pipeline Find and remove staff lines Find symbols Determine type of each symbol Determine how symbols relate to each other 1 = note (G4) 2 = treble clef (first clef) 3 = sharp (key sig with 4) 4 = sharp (key sig with 3) 5 = three (triplet for 6 8 11) 6 = note (E5, triplet 5) ... Semantically Hierarchical structure
Optical Music Recognition: Pipeline Find and remove staff lines Find symbols Determine type of each symbol Determine how symbols relate to each other Convert into computer readable format (musicXML, MIDI, etc...) 1 = note (G4) 2 = treble clef (first clef) 3 = sharp (key sig with 4) 4 = sharp (key sig with 3) 5 = three (triplet for 6 8 11) 6 = note (E5, triplet 5) ... THIS IS HARD Each step depends on the last one T: We’re just gonna focus on the first step...
Optical Music Recognition: Pipeline *** Find and remove staff lines *** Find symbols Determine type of each symbol Determine how symbols relate to each other Convert into computer readable format (musicXML, MIDI, etc...) 1 = note (G4) 2 = treble clef (first clef) 3 = sharp (key sig with 4) 4 = sharp (key sig with 3) 5 = three (triplet for 6 8 11) 6 = note (E5, triplet 5) ... Even this is a formidable task! T: let’s go
Staff Line Removal: Overview Input: original score image Output: staffless image To reiterate: get rid of staff lines T: Now you might be thinking to yourself, “Hey Sam, didn’t we already learn how to find lines in an image????” (If you forgot, Hough transform, we did a HW on it…)
Reminder: Hough Transform... The details don’t matter, because it doesn’t work T: This isn’t gonna work… big reason is….
Staff “Line” Removal... Staff lines aren’t always lines… T: Let’s zoom
Staff “Line” Removal...
Staff “Line” Removal...
Staff “Line” Removal... Curved due to page (scanning from a book) Thin → everything ruins it Rotations (can’t just derotate) Distortion or noise from scanning Need to keep this in mind → robust T: so how we actually gonna do this?
Staff Line Removal: Algorithms Handcrafted: Run-length [Carter & Bacon ‘92] Shortest Path [Cardoso et al. ‘08] Etc… [see Dalitz et al. ‘08] Handcrafted (use domain) Good if the music is perfect Usually not that good in PRACTICE (most scores aren’t perfect) Because not robust / don’t generalize T: Work to use ML to get better generalization
Staff Line Removal: Algorithms Handcrafted: Run-length [Carter & Bacon ‘92] Shortest Path [Cardoso et al. ‘08] Etc… [see Dalitz et al. ‘08] Supervised Machine Learning: (State of the Art) Classification of pixels [Calvo-Zarogoza et al. ‘16] Deep learning on image Convolutional Neural Networks [Calvo-Zarogoza et al. ‘17] Generative Adversarial Networks [Bhunia et al. ‘18] Supervised learning: (SOA) At pixel level or image level Pixel level use labelled data (pixels labelled staff + nonstaff) learn a rule to discriminate between them GAN/CNN: image level Images are LARGE (3000x1000) = SUPER COMPUTATIONALLY INTENSIVE Both have issued: Lots of VARIATION in sheet music style… / distortions / style Requires a good variety of labelled data to generalize well - (and we don’t have that cuz expensive to label) T: So I thought, how can we leverage the power of ML without needing so much labelled data?
Staff Line Removal: Algorithms Handcrafted: Run-length [Carter & Bacon ‘92] Shortest Path [Cardoso et al. ‘08] Etc… [see Dalitz et al. ‘08] Supervised Machine Learning: (State of the Art) Classification of pixels [Calvo-Zarogoza et al. ‘16] Deep learning on image Convolutional Neural Networks [Calvo-Zarogoza et al. ‘17] Generative Adversarial Networks [Bhunia et al. ‘18] Unsupervised Machine Learning: Clustering of pixels [Vinitsky ‘18] Unsupervised: Doesn’t need labelled data Robust + generalize well (FIXED BOTH OUR PROBLMES) T: Alg I came up with is to use clustering
Staff Line Removal via Pixel Clustering Algorithm: (Vinitsky ‘18) Convert each black pixel into a feature vector Idea: convert each pixel into a point in n-dimensional space (like a dot) T: How? Image modified from http://sli.ics.uci.edu/Classes-CS178-Notes/LinearClassify
Staff Line Removal via Pixel Clustering Image source: Calvo-Zaragova et al., ‘16 Lots of ways, just talk about one… Look at a window around each pixel, convert it into a row vector. This is a point in 2w^2 dimensional space… T: so again, the idea...
Staff Line Removal via Pixel Clustering Algorithm: (Vinitsky ‘18) Convert each black pixel into a feature vector Each point is a pixel’s window Image modified from http://sli.ics.uci.edu/Classes-CS178-Notes/LinearClassify
Staff Line Removal via Pixel Clustering Algorithm: (Vinitsky ‘18) Convert each black pixel into a feature vector “Cluster” the feature vectors into two groups Clustering into groups These all look alike... Image modified from http://sli.ics.uci.edu/Classes-CS178-Notes/LinearClassify
Staff Line Removal via Pixel Clustering Algorithm: (Vinitsky ‘18) Convert each black pixel into a feature vector “Cluster” the feature vectors into two groups Figure out which cluster is “staff” and which is “non-staff” Remove pixels from staff one Why should this work? Each image has hundreds of thousands of black pixels Staff and non-staff pixels look very different (in right FV) T: How did we do? Image modified from http://sli.ics.uci.edu/Classes-CS178-Notes/LinearClassify
Staff Line Removal: Results * = implemented ** = original algorithm Used standard datasets (handwritten + typeset) sampled 12 Compared to some of my other algo’s vs reported results of SOA (on the same handwritten) If you don’t know F1 score, think of it as accuracy from 0 to 1 (it’s not) Ours looks worse, but... Only one of mine that did okay on both datasets Not as good as SOA, but pretty close T: Qualitative results
Staff Line Removal: Results Clustering output Ground truth staff-less Typeset: Got almost all of the stafflines Notes look good Chopped up C (classify) Hashtags + clef degraded but recognizable T: handwritten
Staff Line Removal: Results Clustering output Ground truth staff-less Didn’t mess up any symbols too badly, Messed up lines did leave staff lines in… T: we did okay, how could we do better? Stuff I want to work on...
Future Work Clustering for Staff Line Removal: Other clustering methods I used K-means for finding the clusters… others out there
Future Work Clustering for Staff Line Removal: Other clustering methods Better feature vectors I tried a few different FV’s... Biggest factor
Future Work Clustering for Staff Line Removal: Other clustering methods Better feature vectors Picking the “staff” cluster There’s also the question of how to pick which cluster is staff First: picking the smaller one (naive) → ended up doing it by hand
Future Work Clustering for Staff Line Removal: Other clustering methods Better feature vectors Picking the “staff” cluster White pixels can be staff (due to noise) The issue that... 3 clusters - Background, staff, or nonstaff
Future Work Clustering for Staff Line Removal: Other clustering methods Better feature vectors Picking the “staff” cluster White pixels can be staff (due to noise) Optical Music Recognition: The rest of the pipeline...
References [1] Jorge Calvo-Zaragoza, Luisa Mico, and Jose Oncina. Music staff removal with supervised pixel classification. International Journal on Document Analysis and Recognition (IJDAR), 19(3):211–219, Sep 2016. [2] Jorge Calvo-Zaragoza, Jose J. Valero-Mas, and Antonio Pertusa. End-to-end optical music recognition using neural networks. In ISMIR, 2017. [3] Jaime Cardoso, Artur Capela, Ana Rebelo, and Carlos Guedes. A connected path approach for staff detection on a music score, 2008. [4] I. Fujinaga, M. Droettboom, C. Dalitz, and B. Pranzas. A comparative study of staff removal algorithms. IEEE Transactions on Pattern Analysis & Machine Intelligence, 30:753–766, 2008. [5] A. Konwer, A. K. Bhunia, A. Bhowmick, P. Banerjee, P. Pratim Roy, and U. Pal. Staff line Removal using Generative Adversarial Networks. ArXiv e-prints, 2018. [6] J. Novotny and J. Pokorny. Introduction to optical music recognition: Overview and practical challenges. 2015. [7] N.P. Carter, R.A. Bacon: Automatic Recognition of Printed Music. In H.S. Baird, H. Bunke, K. Yamamoto (editors): “Structured Document Image Analysis”, pp. 454-65, Springer, 1992.
OMR is the task of taking an image of sheet music and converting it into a format readable by computers T: WTF is sheet music
Staff Line Removal: Hough Transform (Issues) Original image Predicted staff (red) Actual staff (red) Suppose we DID find the line Hough will find ENTIRE line CROSSING over symbols, not stop at each symbol (we can modify this with trimming)
Staff Line Removal: Hough Transform (Issues) Original image Predicted staff-less Actual staff-less This will break each symbol into multiple CC’s...
Staff Line Removal: Hough Transform (Issues) Original image Predicted staff-less Actual staff-less So when we try to do step 2 (finding symbols), we’ll have trouble Step 3 (classification) is impossible There are techniques to choose which pixels to remove in a smart way (or put them back together), these tend to be #bad Keep this in mind when designing & effects evaluation (def of success) T: Okay so Hough Transform was bad, can we do better?
Staff Line Removal: Results Algorithm F1 - Typeset F1 - Handwritten Run-length* 0.9690 0.5956 Run-length (smarter)** 0.9841 0.0191 Clustering (4-dist)** 0.9040 0.4912 Clustering (window, w=3)** 0.7535 0.8007 Classifier (k-nn) n/a 0.8969 Classifier (SVM) 0.9603 Deep Learning (GAN’s) 0.9932 * = implemented ** = original algorithm Used standard datasets (handwritten + typeset) sampled 12 Compared to some of my other algo’s vs reported results of SOA (on the same handwritten) If you don’t know F1 score, think of it as accuracy from 0 to 1 (it’s not) Ours looks worse, but... Only one of mine that did okay on both datasets Not as good as SOA, but pretty close T: Qualitative results
Staff Line Removal: Results Algorithm F1 - Typeset F1 - Handwritten Run-length* 0.9690 0.5956 Run-length (smarter)** 0.9841 0.0191 Clustering (4-dist)** 0.9040 0.4912 Clustering (window, w=3)** 0.7535 0.8007 Classifier (k-nn) n/a 0.8969 Classifier (SVM) 0.9603 Deep Learning (GAN’s) 0.9932 * = implemented ** = original algorithm Used standard datasets (handwritten + typeset) sampled 12 Compared to some of my other algo’s vs reported results of SOA (on the same handwritten) If you don’t know F1 score, think of it as accuracy from 0 to 1 (it’s not) Ours looks worse, but... Only one of mine that did okay on both datasets Not as good as SOA, but pretty close T: Qualitative results