2015/10/101 Query-by-Singing/Humming: An Overview 「哼唱選歌」綜述 J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan.

2015/10/101 Query-by-Singing/Humming: An Overview 「哼唱選歌」綜述 J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan http://mirlab.org/jang

-2- Outline zIntroduction zMethods for QBSH yPitch Tracking yDatabase Comparison zDemos and Commercial Applications zConclusions

-3- 音樂資訊檢索（ MIR ）分類 zMetadata-based yExample: 歌名、歌手、標記、作詞者、作曲者 yQuery input: text or speech zContent-based yExample: Melody, chord, note onsets, moods… yQuery input: xSymbolic: 音符、和弦、文字 xAcoustic: 哼唱、口哨、敲擊

-4- Acoustic Inputs for MIR z 哼唱 yQuery by humming (usually “ta” or “da”) yQuery by singing z 口哨 yQuery by whistling z 敲擊 yQuery by tapping (at the onsets of notes) z 語音 yQuery by the user’s speech input (for meta- data) z 原音音樂範例 yQuery by recordings of mobile phones zBeatboxing

-5- Introduction to QBSH zQBSH: Query by Singing/Humming yInput: Singing or humming from microphone yOutput: A ranking list retrieved from the song database zProgression yFirst paper: Around 1994 yExtensive studies since 2001 yState of the art: QBSH tasks at ISMIR/MIREXQBSH tasks at ISMIR/MIREX

-6- 「哼唱選歌」的流程  前處理：  收集單軌標準答案（通常是 MIDI 檔）  轉換成適合比對的中介格式  即時處理：  將使用者的音訊輸入轉成音高向量  由音高向量轉成音符（選擇性）  和標準答案進行比對  列出排名

-7- Flowchart of QBSH Pitch vector smoothing Pitch tracking Microphone input Filtering Query results (Ranked song list) Similarity comparison Off-line processing Melody track extraction MIDI files Frame-based representation On-line processing

-8- Pitch Tracking for QBSH zTwo categories for pitch tracking algorithms y Time domain ( 時域 ) xACF (Autocorrelation function) xAMDF (Average magnitude difference function) xSIFT (Simple inverse filtering tracking) y Frequency domain ( 頻域 ) xHarmonic product spectrum method xCepstrum method

-9- Frame Blocking for Pitch Tracking Frame size=256 points Overlap=84 points Frame rate=11025/(256-84)=64 pitch/sec Zoom in Overlap Frame

-10- ACF: Auto-correlation Function Frame s(i): Shifted frame s(i+  ):  =30 30 acf(30) = inner product of overlap part  Pitch period

-11- Pitch Tracking via ACF zSpecs ySampe rate = 11025 Hz yFrame size = 32 ms yOverlap = 0 yFrame rate = 31.25 zPlayback ysoo.wavsoo.wav ysooPitch.wavsooPitch.wav

-12- AMDF: Average Magnitude Difference Function Frame s(i): Shifted frame s(i+  ):  =30 30 amdf(30) = sum of abs. difference  Pitch period

-13- 13/44 UPDUDP (1/4) zUPDUDP: Unbroken Pitch Determination Using DP yGoal: To take pitch smoothness into consideration z : a given path in the AMDF matrix z : Number of frames z : Transition penalty z : Exponent of the transition difference

-14- UPDUDP (2/4) zOptimum-value function D(i, j): the minimum cost starting from frame 1 to position (i, j) zRecurrent formula: z Initial conditions : z Optimum cost :

-15- UPDUDP (3/4) zA typical example of UPDUDP using AMDF

-16- UPDUDP (4/4) zInsensitivity in

-17- Frequency to Semitone Conversion zSemitone : A music scale based on A440 zReasonable pitch range: yE2 - C6 y82 Hz - 1047 Hz ( - )

-18- Vectors after Pitch Tracking With restsWithout rests

-19- Typical Result of Pitch Tracking Pitch tracking via autocorrelation for 茉莉花 (jasmine)

-20- Comparison of Pitch Vectors Yellow line : Target pitch vector

-21- Demo of Pitch Tracking zReal-time display of ACF for pitch tracking ytoolbox/sap/goPtByAcf.mdl zReal-time pitch tracking for real-time mic input ytoolbox/sap/goPtByAcf2.mdl zPitch scaling ypitchShiftDemo/project1.exe ypitchShift-multirate/multirate.m

-22- Comparison Methods of QBSH zCategories of approaches to QBSH yHistogram/statistics-based yNote vs. note xEdit distance yFrame vs. note xHMM yFrame vs. frame xLinear scaling, DTW, recursive alignment

-23- Range Comparison zConcept yReject a song if the range does not match: zCharacteristics yExtremely fast yNot effective yGood for initial filtering

-24- Linear Scaling (LS) zConcept yScale the query linearly to match the candidates zExample:

-25- Linear Scaling (II) zStrength yOne-shot for dealing with key transposition yEfficient and effective yIndexing methods available zWeakness yCannot deal with non- uniform tempo variations zTypical mapping path

-26- Linear Scaling (III) zDistance function for LS yNormalized L 1 -norm yNormalized L 2 -norm zRest handling yExtend previous non-zero note zAlignment example

-27- Dynamic Time Warping (DTW) zGoal: yAllows comparison of high tolerance to tempo variation zCharacteristics: yRobust for irregular tempo variations yTrial-and-error for dealing with key transposition yExpensive in computation yDoes not conform to triangle inequality ySome indexing algorithms do exist z#1 method for task 2 in QBSH/MIREX 2006

-28- Dynamic Time Warping: Type 1 i j t(i-1) r(j) t: input pitch vector (8 sec, 128 points) r: reference pitch vector Local paths: 27-45-63 degrees DTW recurrence: r(j-1) t(i)

-29- Dynamic Time Warping: Type 2 i j t(i-1) r(j) r(j-1) t(i) t: input pitch vector (8 sec, 128 points) r: reference pitch vector Local paths: 0-45-90 degrees DTW recurrence:

-30- Local Path Constraints zType 1: y27-45-63 local paths zType 2: y0-45-90 local paths

-31- DTW Paths of “Match Beginning” zWe assume the speed of a user’s acoustic input falls within 1/2 and 2 times of that of the intended song. zRight-end is free to move. zTypical DTW table size = 128 x 180 i j

-32- DTW Paths of “Match Anywhere” zBoth ends are free to move. zTypical DTW table size = 128 x 2880 i j

-33- DTW Path of “Match Beginning”

-34- DTW Path of “Match Anywhere”

-35- DTW Path of “Match Anywhere”

-36- Demos of DTW zMatch beginning ytoolbox/dcpr/dtw/demoMelodyPath02.m zMatch anywhere ytoolbox/dcpr/dtw/demoMelodyPath02.m zAlignment and note segmentation yToolbox/dcpr/dtw/demoNoteCut.m

-37- Key Transposition zGoal: yAllow users’ input of different keys zMethod 1: yMean shift and heuristic modification y5 DTW computation when compared to each song Mean -440-2213 t-2 t+2 (t’) t’-1t’+1 t

-38- Type-3 DTW: Frame to Note Alignment zDP-based method for filling the table: 67 64 65 Frame-level Pitch vector Notes Recurrent formula: Local constraint: 62 65

-39- Type-3 DTW zCharacteristics yFrame-based query input vs. note-based music database yNote duration unused yMore efficient, less effective yHeuristics for key- transposition zMapping path

-40- RA (Recursive Alignment) zCharacteristics yCombine characteristics of LS & DTW y#1 method for task 1 in QBSH/MIREX 2006 zA typical mapping path

-41- Modified Edit Distance zNote segmentation zModified edit distance

-42- Challenges in QBSH Systems zSong database preparation yMIDIs, singing clips, or audio music zReliable pitch tracking for acoustic input yInput from mobile devices or noisy karaoke bar zEfficient/effective retrieval yKaraoke machine: ~10,000 songs yInternet music search engine: ~500,000,000 songs

-44- Goal and Approach zGoal: To retrieve songs effectively within a given response time, say 5 seconds or so zOur strategy yMulti-stage progressive filtering yIndexing for different comparison methods yRepeating pattern identification

-45- Demo: MIRACLE zMIRACLE: Music Information Retrieval Acoustically via CLuster Engines zDemo page of MIR Lab: yhttp://mirlab.org/new/mir_products.asphttp://mirlab.org/new/mir_products.asp zMIRACLE demo: yhttp://cuda.mirlab.orghttp://cuda.mirlab.org

-46- Internet Music Search Engine zClient-server distributed computing zCloud computing via clustered PCs & GPU Master server Clients Clustered servers PC PDA Cellular Slave Master server Slave servers Request: pitch vector Response: search result

-47- Challenge 1 ：音樂資料庫之收集  由網路收集之音樂檔案：  MIDI 檔案  若要精準，需由人工找出主旋律所在的軌數。若以自動化之方法來進行，辨識率約為 85%  MIDI 檔案格式複雜且不一致  MIDI 主旋律不乾淨（有前奏、疊音、變奏等）  MP3 檔案  流行音樂：極不容易抽取人聲之音高。根據 ISMIR2011 之比賽結果，最佳音高辨識率為 84%  交響樂：可能根本沒有主旋律  人工標記：  若要支援文字搜尋，則需加入歌手、歌詞、類別等資訊。

-48- Challenge 2 ：比對之加速  影響比對速度之因素（及其代表值）  哼唱輸入長度： 8 秒（ 128 音高點）  資料庫大小：約 13000 首歌  比對方法： LS+DTW  CPU ： Pentium 2G （比較不受到記憶體大小影響）  比對位置  從頭比對：約 2 秒  從中間比對副歌開始處每個音符開始處：約 45 秒任意處：約 60 秒

-49- Response Time of Miracle z8 sec recording of “ 小毛驢 ”, comparison from beginning: yLS: 0.4 sec yDTW: 3.5 sec yLS+DTW: 0.6 sec z8 sec recordings of the refrain of “ 夢醒時分 ”, comparison from anywhere: yLS: 40 sec yDTW: IIS time out yLS+DTW: 45 sec yNBDTW: IIS time out

-50- Could It Be More Efficient?  Algorithms  Indexing of LS/DTW  Progressive filtering  New Platforms  GPU (66 times faster for QBSH!)  Grid/clustered computing  Multi-core platforms

-51- Commercial Applications zwww.midomo.comwww.midomo.com zwww.soundhound.comwww.soundhound.com zwww.shazam.comwww.shazam.com

-52- Conclusions zQBSH yFun and interesting way to retrieve music yCan be extend to singing scoring yCommercial applications getting mature zChallenges yHow to deal with massive music databases? yHow to extract melody from audio music?

2015/10/101 Query-by-Singing/Humming: An Overview 「哼唱選歌」綜述 J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan.

Similar presentations

Presentation on theme: "2015/10/101 Query-by-Singing/Humming: An Overview 「哼唱選歌」綜述 J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2015/10/101 Query-by-Singing/Humming: An Overview 「哼唱選歌」綜述 J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan.

Similar presentations

Presentation on theme: "2015/10/101 Query-by-Singing/Humming: An Overview 「哼唱選歌」綜述 J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan."— Presentation transcript:

Similar presentations

About project

Feedback