GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Pitch Detection and Tracking Juhan Nam 1
Introduction Music is described with what? – The majority of musical symbols are notes which mainly contains pitch information – We (our brains) usually memorize music as a melody, that is, a sequence of pitches 2
Outlines Introduction – Definition of Pitch – Information in Pitch – Pitch and Harmonicity Pitch Detection Algorithms – Time-Domain Approaches – Frequency-Domain Approaches – Psychoacoustic Model Approaches – Learning-based Approaches Pitch Tracking Applications 3
Definition of Pitch Pitch – Defined as auditory attribute of sound according to which sounds can be ordered on a scale from low and high (ANSI, 1994) – One way of measuring pitch is finding the frequency of a sine wave that is matched to the target sound in a psychophysical experiment – thus, subject to individual persons: e.g. tone-deaf Fundamental Frequency – Physical attribute of sounds measured from periodicity – Often called F0 Thus, pitch should be discriminated from F0 – However, they are very close for sounds of our interest (i.e. musical sounds). So pitch is often used mixed with F0 4
Information in pitch Music – Melody or notes – Harmony (when there are multiple notes with different pitches ) – Size (or register) of musical instruments: Bass, Cello, violin Speech – Person: gender, age, identity – Context: question, mood, attitude, – Meaning: Chinese (Mandarin) Others – Vocalization of animals (e.g. bird’s chirp, whale): size and types, communication 5
Pitch and Harmonicity Not all sounds have pitch Harmonics sounds – Regularly spaced harmonic partials – Speech or Singing Voice: Vowel – Musical Instruments: Piano*, Guitar, Strings, Woodwind, Brass, Organ Non-harmonic sounds – No harmonic patterns or irregular harmonic partials – Speech or Singing Voice: Consonant – Musical Instruments: Drum, Mallet (has pitch but not harmonic) 6 *Inharmonicity in Piano Vibraphone [From Klapuri’s slides]
Pitch Detection Algorithms Taxonomy of Algorithms – Time-Domain Approaches – Frequency-Domain Approaches – Psychoacoustic Model Approaches – Learning-based Approaches 7
Time-Domain Approach Basic Ideas – Periodicity: x(t) = x(t+T) – Measure similarity (or distance) between two segments – Find the period (T) that gives the closest distance Two main approaches – Auto-correlation function (ACF): distance by inner product – Average magnitude difference function(AMDF): distance by difference (e.g., L1, L2 norm) 8 T
Auto-Correlation Function (ACF) Measuring self-similarity by 9 Singing Voice (Sondhi 1967)
Biased auto-correlation Unbiased auto-correlation Auto-Correlation Function (ACF) 10
Comparison of spectrogram and ACF 11 Spectrogram (tracking max values) ACF (tracking max values)
Interpretation of ACF in Frequency Domain By convolution theorem, auto-correlation can be computed in frequency domain and also efficiently using FFT Thus, the ACF can be computed as 12
Interpretation of ACF in Frequency Domain This is equivalent to ACF is a simple template-based approach in frequency domain – Positive weights for (harmonic) peaks and negative weights for valleys 13
Problems in ACF Bias to the large peak around zero lag Not robust to octave errors, particularly, lower octaves – ACF is sensitive to amplitude changes Equal weights for all harmonic partials – In general, low-numbered harmonic partials are more important in determining pitch 14
Average Magnitude Difference Function (AMDF) Measuring self-similarity by In YIN, p is set to 2 And the AMDF is normalized as 15 Minimize the negative ACF plus a lag-dependent term (de Cheveigné & Kawahara, 2002)
Average Magnitude Difference Function (AMDF) 16 AMDF Normalized AMDF
Why YIN (AMDF) works better 17 Robust to changes in amplitude – The difference takes care of amplitude changes. – This reduces octave errors. Zero-lag bias is avoided by the normalized AMDF The normalized AMDF allows using a fixed threshold – Can choose multiple candidates and refine peaks
Example of AMDF (YIN) 18
Frequency-Domain Approach Basic Ideas – Periodic in time domain Harmonic in frequency domain – Measure how harmonic the spectrum – Find F0 that best explains the harmonic patterns (harmonic partials) Template matching – Harmonic Sieve or Spectral Template Cepstrum Harmonic-Product-Sum (HPS) 19
Harmonic Sieves (or Comb-filtering) Using sharp harmonic sieves to take peak regions only – ACF is similar to this but not sharp enough Sigmund~ (PD) and fiddle~ (MaxMSP) are based on weighted harmonics sieves 20 (Puckette et al. 1998)
Spectral Template Cross-correlation with an ideal template on a log-scale spectrogram 21 [From Ellis’ e4896 course slides]
Cepstrum Real Cepstrum is defined as Basic ideas – Harmonic partials are periodic in frequency domain – (Inverse) FFT find the the periodicity 22 Liftering (Noll, 1967)
Harmonic Product Sum (HPS) Harmonic Product Sum (HPS) is obtained by multiplying the original magnitude spectrum its decimated spectra by an integer number 23 (Noll, 1969)
Stabilize & Combine Auditory Model 24 input HC ACF Summary ACF Correlogram Correlogram is formed by concatenating the ACF of individual HC output Summary ACF is computed by summing the ACF across all channels – The peaks in the ACF represent periodicity features – This is known to be robust to band-limited noises
Example of Auditory Model 25 Summary ACF
Pitch Tracking Pitch is usually continuous over time – Once a pitch with strong harmonicity is detected on a frame, the following frames form smooth pitch contour Pitch tracking methods – Post processing: first detect pitch in a frame-by-frame manner and then find a continuous path by smoothing. Median Filtering Dynamic Programming (Talkin, 1995) – Probabilistic approach: detect multiple pitch candidates every frame and and find the best path Viterbi-decoding: Probabilistic YIN (Mauch, 2014) 26
Issues and Challenges Voice activity detection (VAD) / singing voice detection (SVAD) – Discriminate voice/unvoiced/silent frames Latency: real-time implementation – The use of long windows results in slight delay – Post-processing / Probabilistic approaches need larger delay Noisy environment – Learning-based approaches: NMF or classifiers – Active research topic Melody Transcription – Pre-dominant Pitch Detection – Singing Voice Separation + Pitch Tracking – active research topic Polyphonic Pitch 27
Musical Applications Sound Modification – Time-stretching using PSOLA – Auto-tune: pitch-correction or T-Pain effect Music Performance – Tuning musical instruments – Pitch-based sound control: e.g. fiddle~ – Score-following and auto-accompaniment Query-by humming – Relative pitch change might be more important Singing evaluation (e.g. karaoke) and visualization 28 Original (Variable) Time-Stretched (N. Bryan, 2012)
References A. de Cheveigné and H. Kawahara, “YIN, a Fundamental Frequency Estimator for Speech and Music, A. Noll, “Cepstrum Pitch Determination,” A. Noll, “Pitch Determination of Human Speech by the Harmonic Product Spectrum, the harmonic sum spectrum and a maximum likelihood estimate”, 1969 M. Puckette, T. Apel and D. Zicarelli, “Real-time audio analysis tools for Pd and MSP,” 1998 M. Sondhi,“New Methods of Pitch Extraction,” D. Talkin,“A Robust Algorithm for Pitch Tracking (RAPT),” M. Mauch and S. Dixon,“PYIN: A Fundamental Frequency Estimator Using Probabilistic Threshold Distributions,”