CS591k - 20th November - Fall 2003 1. 2 Content-Based Retrieval of Music and Audio Seminar : CS591k Multimedia Systems By Rahul Parthe Anirudha Vaidya.

Slides:

Advertisements

Similar presentations

Principles of Density Estimation

Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

DECISION TREES. Decision trees  One possible representation for hypotheses.

Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.

Presented by Xinyu Chang

K-NEAREST NEIGHBORS AND DECISION TREE Nonparametric Supervised Learning.

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.

Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.

Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.

Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.

Real-Time Human Pose Recognition in Parts from Single Depth Images Presented by: Mohammad A. Gowayyed.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

1 A scheme for racquet sports video analysis with the combination of audio-visual information Visual Communication and Image Processing 2005 Liyuan Xing,

Multiple View Based 3D Object Classification Using Ensemble Learning of Local Subspaces ( ThBT4.3 ) Jianing Wu, Kazuhiro Fukui

Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight.

F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Chapter 2: Pattern Recognition

Speaker Adaptation for Vowel Classification

Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

A fuzzy video content representation for video summarization and content-based retrieval Anastasios D. Doulamis, Nikolaos D. Doulamis, Stefanos D. Kollias.

Learning Chapter 18 and Parts of Chapter 20

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.

Representing Acoustic Information

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Module 04: Algorithms Topic 07: Instance-Based Learning

TEMPORAL VIDEO BOUNDARIES -PART ONE- SNUEE KIM KYUNGMIN.

Music retrieval Conventional music retrieval systems Exact queries: ”Give me all songs from J.Lo’s latest album” What about ”Give me the music that I like”?

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Presented by Tienwei Tsai July, 2005

Learning of Word Boundaries in Continuous Speech using Time Delay Neural Networks Colin Tan School of Computing, National University of Singapore.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

Jacob Zurasky ECE5526 – Spring 2011

1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:

MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

Look who’s talking? Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project DKE Maastricht University.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Duraid Y. Mohammed Philip J. Duncan Francis F. Li. School of Computing Science and Engineering, University of Salford UK Audio Content Analysis in The.

MMDB-8 J. Teuhola Audio databases About digital audio: Advent of digital audio CD in Order of magnitude improvement in overall sound quality.

Speed improvements to information retrieval-based dynamic time warping using hierarchical K-MEANS clustering Presenter: Kai-Wun Shih Gautam Mantena 1,2.

1 An Efficient Classification Approach Based on Grid Code Transformation and Mask-Matching Method Presenter: Yo-Ping Huang.

MSc Project Musical Instrument Identification System MIIS Xiang LI ee05m216 Supervisor: Mark Plumbley.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.

Vector Quantization CAP5015 Fall 2005.

Rotem Golan Department of Computer Science Ben-Gurion University of the Negev, Israel.

Classification and Regression Trees

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.

Automatic Classification of Audio Data by Carlos H. L. Costa, Jaime D. Valle, Ro L. Koerich IEEE International Conference on Systems, Man, and Cybernetics.

S.R.Subramanya1 Outline of Vector Quantization of Images.

k-Nearest neighbors and decision tree

Instance Based Learning

K Nearest Neighbor Classification

Musical Style Classification

Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa

EE513 Audio Signals and Systems

Using Clustering to Make Prediction Intervals For Neural Networks

Measuring the Similarity of Rhythmic Patterns

Presentation transcript:

CS591k - 20th November - Fall

2 Content-Based Retrieval of Music and Audio Seminar : CS591k Multimedia Systems By Rahul Parthe Anirudha Vaidya Instructor : Dr. Donald Adjeroh

CS591k - 20th November - Fall Introduction Introduction An audio search engine able to retrieve sound files from a large database similar to the input query sound. Sounds are characterized by templates" derived from a tree-based vector quantizer trained to maximize mutual information (MMI).

CS591k - 20th November - Fall Basic Operation Corpus with different classes of audio files is parameterized into feature vectors Construction of a tree-based quantizer. Generation of the audio template using the parameterized data. The template is thus generated by capturing the salient characteristics of the input audio. Construction of a template for the query audio and matching it with the templates in the database.

CS591k - 20th November - Fall Basic Operation [cont..] Fig 1: Audio Template Construction

CS591k - 20th November - Fall Audio Parameterization The basic objective is to parameterize the audio-files into mel-scaled cepstral coefficients (MFCCs). The audio waveform sampled at 16 kHz is transformed into a sequence of 13-dimensional feature vectors (12 MFCC coefficients + energy term).

CS591k - 20th November - Fall Audio Parameterization -Steps The audio is hamming-windowed in overlapping steps. The window is 25mS wide, hence 1 sec. Audio contains 500 overlapped windows. Calculate the log of power spectrum for each window using DFT. Mel–scaling. This emphasizes the mid-frequency bands in order of their perceptual importance. Transform the mel-scaled coefficients into cepstral coefficients using another DFT resulting in dimensionally uncorrelated features. The audio waveform is thus transformed into 13 dimensional feature vectors (12 MFCCs + energy).

CS591k - 20th November - Fall Audio Parameterization[cont..] Fig 2:Audio parameterization into Mel cepstral coefficients

CS591k - 20th November - Fall Tree-Structured Quantization The Q-tree is grown offline using the max amount of training data possible. Supervised tree-based quantization. i.e. the quantizer learns the critical distance between classes while ignoring the other variability. The advantage of this technique is that it can find similarities between similar slides, intervals or scales despite lumping the time-dependant vectors into one time –independent template.

CS591k - 20th November - Fall Tree Construction The quantizer tree partitions the feature space into distinct regions. Each threshold in the tree is chosen to maximize the mutual information I(X;C) between the data X and the associated class C. The best MMI split is found by considering all possible thresholds in all possible directions. Consider an MMI split for the dimension d which it intercepts at value t. The hyper plane divides the set N of training vectors X into 2 sets First split – root node, left child Xb gets the training samples less than the threshold while the right child inherits those greater than the threshold. Splitting process repeated recursively on each child resulting in more modes and splits in the tree.

CS591k - 20th November - Fall Tree Construction [cont..] Each node in the tree corresponds to a hyper-rectangular cell in the feature space. The leaves of the tree partition the feature space into non- overlapping regions as shown. Fig 4: Nearest neighbor MMI Tree

CS591k - 20th November - Fall Estimating I(X;C) H2 is the binary entropy function The probabilities Pr(ci) & Pr(ai) can be defined as follows

CS591k - 20th November - Fall Stopping condition The stopping rule decides that further splits are unnecessary and stops the recursive splitting process. The best-split mutual information is weighted by the probability mass inside the cell to be split. The stopping metric stop for the cell lj is given as: Nj is the data points in cell j and N is the total number of data points

CS591k - 20th November - Fall Template generation The Tree Partitions the feature space into L non-overlapping regions or cells each of which correspond to the leaf of the Tree. One approach of using it is to label the leafs with class name and then use it as classifier, but this wont be robust since classes will be overlapping containing data from many classes. Another approach the paper suggested is to mark the ensemble of leaf probabilities from the quantized class data. In short to use the histogram of leaf probabilities for a sequence of frames. The resulting histogram captures the essential class qualities so that it can be compared with other histograms.

CS591k - 20th November - Fall Template generation [cont] Since the size of the tree determines the size of the templates it can be easily pruned to give us variable free parameters as per the application allowing better characterization of data. The processing being in 1-Dimension the quantization is rapid and takes only log(N) time for N-leaf tree. Visual approximation of the Vectors.

CS591k - 20th November - Fall Distance Metrics The templates generated need to be compared to references in order to determine to which class they belong. Comparing them determines the acoustic similarity. Several distance measures have been proposed but the main two in use are Euclidean Distance and Cosine Distance. Euclidean treats the histogram vectors as N-dimensional vectors and computes the L2 norm between them. The cosine also treats histogram as a N-dimensional vectors but measures the relative angles between them thus is more effective since it is not independent of the magnitude of the vectors.

CS591k - 20th November - Fall Distance Metrics [cont..] Euclidean Distance Measure Cosine Distance Measure

CS591k - 20th November - Fall Classification Query template is matched with corpus templates using the Distance measures as discussed previously. The results are sorted in the form of a list with order of similarity. They can be imagined as the output results of a search engine like google. The search has to be through the full data base hence is a big search as for comparison all the distances have to be compared.

CS591k - 20th November - Fall Experiments & Results 1. Sound Retrieval A simple test was conducted to check the performance of the system with the Muscle Fish System on web. Two types of trees were used one was quasi-supervised and the other was supervised. Quasi supervised means that the tree was used to classify the whole sample space in distinct classes. This results in number of cells in the feature space equal to the size of the sample space. The supervised was used to classify the sample into a subclass or a group with similar properties. Which obviously gives the better results.

CS591k - 20th November - Fall Experiments & Results [con] Distance Q Tree (D c ) unsupervised supervised Muscle Fish (no DPL) Muscle Fish (+ DPL) Laughter (M) Oboe Agogo Speech (F) Touchtone Rain/Thunder Mean AP Retrieval Average Precision (AP) for different schemes. Quantization tree results used un weighted cosine distance measure. Distance measures of both kind were used but as mentioned cosine performed a lot better.

CS591k - 20th November - Fall Experiments & Results [cont..] 2. Music Retrieval In this application music clips were used for classification. Genres used were jazz, pop, rock, rap etc Clips from the same artist were considered as from the same class. Each artist had 5 clips in the corpus. The corpus consisted of second clips 5 clips a artist with 40 artists. Distance Euclidean (D E ) supervised unsupervise d Cosine (D c ) supervisedVectordistance AP Retrieval Average Precision (AP) for music retrieval experiment.

CS591k - 20th November - Fall Conclusions The retrieval works effective for complex data and measures acoustic similarity. The sorted comparison results give the order of similarity between query data and references in the corpus. The computational requirements and storage requirements are also modest. Since the feature vectors are just array of integers and the Q- tree quantization and classification is in one dimension. This method can be used to automatically segment the multimedia data based on changes in the speaker, pause, musical interludes etc Finally the variability of the number of free parameters and ignoring the dimension which are never used the templates can be optimized as per the application requirements.

CS591k - 20th November - Fall Limitations The classifier used is simple straight plane classifier which can distinguish between to subspaces with a simple plane. In real life the vector distribution may not be distributed in such a way. They have used simple acoustical parameters to be used for matching, for more sophisticated systems other parameters like pitch and speaker dependent properties can be used. We need to have recorded musical clips to find the genres. In case of distortion or losses in the clips the system will not work well. There is no option for dynamic training. i.e. the system is not self- updating.

CS591k - 20th November - Fall Suggestions Neural networks can be used to divide the feature space in a more complex fashion. Where curved, concave and hybrid planes can be formed. If the dimensions of the feature vectors are increased then simple Euclidean distance will not work and we need to go for other reasoning methods. Non acoustical features such as pitch, brightness and speaker dependent parameters can be used to find good classification with a less data base. Dynamic training needs to be added to the system to include new samples in database as it encounters them. Possibly speech recognition can be added to search for data based on uttered query by the user.

CS591k - 20th November - Fall References [1] S. Pfeier, S. Fischer, and W. Eelsberg, “Automatic audio content analysis," Tech. Rep. TR , University of Mannheim, D Mannheim, Germany, April [2] E. Wold, T. Blum, D. Keslar, and J. Wheaton, “Content-based classication, search, and retrieval of audio,“ IEEE Multimedia, pp. 27{36, Fall [3] T. Blum, D. Keslar, J. Wheaton, and E. Wold, “Audio analysis for content-based retrieval” tech. rep., Muscle Fish LLC, 2550 Ninth St., Suite 207B, Berkeley, CA 94710, USA, May [4] B. Feiten and S. Gunzel, “Automatic indexing of a sound database using self-organizing neural nets," Computer Music Journal 18(3), pp. 53{65, 1994.

CS591k - 20th November - Fall Questions & Comments

CS591k - 20th November - Fall Thanks