The Center for Signal & Image Processing Georgia Institute of Technology Kernel-Based Detectors and Fusion of Phonological Attributes Brett Matthews Mark.

The Center for Signal & Image Processing Georgia Institute of Technology Kernel-Based Detectors and Fusion of Phonological Attributes Brett Matthews Mark Clements

The Center for Signal & Image Processing Outline n Frame-Based Detection l One-vs-all detectors l Context-dependent framewise detection l Probabilistic Outputs n Kernel-Based Attribute Detection l SVM l Least-Squares SVM n Evaluating Probabilistic Estimates l Naïve Bayes Combinations l Hierarchical Manner Classification n Detector Fusion l Genetic Programming

The Center for Signal & Image Processing Frame-Based Detection n One-vs-All classifiers l Manner of articulation – Vowel, fricative, stop, nasal, glide/semivowel, silence l Place of articulation – Dental, labial, coronal, palatal, velar, glottal, back, front l Vowel Manners – High, mid, low, back, round n Framewise Detection l 10ms frame rate l 12 MFCCs+En l 8 context dependent frames n Classifier Types & Posterior Probs l Artificial Neural nets – Probabilistic outputs l Kernel-Based Classifiers – SVM l Empirically determined posterior probs – LS-SVMs l Probabilistic outputs vowel silence dental velar voicing Event Fusion

The Center for Signal & Image Processing Kernel-Based Classifiers n Support Vector Machines (SVM) n LS-SVM Classifier l Kernel-based classifier like SVM l Least-Squares formulation l Probabilistic output scores l LS-SVM Lab package – Katholieke Universeit Lueven n Same decision function as SVM l Subject to n Equality constraints, instead of inequality constraints l No margin optimization l Linear system solution w

The Center for Signal & Image Processing Least Squares SVMs “Support Vectors”  found by solving a linear system n Kernel Functions n Probabilistic Outputs l Bayesian Inference for Posterior probs l Moderated outputs can be directly interpreted as posterior probabilities Linear Polynomial RBF

The Center for Signal & Image Processing Evaluating Probabilistic Estimates n Reliability and Accuracy of probabilistic scores n Initial Fusion Experiments l Hierarchical Manner Classification – LS-SVM, SVM l Naïve Bayes combination for Phone Detection – LS-SVM, SVM, ANN LS-SVMSVM

The Center for Signal & Image Processing Hierarchical Combinations n Probabilistic Phonetic feature hierarchy for classifying frames into 6 manner classes l Train binary detectors on each split in hierarchy l 5 Detectors, 6 classes – silence vs speech – sonorant | speech – vowel | sonorant – stop | non-sonorant – semivowel | sonorant consonant fricative detection and gnd truth P(fric | x) = (1 – P(st | non-sc)) · (1 – P(son | spch)) · P (spch | x)

The Center for Signal & Image Processing Hierarchical Combinations 1. Reliability of Posterior Probs (right) l Plot probabilistic estimates of combinations vs. observed frequencies l Hierarchical Combinations much more reliable for SVM than LS-SVM 2. Classification Accuracy (below) l Higher classification accuracy for SVMs, especially fricatives 3. Upper-bound Comparison (below) l One-vs-all classifiers trained directly for each class. l Combinations nearly as accurate as one-vs-all for classification performance l LS-SVM combinations not good for semivowel and nasal stop vowel fricative nasal silence semivowel/ glide LS-SVM (Combined) stop vowelfricative nasal silence semivowel/ glide SVM (Combined) Classification accuracy (%)

The Center for Signal & Image Processing Naïve Bayes Combinations n One-vs-all frameworks desired l Phonetic hierarchies are cumbersome n Phone Detection l Combine phonological attribute scores with Naïve Bayes product l Initial experiments in evaluating probabilities n Compare accuracy and reliability of probabilistic outputs for ANN, SVM and LS-SVM l Limited training data (LS-SVM limit is 3000 due to memory restrictions) n Detect phones with combinations of relevant phonetic attributes P(/f/ | x) = P(labial | x) P(fric | x) (1-P(voicing | x))

The Center for Signal & Image Processing Naïve Bayes Combinations 1. Phone Detection l Compare combined attributes with direct training on phones as an upper bound 2. ROC Stats (right) l SVMs best for attribute detection l Mixed results for NB combinations – No clear winner between LS-SVM and SVM l Direct training outperforms combinations 3. Reliability l Naïve Bayes combinations give poor reliability for all detector types 4. Rare phones & vowels l For /v/, /ng/ and /oy/, improvements in EER and AUC across detector types (lower right) l Most vowels saw improvements as well ROC Stats Direct vs. Combined

The Center for Signal & Image Processing Naïve Bayes Combinations Combined attributes (SVM) Direct Training (SVM) 1. Phone Detection l Compare combined attributes with direct training on phones as an upper bound 2. ROC Stats (right) l SVMs best for attribute detection l Mixed results for NB combinations – No clear winner between LS-SVM and SVM l Direct training outperforms combinations 3. Reliability l Naïve Bayes combinations give poor reliability for all detector types 4. Rare phones & vowels l For /v/, /ng/ and /oy/, improvements in EER and AUC across detector types (lower right) l Most vowels saw improvements as well

The Center for Signal & Image Processing Genetic Programming n Evolutionary algorithm for tree-structured feature “creation” (Extraction) n Maximize a fitness function across a number of generations (iterations) n Operations like crossover & mutation control the evolution of the algorithm n Trees are algebraic networks l Inputs are multi-dimensional features l Tree nodes are unary or binary mathematical operators (+, -, *, (.) 2, log) l Algebraic networks simpler and more transparent than neural nets n GPLab Package from Universidade de Coimbra, Portugal l http://gplab.sourceforge.net

The Center for Signal & Image Processing Genetic Programming n Trained GP trees on SVM outputs n Develop algebraic networks for combining detector outputs n Produce a 1-D feature from a nonlinear combination of detector outputs l choose fitness function, set of node operators, tree depth, etc. to maximize separation vowel silence dental velar voicing /aa/ /ae/ /zh/ 1-D feature n Trees are algebraic networks l Inputs are multi-dimensional features l Tree nodes are unary or binary mathematical operators (+, -, *, (.) 2, log) l Algebraic networks simpler and more transparent than neural nets

The Center for Signal & Image Processing Genetic Programming n System is complex for speech recognition (tree + classifier for each phone), but GP trees themselves provide insights for combination l Fitness function l Tree node operators l Important features n Initial results l Mixed results – Good separation for some phones, not good for most – GP Trees select attributes of interest, discard others l Still in progress /oy/ /th/ /oy/ /th/

The Center for Signal & Image Processing Summary n Evaluating Posterior Probs l ANNs, SVMs, LS-SVMs l SVMs are best for reliability and accuracy l In limited training data, rare phones may benefit from from overlapping phonetic classes n Genetic Programming for detector fusion l Small, transparent algebraic networks for combining attribute detectors l GP trees select relevant attributes, but much room for improvement l Limiting tree node operators and selecting “fitness functions” should provide insights into detector fusion

The Center for Signal & Image Processing

Extras Feature Space correlation matrix (1) Feature Space correlation matrix (2) Feature Space correlation matrix (3) Training Data Represents the kernel function K and the range of kernel parameters

The Center for Signal & Image Processing Extras Positive scale parameters Determine w and b by solving the optimization problem Regression error for training sample k Generalization/ Regularization term Subject to Expression for the trade-off between generalization and training set error

The Center for Signal & Image Processing Extras n Support Vector Machines l Good performance, but the majority of training points became support vectors l Posterior probabilities w

The Center for Signal & Image Processing Georgia Institute of Technology Kernel-Based Detectors and Fusion of Phonological Attributes Brett Matthews Mark.

Similar presentations

Presentation on theme: "The Center for Signal & Image Processing Georgia Institute of Technology Kernel-Based Detectors and Fusion of Phonological Attributes Brett Matthews Mark."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Center for Signal & Image Processing Georgia Institute of Technology Kernel-Based Detectors and Fusion of Phonological Attributes Brett Matthews Mark.

Similar presentations

Presentation on theme: "The Center for Signal & Image Processing Georgia Institute of Technology Kernel-Based Detectors and Fusion of Phonological Attributes Brett Matthews Mark."— Presentation transcript:

Similar presentations

About project

Feedback