Presentation is loading. Please wait.

Presentation is loading. Please wait.

You’re Not From ‘Round Here, Are You? Naïve Bayes Detection of Non-native Utterance Text Laura Mayfield Tomokiyo Rosie Jones Carnegie Mellon University.

Similar presentations


Presentation on theme: "You’re Not From ‘Round Here, Are You? Naïve Bayes Detection of Non-native Utterance Text Laura Mayfield Tomokiyo Rosie Jones Carnegie Mellon University."— Presentation transcript:

1 You’re Not From ‘Round Here, Are You? Naïve Bayes Detection of Non-native Utterance Text Laura Mayfield Tomokiyo Rosie Jones Carnegie Mellon University

2 Overview  Motivation  Speech data  Accent detection as document classification  Classification performance  Discriminative tokens  Conclusions

3 Non-native speech recognition The warship U.S.S. Jarrett has pulled into port in San Diego, CA after training voyage Native recognizer (word accuracy = 26.7): Tomorrow CPU a sister at has spilled into port and sandy and afford after a training wage Non-native recognizer (word accuracy = 73.3): The worst eighty U.S.S. chart has pulled into port in San Diego California after training warrior

4 Motivation  Practical  can we detect non-native users with enough accuracy to switch acoustic models?  Exploratory  how well does an algorithm based only on text features work?  what tokens are discriminative for non-native speakers?

5 Speech examples Over the next two months, public officials, Native American leaders, businesses and environmental groups will come up with plans for meeting the law’s requirements. Spontaneous speech Read speech I like to have anything very special in Boston, very native in Boston. Local specialties

6 Speech data Read speechSpontaneous speech Native language Speaker count Utterance count Word count (types) Speaker count Utterance count Word count (types) Japanese1095715868 (3195) 31168515934 (826) English875610237 (2073) 63204117 (418) Mandarin --- 63743490 (391)

7 Transcripts and hypotheses A safety net for the salmons Environment= environmentalists… A safety net forced simon Um environmental activists… Usually gives a good idea of gold standard Finds true differences in linguistic usage Implicitly models acoustics Benefits from amplified difference between native and non-native samples Classification based on transcripts: Classification based on hypotheses: “A safety net for salmon: environmentalists, the government, and ordinary folks team up to save the Northwest’s wondrous wild salmon”

8 Related work  Acoustic feature based accent discrimination (e.g. Fung and Liu 1999)  Competing HMM based accent discrimination (e.g. Teixeira et al 1996)  Classification of documents according to style (Argamon-Engleson et al 1998), author (Mosteller and Wallace 1964)

9 Accent detection as document classification Native speaker utterances Non-native speaker utterances Classifier

10 Accent detection as document classification Classifier Test speaker utterances Classification decision: native or non-native?

11 Experimental methodology  Rainbow naïve Bayes classifier  Both word and part-of-speech tokens were examined  Classification based on token unigrams and bigrams  No feature selection initially  Stopwords were not excluded from feature set  Data randomly split into 30% testing, 70% training data for evaluation; evaluation repeated 20 times and classification results averaged  Utterances from the same speaker never appeared in both training and test sets

12 Classification of spontaneous speech (transcripts only) Native/ Japanese Native/ Chinese Japanese/ Chinese Native/ Non-native Native/ Japanese/ Chinese

13 Classification of read speech A train: same texts test: same texts baseline

14 Classification of read speech A train: same texts test: same texts B train: disjoint texts test: disjoint texts C train: disjoint texts test: same texts D train: same texts test: disjoint texts baseline

15 Classification of read speech A train: same texts test: same texts B train: disjoint texts test: disjoint texts C train: disjoint texts test: same texts D train: same texts test: disjoint texts baseline

16 Feature Selection MethodNumber of featuresAccuracy None 4087 47 IG-524 524 69 SMART-524 524 88 IG-200 200 74 SMART-524, IG-200 200 88 IG-70 70 M&W-70 70 87 IG-48 48 74 SMART-48 48 84

17 Discriminative sequences Speech typeToken typeNativeNon-native ReadWordNMFSthe + the thethat ReadPOSnoun(pl)noun(sing) noun(pl)verb(past) SpontaneousWordWonderlandthe SpontaneousPOSTO + verb(base)noun(sing) SpontaneousPOSNounamnoun(sing) transcriptionshypotheses

18 Conclusions  Transcriptions of spontaneous speech can be classified with high accuracy for both 2-way and 3-way distinctions  Read speech samples, which are simple transformations of native-produced text, can be classified with high accuracy  Recognizer output is classified more accurately than transcripts

19 Future directions  Incorporating the classification decision in acoustic model selection  Minimizing the number of samples from the test speaker needed for classification  Applying classification to parsing grammar selection, language model construction, writer identification

20 Discriminative POS sequences NativeNon-native Noun(pl)Noun(sing) DeterminerPreposition Noun(pl);prepositionPreposition;preposition Adjective;noun(Pl)Noun(sing);noun(sing) Gerund;particleParticle;preposition Noun(s);verb(3s)Cardinal#;cardinal# Noun(pl);modalVerb(past)

21 Discriminative word sequences NativeNon-native NMFSthe;the the;NMFSin;in nineteen;hundredsthe hundreds;nowin hundredsthat habitats;andhabitat;and

22 Phone-based classification NativeNon- native Phone identity /D//D/ /I/ Phone class CCC V Discriminative tokens Condition B


Download ppt "You’re Not From ‘Round Here, Are You? Naïve Bayes Detection of Non-native Utterance Text Laura Mayfield Tomokiyo Rosie Jones Carnegie Mellon University."

Similar presentations


Ads by Google