Download presentation
Presentation is loading. Please wait.
Published byCharlotte Freeman Modified over 9 years ago
1
Towards Separating Trigram- generated and Real sentences with SVM Jerry Zhu CALD KDD Lab 2001/4/20
2
Domain: Speech Recognition A large portion of errors are due to over- generation of trigram language models. If we can detect trigram-generated sentences, we can improve accuracy. when the director of the many of your father and so the the monster and here is obviously a very very profitable business his views on today's seek level thanks very much for being with us
3
A two-class classification problem ‘fake’ (trigram-generated) or real sentence? Data: 100k fake and 100k real long (> 7 words) sentences. ‘fake’ sentences don’t look right (bad syntax), don’t make sense (bad semantics). Boils down to finding good features. Semantic coherence has been explored [Eneva et al], but not syntactic features, and the combination. SVM margin for probabilities.
4
Previous work: semantic features Around 70 semantic features, most interestingly: Content word co-occurrence statistics Content word repetition Decision tree + Boosting, around 80% accuracy. We hope the combination of syntactic features will significantly improve accuracy.
5
Exploring syntactic features Bag-of-word feature (raw counts, frequency, binary; linear or polynomial kernel) : 57% Tag with part-of-speech (39 POS): when/WRB the/DT director/NN of/IN the/DT many/NN of/IN your/PRP$ father/NN Bag-of-POS: 56% Sparse Sequence of POS: any k POS in that order, weighted by the span. 39 k features. ( … WRB-IN-DT …) … 5 ….
6
Exploring syntactic features (cnt.) Sparse Sequence works on letters for text categorization, but on POS: 58% (k=3, max span=8) Leave stopwords together with POS: WRB the NN of the many of your NN Sparse sequence on stopwords&POS: 57%
7
Exploring syntactic features (cnt.) Stopwords&POS 4-grams: novelty rate, count distribution likelihood ratio, min, max, median, mean counts These combined with semantic features: 75% Semantic features alone: 77%
8
SVM margin Empirically ‘good shape’
9
Summary Now we know these features don’t work… SVM wasn’t a wise choice with large amount of data and a lot of noise…
10
Future? Parsing Logistic regression instead of SVM
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.