Presentation is loading. Please wait.

Presentation is loading. Please wait.

Name Extraction from Chinese Novels CS224n Spring 2008 Jing Chen and Raylene Yung.

Similar presentations


Presentation on theme: "Name Extraction from Chinese Novels CS224n Spring 2008 Jing Chen and Raylene Yung."— Presentation transcript:

1 Name Extraction from Chinese Novels CS224n Spring 2008 Jing Chen and Raylene Yung

2 Problem Given a Chinese novel, extract the names of people and locations Different from English NER: no whitespace within sentences, no capitalization Can use other characteristics since the domain is limited

3 System Outline Extract bigrams, trigrams, and quadrigrams from text Run logistic regression on extracted features to learn feature weights Use weights to compute a score for each n-gram Apply thresholding to limit the number of guessed names Use word lists from word segmenter and dictionary Compare output list to correct list for F1 score

4 Features N-gram and segmented word counts Ratio of count of n-gram to (n-1)-gram Transliterated characters Prefixes and suffixes Segmented words and dictionary Mutual information

5 Thresholding Otsu’s method: Often used in image processing Separates data into two classes, minimizing the variance within the classes Does not depend on training data F1 Maximization Find the threshold on training data that maximizes F1 score Use same threshold on test data

6 Results No validation set, so chose a baseline set Ablation tests show that the baseline chosen was non-optimal Best individual scores: F1 scorePrecisionRecall 0.48090.41170.5779 F1 scorePrecisionRecall 0.57140.67500.4954 Recall 0.6605 Precision 0.6750

7 Conclusion Most useful features: N-gram counts / frequency ratios (0.46F1 alone) Varies depending on type of n-gram Thresholding Otsu’s method yielded better overall performance Both methods had drawbacks Future work More rigorous feature set testing Larger / cleaner data sets


Download ppt "Name Extraction from Chinese Novels CS224n Spring 2008 Jing Chen and Raylene Yung."

Similar presentations


Ads by Google