Name Extraction from Chinese Novels CS224n Spring 2008 Jing Chen and Raylene Yung
Problem Given a Chinese novel, extract the names of people and locations Different from English NER: no whitespace within sentences, no capitalization Can use other characteristics since the domain is limited
System Outline Extract bigrams, trigrams, and quadrigrams from text Run logistic regression on extracted features to learn feature weights Use weights to compute a score for each n-gram Apply thresholding to limit the number of guessed names Use word lists from word segmenter and dictionary Compare output list to correct list for F1 score
Features N-gram and segmented word counts Ratio of count of n-gram to (n-1)-gram Transliterated characters Prefixes and suffixes Segmented words and dictionary Mutual information
Thresholding Otsu’s method: Often used in image processing Separates data into two classes, minimizing the variance within the classes Does not depend on training data F1 Maximization Find the threshold on training data that maximizes F1 score Use same threshold on test data
Results No validation set, so chose a baseline set Ablation tests show that the baseline chosen was non-optimal Best individual scores: F1 scorePrecisionRecall F1 scorePrecisionRecall Recall Precision
Conclusion Most useful features: N-gram counts / frequency ratios (0.46F1 alone) Varies depending on type of n-gram Thresholding Otsu’s method yielded better overall performance Both methods had drawbacks Future work More rigorous feature set testing Larger / cleaner data sets