Document Quality Judgment with Textual Featues Bing Bai Computer Science Department Rutgers University December 2003
Document Qualities Not relevance Not relevance Also important in information retrieval system Also important in information retrieval system Partially dependent on Textual features Partially dependent on Textual features Document length Document length “Coward” “Coward”
Document Qualities(Continued) Pre-defined Qualities Pre-defined Qualities Accuracy Accuracy Credibility Credibility Depth Depth Grammar Correctness Grammar Correctness Objectivity Objectivity Multi-side Multi-side Readability Readability Source Authority Source Authority Verbose-Concise Verbose-Concise
Textual Features Statistics by GATE Statistics by GATE Categories of Features Categories of Features Punctuation Number of periods, question marks, exclamation marks, … Symbol Number of dollar signs, percent signs, plus signs, … Length Average length paragraph in words. Length of title, subtitle, … Upper Case Number of all upper case words, number of words with the first letter capital, …
Textual Features (Continued) Quotation Average quotation length Key Terms Number of word "say", "seem", and "expert" Unique words Number of unique words, excluding stop words, … POS Number of token, proper noun, personal pronoun, … Entities Number of person, location, organization, and date, …
Data Set and Testing Scheme More than 2000 Document from 3 different article sources: CNS, TREC, and XinHua News Agency. More than 2000 Document from 3 different article sources: CNS, TREC, and XinHua News Agency. The Nine Qualities of these document are judged by faculty, professionals, and students. The Nine Qualities of these document are judged by faculty, professionals, and students. 3 qualities (“Depth”, “Multi-side”, “Objectivity”) showed strongest correlations with the textual features we defined. 3 qualities (“Depth”, “Multi-side”, “Objectivity”) showed strongest correlations with the textual features we defined. 2-fold Cross Validation for 5 times. The training set and testing set are generated randomly each time. 2-fold Cross Validation for 5 times. The training set and testing set are generated randomly each time.
Results Depth (1119/894) Multi-side (1038/975) Objectivity (995/1018) J4862.6/52.6/58.2± /56.1/58.3± /51.6/51.8±1.1 NB 81.6/42.4/64.2 ± /43.7/61.8± /59.2/53.4±1.0 SMO81.5/45.6/65.5± /56.6/67.7± /61.7/56.8±0.66 LR74.4/51.1/64.0± /60.8/65.8± /59.3/57.3±2.8
Factor Anaysis Purpose: viewing 112 variables is hard, data reduction allows us to concentrate on the most important factors of data. Purpose: viewing 112 variables is hard, data reduction allows us to concentrate on the most important factors of data. Two qualities distribution on factor 1 and factor 2, on the left is “Depth”, on the right is “Multi-side”. Two qualities distribution on factor 1 and factor 2, on the left is “Depth”, on the right is “Multi-side”.
Gaussian-Bayesian Classifier if P(x|C1)P(C1) > P(x|C2)P(C2) then classify x as class I; else classify x as class II. if P(x|C1)P(C1) > P(x|C2)P(C2) then classify x as class I; else classify x as class II. Where Where Singularity elimination (Get rid of trivial eigens) Singularity elimination (Get rid of trivial eigens)
GBC Results
GBC (Continued) Gaussian boundary is not as good as linear boundary (Logistic Regression and Support Vector Machine). Gaussian boundary is not as good as linear boundary (Logistic Regression and Support Vector Machine). One reason: the distributions are not Gaussian One reason: the distributions are not Gaussian The distributions of feature NN, (a) is the distribution with low objectivity, (b) is the distribution with high objectivity. The distributions of feature NN, (a) is the distribution with low objectivity, (b) is the distribution with high objectivity.