A Probabilistic Model for Classification of Multiple-Record Web Documents June Tang Yiu-Kai Ng
Overview Probabilistic Model – Bayes decision theory – Document and query representations – Ranking-function construction Multivariant Statistical Analysis
Approach Constructing a rank function for a probabilistic model based on multivariant statistical analysis Minimizing expected cost of misclassification Deriving a classification rule Deriving a linear classification rule Deriving a sample linear classification rule
Application Ontology
Document Representation (Year, Make, Model, Mileage, Price, Feature, PhoneNr) Total records: 60 (Year:62) (Make:58) (Model:48) (Mileage:12) (Price:58) (Feature:49) (PhoneNr:33) (62,58,48,12,58,49,33) (1.03,0.97,0.80,0.20,0.97,0.82,0.55)
Elementary Concepts Variables are things that we measure, control, or manipulate in research Multi-variant analysis considers multiple variables together as a single unit Normal distribution represents one of the empirically verified elementary "truths about the general nature of reality"
Multivariant Statistical Analysis Let A be an application ontology D be a set of Web documents R be a set of relevant documents R be a set of irrelevant document X = (X1, X2, …, Xp) represent a document be the set of all possible values on which X can take = 1 2
Expected Cost of Misclassification (ECM) Here, Two density functions f1 and f2
Classification Rule
Multivariate Normal Density Functions Where Assume that density functions are normal
Document x is classified as relevant if Linear Classification Rule Assume that density functions are normal and 1, 2, and are equal
Linear Discrimination Function Threshold: ?
Parameter Estimations Suppose we have n1 relevant documents and n2 irrelevant documents Such that n1+n2>=p and p is the dimension of vector x
Parameter Estimations (Cont.)
Sample Classification Rule Document x is classified as relevant if
Misclassification Probabilities Lachenbruch’s “holdout” procedure where
Precision Measure
Experimental Result (Relevant)
Experimental Result (Irrelevant)
Conclusion Precision: 85% (VSM: 77.5%) Multivariant Statistical Analysis Extendibility to Multiple Categorization Classification