Download presentation
Presentation is loading. Please wait.
1
A Probabilistic Model for Classification of Multiple-Record Web Documents June Tang Yiu-Kai Ng
2
Overview Probabilistic Model – Bayes decision theory – Document and query representations – Ranking-function construction Multivariant Statistical Analysis
3
Approach Constructing a rank function for a probabilistic model based on multivariant statistical analysis Minimizing expected cost of misclassification Deriving a classification rule Deriving a linear classification rule Deriving a sample linear classification rule
4
Application Ontology
5
Document Representation (Year, Make, Model, Mileage, Price, Feature, PhoneNr) Total records: 60 (Year:62) (Make:58) (Model:48) (Mileage:12) (Price:58) (Feature:49) (PhoneNr:33) (62,58,48,12,58,49,33) (1.03,0.97,0.80,0.20,0.97,0.82,0.55)
6
Elementary Concepts Variables are things that we measure, control, or manipulate in research Multi-variant analysis considers multiple variables together as a single unit Normal distribution represents one of the empirically verified elementary "truths about the general nature of reality"
7
Multivariant Statistical Analysis Let A be an application ontology D be a set of Web documents R be a set of relevant documents R be a set of irrelevant document X = (X1, X2, …, Xp) represent a document be the set of all possible values on which X can take = 1 2
8
Expected Cost of Misclassification (ECM) Here, Two density functions f1 and f2
9
Classification Rule
10
Multivariate Normal Density Functions Where Assume that density functions are normal
11
Document x is classified as relevant if Linear Classification Rule Assume that density functions are normal and 1, 2, and are equal
12
Linear Discrimination Function Threshold: ?
13
Parameter Estimations Suppose we have n1 relevant documents and n2 irrelevant documents Such that n1+n2>=p and p is the dimension of vector x
14
Parameter Estimations (Cont.)
15
Sample Classification Rule Document x is classified as relevant if
16
Misclassification Probabilities Lachenbruch’s “holdout” procedure where
17
Precision Measure
18
Experimental Result (Relevant)
19
Experimental Result (Irrelevant)
20
Conclusion Precision: 85% (VSM: 77.5%) Multivariant Statistical Analysis Extendibility to Multiple Categorization Classification
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.