Download presentation
Presentation is loading. Please wait.
1
6/11/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer Science Brigham Young University Quan Wang November 2001
2
6/11/20152 Overview Probabilistic Retrieval Model –Application ontology –Document representations –Ranking documents based on logistic regression analysis Experimental Result
3
6/11/20153 Application Ontology Car YearPrice Make Model Mileage FeaturePhoneNr 1:* 0:0.975:10:0.8:1 0:0.908:1 0:1.15:* 0:2.2:* 0:0.925:1 0:0.45:1
4
6/11/20154 Document Representation A set of pairs A 1 :x 1, …….. A n :x n. A density heuristic value y; A grouping heuristic value z; Document d (x 1, ……,x n, y, z)(V, y, z)
5
6/11/20155 Independence Assumption P(R|x 1, ……,x n, y, z) Independence assumption P(R|x 1 ) P(R|x n ) P(R|y)P(R|z) * * * *
6
6/11/20156 Logistic Regression P x P(R|x) * ** * ******* *** * ******* ** * xixi P(R|x i ) P(R| x) = 1/(1+exp(-(C 0 +C 1 x))), ln(O(R|x) = C 0 +C 1 x.
7
6/11/20157 Probabilistic Retrieval Based on Logistic Regression Analysis Data processing Data analysis Probabilistic retrieval on car-ads application ontology Correlation relations
8
6/11/20158 Data Processing The corresponding normalized vector V’ = (X 1 ’, …….. X n ’) is computed as V’ = |V| / |u| V where V is a document vector, u is an ontology vector.,
9
6/11/20159 Data Distributions **** ** *** **
10
6/11/201510 Logistic Regression-1
11
6/11/201511 Logistic Regression-2 Regression coefficients P-value
12
6/11/201512 Statistical Information : P-Value A p-value is a significance indicator. A large p-value indicates either a bad regression model or a statistically insignificant index term. We should keep only significant index terms.
13
6/11/201513 Select Important Index Terms FeaturesPhoneNDensityGrouping P-value.001.034.052.012 YearMakeModelMileagePrice P-value.679.002.074.002.001 The car-ads application ontology Double S-curve
14
6/11/201514 Probabilistic Retrieval Model ln(O(R|x i )), ln(O(R|y)), ln(O(R|z)) > 0< 0 relevantirrelevant
15
6/11/201515 Correlation Relations Correlation: There are strong positive correlations among document properties (e.g. Death Date & Birth Date in the obituaries). Correlations are extra information implicitly contained in a document. Correlation relations handle “patterns”, e.g., Birth Date-Death Date pair appearing in obituaries application ontology.
16
6/11/201516 Special Web Documents Multiple-record Web documents Similar content, format (e.g. item for sale) Same lexical object values (e.g. Honda makes cars and motorcycles) 8 documents (motorcycle, boat, snowmobile, bicycle) for the car-ads, and 5 documents (death notice, bibliography for famous people, find a graveyard, politician died young, famous people died in car accident) for the obituary.
17
6/11/201517 Experimental Results Car-adsobituary recall 100% precision83.3%*83.3% accuracy92.9%92.0% *Ten out of eighteen negative documents are specially selected.
18
6/11/201518 Conclusions We propose a probabilistic model which is suitable for classifying multiple-record Web documents. The model performance on a random chosen test document set could be better than the results we present in the thesis.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.