7/16/20151 Ontology-Based Binary-Categorization of Multiple- Record Web Documents Using a Probabilistic Retrieval Model Department of Computer Science Brigham Young University Q Wang November, 2000
7/16/20152 Multiple-Record Web Documents-1 Acura Integra 1990 $4,000 (1/27/00) ACURA'90 Integra, AC, AM/FM cassette, cruise, new tires. Asking $4,000. (302) Acura Integra 1992 $5,900 (1/27/00) ACURA'92 Integra RS, white, excellent condition. $5, Relevant document--a chunk of Car-sale Ads
7/16/20153 Multiple-Record Web Documents-2 '97 HONDA ACE SHADOW 1100cc 4k. Customized. $7.5K/obo '97 HONDA CR250 Exc. cond. $3300/OBO. (410) Irrelevant document--a chunk of Motorcycle Ads
7/16/20154 Application Ontology Car YearPrice Make Model Mileage FeaturePhoneNr 1:* 0:0.975:10:0.8:1 0:0.908:1 0:1.15:* 0:2.2:* 0:0.925:1 0:0.45:1
7/16/20155 Document Representation A set of pairs A 1 :x 1, …….. A n :x n A density heuristic value A grouping heuristic value P(R|d)P(R|(x 1, ……,x n ), P(R|Density), P(R|Grouping)
7/16/20156 Independence Assumption P(R|(Year, ……,Make) Independence assumption P(R|(Year)P(R|(Make)
7/16/20157 Logistic Regression Prob. 1 1… Make Logistic regression package C 0 C 1 P-value Input from a training set data Output
7/16/20158 Probability Estimation x Make = P(R| Make) = 1/(1+exp(-(C 0 +C 1 x Make ))) = 1/(1+exp(-(8.358+( * )))) = For a test document, the term frequency of index term Make is
7/16/20159 Probability Fitting Curve P x P(R|x) * ** * ******* *** * ******* ** * xixi P(R|x i ) P(R| x) = 1/(1+exp(-(C 0 +C 1 x)))
7/16/ Relevance Probability Calculation For a Car Sale document in a test set, we have C 0 = [.6,8.4,3.7,22.8,15.5,5.9,–2.5,61.9,29.2] C 1 = [-.2,-1.6,-.9,-1.7,-3.0,-2.5,1.1,-10,1,-20.5 ] X = [.26,.25,.14,.07,.23,.84,.26,.15,.33 ] I = [1, 1, 1, 1, 1, 1, 1, 1,1] Index = [Ye,Ma,Mo,Mi,Pr,Fe,Ph,De,Gr] Y = C 0 * I T + C 1 * X T = P(R|d) = 1 + 1/exp(-Y) = 1
7/16/ Statistical Information : P-Value A p-value is a significance indicator. A large p-value indicates either a bad regression model or a statistically insignificant index term. We should keep only significant index terms.
7/16/ Dependent Relations Dependent relation exists among index terms. Independence assumption oversimplifies the problem & causes distortion. For example, in the Car Ads application ontology, we expect Make and Model are likely appearing together. The performance can be improved by including significant dependent relations in relevance probability calculation.
7/16/ Estimation of relevance probability-2 P(R|Density) P(R|Grouping) P(R|Year)P(R|Feature) P(R|Correlation-n) P(R|d) Multiplication P(R|Correlation-1)
7/16/ Comparison EvaluationVSM VSM & Machine Learning Probabilistic Car Sale Precision 100% Recall 85.7%91%100% Obituary Precision 100%91%100% Recall 100%
7/16/ Contribution We propose a probabilistic model which can accurately classify multiple-record Web documents. We will study the impact of dependent relations on the performance of our model.