7/16/20151 Ontology-Based Binary-Categorization of Multiple- Record Web Documents Using a Probabilistic Retrieval Model Department of Computer Science.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

BPS - 5th Ed. Chapter 241 One-Way Analysis of Variance: Comparing Several Means.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Linear Regression - Topics
Simple Linear Regression and Correlation
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Method of MobyDick Chao Wang May 4, Aim Discover multiple motifs from a large collection of sequences It is based on a statistical mechanics model.
April 25 Exam April 27 (bring calculator with exp) Cox-Regression
1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 2. Hypothesis Testing.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
6/11/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer.
1 Canonical Analysis Introduction Assumptions Model representation An output example Conditions Procedural steps Mechanical steps - with the use of attached.
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
SLIDE 1IS 240 – Spring 2010 Logistic Regression The logistic function: The logistic function is useful because it can take as an input any.
Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University.
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
A Probabilistic Model for Classification of Multiple-Record Web Documents June Tang Yiu-Kai Ng.
Probabilistic Information Retrieval Part II: In Depth Alexander Dekhtyar Department of Computer Science University of Maryland.
INFO 624 Week 3 Retrieval System Evaluation
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.
Filtering Multiple-Record Web Documents Based on Application Ontologies Presenter: L. Xu Advisor: D.W.Embley.
EPI 809/Spring Multiple Logistic Regression.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
7/15/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Spring Research Conference.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.
Regression multiple Dan Fisher Marriott School of Management Brigham Young University November 2005 linear.
Decision Tree Models in Data Mining
1 CHAPTER M4 Cost Behavior © 2007 Pearson Custom Publishing.
1 1 Slide IS 310 – Business Statistics IS 310 Business Statistics CSU Long Beach.
Session 4. Applied Regression -- Prof. Juran2 Outline for Session 4 Summary Measures for the Full Model –Top Section of the Output –Interval Estimation.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
Multivariate Data Analysis Chapter 5 – Discrimination Analysis and Logistic Regression.
1 1 Slide Chapter 11 Comparisons Involving Proportions n Inference about the Difference Between the Proportions of Two Populations Proportions of Two Populations.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
1. 2 Traditional Income Statement LO1: Prepare a contribution margin income statement.
ANOVA: Analysis of Variance. The basic ANOVA situation Two variables: 1 Nominal, 1 Quantitative Main Question: Do the (means of) the quantitative variables.
1 1 Slide Simple Linear Regression Estimation and Residuals Chapter 14 BA 303 – Spring 2011.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Chapter Outline Goodness of Fit test Test of Independence.
C.Watterscsci64031 Probabilistic Retrieval Model.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Linear Discriminant Analysis and Logistic Regression.
ANOVA, Regression and Multiple Regression March
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
The Idea of the Statistical Test. A statistical test evaluates the "fit" of a hypothesis to a sample.
Determining How Costs Behave
Zebrafish Research Data Analysis Choices.
بسم الله الرحمن الرحيم.
شاخصهای عملکردی بیمارستان
فرق بین خوب وعالی فقط اندکی تلاش بیشتر است
Introduction to Logistic Regression
ROC Curves and Operating Points
Essentials of Statistics for Business and Economics (8e)
Learning to Rank with Ties
Presentation transcript:

7/16/20151 Ontology-Based Binary-Categorization of Multiple- Record Web Documents Using a Probabilistic Retrieval Model Department of Computer Science Brigham Young University Q Wang November, 2000

7/16/20152 Multiple-Record Web Documents-1 Acura Integra 1990 $4,000 (1/27/00) ACURA'90 Integra, AC, AM/FM cassette, cruise, new tires. Asking $4,000. (302) Acura Integra 1992 $5,900 (1/27/00) ACURA'92 Integra RS, white, excellent condition. $5, Relevant document--a chunk of Car-sale Ads

7/16/20153 Multiple-Record Web Documents-2 '97 HONDA ACE SHADOW 1100cc 4k. Customized. $7.5K/obo '97 HONDA CR250 Exc. cond. $3300/OBO. (410) Irrelevant document--a chunk of Motorcycle Ads

7/16/20154 Application Ontology Car YearPrice Make Model Mileage FeaturePhoneNr 1:* 0:0.975:10:0.8:1 0:0.908:1 0:1.15:* 0:2.2:* 0:0.925:1 0:0.45:1

7/16/20155 Document Representation A set of pairs A 1 :x 1, …….. A n :x n A density heuristic value A grouping heuristic value P(R|d)P(R|(x 1, ……,x n ), P(R|Density), P(R|Grouping)

7/16/20156 Independence Assumption P(R|(Year, ……,Make) Independence assumption P(R|(Year)P(R|(Make)

7/16/20157 Logistic Regression Prob. 1 1… Make Logistic regression package C 0 C 1 P-value Input from a training set data Output

7/16/20158 Probability Estimation x Make = P(R| Make) = 1/(1+exp(-(C 0 +C 1 x Make ))) = 1/(1+exp(-(8.358+( * )))) = For a test document, the term frequency of index term Make is

7/16/20159 Probability Fitting Curve P x P(R|x) * ** * ******* *** * ******* ** * xixi P(R|x i ) P(R| x) = 1/(1+exp(-(C 0 +C 1 x)))

7/16/ Relevance Probability Calculation For a Car Sale document in a test set, we have C 0 = [.6,8.4,3.7,22.8,15.5,5.9,–2.5,61.9,29.2] C 1 = [-.2,-1.6,-.9,-1.7,-3.0,-2.5,1.1,-10,1,-20.5 ] X = [.26,.25,.14,.07,.23,.84,.26,.15,.33 ] I = [1, 1, 1, 1, 1, 1, 1, 1,1] Index = [Ye,Ma,Mo,Mi,Pr,Fe,Ph,De,Gr] Y = C 0 * I T + C 1 * X T = P(R|d) = 1 + 1/exp(-Y) = 1

7/16/ Statistical Information : P-Value A p-value is a significance indicator. A large p-value indicates either a bad regression model or a statistically insignificant index term. We should keep only significant index terms.

7/16/ Dependent Relations Dependent relation exists among index terms. Independence assumption oversimplifies the problem & causes distortion. For example, in the Car Ads application ontology, we expect Make and Model are likely appearing together. The performance can be improved by including significant dependent relations in relevance probability calculation.

7/16/ Estimation of relevance probability-2 P(R|Density) P(R|Grouping) P(R|Year)P(R|Feature) P(R|Correlation-n) P(R|d) Multiplication P(R|Correlation-1)

7/16/ Comparison EvaluationVSM VSM & Machine Learning Probabilistic Car Sale Precision 100% Recall 85.7%91%100% Obituary Precision 100%91%100% Recall 100%

7/16/ Contribution We propose a probabilistic model which can accurately classify multiple-record Web documents. We will study the impact of dependent relations on the performance of our model.