6/11/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Traditional IR models Jian-Yun Nie.
Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
A Vector Space Model for Automatic Indexing
Chapter 12 Simple Linear Regression
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Simple Linear Regression and Correlation
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
Chapter 12 Simple Linear Regression
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
T-Tests.
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
Statistics for Business and Economics
Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University.
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
A Probabilistic Model for Classification of Multiple-Record Web Documents June Tang Yiu-Kai Ng.
Probabilistic Information Retrieval Part II: In Depth Alexander Dekhtyar Department of Computer Science University of Maryland.
The Simple Regression Model
1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,
Filtering Multiple-Record Web Documents Based on Application Ontologies Presenter: L. Xu Advisor: D.W.Embley.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Dorin Comaniciu Visvanathan Ramesh (Imaging & Visualization Dept., Siemens Corp. Res. Inc.) Peter Meer (Rutgers University) Real-Time Tracking of Non-Rigid.
7/15/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
1 Automating the Extraction of Domain Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004.
7/16/20151 Ontology-Based Binary-Categorization of Multiple- Record Web Documents Using a Probabilistic Retrieval Model Department of Computer Science.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Combining Content-based and Collaborative Filtering Department of Computer Science and Engineering, Slovak University of Technology
Means Tests Hypothesis Testing Assumptions Testing (Normality)
Overview of Major Statistical Tools UAPP 702 Research Methods for Urban & Public Policy Based on notes by Steven W. Peuquet, Ph.D. 1.
Query Relevance Feedback and Ontologies How to Make Queries Better.
Correlation.
Sections 9-1 and 9-2 Overview Correlation. PAIRED DATA Is there a relationship? If so, what is the equation? Use that equation for prediction. In this.
1 Chapter 9. Section 9-1 and 9-2. Triola, Elementary Statistics, Eighth Edition. Copyright Addison Wesley Longman M ARIO F. T RIOLA E IGHTH E DITION.
Statistics for Business and Economics Chapter 10 Simple Linear Regression.
Ms. Khatijahhusna Abd Rani School of Electrical System Engineering Sem II 2014/2015.
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 1 Slide Simple Linear Regression Part A n Simple Linear Regression Model n Least Squares Method n Coefficient of Determination n Model Assumptions n.
1 1 Slide © 2004 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
1 1 Slide Simple Linear Regression Coefficient of Determination Chapter 14 BA 303 – Spring 2011.
© 2001 Prentice-Hall, Inc. Statistics for Business and Economics Simple Linear Regression Chapter 10.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
1 Inferences About The Pearson Correlation Coefficient.
Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.
Introduction to Biostatistics and Bioinformatics Regression and Correlation.
Correlation and Linear Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Vector Space Models.
Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.
Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,
Automating Readers’ Advisory to Make Book Recommendations for K-12 Readers by Alicia Wood.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Chapter 12 Simple Linear Regression n Simple Linear Regression Model n Least Squares Method n Coefficient of Determination n Model Assumptions n Testing.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 MVS 250: V. Katch S TATISTICS Chapter 5 Correlation/Regression.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
Statistics Correlation and regression. 2 Introduction Some methods involve one variable is Treatment A as effective in relieving arthritic pain as Treatment.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS St. Edward’s University.
Quantitative Methods Simple Regression.
Slides by JOHN LOUCKS St. Edward’s University.
CSE 4705 Artificial Intelligence
Collaborative Filtering Nearest Neighbor Approach
Representation of documents and queries
Topic 8 Correlation and Regression Analysis
Learning to Rank with Ties
St. Edward’s University
Presentation transcript:

6/11/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer Science Brigham Young University Quan Wang November 2001

6/11/20152 Overview Probabilistic Retrieval Model –Application ontology –Document representations –Ranking documents based on logistic regression analysis Experimental Result

6/11/20153 Application Ontology Car YearPrice Make Model Mileage FeaturePhoneNr 1:* 0:0.975:10:0.8:1 0:0.908:1 0:1.15:* 0:2.2:* 0:0.925:1 0:0.45:1

6/11/20154 Document Representation A set of pairs A 1 :x 1, …….. A n :x n. A density heuristic value y; A grouping heuristic value z; Document d (x 1, ……,x n, y, z)(V, y, z)

6/11/20155 Independence Assumption P(R|x 1, ……,x n, y, z) Independence assumption P(R|x 1 ) P(R|x n ) P(R|y)P(R|z) * * * *

6/11/20156 Logistic Regression P x P(R|x) * ** * ******* *** * ******* ** * xixi P(R|x i ) P(R| x) = 1/(1+exp(-(C 0 +C 1 x))), ln(O(R|x) = C 0 +C 1 x.

6/11/20157 Probabilistic Retrieval Based on Logistic Regression Analysis Data processing Data analysis Probabilistic retrieval on car-ads application ontology Correlation relations

6/11/20158 Data Processing The corresponding normalized vector V’ = (X 1 ’, …….. X n ’) is computed as V’ = |V| / |u| V where V is a document vector, u is an ontology vector.,

6/11/20159 Data Distributions **** ** *** **

6/11/ Logistic Regression-1

6/11/ Logistic Regression-2 Regression coefficients P-value

6/11/ Statistical Information : P-Value A p-value is a significance indicator. A large p-value indicates either a bad regression model or a statistically insignificant index term. We should keep only significant index terms.

6/11/ Select Important Index Terms FeaturesPhoneNDensityGrouping P-value YearMakeModelMileagePrice P-value The car-ads application ontology Double S-curve

6/11/ Probabilistic Retrieval Model ln(O(R|x i )), ln(O(R|y)), ln(O(R|z)) > 0< 0 relevantirrelevant

6/11/ Correlation Relations Correlation: There are strong positive correlations among document properties (e.g. Death Date & Birth Date in the obituaries). Correlations are extra information implicitly contained in a document. Correlation relations handle “patterns”, e.g., Birth Date-Death Date pair appearing in obituaries application ontology.

6/11/ Special Web Documents Multiple-record Web documents Similar content, format (e.g. item for sale) Same lexical object values (e.g. Honda makes cars and motorcycles) 8 documents (motorcycle, boat, snowmobile, bicycle) for the car-ads, and 5 documents (death notice, bibliography for famous people, find a graveyard, politician died young, famous people died in car accident) for the obituary.

6/11/ Experimental Results Car-adsobituary recall 100% precision83.3%*83.3% accuracy92.9%92.0% *Ten out of eighteen negative documents are specially selected.

6/11/ Conclusions We propose a probabilistic model which is suitable for classifying multiple-record Web documents. The model performance on a random chosen test document set could be better than the results we present in the thesis.