Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
An Introduction To Categorization Soam Acharya, PhD 1/15/2003.
Advertisements

Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Mapping Between Taxonomies Elena Eneva 27 Sep 2001 Advanced IR Seminar.
Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: Mark Craven,
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect.
Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.
Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.
Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.
HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Web-page Classification through Summarization D. Shen, *Z. Chen, **Q Yang, *H.J. Zeng, *B.Y. Zhang, Y.H. Lu and *W.Y. Ma TsingHua University, *Microsoft.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Similar Document Search and Recommendation Vidhya Govindaraju, Krishnan Ramanathan HP Labs, Bangalore, India JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
MINING MULTI-LABEL DATA BY GRIGORIOS TSOUMAKAS, IOANNIS KATAKIS, AND IOANNIS VLAHAVAS Published on July, 7, 2010 Team Members: Kristopher Tadlock, Jimmy.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Headings are defined with the to tags. defines the largest heading. defines the smallest heading. Note: Browsers automatically add an empty line before.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.
Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode.
Data Mining and Text Mining. The Standard Data Mining process.
Data mining in web applications
Data Mining 101 with Scikit-Learn
Family History Technology Workshop
Information Retrieval
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Using Link Information to Enhance Web Page Classification
Presentation transcript:

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University

How is hypertext different?  Link Information (possibly useful but noisy)  Diverse Authorship  Short text - topic not obvious from the text  Structure / position within the web graph  Author-supplied features (meta-tags)  External Sources of Information (Meta- Data)  Bold, italics, heading etc.

Goal  Present several hypothesis about regularities in hypertext classification tasks  Describe methods to exploit these regularities  Evaluate the different methods and regularities on real-world hypertext datasets

Regularities in Hypertext  No Regularity  “Encyclopedia” Regularity  “Co-Referencing” Regularity  Partial “Co-Referencing” Regularity  Preclassified Regularity  Meta-Data Regularity

No Regularity “Encyclopedia” Regularity “Co-Referencing” Regularity The documents are linked at random, or at least independent of the document class The majority of linked documents share the same class as a document. Encyclopedia articles generally reference other articles which are topically similar. Documents with the same class tend to link to documents not of that class, but which are topically similar to each other. University student index pages which tend not to link to other student index pages, but do link mostly to home pages of students.

Partial “Co-Referencing” Regularity Pre-Classified Regularity Meta-Data Regularity “Co­referencing” regularity where we might have more than a few “noisy” links Many students may point to pages about their hobbies, but also link to a wide variety of other pages which are less unique to student home pages Either one page, or some small set of pages, may contain lists of hyperlinks to pages that are mostly members of the same class Any page from the Yahoo topic hierarchy Meta­data available from external sources on the web that can be exploited in the form of additional features. Movie reviews for movie classification, online discussion boards for various other topic classification tasks (such as stock market predictions or competitive analysis).

Ignore Links Use All the Text From Neighbors Use All the Text From Neighbors Separately Use standard text classifiers on the text of the document itself Also serves as baseline Augment the text of each document with the text of its neighbors Adding more topic­related words to the document. Add the words of linked documents, but treating them as if they come from a separate vocabulary. A simple way to do this is to prefix the words in the linked documents with a tag, such as linked­word:

Look for Linked document subsets Use the identity of the linked documents Use External Features / Meta-Data Search for the topically similar linked pages At the top level, this is a clustering problem to find similar documents among all the documents linked to documents in the same class. Search for these pages by representing each page with only the names of the pages it links with. Collect features that relate two or more entities/documents being classified using information extraction techniques. These extracted features can then be used in a similar fashion by using the identity of the related documents and by using the text of related documents in various ways.

Look for Linked document subsets Use the identity of the linked documents Use External Features / Meta-Data Search for the topically similar linked pages At the top level, this is a clustering problem to find similar documents among all the documents linked to documents in the same class. Search for these pages by representing each page with only the names of the pages it links with. Collect features that relate two or more entities/documents being classified using information extraction techniques. These extracted features can then be used in a similar fashion by using the identity of the related documents and by using the text of related documents in various ways.

Learning Algorithms Used  Naïve Bayes (NB) –Probabilistic, Builds a Generative Model  k Nearest Neighbor (kNN) –Example-based  First Order Inductive Learner (FOIL) –Relational Learner

Datasets  A collection of up to 50 web pages from 4285 companies (as used in Ghani et al. 2000)  Two types of classifications (labels obtained from –Coarse-grained Classification - 28 classes –Fine-grained Classification – 255 classes  Classification is at the level of Companies so that task is to classify the company by collapsing all of the web pages in a corporate website.

Accuracy for 28 Class Task

Accuracy for 255 Class Task

Accuracy Vs. Feature Size

Conclusions  Hyperlinks can be extremely noisy and harmful for classification  Meta-Data about websites can be useful and techniques for automatically finding meta-data should be explored  Naïve Bayes and kNN are suitable since they scale up well for the noise and feature-set size while FOIL has the power to discover relational regularities that cannot be explicitly identified by others.