Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Random Forest Predrag Radenković 3237/10
Unsupervised Learning
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
An Overview of Machine Learning
Social Media Mining Chapter 5 1 Chapter 5, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Scalable Text Mining with Sparse Generative Models
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Overview of Web Data Mining and Applications Part I
Introduction to machine learning
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Machine Learning CUNY Graduate Center Lecture 1: Introduction.
1 Mining surprising patterns using temporal description length Soumen Chakrabarti (IIT Bombay) Sunita Sarawagi (IIT Bombay) Byron Dom (IBM Almaden)
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Data mining and machine learning A brief introduction.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin.
Text Classification, Active/Interactive learning.
1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar
Using Hyperlink structure information for web search.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction Soumen Chakrabarti Center for Intelligent.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Fast and accurate text classification via multiple linear discriminant projections Soumen Chakrabarti Shourya Roy Mahesh Soundalgekar IIT Bombay
Algorithmic Detection of Semantic Similarity WWW 2005.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)
Post-Ranking query suggestion by diversifying search Chao Wang.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Data Mining and Decision Support
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Machine learning & object recognition Cordelia Schmid Jakob Verbeek.
Data Mining Practical Machine Learning Tools and Techniques
Machine Learning for Computer Security
Semi-Supervised Clustering
Machine Learning Clustering: K-means Supervised Learning
Deep Learning Amin Sobhani.
Constrained Clustering -Semi Supervised Clustering-
Machine Learning Basics
Text & Web Mining 9/22/2018.
Overview of Machine Learning
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering IIT Bombay

2 Traditional supervised learning  Training instance  Test instance  Independent variables x mostly continuous, maybe categorical  Predicted variable y discrete (classification) or continuous (regression) Statistical models, inference rules, or separators Learner Prediction y

3 Traditional unsupervised learning  No training / testing phases  Input is a collection of records with independent attributes alone  Measure of similarity  Partition or cover instances using clusters with large “self-similarity” and small “cross-similarity”  Hierarchical partitions Large self- similarity Small cross- similarity

4 Learning hypertext models  Entities are pages, sites, paragraphs, links, people, bookmarks, clickstreams…  Transformed into simple models and relations Vector space/bag-of-words Hyperlink graph Topic directories Discrete time series occurs(term, page, cnt) cites(page, page) is-a(topic, topic) example(topic, page)

5 Challenges  Large feature space in raw data Structured data sets: 10s to 100s Text (Web): 50 to 100 thousand  Most features not completely useless Feature elimination / selection not perfect Beyond linear transformations?  Models used today are simplistic Good accuracy on simple labeling tasks Lose a lot of detail present in hypertext to fit known learning techniques

6 Challenges  Complex, interrelated objects Not a structured tuple-like entity Explicit and implicit connections Document markup sub-structure Site boundaries and hyperlinks Placement in popular directories like Yahoo!  Traditional distance measures are noisy How to combine diverse features? (Or, a link is worth a ? words) Unreliable clustering results

7 This session  Semi-supervised clustering (Rich Caruana) Enhanced clustering via user feedback  Kernel methods (Nello Cristianini) Modular learning systems for text and hypertext  Reference matching(Andrew McCallum) Recovering and cleaning implicit citation graphs from unstructured data

8 This talk: Two examples  Learning topics of hypertext documents Semi-supervised learning scenario Unified model of text and hyperlinks Enhanced accuracy of topic labeling  Segmenting hierarchical tagged pages Topic distillation (hubs and authorities) Minimum description length segmentation Better focused topic distillation Extract relevant fragments from pages

9 Classifying interconnected entities  Early examples: Some diseases have complex lineage dependency Robust edge detection in images  How are topics interconnected in hypertext?  Maximum likelihood graph labeling with many classes Finding edge pixels in a differentiated image ? ? ? ? ? ?.3 red.7 blue

10 Naïve Bayes classifiers  Decide topic; topic c is picked with prior probability  (c);  c  (c) = 1  Each c has parameters  (c,t) for terms t  Coin with face probabilities  t  (c,t) = 1  Fix document length n(d) and toss coin  Naïve yet effective; can use other algos  Given c, probability of document is

11 Enhanced models for hypertext  c=class, d=text, N=neighbors  Text-only model: Pr(d|c)  Using neighbors’ text to judge my topic: Pr(d, d(N) | c)  Better recursive model: Pr(d, c(N) | c)  Relaxation labeling over Markov random fields  Or, EM formulation ?

12 Hyperlink modeling boosts accuracy  9600 patents from 12 classes marked by USPTO  Patents have text and prior art links  Expand test patent to include neighborhood  ‘Forget’ and re- estimate fraction of neighbors’ classes (Even better for Yahoo)

13 Hyperlink Induced Topic Search Radius-1 expanded graph Response Keyword Search engine Query a = E T h h = Ea ‘Hubs’ and ‘authorities’ h a h h h a a a

14 “Topic drift” and various fixes  Some hubs have ‘mixed’ content  Authority ‘leaks’ through mixed hubs from good to bad pages  Clever: match query with anchor text to favor some edges  B&H: eliminate outlier documents Vector-space document model Centroid × Cut-off radius Query term Activation window ‘Thick’ links

15 Document object model (DOM)  Hierarchical graph model for semi- structured data  Can extract reasonable DOM from HTML  A fine-grained view of the Web  Valuable because page boundaries are less meaningful now Portals Yahoo Lycos html headbody titleul li aa

16 A model for hub generation  Global hub score distribution  0 w.r.t. given query  Authors use DOM nodes to specialize  0 into local  I  At a certain ‘cut’ in the DOM tree, local distribution directly generates hub scores Global distribution Progressive ‘distortion’ Model frontier Other pages

17 Optimizing a cost measure HvHv v Reference distribution  0 Data encoding cost is roughly Distribution distortion cost is (for Poisson distribution)

18 Modified topic distillation algorithm  Will this (non-linear) system converge?  Will segmentation help in reducing drift? Initialize DOM graph Let only root set authority scores be 1 Repeat until reasonable convergence: Authority-to-hub score propagation MDL-based hub score smoothing Hub-to-authority score propagation Normalization of authority scores Segment and rank micro-hubs Present annotated results

19 Convergence  28 queries used in Clever and by B&H  366k macro-pages, 10M micro-links  Rank converges within 15 iterations

20 Effect of micro-hub segmentation  ‘Expanded’ implies authority diffusion arrested  As nodes outside rootset start participating in the distillation… #Expanded increases #Pruned decreases  Prevents authority leaks via mixed hubs

21 Rank correlation with B&H  Positively correlated  Some negative deviations  Pseudo- authorities downgraded by our algorithm  These were earlier favored by mixed hubs (Axes not to same scale)

22 Conclusion  Hypertext and the Web pose new modeling and algorithmic challenges  Locality exists in many guises  Diverse sources of information: text, links, markup, usage  Unifying models needed  Anecdotes suggest that synergy can be exploited