Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering.

Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering IIT Bombay www.cse.iitb.ernet.in/~soumen

2 Traditional supervised learning  Training instance  Test instance  Independent variables x mostly continuous, maybe categorical  Predicted variable y discrete (classification) or continuous (regression) Statistical models, inference rules, or separators Learner Prediction y

3 Traditional unsupervised learning  No training / testing phases  Input is a collection of records with independent attributes alone  Measure of similarity  Partition or cover instances using clusters with large “self-similarity” and small “cross-similarity”  Hierarchical partitions Large self- similarity Small cross- similarity

4 Learning hypertext models  Entities are pages, sites, paragraphs, links, people, bookmarks, clickstreams…  Transformed into simple models and relations Vector space/bag-of-words Hyperlink graph Topic directories Discrete time series occurs(term, page, cnt) cites(page, page) is-a(topic, topic) example(topic, page)

5 Challenges  Large feature space in raw data Structured data sets: 10s to 100s Text (Web): 50 to 100 thousand  Most features not completely useless Feature elimination / selection not perfect Beyond linear transformations?  Models used today are simplistic Good accuracy on simple labeling tasks Lose a lot of detail present in hypertext to fit known learning techniques

6 Challenges  Complex, interrelated objects Not a structured tuple-like entity Explicit and implicit connections Document markup sub-structure Site boundaries and hyperlinks Placement in popular directories like Yahoo!  Traditional distance measures are noisy How to combine diverse features? (Or, a link is worth a ? words) Unreliable clustering results

7 This session  Semi-supervised clustering (Rich Caruana) Enhanced clustering via user feedback  Kernel methods (Nello Cristianini) Modular learning systems for text and hypertext  Reference matching(Andrew McCallum) Recovering and cleaning implicit citation graphs from unstructured data

8 This talk: Two examples  Learning topics of hypertext documents Semi-supervised learning scenario Unified model of text and hyperlinks Enhanced accuracy of topic labeling  Segmenting hierarchical tagged pages Topic distillation (hubs and authorities) Minimum description length segmentation Better focused topic distillation Extract relevant fragments from pages

9 Classifying interconnected entities  Early examples: Some diseases have complex lineage dependency Robust edge detection in images  How are topics interconnected in hypertext?  Maximum likelihood graph labeling with many classes Finding edge pixels in a differentiated image ? ? ? ? ? ?.3 red.7 blue

10 Naïve Bayes classifiers  Decide topic; topic c is picked with prior probability  (c);  c  (c) = 1  Each c has parameters  (c,t) for terms t  Coin with face probabilities  t  (c,t) = 1  Fix document length n(d) and toss coin  Naïve yet effective; can use other algos  Given c, probability of document is

11 Enhanced models for hypertext  c=class, d=text, N=neighbors  Text-only model: Pr(d|c)  Using neighbors’ text to judge my topic: Pr(d, d(N) | c)  Better recursive model: Pr(d, c(N) | c)  Relaxation labeling over Markov random fields  Or, EM formulation ?

12 Hyperlink modeling boosts accuracy  9600 patents from 12 classes marked by USPTO  Patents have text and prior art links  Expand test patent to include neighborhood  ‘Forget’ and re- estimate fraction of neighbors’ classes (Even better for Yahoo)

13 Hyperlink Induced Topic Search Radius-1 expanded graph Response Keyword Search engine Query a = E T h h = Ea ‘Hubs’ and ‘authorities’ h a h h h a a a

14 “Topic drift” and various fixes  Some hubs have ‘mixed’ content  Authority ‘leaks’ through mixed hubs from good to bad pages  Clever: match query with anchor text to favor some edges  B&H: eliminate outlier documents Vector-space document model Centroid × Cut-off radius Query term Activation window ‘Thick’ links

15 Document object model (DOM)  Hierarchical graph model for semi- structured data  Can extract reasonable DOM from HTML  A fine-grained view of the Web  Valuable because page boundaries are less meaningful now Portals Yahoo Lycos html headbody titleul li aa

16 A model for hub generation  Global hub score distribution  0 w.r.t. given query  Authors use DOM nodes to specialize  0 into local  I  At a certain ‘cut’ in the DOM tree, local distribution directly generates hub scores Global distribution Progressive ‘distortion’ Model frontier Other pages

17 Optimizing a cost measure HvHv v Reference distribution  0 Data encoding cost is roughly Distribution distortion cost is (for Poisson distribution)

18 Modified topic distillation algorithm  Will this (non-linear) system converge?  Will segmentation help in reducing drift? Initialize DOM graph Let only root set authority scores be 1 Repeat until reasonable convergence: Authority-to-hub score propagation MDL-based hub score smoothing Hub-to-authority score propagation Normalization of authority scores Segment and rank micro-hubs Present annotated results

19 Convergence  28 queries used in Clever and by B&H  366k macro-pages, 10M micro-links  Rank converges within 15 iterations

20 Effect of micro-hub segmentation  ‘Expanded’ implies authority diffusion arrested  As nodes outside rootset start participating in the distillation… #Expanded increases #Pruned decreases  Prevents authority leaks via mixed hubs

21 Rank correlation with B&H  Positively correlated  Some negative deviations  Pseudo- authorities downgraded by our algorithm  These were earlier favored by mixed hubs (Axes not to same scale)

22 Conclusion  Hypertext and the Web pose new modeling and algorithmic challenges  Locality exists in many guises  Diverse sources of information: text, links, markup, usage  Unifying models needed  Anecdotes suggest that synergy can be exploited

Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering.

Similar presentations

Presentation on theme: "Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering.

Similar presentations

Presentation on theme: "Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering."— Presentation transcript:

Similar presentations

About project

Feedback