CS347 Lecture 10 May 14, 2001 ©Prabhakar Raghavan.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

COMPUTER AIDED DIAGNOSIS: CLASSIFICATION Prof. Yasser Mostafa Kadah –
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Data Mining Classification: Alternative Techniques
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
CS276B Text Information Retrieval, Mining, and Exploitation Lecture 9 Text Classification IV Feb 13, 2003.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Chapter 5 Data mining : A Closer Look.
Introduction to machine learning
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Data mining and machine learning A brief introduction.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Bayesian Networks. Male brain wiring Female brain wiring.
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
A Language Independent Method for Question Classification COLING 2004.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
1 E. Fatemizadeh Statistical Pattern Recognition.
Fast and accurate text classification via multiple linear discriminant projections Soumen Chakrabarti Shourya Roy Mahesh Soundalgekar IIT Bombay
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
KNN & Naïve Bayes Hongning Wang
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Semi-Supervised Clustering
Machine Learning Lecture 9: Clustering
Lecture 15: Text Classification & Naive Bayes
Data Mining Lecture 11.
Machine Learning. k-Nearest Neighbor Classifiers.
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
COSC 4335: Other Classification Techniques
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Text Categorization Berlin Chen 2003 Reference:
Parametric Methods Berlin Chen, 2005 References:
Information Retrieval
Presentation transcript:

CS347 Lecture 10 May 14, 2001 ©Prabhakar Raghavan

Topics du jour Centroid/nearest-neighbor classification Bayesian Classification Link-based classification Document summarization

Centroid/NN Given training docs for a topic, compute their centroid Now have a centroid for each topic Given query doc, assign to topic whose centroid is nearest.

Example Government Science Arts

Bayesian Classification As before, train classifier on exemplary docs from classes c 1, c 2,…c r Given test doc d estimate P r [d belongs to class c j ] = P r [c j |d]

Apply Bayes’ Theorem P r [c j |d] ° P r [d] = P r [d |c j ] ° P r [c j ] So P r [c j |d] = Express P r [d] as P r [d|c i ] ° P r [c i ]

“Reverse Engineering” To compute P r [c j |d], all we need are P r [d |c i ] and P r [c i ], for all i. Will get these from training.

Training Given a set of training docs, together with a class label for each training doc. –e.g., these docs belong to Physics, those others to Astronomy, etc.

Estimating P r [c i ] P r [c i ] = Fraction of training docs that are labeled c i. In practice, use more sophisticated “smoothing” to boost probabilities of classes under-represented in sample.

Estimating P r [d |c i ] Basic assumption - each occurrence of each word in each doc is independent of all others. For a word w, (from sample docs) Pr [w |c i ] = Frequency of word w amongst all docs labeled c i.

Example Thus, the probability of a doc consisting of Friends, Romans, Countrymen = Pr[Friends] ° Pr[Romans] ° Pr[Countrymen] In implementations, pay attention to precision/underflow. Extract all probabilities from term-doc matrix.

To summarize Training Use class frequencies in training data for P r [c i ]. Estimate word frequencies for each word and each class to estimate Pr [w |c i ]. Test doc d Use the Pr [w |c i ] values to estimate P r [d |c i ] for each class c i. Determine class c j for which P r [c j |d] is maximized.

Abstract features So far, have relied on word counts as the “features” to train and classify on. In general, could be any statistic. –terms in boldface count for more. –authors of cited docs. –number of equations. –square of the number of commas … “Abstract features”.

Bayesian in practice Many improvements used over “naïve” version discussed above –various models for document generation –varying emphasis on words in different portions of docs –smoothing statistics for infrequent terms –classifying into a hierarchy

Supervised learning deployment issues Uniformity of docs in training/test Quality of authorship Volume of training data

Typical empirical observations Training ~ docs/class Accuracy upto 90% in the very best circumstances below 50% in the worst

SVM vs. Bayesian SVM appears to beat variety of Bayesian approaches –both beat centroid-based methods SVM needs quadratic programming –more computation than naïve Bayes at training time –less at classification time Bayesian classifiers also partition vector space, but not using linear decision surfaces

Classifying hypertext Given a set of hyperlinked docs Class labels for some docs available Figure out class labels for remaining docs

Example c1 c2 ? ? c4 c3 ? ?

Bayesian hypertext classification Besides the terms in a doc, derive cues from linked docs to assign a class to test doc. Cues could be any abstract features from doc and its neighbors.

Feature selection Attempt 1: –use terms in doc + those in its neighbors. Generally does worse than terms in doc alone. Why? Neighbors’ terms diffuse focus of doc’s terms.

Attempt 2 Use terms in doc, plus tagged terms from neighbors. E.g., –car denotes a term occurring in d. denotes a term occurring in a doc with a link into d. denotes a term occurring in a doc with a link from d. Generalizations possible:

Attempt 2 also fails Key terms lose density e.g., car gets split into car,

Better attempt Use class labels of (in- and out-) neighbors as features in classifying d. –e.g., docs about physics point to docs about physics. Setting: some neighbors have pre-assigned labels; need to figure out the rest.

Content + neighbors’ classes Naïve Bayes gives Pr[c j |d] based on the words in d. Now consider Pr[c j |N] where N is the set of labels of d’s neighbors. (Can separate N into in- and out-neighbors.) Can combine conditional probs for c j from text- and link-based evidence.

Training As before, use training data to compute Pr[N|c j ] etc. Assume labels of d’s neighbors independent (as we did with word occurrences). (Also continue to assume word occurrences within d are independent.)

Classification Can invert probs using Bayes to derive Pr[c j |N]. Need to know class labels for all of d’s neighbors.

Unknown neighbor labels What if all neighbors’ class labels are not known? First, use word content alone to assign a tentative class label to each unlabelled doc. Next, iteratively recompute all tentative labels using word content as well as neighbors’ classes (some tentative).

Convergence This iterative relabeling will converge provided tentative labels “not too far off”. Guarantee requires ideas from Markov random fields, used in computer vision. Error rates significantly below text-alone classification.

End of classification Move on to document summarization

Document summarization Given a doc, produce a short summary. Length of summary a parameter. –Application/form factor. Very sensitive to doc quality. Typically, corpus-independent.

Summarization Simplest algorithm: Output the first 50 (or however many) words of the doc. Hard to beat on high-quality docs. For very short summaries (e.g., 5 words), drop stop words.

Summarization Slightly more complex heuristics: Compute an “importance score” for each sentence. Summary output contains sentences from original doc, beginning with the most “important”.

Article from WSJ –1: Tandon Corp. will introduce a portable hard-disk drive today that will enable personal computer owners to put all of their programs and data on a single transportable cartridge that can be plugged into other computers. –2: Tandon, which has reported big losses in recent quarters as it shifted its product emphasis from disk drives to personal computer systems, asserts that the new device, called Personal Data Pac, could change the way software is sold, and even the way computers are used. –3: The company, based in Moorpark, Calif., also will unveil a new personal computer compatible with International Business Machines Corp.'s PC-AT advanced personal computer that incorporates the new portable hard-disks. –4: "It's an idea we've been working on for several years," said Chuck Peddle, president of Tandon's computer systems division. –5: "As the price of hard disks kept coming down, we realized that if we could make them portable, the floppy disk drive would be a useless accessory. –6: Later, we realized it could change the way people use their computers." –7: Each Data Pac cartridge, which will be priced at about $400, is about the size of a thick paperback book and contains a hard-disk drive that can hold 30 million pieces of information, or the equivalent of about four Bibles. –8: To use the Data Pacs, they must be plugged into a cabinet called the Ad-Pac 2. –That device, to be priced at about $500, contains circuitry to connect the cartridge to an IBM- compatible personal computer, and holds the cartridge steadily in place. –9: The cartridges, which weigh about two pounds, are so durable they can be dropped on the floor without being damaged. –10: Tandon developed the portable cartridge in conjunction with Xerox Corp., which supplied much of the research and development funding. –11: Tandon also said it is negotiating with several other personal computer makers, which weren't identified, to incorporate the cartridges into their own designs. –12: Mr. Peddle, who is credited with inventing Commodore International Ltd.'s Pet personal computer in the late 1970s, one of the first personal computers, said the Data Pac will enable personal computer users to carry sensitive data with them or lock it away. –………. Example

–1: Tandon Corp. will introduce a portable hard-disk drive today that will enable personal computer owners to put all of their programs and data on a single transportable cartridge that can be plugged into other computers. –12: Mr. Peddle, who is credited with inventing Commodore International Ltd.'s Pet personal computer in the late 1970s, one of the first personal computers, said the Data Pac will enable personal computer users to carry sensitive data with them or lock it away. Example

Article from WSJ –1: Tandon Corp. will introduce a portable hard-disk drive today that will enable personal computer owners to put all of their programs and data on a single transportable cartridge that can be plugged into other computers. –7: Each Data Pac cartridge, which will be priced at about $400, is about the size of a thick paperback book and contains a hard-disk drive that can hold 30 million pieces of information, or the equivalent of about four Bibles. –12: Mr. Peddle, who is credited with inventing Commodore International Ltd.'s Pet personal computer in the late 1970s, one of the first personal computers, said the Data Pac will enable personal computer users to carry sensitive data with them or lock it away. Example

Using Lexical chains Score each word based on lexical chains. Score sentences. Extract hierarchy of key sentences –Better (WAP) summarization

What are Lexical Chains? Dependency Relationship between words –reiteration: e.g. tree - tree –superordinate: e.g. tree - plant –systematic semantic relation e.g. tree - bush –non- systematic semantic relation e.g. tree - tall

Using lexical chains Look for chains of reiterated words Score chains Use to score sentences –determine how important a sentence is to the content of the doc.

Computing Lexical Chains Quite simple if only dealing with reiteration. –Issues: How far apart can 2 nodes in a chain be? How do we score a chain?

Article from WSJ –1: Tandon Corp. will introduce a portable hard-disk drive today that will enable personal computer owners to put all of their programs and data on a single transportable cartridge that can be plugged into other computers. –2: Tandon, which has reported big losses in recent quarters as it shifted its product emphasis from disk drives to personal computer systems, asserts that the new device, called Personal Data Pac, could change the way software is sold, and even the way computers are used. –3: The company, based in Moorpark, Calif., also will unveil a new personal computer compatible with International Business Machines Corp.'s PC-AT advanced personal computer that incorporates the new portable hard-disks. –4: "It's an idea we've been working on for several years," said Chuck Peddle, president of Tandon's computer systems division. –5: "As the price of hard disks kept coming down, we realized that if we could make them portable, the floppy disk drive would be a useless accessory. –6: Later, we realized it could change the way people use their computers." –7: Each Data Pac cartridge, which will be priced at about $400, is about the size of a thick paperback book and contains a hard-disk drive that can hold 30 million pieces of information, or the equivalent of about four Bibles. –8: To use the Data Pacs, they must be plugged into a cabinet called the Ad-Pac 2. –That device, to be priced at about $500, contains circuitry to connect the cartridge to an IBM- compatible personal computer, and holds the cartridge steadily in place. –9: The cartridges, which weigh about two pounds, are so durable they can be dropped on the floor without being damaged. –10: Tandon developed the portable cartridge in conjunction with Xerox Corp., which supplied much of the research and development funding. –11: Tandon also said it is negotiating with several other personal computer makers, which weren't identified, to incorporate the cartridges into their own designs. –12: Mr. Peddle, who is credited with inventing Commodore International Ltd.'s Pet personal computer in the late 1970s, one of the first personal computers, said the Data Pac will enable personal computer users to carry sensitive data with them or lock it away. –………. Example

For each word: How far apart can 2 nodes in a chain be? Will continue chain if within 2 sentences. How to score a chain? Function of word frequency and chain size.

Compute & Score Chains Chain 1Score = PAC: Chain 2Score = TANDON: 124 Chain 3Score = PORTABLE: 135 Chain 4Score = CARTRIDGE: Chain 5Score = COMPUTER: Chain 6Score = TANDON: 1517 Chain 7Score = TANDON: 1011 Chain 8Score = PEDDLE: 1214 Chain 9Score = COMPUTER: Chain 10Score = DATA: Chain Structure Analysis

·For each sentence S in the doc. f(S) = a*h(S) - b*t(S) where h(S) = total score of all chains starting at S and t(S) = total score of all chains covering S, but not starting at S Sentence scoring

Semantic Structure Analysis

·Derive the Document Structure Semantic Structure Analysis

Resources S. Chakrabarti, B. Dom, P. Indyk. Enhanced hypertext categorization using hyperlinks. R. Barzilay. Using lexical chains for text summarization.