Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 13.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Unsupervised learning
Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011.
Clustering II.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
CS728 Web Clustering II Lecture 14. K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean)
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Visual Recognition Tutorial
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Semi-Supervised Natural Language Learning Reading Group I set up a site at: ervised/
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Semi-Supervised Learning
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Text Classification, Active/Interactive learning.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Word Sense Disambiguation Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Machine Learning Queens College Lecture 7: Clustering.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Data Mining and Text Mining. The Standard Data Mining process.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Semi-Supervised Clustering
Machine Learning Lecture 9: Clustering
K-means and Hierarchical Clustering
Statistical NLP: Lecture 9
Revision (Part II) Ke Chen
Information Organization: Clustering
KAIST CS LAB Oh Jong-Hoon
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
CS 391L: Machine Learning Clustering
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Clustering Techniques
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General scheme: 1.Initial supervision – seed examples for training an initial model or seed model itself (indicative features) 2.Classify corpus with seed model 3.Add most confident classifications to training and iterate. Exploits feature redundancy in examples

Bootstrapping Decision List for Word Sense Disambiguation Yarowsky (1995) scheme Relies on “one sense per collocation” 1.Initialize seed model with all collocations in the sense dictionary definition – an initial decision list for each sense. –Alternatively – train from seed examples 2.Classify all examples with current model –Any example containing a feature for some sense 3.Compute odds ratio for each feature-sense combination; if above threshold then add the feature to the decision list 4.Optional: add the “one sense per discourse constraint” to add/filter examples 5.Iterate (2)

“One Sense per Collocation”

Results Applies the “one sense per discourse” constraint after final classification – classify all word occurrences in a document by the majority sense Evaluation on ~37,000 examples:

Co-training for Name Classification Collins and Singer, EMNLP 1999 A bootstrapping approach that relies on a (natural) partition of the feature space to two different sub-spaces The bootstrapping process iterates by switching sub-space in each iteration Name classification task: person, organization, location Features for decision list: 1.Spelling features (full, contains, cap, non-alpha) 2.Context features (words, syntactic type)

Seed Features Initialized with score for the decision list

DL-CoTrain Algorithm 1.Initialize n=5 (max. #rules learned in an iteration) 2.Initialize spelling decision list with seed 3.Classify corpus by spelling rules (where applicable) 4.Induce contextual rule list from classified examples; keep at most the n most frequent rules whose score>θ for each class 5.Classify corpus by contextual rules 6.Induce spelling rules, as in step 4, and add to list 7.If n<2500 then n  n+5, and return to step 3. Otherwise, classify corpus by combined list and induce a final classifier from these labeled examples

Co-Train vs. Yarowsky’s Algorithm Collins & Singer implemented a cautious version of Yarowsky’s algorithm, where n rules are added at each iteration Same accuracy (~91%) (as well as a boosting-based bootstrapping algorithm) Abney (ACL 2002) provides some analysis for both algorithm, showing that they are based on different independence assmptions

Clustering and Unsupervised Disambiguation Word sense disambiguation without labeled training or other information sources Cannot label to predefined senses (there are none), so try to find “natural” senses Use clustering methods to divide different contexts of a word into “sensible” classes Other applications of clustering: –Thesaurus construction, document clustering –Forming word classes for generalization in language modeling and disambiguation

Clustering

Clustering Techniques Hierarchical: –Bottom-up (agglomerative) –Top-down (divisive) Flat (non-hierarchical) –Usually iterative –Soft (vs. hard) clustering “Degree” of membership

Comparison Hierarchical: Preferable for detailed data analysis More information provided No clearly preferred algorithm Less efficient (at least O(n 2 )) Non-hierarchical: Preferable if efficiency is important or lots of data K-means is the simplest method and often good enough If no Euclidean space, can use EM instead

Hierarchical Clustering andbutinonwithforatfromoftoas

Similarity Functions sim(c 1,c 2 ) = similarity between clusters –Defined over pairs of individual elements, and (possibly inductively) over cluster pairs Values typically between 0 and 1 For hierarchical clustering, require: Monotonicity: For all possible clusters, min[sim(c 1,c 2 ),sim(c 1,c 3 )]  sim(c 1,c 2  c 3 ) Merging two clusters cannot increase similarity!

Input: Set X = {x 1,…,x n } of objects Similarity function sim: 2 X  2 X   for i  1 to n do: c i  {x i } C  {ci,..., cn} j  n + 1 while |C| > 1 do: (c a, c b )  arg max (c u,c v ) sim(c u,c v ) c j  c a  c b; Trace merge to construct cluster hierarchy C  (C \ {c a,c b })  {c j } j++ Bottom-Up Clustering

Types of Cluster Pair Similarity single linksimilarity of most similar (nearest) members complete linksimilarity of most dissimilar (remote) members - property of unified cluster group averageaverage pairwise similarity of members in unified cluster - property of unified cluster

Single Link Clustering (“chaining” risk)

Complete Link Clustering “Global view” yields tighter clusters – usually more suitable in NLP

Efficiency of Merge Step Maintain pairwise similarities between clusters, and maximal similarity per cluster: Single link update - O(n): –sim(c 1  c 2, c 3 ) = max(sim(c 1,c 3 ),sim(c 2,c 3 )) Complete link update - O(n log(n )): –Have to compare all pairwise similarities with all points in merged cluster to find minimum

Group Average Clustering A “mid-point” between single and complete link – produces relatively tight clusters Efficiency depends on choice of similarity metric Computing average similarity for a newly formed cluster from scratch is O(n 2 ) –Goal – incremental computation Represent objects as m-dimensional unit vectors (length-normalized), and use cosine for similarity:

Cosine Similarity 33 22 11 cos  1 > cos  2 > cos  3

Efficient Cosine Similarity Average cosine similarity within a cluster c (to be maximized after merge): Define sum vector for cluster:

On merge, update sum: Then update A for all new pairs of clusters Complexity: O(n) per iteration (assuming constant dimensionality); Overall for algorithm: O(n 2 )

Input: Set X = {x 1,…,x n } of objects Coherence measure coh: 2 X   Splitting function split: 2 X  2 X  2 X C  { X } j  1 while C contains a cluster larger than 1 element do: c a  arg min c u coh(c u ) (c j+1,c j+2 )  split(c a ) C  (C \ c a )  {c j+1,c j+2 } j += 2 Top-Down Clustering

Top-Down Clustering (cont.) Coherence measure – can be any type of cluster quality function, including those used for agglomerative clustering –Single link – maximal similarity in cluster –Complete link – minimal similarity in cluster –Group average – average pairwise similarity Split – can be handled as a clustering problem on its own, aiming to form two clusters –Can use hierarchical clustering, but often more natural to use non-hierarchical clustering (see next) May provide a more “global” view of the data (vs. the locally greedy bottom-up procedure)

Non-Hierarchical Clustering Iterative clustering: –Start with initial (random) set of clusters –Assign each object to a cluster (or clusters) –Re-compute cluster parameters E.g. centroids: –Stop when clustering is “good” Q: How many clusters?

K-means Algorithm Input: Set X = {x 1,…,x n } of objects Distance measure d: X  X   Mean function  : 2 X  X Select k initial cluster centers f 1,..., f k while not finished do: for all clusters c j do: c j  { x i | f j = arg min f d(x i, f) } for all means fj do: f j   (c j ) Complexity: O(n), assuming constant number of iterations

K-means Clustering

Example Clustering of words from NY Times using cooccurring words 1. ballot, polls, Gov, seats 2. profit, finance, payments 3. NFL, Reds, Sox, inning, quarterback, score 4. researchers, science 5. Scott, Mary, Barbara, Edward

Buckshot Algorithm Often, random initialization for K-means works well If not: –Randomly choose points –Run hierarchical group average clustering: O(( ) 2 )=O(n) –Use those cluster means as starting points for K-means –O(n) total complexity Scatter/Gather (Cutting, Karger, Pedersen, 1993) –An interactive clustering-based browsing scheme for text collections and search results (constant time improvements)

The EM Algorithm Soft clustering method to solve  * = arg max  P(X | m(  )) –Soft clustering – probabilistic association between clusters and objects (  - the model parameters) Note: Any occurrence of the data consists of: –Observable variables: The objects we see Words in each context Word sequence in POS tagging –Hidden variables: Which cluster generated which object Which sense generates each context Underlying POS sequence Soft clustering can capture ambiguity (words) and multiple topics (documents)

Two Principles E xpectation: If we knew  we could compute the expectation of the hidden variables (e.g, probability of x belonging to some cluster) M aximization: If we knew the hidden structure, we could compute the maximum likelihood value of 

EM for Word-sense “Discrimination” Cluster the contexts (occurrences) of an ambiguous word, where each cluster constitutes a “sense”: Initialize randomly model parameters P(v j | s k ) and P(s k ) E-step: Compute P(s k | c i ) for each context c i and sense s k M-step: Re-estimate the model parameters P(v j | s k ) and P(s k ), for context words and senses Continue as long as log-likelihood of all corpus contexts 1 ≤ i ≤ I increases (EM – guaranteed to increase in each step till convergence to a maximum, possibly local):

E-Step To compute P(c i | s k ), use naive Bayes assumption: For each context i & each sense s k, estimate the posterior probability h ik = P(s k | c i ) (an expected “count” of the sense for c i ), using Bayes rule:

M-Step Re-estimate parameters using maximum- likelihood estimation:

Decision Procedure Assign senses by (same as Naïve Bayes): Can adjust the pre-determined number of senses k to get finer or coarser distinctions –Bank as physical location vs. abstract corporation If adding more senses doesn’t increase log- likelihood much, then stop

Results (Schütze 1998) Word SenseAccuracy suitlawsuit95  0 suit you wear96  0 motionphysical movement85  1 proposal for action88  13 trainline of railroad cars79  19 to teach55  33 Works better for topic-related senses (given the broad-context features used) Improved IR performance by 14% - representing both query and document as senses, and combining results with word-based retrieval