Download presentation
Presentation is loading. Please wait.
Published byChristina Bonnie Stokes Modified over 9 years ago
1
A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com
2
Co-authors Chris Biemann University of Leipzig Joydeep Nath Animesh Mukherjee Niloy Ganguly Indian Institute of Technology Kharagpur
3
Language – A Complex System Structure: phones words, words phrases, phrase sentence, sentence discourse Function: Communication through recursive syntax compositional semantics Dynamics: Evolution Language change
4
Computational Linguistics Study of language using computers Study of language-using computers Natural Language Processing: Speech recognition Machine translation Automatic summarization Spell checkers, Information retrieval & extraction, …
5
Labeling of Text Lexical Category (POS tags) Syntactic Category (Phrases, chunks) Semantic Role (Agent, theme, …) Sense Domain dependent labeling (genes, proteins, …) How to define the set of labels? How to (learn to) predict them automatically?
6
Distributional Hypothesis “A word is characterized by the company it keeps” – Firth, 1957 Syntax: function words (Harris, 1968) Semantics: content words
7
Outline Defining Context Syntactic Network of Words Complex Network – Theory & Applications Chinese Whispers: Clustering the Network Experiments Topological Properties of the Networks Evaluation Future work
8
Features Words Estimate the unigram frequencies Feature words: Most frequent m words
9
Feature Vector From the familiar to the exotic, the collection is a delight 00…01 10…00 01…00 10…00 fw 1 fw 2 fw 199 fw 200 p -2 p -1 p1p1 p2p2 thetoisfrom
10
Syntactic Network of Words light color red blue blood sky heavy weight 100 20 1 1 1 – cos(red, blue)
11
The Chinese Whisper Algorithm light color red blue blood sky heavy weight 0.9 0.5 0.9 0.7 0.8 -0.5
12
The Chinese Whisper Algorithm light color red blue blood sky heavy weight 0.9 0.5 0.9 0.7 0.8 -0.5
13
The Chinese Whisper Algorithm light color red blue blood sky heavy weight 0.9 0.5 0.9 0.7 0.8 -0.5
14
Experiments Corpus: Anandabazaar Patrika (17M words) We build networks G n,m n: corpus size – {1M, 2M, 5M, 10M, 17M} m: number of feature words – {25, 50, 100, 200} Number of nodes: 5000 Number of edges ~ 150,000
15
Topological Properties: Cumulative Degree Distribution PkPk k P k -log(k) p k = -dP k /dk 1/k Zipfian Distribution!! CDD: P k is the probability that a randomly chosen node has degree ≥ k G 17M,50
16
Topological Properties: Clustering Coefficient Measures transitivity of the network or equivalently the proportion of triangles Very small for random graphs, high for social networks Mean CC for G 17M,50 : 0.53 CC vs. Degree
17
Topological Properties: Cluster Size Distribution Cluster Size rank Variation with n (m = 50)Variation with m (n = 17M)
18
Evaluation: Tag Entropy w: {t 1, t 6, t 9 } Tag w : Cluster C: {w 1, w 2, w 3, w 4 } TE(C)= 1000010010 1000010010 0010010010 0000010010 1010010010 1010000000 = 2
19
Mean Tag Entropy MTE = 1 /N TE(C i ) Weighted MTE = |C i |TE(C i ) / ( |C i |) Caveat: Every word in separate cluster has 0 MTE and WMTE Baseline: Every word in a single cluster
20
Tag Entropy vs. Corpus Size m = 50 1M2M5M10M17M 74.4975.1476.0978.2974.94 17.4618.6824.2327.5630.60 %Reduction in Tag Entropy MTE WMTE
21
The Bigger the worse! Cluster Size Tag Entropy
22
Clusters … Big ones Bad ones mix of everything! Medium sized clusters are good http://banglaposclusters.googlepages.com/home RankSizeType 5596Proper nouns, titles and posts 6352Possessive case of nouns (common, proper, verbal) and pronouns 8133Nouns (common, verbal) forming compounds with “do” or “be” 1144Number-Classifier (e.g. 1-TA, ekaTA) 1284Adjectives
23
More Observations Words are split into First name vs. Surnames Animate nouns-poss vs. Inanimate noun-poss Nouns-acc vs. Nouns-poss vs. Nouns-loc Verb-finite vs. Verb-infinitive Syntactic or semantic? Nouns related to professions, months, days of week, stars, players etc.
24
Advantages No labeled data required: A good solution to resources scarcity No prior class information: Circumvents issues related to tag set definition Computational definition of Class Understanding the structure of language (Syntax) and it’s evolution
25
Danke für Ihre Aufmerksamkeit. Dieses ist „vom Übersetzer übersetzt worden, der“ von Phasen Microsoft Beta ist. Thank you for your attention This has been translated by "Translator Beta" from Microsoft Live.
26
Related Work Harris, 68: Distributional hypothesis for syntactic classes Miller and Charles, 91: Function words as features Finch and Chater, 92; Schtze, 93, 95; Clark, 00; Rapp, 05; Biemann, 06: The general technique Haghighi and Klein, 06; Goldwater and Griffiths, 07: Bayesian approach to unsupervised POS tagging Dasgupta and Ng, 07: Bengali POS induction through morphological features
27
Medium and Low Frequency Words Neighboring (window 4) co-occurrences ranked by log-likelihood thresholded by θ Two words are connected iff they share at least 4 neighbors LanguageEnglishFinnishGerman Nodes5285785627137951 Edges6912417023491493571
28
Construction of Lexicon Each word assigned a unique tag based on the word class it belongs to Class 1: sky, color, blood, weight Class 2: red, blue, light, heavy Ambiguous words: High and medium frequency words that formed singleton cluster Possible tags of neighboring clusters
29
Training and Evaluation Unsupervised training of trigram HMM using the clusters and lexicon Evaluation: Tag a text, for which gold standard is available Estimate the conditional entropy H(T|C) and the related perplexity 2 H(T|C) Final Results: English – 2.05 (619/345), Finnish – 3.22 (625/466), German – 1.79 (781/440)
30
Example From the familiar to the exotic, the collection is a delight Prep At JJ Prep At JJ At NN V At NN C200 C1 C331 C5 C1 C331 C1 C221 C3 C1 C220
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.