A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India

A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Co-authors Chris Biemann University of Leipzig Joydeep Nath Animesh Mukherjee Niloy Ganguly Indian Institute of Technology Kharagpur

Language – A Complex System Structure:  phones  words, words  phrases, phrase  sentence, sentence  discourse Function: Communication through  recursive syntax  compositional semantics Dynamics:  Evolution  Language change

Computational Linguistics Study of language using computers Study of language-using computers Natural Language Processing:  Speech recognition  Machine translation  Automatic summarization  Spell checkers, Information retrieval & extraction, …

Labeling of Text Lexical Category (POS tags) Syntactic Category (Phrases, chunks) Semantic Role (Agent, theme, …) Sense Domain dependent labeling (genes, proteins, …) How to define the set of labels? How to (learn to) predict them automatically?

Distributional Hypothesis “A word is characterized by the company it keeps” – Firth, 1957 Syntax: function words (Harris, 1968) Semantics: content words

Outline Defining Context Syntactic Network of Words Complex Network – Theory & Applications Chinese Whispers: Clustering the Network Experiments Topological Properties of the Networks Evaluation Future work

Features Words Estimate the unigram frequencies Feature words: Most frequent m words

Feature Vector From the familiar to the exotic, the collection is a delight 00…01 10…00 01…00 10…00 fw 1 fw 2 fw 199 fw 200 p -2 p -1 p1p1 p2p2 thetoisfrom

Syntactic Network of Words light color red blue blood sky heavy weight 100 20 1 1 1 – cos(red, blue)

The Chinese Whisper Algorithm light color red blue blood sky heavy weight 0.9 0.5 0.9 0.7 0.8 -0.5

Experiments Corpus: Anandabazaar Patrika (17M words) We build networks G n,m  n: corpus size – {1M, 2M, 5M, 10M, 17M}  m: number of feature words – {25, 50, 100, 200} Number of nodes: 5000 Number of edges ~ 150,000

Topological Properties: Cumulative Degree Distribution PkPk k P k  -log(k) p k = -dP k /dk  1/k Zipfian Distribution!! CDD: P k is the probability that a randomly chosen node has degree ≥ k G 17M,50

Topological Properties: Clustering Coefficient Measures transitivity of the network or equivalently the proportion of triangles Very small for random graphs, high for social networks Mean CC for G 17M,50 : 0.53 CC vs. Degree

Topological Properties: Cluster Size Distribution Cluster Size rank Variation with n (m = 50)Variation with m (n = 17M)

Evaluation: Tag Entropy w: {t 1, t 6, t 9 } Tag w : Cluster C: {w 1, w 2, w 3, w 4 } TE(C)= 1000010010 1000010010 0010010010 0000010010 1010010010 1010000000 = 2

Mean Tag Entropy MTE = 1 /N  TE(C i ) Weighted MTE =  |C i |TE(C i ) / (  |C i |) Caveat: Every word in separate cluster has 0 MTE and WMTE Baseline: Every word in a single cluster

Tag Entropy vs. Corpus Size m = 50 1M2M5M10M17M 74.4975.1476.0978.2974.94 17.4618.6824.2327.5630.60 %Reduction in Tag Entropy MTE WMTE

The Bigger the worse! Cluster Size Tag Entropy

Clusters … Big ones  Bad ones  mix of everything! Medium sized clusters are good http://banglaposclusters.googlepages.com/home RankSizeType 5596Proper nouns, titles and posts 6352Possessive case of nouns (common, proper, verbal) and pronouns 8133Nouns (common, verbal) forming compounds with “do” or “be” 1144Number-Classifier (e.g. 1-TA, ekaTA) 1284Adjectives

More Observations Words are split into  First name vs. Surnames  Animate nouns-poss vs. Inanimate noun-poss  Nouns-acc vs. Nouns-poss vs. Nouns-loc  Verb-finite vs. Verb-infinitive Syntactic or semantic?  Nouns related to professions, months, days of week, stars, players etc.

Advantages No labeled data required: A good solution to resources scarcity No prior class information: Circumvents issues related to tag set definition Computational definition of Class Understanding the structure of language (Syntax) and it’s evolution

Danke für Ihre Aufmerksamkeit. Dieses ist „vom Übersetzer übersetzt worden, der“ von Phasen Microsoft Beta ist. Thank you for your attention This has been translated by "Translator Beta" from Microsoft Live.

Related Work Harris, 68: Distributional hypothesis for syntactic classes Miller and Charles, 91: Function words as features Finch and Chater, 92; Schtze, 93, 95; Clark, 00; Rapp, 05; Biemann, 06: The general technique Haghighi and Klein, 06; Goldwater and Griffiths, 07: Bayesian approach to unsupervised POS tagging Dasgupta and Ng, 07: Bengali POS induction through morphological features

Medium and Low Frequency Words Neighboring (window 4) co-occurrences ranked by log-likelihood thresholded by θ Two words are connected iff they share at least 4 neighbors LanguageEnglishFinnishGerman Nodes5285785627137951 Edges6912417023491493571

Construction of Lexicon Each word assigned a unique tag based on the word class it belongs to  Class 1: sky, color, blood, weight  Class 2: red, blue, light, heavy Ambiguous words:  High and medium frequency words that formed singleton cluster  Possible tags of neighboring clusters

Training and Evaluation Unsupervised training of trigram HMM using the clusters and lexicon Evaluation:  Tag a text, for which gold standard is available  Estimate the conditional entropy H(T|C) and the related perplexity 2 H(T|C) Final Results:  English – 2.05 (619/345), Finnish – 3.22 (625/466), German – 1.79 (781/440)

Example From the familiar to the exotic, the collection is a delight Prep At JJ Prep At JJ At NN V At NN C200 C1 C331 C5 C1 C331 C1 C221 C3 C1 C220

A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India

Similar presentations

Presentation on theme: "A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India

Similar presentations

Presentation on theme: "A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India"— Presentation transcript:

Similar presentations

About project

Feedback