Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linguistic Processing in Lattice- Based Taxonomy Construction Anastasia Novokreshchenova, Maria Shabanova, Dmitry Zaytsev and Nina Belyaeva State University.

Similar presentations


Presentation on theme: "Linguistic Processing in Lattice- Based Taxonomy Construction Anastasia Novokreshchenova, Maria Shabanova, Dmitry Zaytsev and Nina Belyaeva State University."— Presentation transcript:

1 Linguistic Processing in Lattice- Based Taxonomy Construction Anastasia Novokreshchenova, Maria Shabanova, Dmitry Zaytsev and Nina Belyaeva State University Higher School of Economics, Moscow School of Applied Mathematics and Computer Science CLA 2010 Seville, Spain. October19-21, 2010.

2 Outline Motivation in Social Studies and the Data Building a lattice-based taxonomy over a text corpus Natural language processing techniques for automatic attributes acquisition – Keywords extraction – Probabilistic latent modeling of text – Named entity recognition

3 Motivation Represent the structure of a given domain in a form of a lattice-based taxonomy – Interdisciplinary research project “Discrete mathematical models for political analysis of democratic institutions and human rights" – Speeches of Western leaders and international organizations – The context in which Russia is addressed – The role and importance of democracy and human rights agenda Construct a context from the text corpora – Extract the set of attributes from texts for describing the documents – Analyze and develop natural language processing methods

4 T HE D ATA : 26 FULL SPEECHES OF FOREIGN LEADERS

5 C ONSTRUCTING LATTICE - BASED TAXONOMY OVER A TEXT CORPUS Preliminary text processing Attributes extraction for describing the documents Building and pruning the lattice

6 T HREE KINDS OF TAXONOMIES Three kinds of taxonomies depending on the attributes type: frequent words latent topics named entities

7 B UILDING A TAXONOMY WITH FREQUENT WORDS eliminating of stop- words stemming - collapsing all morphological variants of the term to a single root form describing each document with its N most frequent terms building and pruning the lattice t1…tn Doc1 Х …- Doc2- Х … DocT Х … Х

8 31 FORMAL CONCEPTS OF THE LATTICE BASED ON FREQUENT WORDS Figures in squares show the number of documents in each concept

9 A CCORDING TO WORD FREQUENCIES TAXONOMY : security issues and relationships of Russia with Europe are the most discussed topics along with some global problems democracy and human rights are not included in the presented taxonomy due to pruning ◦ words "democracy", "human" and "right" appear in the concepts which include speeches by Barack Obama and Hillary Clinton.

10 Probabilistic latent semantic analysis (pLSA) P( z ) – the distribution over topics z in a particular document P( w | z ) – the probability distribution over words w given topic z T is the number of topics

11 B UILDING A TAXONOMY WITH LATENT TOPICS probabilistic modeling of text: documents are represented as random mixtures over latent topics each topic is characterized by a distribution over words. 20 topics were derived from the 26 documents 20 topics were used as attributes for describing the documents

12 6 OF THE 20 RECEIVED TOPICS FROM THE DOCUMENTS : WORDS DISTRIBUTIONS OVER TOPICS Economics and financial crisis Democracy and human rights Future of the US and weapon issues France and ecological problems Russian – Georgian conflict Russia and energy issues crisirightnationfrancgeorgiarussia presidhumanunitsummitrussian financegovernnuclearresponcinterninterest econompeoplamericafinalgeorgianenergy systemdemocraciamericanfrenchterritorimedvedev governworkinterestpreaparsouthissu reformwomenfuturlongerorderrule proposdemocratweaponleadprocesstrust timeprotectallichoicethnicdialog marketprinciplcenturienvironmentfederarea subjectsocietiwarafricandirectagreement unitaccountcommondebataddresspartnership bankuniversyearrenewossetiatrade septembcommunprosperorganplanlaw reasonleaderforwardafricasepatatismintern euroLifepartnershipcollectaugustneighbor warclintongreatcontributabsolutcommon promotindependgoalambitibombgas

13 17 FORMAL CONCEPTS OF THE LATTICE BASED ON LATENT TOPICS

14

15 A CCORDING TO THE LATENT TOPICS - TAXONOMY The most actual topics are those connected with: European Union global problems security issues energy resources Russian-Georgian conflict possible ways of solving conflicts and problems The topic of democracy and human rights is not included in the presented taxonomy due to pruning the concept with this topic includes speeches by Barack Obama and Nicolas Sarcozy

16 B UILDING A TAXONOMY WITH N AMED E NTITIES 38 paragraphs derived from the 26 and enlighten solely issues concerning Russia three types of named entities for describing the documents ◦ names of persons ◦ organizations ◦ geographical objects

17 21 CONCEPTS OF A LATTICE BUILT FROM PARAGRAPHS AND NAMED ENTITIES

18 C ONCLUSION REMARKS several techniques have been proposed to build a context over a text corpus frequent words allowed to define what questions are raised most frequently by foreign leaders regarding Russia latent topic modeling allowed to specify and describe these issues more thoroughly Named-entity would be more informative to use in the context of latent topics the corpus of the texts should be expanded

19 Thank you!


Download ppt "Linguistic Processing in Lattice- Based Taxonomy Construction Anastasia Novokreshchenova, Maria Shabanova, Dmitry Zaytsev and Nina Belyaeva State University."

Similar presentations


Ads by Google