Presentation is loading. Please wait.

Presentation is loading. Please wait.

Problems of Ontology Development for a Broad Domain Loukachevitch Natalia Leading Researcher of Lomonosov Moscow State University Center.

Similar presentations


Presentation on theme: "Problems of Ontology Development for a Broad Domain Loukachevitch Natalia Leading Researcher of Lomonosov Moscow State University Center."— Presentation transcript:

1 Problems of Ontology Development for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru Leading Researcher of Lomonosov Moscow State University Center for Information Research Lomonosov Moscow State University Research Computing Center

2 Technologies Ontologies for Natural Language Processing and Information Retrieval Applications Applications –Conceptual indexing –Query expansion –Text Categorization –Document Clustering –Question-Answering –Automatic Summarization Linguistic Ontologies –RuThes thesaurus (52 thousand concepts, 150 thousand words and expressions) –Ontology on Natural Sciences and Technologies (60 thousand concepts) –Banking Thesaurus for Information Retrieval applications et. al.

3 Projects of Our Research Group-1 State Bodies –Central Bank of the Russian Federation (2006 –..) Development of banking thesaurus, conceptual indexing, text categorization –Central Election Committee of the RF (1999 –..) Information-retrieval system, conceptual indexing, text categorization, –State Duma of RF (1999 –..) Information retrieval system on Duma records –Accounting Chamber of RF (2003) Creation of a terminology dictionary –other state bodies Text categorization, clusterization, development of domain-specific ontologies,

4 Projects of Our Research Group-2 Commercial organizations –Rambler Media company (2007–..) Automatic clusterization, categorization, summarization of news flow Personalization of news and advertisements Spam detection Information extraction –Garant Legal Information Company (2002 – …) Text categorization of legal documents Summarization of court decisions Learning to rank in information-retrieval –etc.

5 Plan of Tutorial Ontologies: general remarks –Main paradigms and their problems –Level of formalization Broad vs. simple domains –Boundaries of a domain –Main source of knowledge - texts Domain-specific texts –Concepts and terms, term extraction –Synonyms and near-synonyms –Ambiguity of terms –Establishing relations Example: Ontology-based text categorization

6 Domains and Tasks Ontology vs. Machine Learning? Description of domains is difficult Data can need generalization Some knowledge can be already described in ontology-based resources Therefore for many tasks we need Ontology+Machine learningOntology+Machine learning

7 Ontologies: general remarks Ontology - formal representation of knowledge as a set of concepts within a domain and the relationships between those concepts Main components: –Concepts (classes) –Instances (individuals) –Relations –Attributes –Axioms (rules)

8 siamese mammal cat organism object Taxonomy Classes animal frog instances

9 Knowledge management domain

10 Ontology development paradigms Formal, logically sound ontologies –Logical inference, –Some domains are difficult to formalize –Inconsistency is a huge problem Semantic Web –Lot of specific ontologies –Rdf triples, Same_as links –a lot of “messy” data Ontologies for Natural Language processing –Less formal –Relation to language semantics –Formalization is restricted with current state of natural language processing

11 Ontology-1: Ontology Spectrum (Obrst, 2006) weak semantics strong semantics Is Disjoint Subclass of with transitivity property Modal Logic Logical Theory Thesaurus Has Narrower Meaning Than Taxonomy Is Sub-Classification of Conceptual Model Is Subclass of DB Schemas, XML Schema UML First Order Logic Relational Model, XML ER Extended ER Description Logic DAML+OIL, OWL RDF/S XTM Syntactic Interoperability Structural Interoperability Semantic Interoperability From less to more expressive

12 Expressivity vs. community-size (Hepp, 2007)

13 Ontology-2,Semantic Web. Linking Data Project http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData

14 Approach 3. Ontologies for Natural Language Processing Relations between the concepts and lexical meanings are quite complex How represent synonyms and near- synonyms How detailed lexical senses of ambiguous words should be represented Large volume vs. complexity of description WordNet as a symbol of this approach (!) For different tasks – different types of ontologies

15 Plan of Tutorial Ontologies: general remarks –Main paradigms and their problems –Level of formalization Broad vs. simple domains –Boundaries of a domain –Main source of knowledge - texts Domain-specific texts –Concepts and terms, term extraction –Synonyms and near-synonyms –Ambiguity of terms –Establishing relations Example: Ontology-based text categorization

16 Complicated vs. simple domains Simple domains (wine ontology) –Explicit boundaries –Boundaries are determined with “physical processes” e.g. production, services –Clear roles of entities –Small number of classes (may have many instances) or many uniform classes Complicated domains (terrorism, financial control) –Vague boundaries, –The same entities used in different roles and functions –Knowledge stored in text documents,

17 Wine ontology http://www.w3.org/TR/owl-guide/wine.rdf http://www.w3.org/TR/owl-guide/wine.rdf Wine WhiteBurgundy WhiteLoire WhiteBordeaux TableWine SweetWine Region Grape WhiteWine RedWine Meal course

18 Complicated domains: vague boundaries Interdisciplinarity –state financial control (economy+ law + finances) –Counter-terrorism (criminal law + international law+ + constitutional law +state bodies+ buildings+vehicles+weapons…) Two main parts –Center of the domain –Additional concepts from neighbour domains

19

20 Boundaries of domain: Terrorism Center of domain –Terrorist acts, groups, terrorists –Anti-terrorist activity Additional spheres –Geographic places, –Weapons and explosives, –Transport, –Financial payment, –Ideology, Religion etc. Re-use of ontologies?

21 Problem: Distortion of Reality General concepts necessary for domain description are treated as subordinates of domain concepts Name of concept is general but its intended sense in domain specific –Law (=antiterrorist law=), –Intelligence –(= antiterrorist intelligence) Problems in ontology mapping, ontology reuse Thesaurus on Radiological terrorism http://www.jasonmorrison.net/content/2004/a-thesaurus-for- radiological-terrorism-research /http://www.jasonmorrison.net/content/2004/a-thesaurus-for- radiological-terrorism-research /

22 Example: distortion of reality

23 Plan of Tutorial Ontologies: general remarks –Main paradigms and their problems –Level of formalization Broad vs. simple domains –Boundaries of a domain –Main source of knowledge - texts Domain-specific texts –Concepts and terms, term extraction –Synonyms and near-synonyms –Ambiguity of terms –Establishing relations Example: Ontology-based text categorization

24 Ontology Development and Domain-Specific Texts Knowledge stored in texts Domain-specific text collection –As many as possible –Necessary to find exact boundaries Automatic extraction of terms from texts (Term acquisition) –Terms are expressions corresponding to concepts of a specific domain Top-level modeling Use of existing ontologies

25 Automatic Term Acquisition from Texts Linguistic criteria (noun groups) Lexical restrictions (f.e. evaluative words good, bad are rarely parts of terms) Statistical criteria (Frequency, Mutual information, and many others) !!Use of machine learning approaches to improve term extraction Formation of ordered list of term-candidates

26 The most frequent phrases in documents of financial control domain Translation from Russian –Federal budget –Russian Federation –Accounting Chamber –Federal law –Overall sum (-) –Resources of federal budget (?) –Oblast budget –Financial means –Use of financial means (?) –Wages –Ministry of finance –Budget resources –Tax body

27 Analysis of Term-Candidate List In the beginning of the list there are many evident terms Further there are many unclear expressions –whether they are terms (domain experts can have different opinions) –whether they are related to the domain –where is a boundary of the domain A lot of synonymic variants Ambiguity of terms

28 Efficiency of term extraction methods

29 Boundaries of the domain Bottom-up+top-down Term extraction from texts – a bottom-up stage Extracted expressions are necessary to understand what types of entities are needed in the domain – in fact design of top-level taxonomy Top-down analysis Combined approach to concept selection (frequency from the collection+top-level taxonomy restrictions)

30 Synonyms and variants of “money laundering” CRIMINAL LAUNDERING ILLEGAL LAUNDERING LAUNDERING LAUNDERING ACTIVITIES LAUNDERING OF MONEY LAUNDERING OPERATIONS MONEY LAUNDERING MONEY LAUNDERING ACTIVITIES MONEY LEGALIZATION MONEY WASHING PROFIT LAUNDERING PROFIT WASHING

31 Lexical ambiguity Homonyms are words that share the same spelling but have different meanings (unrelated in origin) –bank (financial institution vs. land (river bank)), –rarely met in the same domain except broad one –easily recognized by non-linguists –different concepts, different sets of relations Polysemes are words with the same spelling and distinct but related meanings –bank (financial institution vs. building) –very often met in any domains –regular polysemes (institutions and their buildings) –difficult for recognition by non-linguists –tendency to use the same concept of ontology for related senses

32 Lexical ambiguity (homonyms): bow

33 Lexical ambiguity (polysemes) Transport –They have succeeded in stopping the transport of live animals (=moving) –mechanism of contactless payment in public transport (=vehicles) Regular polysemy –Tree – wood (material): birch Non-linguists cannot recognize different senses, feel strange deviations in relations

34 Lexical ambiguity (polysemes) How to help yourselves – nonambiguous synonymic phrases –Transport1 = Transportation process –Transport2 = transport vehicle –Birch1 = birch tree –Birch2 = birch wood Possible to see different entities behind closely related senses

35 Relations of an ontology The set of relations of ontology can be non-evident Main relations –Class-subclass –Instance relation –Role relations Different properties: transitivity et.al. Old AI books and manuals: the same relation in all cases – “is_a” Diagnostic expression “X is a Y” can be appropriate in all cases

36 Class-subclass relation Relation between two sets of entities (classes) (many- to-many): birch - tree Properties: transitive, inheritance Rules: –If class A is a subclass of class B, then each instance of class A is also an instance of B –Top-level classes (categories) should coincide for A and B –Real example of a mistake: –river – water object – water – substance -> –Moscow river – is a Substance? ?

37 Instance relation Relation one-to-many –Moscow river – instance of river –Teacher – instance of profession Not transitive –Rex, Poodle, dog breed, dog – what relations –Rex is an instance of poodle –Poodle is an instance of dog breed –Poodle is a subclass of dog –Rex is not a dog breed –Rex is a dog Dog breed Poodle Rex Instance Subclass X Instance

38 Roles and types Roles: student, employer, terrorist, player Types: Person, animal, building, car Role is a type in some conditions A student is a person in the role of learning Properties of roles: –Roles are created dynamically –Roles can play other roles –A type can play many different roles

39 Confusion of type-role relations with class-subclass relations Frequent mistake of almost every beginner Not every person is an employer, an organization is not an employer in all situations Problems with inference Person Employer Organization X X

40 Text-motivated confusion of types and roles Natural substances such as salt, sugar, vinegar, alcohol,.. are also used as traditional preservatives. (wikipedia) Often salt and other preservatives are added to canned foods. (http://www.family-health-and-nutrition.com/this-vs-that.html)http://www.family-health-and-nutrition.com/this-vs-that.html What relation is between salt and preservative? –Class-subclass? –Class – instance? –.. In practice, beginners usually try to establish relations “Class-subclass”, however this is a type-role relation, preservative is a role of substances.

41 Automatic extraction of relations from texts A lot of scientific publications: extraction of synonyms, taxonomies, part-whole relations etc. But in complex domain it is impossible fully rely on automatic tools In many cases evident relations are extracted Causes –Multiword expressions –Ambiguity of language expressions –Contextual dependence –Necessity of very large domain text collection processing

42 Plan of Tutorial Ontologies: general remarks –Main paradigms and their problems –Level of formalization Broad vs. simple domains –Boundaries of a domain –Main source of knowledge - texts Domain-specific texts –Concepts and terms, term extraction –Synonyms and near-synonyms –Ambiguity of terms –Establishing relations Example: Ontology-based text categorization

43 Automatic text categorization Main approaches –Knowledge-based methods (based on rules) –Machine learning methods – very popular in scientific conferences Text categorization in real practice (operational text categorization) –Training collection should exist –Experts should categorize documents in a consistent way –Every category needs enough number of training examples In practice knowledge-based systems are widely used Reuter company (provider of known training collection Reuter-21578) uses a knowledge-based system for text categorization of own documents

44 Subjectivity of experts Experts’ agreement in manual text categorization is around 60%

45 Our text categorization projects Use of both approaches in dependence of task and data Knowledge-based approach uses knowledge of our large resource RuThes thesaurus Projects –Classifier for Central Election Committee (450 categories, 4 levels) –Classifier of Russian legislation (1169 categories, 3000 categories) –Classifier of English economic research papers (700 categories) –Classifier of public opinion polls (350 categories) –Classifier of banking document and news (200 categories) –General news classifiers –and others

46 Thesaurus on sociopolitical life Sociopolitical domain: social life of contemporary society Includes: thematic vocabulary and terminology from such domains as economy, finance, defense, law, sport, arts, military conflicts etc. Domain for such documents as government documents, legal acts, international treaties, newspaper articles, news reports 36 thousand concepts, 100 thousand terms, 140 thousand direct relations Applications: conceptual indexing; automatic text categorization, document clustering, automatic text summarization, question-answering.

47 Socio-Political Domain Levels of Hierarchy Law Accounting Taxation Banking

48 Thesaurus-based text categorization Use of knowledge described in the Thesaurus Manual description of Boolean expressions for categories based on small number of thesaurus concepts Automatic thesaurus-based expansion of Boolean expressions Thesaurus-based thematic representation of the text content independent of the genre and the length of a text (lexical chain technique)

49 Describing a category with supporting concepts Categotization of legal acts 200.020.020. Heads of states summits { ( HEADS OF STATES SUMMIT Y ) OR { ( NEGOTIATIONS N ) ( INTERNATIONAL NEGOTIATIONS Y ) ( INTERNATIONAL CONTACTS N ) ( MEETING N )} AND ( HEAD OF STATE L )}

50 Expanded representation of the category {( HEADS OF STATES SUMMIT Y ) ( summit, summit meeting, top-level meeting, head of states meeting ) OR {( NEGOTIATIONS N ) ( negotiations, talks ) ( INTERNATIONAL NEGOTIATIONS Y ) ( international talks, interstate talks, diplomatic negotiations, international talks, multinational talks, intergovernmental talks, contracting nations, negotiating states …) ( INTERNATIONAL CONTACTS N ) ( international intercourse, transnational contacts… ) ( MEETING N )} AND ( HEAD OF STATE L ) ( leader of country, president, president of country, federal president, RF president, US president, monarch, …, emir, emir of Kuwait … )}

51 ROMIP: Russian Seminar on Information Retrieval Russian TREC Text categorization task Categories: DMOZ, 247 categories of 2 nd level Top/World/Russian/*/* Training collection: «DMOZ» (presented by Rambler) –300 000 documents, 2100 sites. Testing collection: Belorussian Internet «BY.web» (granted by Yandex company) –1 500 000 documents, 19 000 sites Our task: –Thesaurus-based text categorization –Measuring of time to create categorization system –Evaluation

52 Knowledge-based approach (8 man-hours)  Category 135 «Martial arts» (F1-measure [OR] = 97%, R=98%, P= 96%)  Boolean expression for the category MARTIAL ARTS (Е) «E» -- full expansion using the thesaurus tree  The expanded description includes: AIKIDO, JIUJUTSU, JUDO, KARATE, JUDOIST, KARATEKA …

53 ROMIP: web-page categorization [or]

54 Benefits from Large-Scale Linguistic Ontologies Use in Information Retrieval Information Retrieval TasksBenefits Web Search 0+ % Corporate Search / Legal Search10 % Long Queries / Verbose Queries15 % Text Categorization15-50 % News Clustering15 % Summarization, Visualization, Multi Document Summarization ++ (SUMMAC)

55 Conclusion Complex domains –Broad domains including a lot of heterogeneous entities –vague boundaries, –Knowledge stored in texts Special efforts to find boundaries Acquisition knowledge from texts –Partial automation –Necessity to prevail ambiguity and vagueness of natural texts even for non-linguists


Download ppt "Problems of Ontology Development for a Broad Domain Loukachevitch Natalia Leading Researcher of Lomonosov Moscow State University Center."

Similar presentations


Ads by Google