Abstracting concepts from text documents by using an ontology E. Chernyak 1, O. Chugunova 1, J. Askarova 1, S. Nascimento 2, B. Mirkin 1,3 1 Division of.

Slides:



Advertisements
Similar presentations
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Online Qualitative Data Resources: Best Practice in Metadata Creation.
Advertisements

Foundational Objects. Areas of coverage Technical objects Foundational objects Lessons learned from review of Use Case content Simple Study Simple Questionnaire.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.
Standards Alignment A study of alignment between state standards and the ACM K-12 Curriculum.
Modern Information Retrieval Chapter 1: Introduction
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Introductory Lecture. What is Discrete Mathematics? Discrete mathematics is the part of mathematics devoted to the study of discrete (as opposed to continuous)
Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.
ITEC810 Final Report Inferring Document Structure Wieyen Lin/ Supervised by Jette Viethen.
A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
1 Representing a Computer Science Research Organization on the ACM Computing Classification System Boris Mirkin School of Computer Science and Information.
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Scalable Text Mining with Sparse Generative Models
Overview of the MS Program Jan Prins. The Computer Science MS Objective – prepare students for advanced technical careers in computing or a related field.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Do we need theoretical computer science in software engineering curriculum: an experience from Uni Novi Sad Bansko, August 28, 2013.
9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Programming Logic Program Design. Objectives Steps in program development Algorithms and Pseudocode Data Activity: Alice program.
Mining Disjunctive Association Rules Using Genetic Programming Michelle Lyman Gary Lewandowski Department of Mathematics and Computer Science Xavier University.
1 How ACM classification can be used for profiling a University CS department Boris Mirkin, SCSIS Birkbeck, London Joint work with Susana Nascimento and.
Web Science Subject Categorization: a proposal for discussion version 0.2 Michalis Vafopoulos 23/9/2010 Curriculum Workshop, University of Southampton.
1 Computational Linguistics Ling 200 Spring 2006.
1 An Introduction to Formal Languages and Automata Provided by : Babak Salimi webAdd:
Extracting Semantic Constraint from Description Text for Semantic Web Service Discovery Dengping Wei, Ting Wang, Ji Wang, and Yaodong Chen Reporter: Ting.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Language and Computation Day University of Essex 4 October 2005.
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
Discrete Structures for Computing
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Sergey Gromov Yulia Krasilnikova Vladimir Polyakov (NRTU MISIS, Moscow) KNOWLEDGE BASE CREATION FOR NATIONAL NANOTECHNOLOGY NETWORKS «CONSTRUCTIONAL NANOMATERIALS»
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
Kherson, May 20-22, Nikolaj S. Nikitchenko Kyiv National Taras Shevchenko University, Ukraine Integration of Informatics-Programming Disciplines.
Theory of Programming Languages Introduction. What is a Programming Language? John von Neumann (1940’s) –Stored program concept –CPU actions determined.
Computer Science 210 Computer Organization Course Introduction.
Some questions -What is metadata? -Data about data.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Text Analytics A Tool for Taxonomy Development Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Information Retrieval Chapter 2 by Rajendra Akerkar, Pawan Lingras Presented by: Xxxxxx.
Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
Higher Computing Science 2016 Prelim Revision. Topics to revise Computational Constructs parameter passing (value and reference, formal and actual) sub-programs/routines,
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Introductory Lecture. What is Discrete Mathematics? Discrete mathematics is the part of mathematics devoted to the study of discrete (as opposed to continuous)
BMTS Computer Programming Pre-requisites :BMTS 242 –Computer and Systems Nature Of the Course: Programming course, contain such as C, C++, Database.
Computer Science: A Structured Programming Approach Using C1 Objectives ❏ To understand the structure of a C-language program. ❏ To write your first C.
Adviser: Ming-Puu Chen Presenter: Li-Chun Wang
Logic You will learn three common logical operations that will be applied in much of the course (spreadsheets, databases, web searches and both programming.
Formal Foundations-II [Theory of Automata]
Introduction to the C Language
Logic You will learn three common logical operations that will be applied in much of the course (spreadsheets, databases, web searches and both programming.
Overview of Compilation The Compiler Front End
UNIT – V BUSINESS ANALYTICS
The Greedy Method and Text Compression
Logic You will learn three common logical operations that will be applied in much of the course (spreadsheets, databases, web searches and both programming.
Information Retrieval
Searching full-text journals in Westlaw International – Advanced searching techniques are essential! (Use Westlaw Tips) Within Westlaw UK, click on Westlaw.
Principles of Programming Languages
Computer Science 210 Computer Organization
Presentation transcript:

Abstracting concepts from text documents by using an ontology E. Chernyak 1, O. Chugunova 1, J. Askarova 1, S. Nascimento 2, B. Mirkin 1,3 1 Division of Applied Mathematics and Informatics, NRU-HSE, Moscow, Russia 2 Department of Informatics, New University of Lisbon, Caparica, Portugal 3 Department of Computer Science, Birkbeck University of London, London, UK

Contents 1. Statement of the problem 2. Method 3. Examples of application 4. Future work

Statement of the problem Interpretation of a text corpus over a taxonomy (the main part of an ontology)

Input... Collection of the ACM Journal abstracts The ACM Computing Classification System (1998)...

Output Head subjects and related events (gap, offshoot) CodeMembership value ACM-CCS Topic F Complexity Measures and Classes H Languages F Tradeoffs between Complexity Measures H Logical Design F Models of Computation H Systems D Metrics H Database Applications J SOCIAL AND BEHAVIORAL SCIENCES K General H Database Machines F Nonnumerical Algorithms and Problems I Algorithms... Head Subjects (Interpretation): H.2 DATABASE MANAGEMENT F. Theory of Computation

Method 1. Building a profile of the corpus A. Annotated suffix tree for abstracts and keywords (Pampapathi, Mirkin, Levene, 2006) B. Scoring ACM-CCS leaves including references between them C. Clustering the profiles (if needed) 2. Lifting the profile in the taxonomy tree A. Specifying head subject, gap and offshoot penalty weights B. Parsimonious lifting (Mirkin, Nascimento, Fenner, Pereira, 2010)

Annotated Suffix Tree(AST) is used to compute and store the frequencies of all substrings of the string

Lifting Represent the thematic clusters in ACM-CCS by higher, more general, nodes depending on the inconsistencies ( Lift )

Applications I The Journal of ACM abstracts and the ACM-CCS Course syllabuses of Mathematics and Informatics disciplines and an in-house taxonomy of Mathematics and Informatics built using Supreme Attestation Committee of Russia documentation (in Russian)

Applications II Two AST-profiles: A – good Profile A. Two variable logic on data trees and XML reasoning AST-profileACM-CCS index terms IDTEACM–CCS topicID#ACM–CCS topic H LanguagesH.2.3 0Languages I Languages and SystemsF.4.3 2Formal Languages F Formal LanguagesH Logical Design D Reliability F.4.127Mathematical Logic I Simulation Languages I.7.252Document Preparation

Applications II Two AST-profiles: B – poor Profile B. Lower bounds for processing data with few random accesses to external memory AST-profileACM-CCS index terms IDTEACM–CCS topicID#ACM–CCS topic H Database ApplicationsF Complexity Measures and Classes H Heterogeneous DatabasesH Systems C Large and Medium (``Mainframe'') Computers F Models of Computation J ADMINISTRATIVE DATA PROCESSING I Natural Language Processing

Conclusion Interpretation by producing profiles and lifting them in the taxonomy Issues A. AST scoring – slow and noised B. The taxonomies are not quite relevant C. Penalty weights? (Future work: change the parsinomy criterion for that of the maximum likelihood)