A Robust System Architecture For Mining Semi-structured Data By Aby M Mathew CSE 633111301999.

Slides:



Advertisements
Similar presentations
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Advertisements

Database Systems: Design, Implementation, and Management Tenth Edition
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Copyright © 2012 Pearson Education, Inc. Publishing as Prentice Hall 7.1.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Fundamentals of Information Systems, Second Edition 1 Organizing Data and Information Chapter 3.
1 Introduction The Database Environment. 2 Web Links Google General Database Search Database News Access Forums Google Database Books O’Reilly Books Oracle.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Organizing Data & Information
Databases and Processing Modes. Fundamental Data Storage Concepts and Definitions What is an entity? An entity is something about which information is.
Data Management I DBMS Relational Systems. Overview u Introduction u DBMS –components –types u Relational Model –characteristics –implementation u Physical.
Introduction to Databases Transparencies
CBioC: Massive Collaborative Curation of Biomedical Literature Future Directions.
ICS (072)Database Systems Background Review 1 Database Systems Background Review Dr. Muhammad Shafique.
Research Project Mining Negative Rules in Large Databases using GRD.
The RDF meta model: a closer look Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations.
Amarnath Gupta Univ. of California San Diego. An Abstract Question There is no concrete answer …but …
10/14/2001 Coping with Semantics in XML Document Management Thomas Kudrass Leipzig University of Applied Sciences Department of Computer Science and Mathematics.
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
Information storage: Introduction of database 10/7/2004 Xiangming Mu.
© D. Wong 2002 © D. Wong CS610 / CS710 Database Systems I Daisy Wong.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.
Concepts and Terminology Introduction to Database.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
Developing an improved focused crawler for the IDEAL project Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin.
Database Management System Lecture 4 The Relational Database Model- Introduction, Relational Database Concepts.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Dimitrios Skoutas Alkis Simitsis
Database A database is a collection of data organized to meet users’ needs. In this section: Database Structure Database Tools Industrial Databases Concepts.
Data Tagging Architecture for System Monitoring in Dynamic Environments Bharat Krishnamurthy, Anindya Neogi, Bikram Sengupta, Raghavendra Singh (IBM Research.
1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.
Lecture2: Database Environment Prepared by L. Nouf Almujally 1 Ref. Chapter2 Lecture2.
Lecture # 3 & 4 Chapter # 2 Database System Concepts and Architecture Muhammad Emran Database Systems 1.
Database Management System Prepared by Dr. Ahmed El-Ragal Reviewed & Presented By Mr. Mahmoud Rafeek Alfarra College Of Science & Technology- Khan younis.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Chapter 9 Database Systems Introduction to CS 1 st Semester, 2014 Sanghyun Park.
Chapter 8 Data and Knowledge Management. 2 Learning Objectives When you finish this chapter, you will  Know the difference between traditional file organization.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
Citation Linking in Federated Digital Libraries Eike Schallehn, Martin Endig, Kai-Uwe Sattler Otto-von-Guericke-University Magdeburg Institute for Technical.
McGraw-Hill/Irwin ©2009 The McGraw-Hill Companies, All Rights Reserved CHAPTER 6 DATABASES AND DATA WAREHOUSES CHAPTER 6 DATABASES AND DATA WAREHOUSES.
The RDF meta model Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations of XML compared.
1 MedAT: Medical Resources Annotation Tool Monika Žáková *, Olga Štěpánková *, Taťána Maříková * Department of Cybernetics, CTU Prague Institute of Biology.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Jemerson Pedernal IT 2.1 FUNDAMENTALS OF DATABASE APPLICATIONS by PEDERNAL, JEMERSON G. [BS-Computer Science] Palawan State University Computer Network.
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
Hierarchical Modeling.  Explain the 3 different types of model for which computer graphics is used for.  Differentiate the 2 different types of entity.
1 Storing and Maintaining Semistructured Data Efficiently in an Object- Relational Database Mo Yuanying and Ling Tok Wang.
Rationale Databases are an integral part of an organization. Aspiring Database Developers should be able to efficiently design and implement databases.
Geographic Information Systems GIS Data Databases.
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Database Systems: Design, Implementation, and Management Tenth Edition
Business System Development
MODELS OF DATABASE AND DATABASE DESIGN
Chapter 9 Database Systems
Fundamentals & Ethics of Information Systems IS 201
Database management concepts
Introduction to Databases Transparencies
MANAGING DATA RESOURCES
Data Model.
Database management concepts
CHAPTER 1: THE DATABASE ENVIRONMENT AND DEVELOPMENT PROCESS
Course Instructor: Supriya Gupta Asstt. Prof
Geographic Information Systems
Presentation transcript:

A Robust System Architecture For Mining Semi-structured Data By Aby M Mathew CSE

Introduction A versatile system architecture for text mining that differentiates and maintains structured plus unstructured data components.

Motivation A digital library could contain tons of document concepts, using SQL - possible to generate quantitative rules, based on a certain criteria. What about rules related to a subset such as, –which journal publishes articles associated within an area of interest.

Presentation Organization Overview of the IRIS system. Differences between structured & unstructured data. How is the data stored. Algorithm used for rule generation. Conclusion.

Overview of the IRIS system GUI Concept LibraryDatabase Rule Generator IDM Document Collection

Brief Description Of Individual Components Rule Generator - parses the user request via GUI and determines an execution strategy. Database contains structured data - which has mappings b/w tuples and the document. Concept library maintains unstructured data as concepts - mappings exist b/w concepts and documents.

Contd.. IDM ( Information discovery module ) –extracts concepts and structured values from a document collection –updates the database and concept library.

Components of the Rule Generator Parser - accepts data and reconditions it for the optimizer. Optimizer - uses the constraints, rule type and generates an efficient execution plan. Processor - executes plans laid out by the optimizer. parseroptimizerprocessor

Components of the IDM Discoverer - Intelligent agent that determines domains. Extractor - Based on the domain knowledge, it populates the database and concept library. Refresher - Helps maintain consistency of the database and concept library. DiscovererExtractorRefresher

Differences b/w the two data types Structured data type –Certain features that forms key entities. E.g.., Author, Publisher, Date etc. Unstructured data type –Blocks of text that are unidentifiable as structured. E.g.., Abstract headings, paragraphs etc.

How is the data stored ? Structured data is stored using a relational schema that is mapped to a database. Unstructured data is stored in a compressed form using ECH(extended concept hierarchy).

Extended Concept Hierarchy This is a hierarchical form of representing data.  its not always constrained to a tree structure.  relationships maintain additional links b/w the entities in the hierarchy.

Example University ECH Faculty Admin Full Associate Provost Dean Employees

Calculation of minimum support (min sup) in ECH If C1 & C2 are the two concepts found in the document, then min sup = documents( C1 )  documents( C2 ) documents( C1 )  documents( C2 ) where ‘documents ( c )’ is the number of documents where concept ‘c’ occurs.

Example for calculating min sup Say concept C1 appears in 500 documents and C2 appears in 600 documents, 100 of which concept C1 also appears. Min sup = 100 / 1000 =0.1

Algorithm used for rule generation Get Document ids of documents containing structured data value - using SQL statements. ( set ‘A’ ). Get Document ids of documents containing unstructured concept - using ECH. ( set ‘B’ ). C = A  B. Get document ids of concept C r where C r is related to C1 via edge P, C or S. If the min sup of C r & C1 are above min sup. ( set ‘D’ ). E = C  D. confidence = ( num elements in E ) / ( num elements in C ).

Advantages of Using this system Distinguishing b/w structured -vs- unstructured data, helps generate more interesting rules. Being domain specific - accuracy improves. Scalable as any database can be used as the database component. Meaningful data is stored - compact representation of the document.

Bibliography L. Singh, P. Scheurmann & B. Chen, “IRIS: Our prototype rule generation system”, L. Singh, P. Scheurmann & B. Chen, “Generating Association Rules from Semi-structured documents using an Extended concept Hierarchy”, 1999.