A Robust System Architecture For Mining Semi-structured Data By Aby M Mathew CSE 633111301999.

A Robust System Architecture For Mining Semi-structured Data By Aby M Mathew CSE 633111301999

Introduction A versatile system architecture for text mining that differentiates and maintains structured plus unstructured data components.

Motivation A digital library could contain tons of document concepts, using SQL - possible to generate quantitative rules, based on a certain criteria. What about rules related to a subset such as, –which journal publishes articles associated within an area of interest.

Presentation Organization Overview of the IRIS system. Differences between structured & unstructured data. How is the data stored. Algorithm used for rule generation. Conclusion.

Overview of the IRIS system GUI Concept LibraryDatabase Rule Generator IDM Document Collection

Brief Description Of Individual Components Rule Generator - parses the user request via GUI and determines an execution strategy. Database contains structured data - which has mappings b/w tuples and the document. Concept library maintains unstructured data as concepts - mappings exist b/w concepts and documents.

Contd.. IDM ( Information discovery module ) –extracts concepts and structured values from a document collection –updates the database and concept library.

Components of the Rule Generator Parser - accepts data and reconditions it for the optimizer. Optimizer - uses the constraints, rule type and generates an efficient execution plan. Processor - executes plans laid out by the optimizer. parseroptimizerprocessor

Components of the IDM Discoverer - Intelligent agent that determines domains. Extractor - Based on the domain knowledge, it populates the database and concept library. Refresher - Helps maintain consistency of the database and concept library. DiscovererExtractorRefresher

Differences b/w the two data types Structured data type –Certain features that forms key entities. E.g.., Author, Publisher, Date etc. Unstructured data type –Blocks of text that are unidentifiable as structured. E.g.., Abstract headings, paragraphs etc.

How is the data stored ? Structured data is stored using a relational schema that is mapped to a database. Unstructured data is stored in a compressed form using ECH(extended concept hierarchy).

Extended Concept Hierarchy This is a hierarchical form of representing data.  its not always constrained to a tree structure.  relationships maintain additional links b/w the entities in the hierarchy.

Example University ECH Faculty Admin Full Associate Provost Dean Employees

Calculation of minimum support (min sup) in ECH If C1 & C2 are the two concepts found in the document, then min sup = documents( C1 )  documents( C2 ) documents( C1 )  documents( C2 ) where ‘documents ( c )’ is the number of documents where concept ‘c’ occurs.

Example for calculating min sup Say concept C1 appears in 500 documents and C2 appears in 600 documents, 100 of which concept C1 also appears. Min sup = 100 / 1000 =0.1

Algorithm used for rule generation Get Document ids of documents containing structured data value - using SQL statements. ( set ‘A’ ). Get Document ids of documents containing unstructured concept - using ECH. ( set ‘B’ ). C = A  B. Get document ids of concept C r where C r is related to C1 via edge P, C or S. If the min sup of C r & C1 are above min sup. ( set ‘D’ ). E = C  D. confidence = ( num elements in E ) / ( num elements in C ).

Advantages of Using this system Distinguishing b/w structured -vs- unstructured data, helps generate more interesting rules. Being domain specific - accuracy improves. Scalable as any database can be used as the database component. Meaningful data is stored - compact representation of the document.

Bibliography L. Singh, P. Scheurmann & B. Chen, “IRIS: Our prototype rule generation system”, 1999. L. Singh, P. Scheurmann & B. Chen, “Generating Association Rules from Semi-structured documents using an Extended concept Hierarchy”, 1999.

A Robust System Architecture For Mining Semi-structured Data By Aby M Mathew CSE 633111301999.

Similar presentations

Presentation on theme: "A Robust System Architecture For Mining Semi-structured Data By Aby M Mathew CSE 633111301999."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Robust System Architecture For Mining Semi-structured Data By Aby M Mathew CSE 633111301999.

Similar presentations

Presentation on theme: "A Robust System Architecture For Mining Semi-structured Data By Aby M Mathew CSE 633111301999."— Presentation transcript:

Similar presentations

About project

Feedback