Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Applied to Document Imaging Jeff Rekoske.

Similar presentations


Presentation on theme: "Data Mining Applied to Document Imaging Jeff Rekoske."— Presentation transcript:

1 Data Mining Applied to Document Imaging Jeff Rekoske

2 Agenda Introduction Introduction Problem Definition Problem Definition Solution and Methodology Solution and Methodology Progress Report Progress Report Tools Tools Techniques Applied from CSC-288 Techniques Applied from CSC-288 Lessons Learned/Reinforced Lessons Learned/Reinforced Summary Summary

3 Introduction Employed as SW Developer and DBA on document imaging project Employed as SW Developer and DBA on document imaging project Access to OCR statistics Access to OCR statistics Management staff has a few questions that can be answered by analysis of existing data Management staff has a few questions that can be answered by analysis of existing data

4 Problem Definition Two Parts Two Parts  Management questions  Data mining demonstration

5 Management Questions Result of interviews Result of interviews Fairly basic Fairly basic  What forms are processed the most?  What are the recognition rates for the top forms?  What is the percentage of forms that were presented to an operator for keying?

6 Data Mining Demonstration Purpose is to show the usefulness of data mining techniques. Purpose is to show the usefulness of data mining techniques.  Prediction of rates for new forms  Characteristics of highly recognized forms  Use mined data to develop new forms

7 Solution Data mart Data mart  Answer management questions  Provide data for mining activities

8 Data Mart Schema (Snowflake)

9 ETL and Data Mining Dataflow

10 Methodology Choose a small timeframe to sample data Choose a small timeframe to sample data  September – October 2004 Use ETL to load data Use ETL to load data  Relatively “clean” process due to data location Apply SQL statements to data mart to answer management questions Apply SQL statements to data mart to answer management questions

11 Methodology (continued) Extract data from data mart to create WEKA files Extract data from data mart to create WEKA files  Attribute-Relation File Format (ARFF) Use WEKA to create classifier model using C4.5 algorithm (pass/fail recognition) Use WEKA to create classifier model using C4.5 algorithm (pass/fail recognition) Validate model with 10-fold cross validation Validate model with 10-fold cross validation

12 Progress Report First part (management questions) complete First part (management questions) complete  14,210 imaged documents  865,409 OCR fields View created that joins tables View created that joins tables Allows for non-technical personnel to create basic queries Allows for non-technical personnel to create basic queries Management is pleased with results Management is pleased with results

13 Progress Report (continued) Part Two (WEKA –classifier) in progress Part Two (WEKA –classifier) in progress  ARFF generation scripts complete  Need to run ARFF files through WEKA  Need to cross validate results

14 Tools Oracle 8i RDBMS Oracle 8i RDBMS Oracle PL/SQL scripting language Oracle PL/SQL scripting language WEKA implementation of C4.5 classifier WEKA implementation of C4.5 classifier WEKA cross validation WEKA cross validation

15 Techniques Applied from CSC-288 Data Mart Data Mart  Snowflake Schema  ETL  OLAP Operations

16 Techniques Applied (continued) Classification Classification  C4.5 Algorithm  Supervised Learning Credibility Credibility  Cross-Validation

17 Lessons Learned/Reinforced Get firm requirements (if possible) Get firm requirements (if possible) Data marts can get large quickly Data marts can get large quickly OLAP operations should be performed offline (from the OLTP system) OLAP operations should be performed offline (from the OLTP system) Demonstrations are useful for explaining concepts Demonstrations are useful for explaining concepts

18 Summary Application of knowledge from CSC-288 to my work Application of knowledge from CSC-288 to my work Data mart can be used to answer multiple questions without effecting OLTP processing Data mart can be used to answer multiple questions without effecting OLTP processing Hopefully demonstrate using the data mart for creating a classification model Hopefully demonstrate using the data mart for creating a classification model

19 References “Data Mining: Concepts and Techniques,” by Jiawei Han and Micheline Kamber, Morgan Kaufmann, San Francisco, 2001 “Data Mining: Concepts and Techniques,” by Jiawei Han and Micheline Kamber, Morgan Kaufmann, San Francisco, 2001 "Data Mining: Practical machine learning tools with Java implementations," by Ian H. Witten and Eibe Frank, Morgan Kaufmann, San Francisco, 2000. "Data Mining: Practical machine learning tools with Java implementations," by Ian H. Witten and Eibe Frank, Morgan Kaufmann, San Francisco, 2000.

20 Questions?


Download ppt "Data Mining Applied to Document Imaging Jeff Rekoske."

Similar presentations


Ads by Google