Data Mining Applied to Document Imaging Jeff Rekoske
Agenda Introduction Introduction Problem Definition Problem Definition Solution and Methodology Solution and Methodology Progress Report Progress Report Tools Tools Techniques Applied from CSC-288 Techniques Applied from CSC-288 Lessons Learned/Reinforced Lessons Learned/Reinforced Summary Summary
Introduction Employed as SW Developer and DBA on document imaging project Employed as SW Developer and DBA on document imaging project Access to OCR statistics Access to OCR statistics Management staff has a few questions that can be answered by analysis of existing data Management staff has a few questions that can be answered by analysis of existing data
Problem Definition Two Parts Two Parts Management questions Data mining demonstration
Management Questions Result of interviews Result of interviews Fairly basic Fairly basic What forms are processed the most? What are the recognition rates for the top forms? What is the percentage of forms that were presented to an operator for keying?
Data Mining Demonstration Purpose is to show the usefulness of data mining techniques. Purpose is to show the usefulness of data mining techniques. Prediction of rates for new forms Characteristics of highly recognized forms Use mined data to develop new forms
Solution Data mart Data mart Answer management questions Provide data for mining activities
Data Mart Schema (Snowflake)
ETL and Data Mining Dataflow
Methodology Choose a small timeframe to sample data Choose a small timeframe to sample data September – October 2004 Use ETL to load data Use ETL to load data Relatively “clean” process due to data location Apply SQL statements to data mart to answer management questions Apply SQL statements to data mart to answer management questions
Methodology (continued) Extract data from data mart to create WEKA files Extract data from data mart to create WEKA files Attribute-Relation File Format (ARFF) Use WEKA to create classifier model using C4.5 algorithm (pass/fail recognition) Use WEKA to create classifier model using C4.5 algorithm (pass/fail recognition) Validate model with 10-fold cross validation Validate model with 10-fold cross validation
Progress Report First part (management questions) complete First part (management questions) complete 14,210 imaged documents 865,409 OCR fields View created that joins tables View created that joins tables Allows for non-technical personnel to create basic queries Allows for non-technical personnel to create basic queries Management is pleased with results Management is pleased with results
Progress Report (continued) Part Two (WEKA –classifier) in progress Part Two (WEKA –classifier) in progress ARFF generation scripts complete Need to run ARFF files through WEKA Need to cross validate results
Tools Oracle 8i RDBMS Oracle 8i RDBMS Oracle PL/SQL scripting language Oracle PL/SQL scripting language WEKA implementation of C4.5 classifier WEKA implementation of C4.5 classifier WEKA cross validation WEKA cross validation
Techniques Applied from CSC-288 Data Mart Data Mart Snowflake Schema ETL OLAP Operations
Techniques Applied (continued) Classification Classification C4.5 Algorithm Supervised Learning Credibility Credibility Cross-Validation
Lessons Learned/Reinforced Get firm requirements (if possible) Get firm requirements (if possible) Data marts can get large quickly Data marts can get large quickly OLAP operations should be performed offline (from the OLTP system) OLAP operations should be performed offline (from the OLTP system) Demonstrations are useful for explaining concepts Demonstrations are useful for explaining concepts
Summary Application of knowledge from CSC-288 to my work Application of knowledge from CSC-288 to my work Data mart can be used to answer multiple questions without effecting OLTP processing Data mart can be used to answer multiple questions without effecting OLTP processing Hopefully demonstrate using the data mart for creating a classification model Hopefully demonstrate using the data mart for creating a classification model
References “Data Mining: Concepts and Techniques,” by Jiawei Han and Micheline Kamber, Morgan Kaufmann, San Francisco, 2001 “Data Mining: Concepts and Techniques,” by Jiawei Han and Micheline Kamber, Morgan Kaufmann, San Francisco, 2001 "Data Mining: Practical machine learning tools with Java implementations," by Ian H. Witten and Eibe Frank, Morgan Kaufmann, San Francisco, "Data Mining: Practical machine learning tools with Java implementations," by Ian H. Witten and Eibe Frank, Morgan Kaufmann, San Francisco, 2000.
Questions?