Data Mining Applied to Document Imaging Jeff Rekoske.

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

C6 Databases.
Database Management3-1 L3 Database Management Santa R. Susarapu Ph.D. Student Virginia Commonwealth University.
2015/6/1Course Introduction1 Welcome! MSCIT 521: Knowledge Discovery and Data Mining Qiang Yang Hong Kong University of Science and Technology
ORACLE Lecture 1: Oracle 11g Introduction & Installation.
CSC 177 Data warehouse and Mining project Pooja Vora Vishma Shah Guided by – Prof. Meiliu lu.
Chapter 9 DATA WAREHOUSING Transparencies © Pearson Education Limited 1995, 2005.
Chapter 3 Database Management
Manajemen Basis Data Pertemuan 8 Matakuliah: M0264/Manajemen Basis Data Tahun: 2008.
DATA WAREHOUSING.
Introduction to Data Mining with Case Studies
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
Data Warehousing 資料倉儲 Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University Dept. of Information ManagementTamkang.
M ODULE 5 Metadata, Tools, and Data Warehousing Section 4 Data Warehouse Administration 1 ITEC 450.
CSc288 Term Project Data mining on predict Voice-over-IP Phones market Huaqin Xu.
Lecture-1 Introduction and Background
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
An Exercise in Machine Learning
Chapter 4: Organizing and Manipulating the Data in Databases
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
CSCI 347 – Data Mining Lecture 01 – Course Overview.
CLassification TESTING Testing classifier accuracy
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
Appendix: The WEKA Data Mining Software
1 Research Groups : KEEL: A Software Tool to Assess Evolutionary Algorithms for Data Mining Problems SCI 2 SMetrology and Models Intelligent.
Course Title Database Technologies Instructor: Dr ALI DAUD Course Credits: 3 with Lab Total Hours: 45 approximately.
Oracle9i Performance Tuning Chapter 1 Performance Tuning Overview.
CS525 DATA MINING COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Christoph F. Eick Introduction Data Management Today 1. Introduction to Databases 2. Questionnaire 3. Course Information 4. Grading and Other Things.
Data Mining with Oracle using Classification and Clustering Algorithms Proposed and Presented by Nhamo Mdzingwa Supervisor: John Ebden.
Weka: a useful tool in data mining and machine learning Team 5 Noha Elsherbiny, Huijun Xiong, and Bhanu Peddi.
Database A database is a collection of data organized to meet users’ needs. In this section: Database Structure Database Tools Industrial Databases Concepts.
Data Warehousing Lecture-1 1. Introduction and Background 2.
1 Reviewing Data Warehouse Basics. Lessons 1.Reviewing Data Warehouse Basics 2.Defining the Business and Logical Models 3.Creating the Dimensional Model.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Data Warehousing/Mining 1 Data Warehousing/Mining Comp 150DW Course Overview Instructor: Dan Hebert.
W E K A Waikato Environment for Knowledge Analysis Branko Kavšek MPŠ Jožef StefanNovember 2005.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
13 1 Chapter 13 The Data Warehouse Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
1 IMM472 資料探勘 陳春賢. 2 Lecture I Class Introduction.
COSC 3480 Projects, Christoph F. Eick 1 Lab COSC 3480 Fall 2000.
MAIN BOOKS 1. DATA WAREHOUSING IN THE REAL WORLD : Sam Anshory & Dennis Murray, Pearson 2. DATA MINING CONCEPTS AND TECHNIQUES : Jiawei Han & Micheline.
An Evaluation of Commercial Data Mining Proposed and Presented by Emily Davis Supervisor: John Ebden.
Foundations of Business Intelligence: Databases and Information Management.
W E K A Waikato Environment for Knowledge Aquisition.
An Exercise in Machine Learning
Mining real world data RDBMS and SQL. Index RDBMS introduction SQL (Structured Query language)
© 2003 Prentice Hall, Inc.3-1 Chapter 3 Database Management Information Systems Today Leonard Jessup and Joseph Valacich.
Advanced Database Concepts
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
1 IMM472 資料探勘 陳春賢. 2 Lecture I Class Introduction.
Waqas Haider Bangyal. 2 Source Materials “ Data Mining: Concepts and Techniques” by Jiawei Han & Micheline Kamber, Second Edition, Morgan Kaufmann, 2006.
Data Mining Concepts and Techniques Course Presentation by Ali A. Ali Department of Information Technology Institute of Graduate Studies and Research Alexandria.
1 Copyright © Oracle Corporation, All rights reserved. Business Intelligence and Data Warehousing.
1 Data Mining on New Road Prediction By Qing Liu Dec. 9, 2004.
1 Management Information Systems M Agung Ali Fikri, SE. MM.
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
1 SBM411 資料探勘 陳春賢. 2 Lecture I Class Introduction.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
WEKA: A Practical Machine Learning Tool WEKA : A Practical Machine Learning Tool.
Detecting Web Attacks Using Multi-Stage Log Analysis
Collage Score Card & Software defect prediction
Data Mining: Concepts and Techniques Course Outline
Data Warehouse and OLAP
CS4705 – Natural Language Processing Thursday, September 28
Dept. of Computer Science University of Liverpool
Assignment 7 Due Application of Support Vector Machines using Weka software Must install libsvm Data set: Breast cancer diagnostics Deliverables:
Data Warehouse and OLAP
Presentation transcript:

Data Mining Applied to Document Imaging Jeff Rekoske

Agenda Introduction Introduction Problem Definition Problem Definition Solution and Methodology Solution and Methodology Progress Report Progress Report Tools Tools Techniques Applied from CSC-288 Techniques Applied from CSC-288 Lessons Learned/Reinforced Lessons Learned/Reinforced Summary Summary

Introduction Employed as SW Developer and DBA on document imaging project Employed as SW Developer and DBA on document imaging project Access to OCR statistics Access to OCR statistics Management staff has a few questions that can be answered by analysis of existing data Management staff has a few questions that can be answered by analysis of existing data

Problem Definition Two Parts Two Parts  Management questions  Data mining demonstration

Management Questions Result of interviews Result of interviews Fairly basic Fairly basic  What forms are processed the most?  What are the recognition rates for the top forms?  What is the percentage of forms that were presented to an operator for keying?

Data Mining Demonstration Purpose is to show the usefulness of data mining techniques. Purpose is to show the usefulness of data mining techniques.  Prediction of rates for new forms  Characteristics of highly recognized forms  Use mined data to develop new forms

Solution Data mart Data mart  Answer management questions  Provide data for mining activities

Data Mart Schema (Snowflake)

ETL and Data Mining Dataflow

Methodology Choose a small timeframe to sample data Choose a small timeframe to sample data  September – October 2004 Use ETL to load data Use ETL to load data  Relatively “clean” process due to data location Apply SQL statements to data mart to answer management questions Apply SQL statements to data mart to answer management questions

Methodology (continued) Extract data from data mart to create WEKA files Extract data from data mart to create WEKA files  Attribute-Relation File Format (ARFF) Use WEKA to create classifier model using C4.5 algorithm (pass/fail recognition) Use WEKA to create classifier model using C4.5 algorithm (pass/fail recognition) Validate model with 10-fold cross validation Validate model with 10-fold cross validation

Progress Report First part (management questions) complete First part (management questions) complete  14,210 imaged documents  865,409 OCR fields View created that joins tables View created that joins tables Allows for non-technical personnel to create basic queries Allows for non-technical personnel to create basic queries Management is pleased with results Management is pleased with results

Progress Report (continued) Part Two (WEKA –classifier) in progress Part Two (WEKA –classifier) in progress  ARFF generation scripts complete  Need to run ARFF files through WEKA  Need to cross validate results

Tools Oracle 8i RDBMS Oracle 8i RDBMS Oracle PL/SQL scripting language Oracle PL/SQL scripting language WEKA implementation of C4.5 classifier WEKA implementation of C4.5 classifier WEKA cross validation WEKA cross validation

Techniques Applied from CSC-288 Data Mart Data Mart  Snowflake Schema  ETL  OLAP Operations

Techniques Applied (continued) Classification Classification  C4.5 Algorithm  Supervised Learning Credibility Credibility  Cross-Validation

Lessons Learned/Reinforced Get firm requirements (if possible) Get firm requirements (if possible) Data marts can get large quickly Data marts can get large quickly OLAP operations should be performed offline (from the OLTP system) OLAP operations should be performed offline (from the OLTP system) Demonstrations are useful for explaining concepts Demonstrations are useful for explaining concepts

Summary Application of knowledge from CSC-288 to my work Application of knowledge from CSC-288 to my work Data mart can be used to answer multiple questions without effecting OLTP processing Data mart can be used to answer multiple questions without effecting OLTP processing Hopefully demonstrate using the data mart for creating a classification model Hopefully demonstrate using the data mart for creating a classification model

References “Data Mining: Concepts and Techniques,” by Jiawei Han and Micheline Kamber, Morgan Kaufmann, San Francisco, 2001 “Data Mining: Concepts and Techniques,” by Jiawei Han and Micheline Kamber, Morgan Kaufmann, San Francisco, 2001 "Data Mining: Practical machine learning tools with Java implementations," by Ian H. Witten and Eibe Frank, Morgan Kaufmann, San Francisco, "Data Mining: Practical machine learning tools with Java implementations," by Ian H. Witten and Eibe Frank, Morgan Kaufmann, San Francisco, 2000.

Questions?