University of Illinois OCR Workshop Loretta Auvil UIUC October 18, 2011.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

HATHI TRUST A Shared Digital Repository Delivering Data For New Generations of Research Strategies and Challenges Jeremy York NISO/BISG Forum ALA 2010.
Tools for Unstructured Text
Microsoft Office Illustrated Fundamentals Unit C: Getting Started with Unit C: Getting Started with Microsoft Office 2010 Microsoft Office 2010.
From Ontology Design to Deployment Semantic Application Development with TopBraid Holger Knublauch
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
University of Illinois Visualizing Text Loretta Auvil UIUC February 25, 2011.
PolyAnalyst Data and Text Mining tool Your Knowledge Partner TM www
DEV392: Extending SharePoint Products And Technologies Through Web Parts And ASP.NET Clint Covington, Program Manager Data And Developer Services - Office.
Semantic Search Jiawei Rong Authors Semantic Search, in Proc. Of WWW Author R. Guhua (IBM) Rob McCool (Stanford University) Eric Miller.
Aki Hecht Seminar in Databases (236826) January 2009
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
University of Illinois Role of Mashups, Cloud Computing, and Parallelism for Visual Analytics Loretta Auvil.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
SEASR Overview Loretta Auvil, Boris Capitanu National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
Overview of Search Engines
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of Atmosphere.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
SEASR Analytics and Zotero University of Illinois at Urbana-Champaign.
What Can Do for You! Fabian Christ
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
1.Knowledge management 2.Online analytical processing 3. 4.Supply chain management 5.Data mining Which of the following is not a major application.
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Web 2.0: Concepts and Applications 6 Linking Data.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
More HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Data Visualization Project B.Tech Major Project Project Guide Dr. Naresh Nagwani Project Team Members Pawan Singh Sumit Guha.
Integrated Collaborative Information Systems Ahmet E. Topcu Advisor: Prof Dr. Geoffrey Fox 1.
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
SEASR Applications and Future Work University of Illinois at Urbana-Champaign.
SEASR Analytics for Zotero Loretta Auvil Automated Learning Group Data-Intensive Technologies and Applications, National Center for.
Constructing Data Mining Applications based on Web Services Composition Ali Shaikh Ali and Omer Rana
KNOWLEDGE GRIDS Akshat Mishra GRID SEMINAR WINTER 2008 Feb 2008.
Meandre Workbench National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation Meandre Workbench National Center for Supercomputing.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
SEASR Analytics Loretta Auvil Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Mashups and Dashboards National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
Tools and Deployment University of Illinois at Urbana-Champaign.
Visualizations, Mashups and Dashboards University of Illinois at Urbana-Champaign.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
IT and Network Organization Ecommerce. IT and Network Organization OPTIMIZING INTERNAL COLLABORATIONS IN NETWORK ORGANIZATIONS.
SEASR Overview Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign
SEASR Analytics and Zotero University of Illinois at Urbana-Champaign.
 A content management system ( CMS ) is a system providing a collection of procedures used to manage work flow in a collaborative environment. These.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Deployment of Flows Loretta Auvil
SEASR & Meandre for Second Generation Digital Libraries
SEASR Overview Loretta Auvil, Boris Capitanu
Data Mining: Concepts and Techniques Course Outline
What's New in eCognition 9
AGMLAB Information Technologies
Microsoft Office Illustrated Fundamentals
AI Discovery Template IBM Cloud Architecture Center
What's New in eCognition 9
Presentation transcript:

University of Illinois OCR Workshop Loretta Auvil UIUC October 18, 2011

University of Illinois Correlation-Ngram Viewer Pearson Correlation Algorithm

University of Illinois Correlation-Ngram Viewer new version of the Google ngrams viewer (for 1 grams) addresses case-sensitivity period spellings past-tense syncope (' d) f/s substitution as well as other OCR issues searches within already stored correlation results (using Pearson) results for top 10K ngrams Computes correlation (using Pearson) results for given word against top 1K ngrams

University of Illinois OCR Correction HTRC Example of one of the worst pages of text based on number of corrections per word rate =

University of Illinois Worst Page

University of Illinois Corrected Page

University of Illinois Some Stats Google NgramHTRC 250K BooksLaura’s Total number of ngrams:359,511,583,09720,173,974,251 Total number of ngrams (ignoring punctuation chars):306,780,490,555 Total number of ngrams (ignoring numbers only & repeating characters, other noise that I could easily identify):293,760,570,94619,282,108,416593,055 Total number of corrections that we have made:1,660,948,155131,571,0464,294 Percent of Cleaning0.57%0.68%0.72% Unique ngrams before cleaning7,380,25624,545 Unique ngrams after cleaning4,977,54822,354 Number of unique misspelled words:17,906 Number of unique misspelled words with no suggested replacement:11,143 Number of generated rules: ,763 Number of valid rules:99,455 3,751 Number of rules that are shorter than 5 chars and ignored7,0761,674

University of Illinois Spellcheck Component Wrapped existing spellchecker from com.swabunga.spell Input Dictionary to define the correct words Transformations is a set of rules that should be tried on misspelled words before taking the spell checker's suggestions Token counts is a set of counts that can be used to choose word when spell checker suggests multiple ones Output Replacement Rules are the transformation rules for misspelled words Replacements are suggestions for misspelled words Corrected Text is the original text with corrections applied Uncorrected Misspellings is the list of words for which a correction/replacement could not be found

University of Illinois Adding Levenshtein Use the Levenshtein algorithm to filter the list of suggestions considered The Levenshtein distance is a metric for measuring the amount of difference between two sequences. The value of this property is expressed as a percentage that will depend on the length of the misspelled word Example:

University of Illinois Transformation Rules Complete List o=0; i=1; l=1; z=2; o=3; e=3; s=3; d=3; t=4;e=4; l=4; s=o; s=5; c=6; e=6; fi=6; o=6; l=7; z=7; y=7; j=8; g=8; s=8; a=9; c=9; g=9; o=9; ti=9; b={h,o}; c={e,o,q}; cl={ct,d}; ct={cl,d,dl,dt,ft}; d={cl,ct}; dl=ct; dt=ct; e=c; fl={ss,st}; ft=ct; h={li,b,ii,ll}; i=l; j=y; l=i; li=h; m={rn,lll}; n={ll,il,h}; oe=ce; r=ll; rn=m; s=f; sh={fli,ih,jb,jh,m,sb}; ss=fl; st=fl; tb=th; th=tb; v=y; u={ll,n,ti}; y={j,v};

University of Illinois Mashup Framework Components Virtualization Infrastructure Meandre Infrastructure Visualization Component Repository Component Discovery Meandre Data-Intensive Flows AppsServicesPlugins Web Apps AnalyticsData Developer Tools Repositories Data Analysis Components Flows User Interfaces Computational Resources Visualizations Meandre Workbench

University of Illinois Meandre for Mashups Major Capabilities Dataflow execution Semantic technology (using RDF for storing meta info) Web-Oriented Supports publishing services for data, analytics and visualization Modular components Encapsulation and execution mechanism Promotes reuse, sharing, and collaboration Cloud-friendly infrastructure Implements MapReduce for parallelization Open source Note: Trading off some performance for reuse, flexibility and modular components… with option to parallelize components to improve performance

University of Illinois Locations Components Flows Meandre Workbench Web-based UI (GWT) Components and flows are retrieved from server Additional locations of components and flows can be added to server Create flow using a graphical drag and drop interface Change property values Execute the flow

University of Illinois Spellcheck Flow

University of Illinois Knowledge Discovery Infrastructure Benefits Provides access to data management tools Selecting/Loading data from databases, flat files or repositories Integrates data mining algorithms Supports an extensible interface for creating one’s own algorithms Provides means for building and applying models Provides integrated visualizations components Provides capability to build custom applications Provides access for local or distributed computation Provides the ability to share components and applications

University of Illinois From Silos to Mashups Definition: Mashup is a web page or application that uses and combines data, presentation or functionality from two or more sources to create new services Why do we want this? Enable out services in many applications and on a variety of devices (laptop, high-res display wall, ipad, iphone or the others) Share and reuse is a good thing Reach communities with our tools and their data!!! What can we do to change this? We can think and create data driven solutions so that they can be mashed up with other tools. We can build web services that can be deployed or accessed. We can create API’s to be used.

University of Illinois Components Analytics Unsupervised Learning Clustering Frequent Pattern Analysis Topic Modeling (Mallet) Concept Mapping Supervised Learning Naïve Bayesian Support Vector Machines (Weka) Decision Trees (c4.5) Optimization Approaches Genetic Algorithm Text Analysis (NLP, Entity Extraction) OpenNLP Stanford NER Spellcheck OpenMary (NLP, Text-Speech) Visualization Geographic (Google Maps) Temporal (Simile) Network Graphs – Link Nodes and Arcs (Protovis) Line Charts (D3) Parallel Coordinates (Protovis) Stacked Area Chart (Flare) Tag Cloud Maker Decision Tree (Applet D2K) Naïve Bayes (Applet D2K) Rule Association (Applet) Dendogram (GWT)

University of Illinois Topic Modeling Uses Mallet Topic Modeling to cluster nouns from over 4000 documents from 19th century with 10 segments per document Top 10 topics showing at most 200 keywords for that topic

University of Illinois Concept Mapping Sentiment Analysis six core emotions (Love, Joy, Surprise, Anger, Sadness, Fear)

University of Illinois Thanks Xavier Llora lead developer, now at Google Boris Capitanu, developer of Workbench, and now lead developer Other team members

University of Illinois Links