Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Illinois OCR Workshop Loretta Auvil UIUC October 18, 2011.

Similar presentations


Presentation on theme: "University of Illinois OCR Workshop Loretta Auvil UIUC October 18, 2011."— Presentation transcript:

1 University of Illinois OCR Workshop Loretta Auvil UIUC October 18, 2011

2 University of Illinois Correlation-Ngram Viewer Pearson Correlation Algorithm

3 University of Illinois Correlation-Ngram Viewer new version of the Google ngrams viewer (for 1 grams) addresses case-sensitivity period spellings past-tense syncope (' d) f/s substitution as well as other OCR issues searches within already stored correlation results (using Pearson) results for top 10K ngrams Computes correlation (using Pearson) results for given word against top 1K ngrams

4 University of Illinois OCR Correction HTRC Example of one of the worst pages of text based on number of corrections per word rate = 0.1994

5 University of Illinois Worst Page

6 University of Illinois Corrected Page

7 University of Illinois Some Stats Google NgramHTRC 250K BooksLaura’s Total number of ngrams:359,511,583,09720,173,974,251 Total number of ngrams (ignoring punctuation chars):306,780,490,555 Total number of ngrams (ignoring numbers only & repeating characters, other noise that I could easily identify):293,760,570,94619,282,108,416593,055 Total number of corrections that we have made:1,660,948,155131,571,0464,294 Percent of Cleaning0.57%0.68%0.72% Unique ngrams before cleaning7,380,25624,545 Unique ngrams after cleaning4,977,54822,354 Number of unique misspelled words:17,906 Number of unique misspelled words with no suggested replacement:11,143 Number of generated rules:1542276,763 Number of valid rules:99,455 3,751 Number of rules that are shorter than 5 chars and ignored7,0761,674

8 University of Illinois Spellcheck Component Wrapped existing spellchecker from com.swabunga.spell Input Dictionary to define the correct words Transformations is a set of rules that should be tried on misspelled words before taking the spell checker's suggestions Token counts is a set of counts that can be used to choose word when spell checker suggests multiple ones Output Replacement Rules are the transformation rules for misspelled words Replacements are suggestions for misspelled words Corrected Text is the original text with corrections applied Uncorrected Misspellings is the list of words for which a correction/replacement could not be found

9 University of Illinois Adding Levenshtein Use the Levenshtein algorithm to filter the list of suggestions considered The Levenshtein distance is a metric for measuring the amount of difference between two sequences. The value of this property is expressed as a percentage that will depend on the length of the misspelled word Example:

10 University of Illinois Transformation Rules Complete List o=0; i=1; l=1; z=2; o=3; e=3; s=3; d=3; t=4;e=4; l=4; s=o; s=5; c=6; e=6; fi=6; o=6; l=7; z=7; y=7; j=8; g=8; s=8; a=9; c=9; g=9; o=9; ti=9; b={h,o}; c={e,o,q}; cl={ct,d}; ct={cl,d,dl,dt,ft}; d={cl,ct}; dl=ct; dt=ct; e=c; fl={ss,st}; ft=ct; h={li,b,ii,ll}; i=l; j=y; l=i; li=h; m={rn,lll}; n={ll,il,h}; oe=ce; r=ll; rn=m; s=f; sh={fli,ih,jb,jh,m,sb}; ss=fl; st=fl; tb=th; th=tb; v=y; u={ll,n,ti}; y={j,v};

11 University of Illinois Mashup Framework Components Virtualization Infrastructure Meandre Infrastructure Visualization Component Repository Component Discovery Meandre Data-Intensive Flows AppsServicesPlugins Web Apps AnalyticsData Developer Tools Repositories Data Analysis Components Flows User Interfaces Computational Resources Visualizations Meandre Workbench

12 University of Illinois Meandre for Mashups Major Capabilities Dataflow execution Semantic technology (using RDF for storing meta info) Web-Oriented Supports publishing services for data, analytics and visualization Modular components Encapsulation and execution mechanism Promotes reuse, sharing, and collaboration Cloud-friendly infrastructure Implements MapReduce for parallelization Open source Note: Trading off some performance for reuse, flexibility and modular components… with option to parallelize components to improve performance

13 University of Illinois Locations Components Flows Meandre Workbench Web-based UI (GWT) Components and flows are retrieved from server Additional locations of components and flows can be added to server Create flow using a graphical drag and drop interface Change property values Execute the flow

14 University of Illinois Spellcheck Flow

15 University of Illinois Knowledge Discovery Infrastructure Benefits Provides access to data management tools Selecting/Loading data from databases, flat files or repositories Integrates data mining algorithms Supports an extensible interface for creating one’s own algorithms Provides means for building and applying models Provides integrated visualizations components Provides capability to build custom applications Provides access for local or distributed computation Provides the ability to share components and applications

16 University of Illinois From Silos to Mashups Definition: Mashup is a web page or application that uses and combines data, presentation or functionality from two or more sources to create new services Why do we want this? Enable out services in many applications and on a variety of devices (laptop, high-res display wall, ipad, iphone or the others) Share and reuse is a good thing Reach communities with our tools and their data!!! What can we do to change this? We can think and create data driven solutions so that they can be mashed up with other tools. We can build web services that can be deployed or accessed. We can create API’s to be used.

17 University of Illinois Components Analytics Unsupervised Learning Clustering Frequent Pattern Analysis Topic Modeling (Mallet) Concept Mapping Supervised Learning Naïve Bayesian Support Vector Machines (Weka) Decision Trees (c4.5) Optimization Approaches Genetic Algorithm Text Analysis (NLP, Entity Extraction) OpenNLP Stanford NER Spellcheck OpenMary (NLP, Text-Speech) Visualization Geographic (Google Maps) Temporal (Simile) Network Graphs – Link Nodes and Arcs (Protovis) Line Charts (D3) Parallel Coordinates (Protovis) Stacked Area Chart (Flare) Tag Cloud Maker Decision Tree (Applet D2K) Naïve Bayes (Applet D2K) Rule Association (Applet) Dendogram (GWT)

18 University of Illinois Topic Modeling Uses Mallet Topic Modeling to cluster nouns from over 4000 documents from 19th century with 10 segments per document Top 10 topics showing at most 200 keywords for that topic

19 University of Illinois Concept Mapping Sentiment Analysis six core emotions (Love, Joy, Surprise, Anger, Sadness, Fear)

20 University of Illinois Thanks Xavier Llora lead developer, now at Google Boris Capitanu, developer of Workbench, and now lead developer Other team members

21 University of Illinois Links www.seasr.org www.seasr.org/meandre


Download ppt "University of Illinois OCR Workshop Loretta Auvil UIUC October 18, 2011."

Similar presentations


Ads by Google