Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Oracle Enterprise Data Quality CDEP: Tailoring Parser Configuration.

Slides:



Advertisements
Similar presentations
From Words to Meaning to Insight
Advertisements

Using the Degree Works Planner. Click on the Planner tab in the students Degree Works account.
WASTE MANAGEMENT ©2010 SciQuest USA Confidential 1 Powered by RFx User Guide.
Premier Director Document Imaging
Compilers and Language Translation
Mark Entry Understand role of entering marks and module results in student life cycle Be confident in processing marks and calculating module results –Mark.
Information Retrieval in Practice
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
Linux+ Guide to Linux Certification, Second Edition
Transformation-based error- driven learning (TBL) LING 572 Fei Xia 1/19/06.
Chapter 9 Compilers and Language Translation. The Compilation Process Phase I: Lexical analysis Phase I: Lexical analysis Phase II: Parsing Phase II:
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
Overview of Search Engines
Lesson 32: Designing a Relational Database. 2 Lesson Objectives After studying this lesson, you will be able to:  Identify and apply principles for good.
8 Copyright © 2004, Oracle. All rights reserved. Creating LOVs and Editors.
Invitation to Computer Science 5th Edition
GTECH 361 Lecture 13a Address Matching. Address Event Tables Any supported tabular format One field must specify an address The name of that field is.
XP New Perspectives on Microsoft Access 2002 Tutorial 51 Microsoft Access 2002 Tutorial 5 – Enhancing a Table’s Design, and Creating Advanced Queries and.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
INTRODUCTION TO COMPUTING CHAPTER NO. 06. Compilers and Language Translation Introduction The Compilation Process Phase 1 – Lexical Analysis Phase 2 –
Week 7 Working with the BASH Shell. Objectives  Redirect the input and output of a command  Identify and manipulate common shell environment variables.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis Hira Waseem Lecture
© 2011 Autodesk High-End Infrastructure Modeling with Low-Cost Tools: Introducing AutoCAD® Map 3D 2012 Bradford Heasley, GISP Vice President, Brockwell.
Linux+ Guide to Linux Certification, Third Edition
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik materials by: Katy Wolstencroft University of Manchester.
PROBLEM SOLVING & ALGORITHMS CHAPTER 5: CONTROL STRUCTURES - SELECTION.
FIX Eye FIX Eye Getting started: The guide EPAM Systems B2BITS.
Copyright 2007, Paradigm Publishing Inc. ACCESS 2007 Chapter 3 BACKNEXTEND 3-1 LINKS TO OBJECTIVES Modify a Table – Add, Delete, Move Fields Modify a Table.
Database Applications – Microsoft Access Lesson 7 Designing Custom Reports Updated 11/13 27 Slides in Presentation.
Topic #1: Introduction EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
Graphical Enablement In this presentation… –What is graphical enablement? –Introduction to newlook dialogs and tools used to graphical enable System i.
Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Oracle Enterprise Data Quality Introduction to Parsing.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
The Role of Lexical Analyzer
8 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. BI Publisher Server: Administration and Security.
10 Copyright © 2009, Oracle. All rights reserved. Using the Mapping Debugger.
B Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Working with PDF and eText Templates.
22 Copyright © 2009, Oracle. All rights reserved. Filtering Requests in Oracle Business Intelligence Answers.
Overview of Compilation Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida Programming Language Principles Lecture 2.
9 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Scheduling and Bursting Reports.
© Execview Ltd 2015: all rights reserved Execview Scorecard Training/Reference Guide 2013 Key functions guide for Scorecard administrators.
7 Copyright © 2006, Oracle. All rights reserved. Defining a Relational Dimensional Model.
6 Copyright © 2010, Oracle and/or its affiliates. All rights reserved. Site Hub User Role – Managing Sites.
Information Retrieval in Practice
Excel Tutorial 8 Developing an Excel Application
Advanced Computer Systems
Project Management: Messages
Compiler Design (40-414) Main Text Book:
Creating Oracle Business Intelligence Interactive Dashboards
Overview of Compilation The Compiler Front End
Overview of Compilation The Compiler Front End
Core LIMS Training: Advanced Administration
Introduction to Scripting
Database Applications – Microsoft Access
The Lightroom Sessions A Quick Start Review – Pittwater Camera Club
Exception Handling Chapter 9.
Variables In programming, we often need to have places to store data. These receptacles are called variables. They are called that because they can change.
DESIGNING AND USING NORMALIZATION RULES
Training course Part 2: Administration tasks
PolyAnalyst Web Report Training
JavaScript: Introduction to Scripting
Chapter 10: Compilers and Language Translation
Approving Time in Kronos Manager/Supervisor Reference Guide
CIS 136 Building Mobile Apps
Presentation transcript:

Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Oracle Enterprise Data Quality CDEP: Tailoring Parser Configuration

2 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Topic 1: Base Tokenization

3 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Parse Processor......has seven sub-processors.

4 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Parse Processor’s Tabs All tabs available from Parse processor. Sub-set of tabs available from sub-processors.

5 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Input Sub-Processor (1) Tokenize Classify Reclassify Select Resolve Output data and metadata Syntactic analysis of data. Split data into base tokens. Semantic analysis of data. Assign meaning to tokens. Examine token sequences for new classified tokens. Select the best description of the data, where possible. Resolve data to its desired structure and give a result. Input Map Select the inputs to be parsed. Map the inputs. Can use as a template. Input data

6 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Define input attribute(s): The Input Sub-Processor (2)

7 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Map Sub-Processor (1) Tokenize Classify Reclassify Select Resolve Output data and metadata Syntactic analysis of data. Split data into base tokens. Semantic analysis of data. Assign meaning to tokens. Examine token sequences for new classified tokens. Select the best description of the data, where possible. Resolve data to its desired structure and give a result. Input Map Select the inputs to be parsed. Map the inputs. Can use as a template. Input data

8 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Same processor configuration can be used in different process regardless of input attribute names. The Map Sub-Processor (2)

9 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Tokenize Sub-Processor (1) Tokenize Classify Reclassify Select Resolve Output data and metadata Syntactic analysis of data. Split data into base tokens. Semantic analysis of data. Assign meaning to tokens. Examine token sequences for new classified tokens. Select the best description of the data, where possible. Resolve data to its desired structure and give a result. Input Map Select the inputs to be parsed. Map the inputs. Can use as a template. Input data

10 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Defines how base tokens are created. The Tokenize Sub-Processor (2)

11 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Tokenization uses Global Reference Data. Base Tokenization Map

12 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Tokenize Sub-Processor (3) Configuration: Character Map reference data. Token splitting rules. Tokenization rules yield distinct ‘base tokens’ and give them a tag based on their character types.

13 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Lab Overview Lab 1: Examine Base Tokens Connect the Unstructured Name Parser to a Data Source Read the Online Help Run Parsing and Study Base Tokenization Results

14 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Topic 2: Classification

15 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Classify Sub-Processor (1) Tokenize Classify Reclassify Select Resolve Output data and metadata Syntactic analysis of data. Split data into base tokens. Semantic analysis of data. Assign meaning to tokens. Examine token sequences for new classified tokens. Select the best description of the data, where possible. Resolve data to its desired structure and give a result. Input Map Select the inputs to be parsed. Map the inputs. Can use as a template. Input data

16 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Example 1 - Create reference data to store invalid values, then classify tokens against these values. Example 2 - use a simple RegEx token check to find numbers in a name field. The Classify Sub-Processor (2)

17 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Classify Sub-Processor (3) Input Data After Classification Reference Data

18 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Adds semantic meaning to the data by classifying tokens using rules. The Classify Sub-Processor (4) _ DH4 8NG RegEx Token Check BS Pattern Token Check Volkswagen List Token Check

19 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Lab Overview Lab 2: Classify Tokens Examine Parsing Results Examine Classification Results Examine Classification Rules and Reference Data Amend the Reference Data Supplied with a Token Check Create a New Token Check Lab 3 (Optional): Classify Suspect Data

20 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Topic 3: Reclassification

21 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Reclassify Sub-Processor (1) Tokenize Classify Reclassify Select Resolve Output data and metadata Syntactic analysis of data. Split data into base tokens. Semantic analysis of data. Assign meaning to tokens. Examine token sequences for new classified tokens. Select the best description of the data, where possible. Resolve data to its desired structure and give a result. Input Map Select the inputs to be parsed. Map the inputs. Can use as a template. Input data

22 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Further reduces the number of unclassified tokens. Based on position within a pattern. – e.g. Family Names usually appears after Given Name. The Reclassify Sub-Processor (2)

23 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Token names are case-sensitive (validName ≠ validname). Use {} to show exact number of occurrences of the token shown in []. Use * to signify any number of occurrences. Syntax and Wildcards [ ]{0,1}([ ]* )[.]* Tom Jones Avenue 23 Arcadia Avenue, Seaford Would match both:

24 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Space token is _ (underscore). Patterns for reclassification can contain spaces. If the space is left in reclassification will only match pattern with space. Best to remove spaces completely. Example: Use of Spaces in Token Patterns will match both: _ (1 space) __ (2 spaces) _ will only match: _

25 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Classify: Identifies semantic meaning of data by classifying tokens based on rules and token checks (e.g. list, RegEx, pattern). Reclassify: Optionally allows the reduction of token patterns by recognizing sequences and reclassifying as a new token, with a given confidence level. Also allows identification of text by positional reference to an existing token. – E.g. Extract the number immediately before a recognised unit. Classification vs Reclassification

26 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Lab Overview Lab 4: Examine Pre-Configured Reclassification Rules Examine an Existing Rule Lab 5: Create a New Reclassification Rule Create Reference Data Create a New Classification Rule for AccountTypeLiteral Create a Reclassification Rule for AccountType

27 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Topic 4: Selection

28 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Select Sub-Processor (1) Tokenize Classify Reclassify Select Resolve Output data and metadata Syntactic analysis of data. Split data into base tokens. Semantic analysis of data. Assign meaning to tokens. Examine token sequences for new classified tokens. Select the best description of the data, where possible. Resolve data to its desired structure and give a result. Input Map Select the inputs to be parsed. Map the inputs. Can use as a template. Input data

29 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. All possible token patterns for each record initially created. Selection step selects the most likely correct pattern. Resolution rules created in Select tab of Results Browser. The Select Sub-Processor (2)

30 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Select sub-processor uses tuneable point scoring algorithm. Subtracts points for each token marked as unclassified or possible. Pattern with winning score is selected. Default points to subtract can be tuned in the Select dialog box: More details plus a worked example in online Help. The Select Sub-Processor (3)

31 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Lab Overview Lab 6: Selection Investigate Results from the Select sub-Processor

32 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Topic 5: Resolution

33 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Resolve Sub-Processor (1) Tokenize Classify Reclassify Select Resolve Output data and metadata Syntactic analysis of data. Split data into base tokens. Semantic analysis of data. Assign meaning to tokens. Examine token sequences for new classified tokens. Select the best description of the data, where possible. Resolve data to its desired structure and give a result. Input Map Select the inputs to be parsed. Map the inputs. Can use as a template. Input data

34 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Resolution Rules determine whether data is output through Pass, Review or Fail ports. The Resolve Sub-Processor (2)

35 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Resolution Rules: Control whether records are output through the Pass, Review or Fail ports. Can include comments. E.g. "Fail because of suspect data." "Review because the patterns contains not title". Can be: Exact: Apply to one specific pattern of tokens only. Fuzzy: Include wildcards and options. Can match against multiple different token patterns. Resolution Rules

36 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Exact Rules are used to match single token patterns precisely. Fuzzy Rules use wildcards to match a number of token patterns. Rules are processed in order: exact first, then fuzzy and each in order of appearance within that section. [.]* may be used as a fuzzy wildcard to output “everything else remaining”: This should be the final rule in the list. Exact vs Fuzzy Rules

37 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Optional feature creates output without need for Resolution Rules. Also deals with records not picked up by Resolution Rules. Switched off in the Unstructured Name Parser. Automatic Extraction Automatic Extraction ON Automatic Extraction OFF

38 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Lab Overview Lab 7: Examine Existing Resolution Rules Examine a Pre-Configured Resolution Rule Lab 8: Create a New Resolution Rule Create a New Output Attribute for the Account Number Create a New Resolution Rule for the Account Number Lab 9 (Optional): Create Further Resolution Rules Resolve Patterns that Include a Amend the Parser’s Configuration to Resolve the Remaining Records

39 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. With Run Profiles you can: Set up configuration overrides for a job. E.g. to: – Load a different set of data. – Load only a sample of data. – Use different processor options (including reference data). – Export to a different target. Apply your overrides at run-time. Run Profiles promote: Configuration reuse. Efficient configuration management. Run Profiles Overview