Download presentation
Presentation is loading. Please wait.
Published byEileen Stafford Modified over 9 years ago
1
Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Oracle Enterprise Data Quality CDEP: Tailoring Parser Configuration
2
2 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Topic 1: Base Tokenization
3
3 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Parse Processor......has seven sub-processors.
4
4 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Parse Processor’s Tabs All tabs available from Parse processor. Sub-set of tabs available from sub-processors.
5
5 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Input Sub-Processor (1) Tokenize Classify Reclassify Select Resolve Output data and metadata Syntactic analysis of data. Split data into base tokens. Semantic analysis of data. Assign meaning to tokens. Examine token sequences for new classified tokens. Select the best description of the data, where possible. Resolve data to its desired structure and give a result. Input Map Select the inputs to be parsed. Map the inputs. Can use as a template. Input data
6
6 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Define input attribute(s): The Input Sub-Processor (2)
7
7 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Map Sub-Processor (1) Tokenize Classify Reclassify Select Resolve Output data and metadata Syntactic analysis of data. Split data into base tokens. Semantic analysis of data. Assign meaning to tokens. Examine token sequences for new classified tokens. Select the best description of the data, where possible. Resolve data to its desired structure and give a result. Input Map Select the inputs to be parsed. Map the inputs. Can use as a template. Input data
8
8 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Same processor configuration can be used in different process regardless of input attribute names. The Map Sub-Processor (2)
9
9 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Tokenize Sub-Processor (1) Tokenize Classify Reclassify Select Resolve Output data and metadata Syntactic analysis of data. Split data into base tokens. Semantic analysis of data. Assign meaning to tokens. Examine token sequences for new classified tokens. Select the best description of the data, where possible. Resolve data to its desired structure and give a result. Input Map Select the inputs to be parsed. Map the inputs. Can use as a template. Input data
10
10 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Defines how base tokens are created. The Tokenize Sub-Processor (2)
11
11 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Tokenization uses Global Reference Data. Base Tokenization Map
12
12 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Tokenize Sub-Processor (3) Configuration: Character Map reference data. Token splitting rules. Tokenization rules yield distinct ‘base tokens’ and give them a tag based on their character types.
13
13 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Lab Overview Lab 1: Examine Base Tokens Connect the Unstructured Name Parser to a Data Source Read the Online Help Run Parsing and Study Base Tokenization Results
14
14 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Topic 2: Classification
15
15 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Classify Sub-Processor (1) Tokenize Classify Reclassify Select Resolve Output data and metadata Syntactic analysis of data. Split data into base tokens. Semantic analysis of data. Assign meaning to tokens. Examine token sequences for new classified tokens. Select the best description of the data, where possible. Resolve data to its desired structure and give a result. Input Map Select the inputs to be parsed. Map the inputs. Can use as a template. Input data
16
16 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Example 1 - Create reference data to store invalid values, then classify tokens against these values. Example 2 - use a simple RegEx token check to find numbers in a name field. The Classify Sub-Processor (2)
17
17 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Classify Sub-Processor (3) Input Data After Classification Reference Data
18
18 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Adds semantic meaning to the data by classifying tokens using rules. The Classify Sub-Processor (4) _ DH4 8NG RegEx Token Check 97-1438-BS Pattern Token Check Volkswagen List Token Check
19
19 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Lab Overview Lab 2: Classify Tokens Examine Parsing Results Examine Classification Results Examine Classification Rules and Reference Data Amend the Reference Data Supplied with a Token Check Create a New Token Check Lab 3 (Optional): Classify Suspect Data
20
20 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Topic 3: Reclassification
21
21 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Reclassify Sub-Processor (1) Tokenize Classify Reclassify Select Resolve Output data and metadata Syntactic analysis of data. Split data into base tokens. Semantic analysis of data. Assign meaning to tokens. Examine token sequences for new classified tokens. Select the best description of the data, where possible. Resolve data to its desired structure and give a result. Input Map Select the inputs to be parsed. Map the inputs. Can use as a template. Input data
22
22 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Further reduces the number of unclassified tokens. Based on position within a pattern. – e.g. Family Names usually appears after Given Name. The Reclassify Sub-Processor (2)
23
23 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Token names are case-sensitive (validName ≠ validname). Use {} to show exact number of occurrences of the token shown in []. Use * to signify any number of occurrences. Syntax and Wildcards [ ]{0,1}([ ]* )[.]* 25-27 Tom Jones Avenue 23 Arcadia Avenue, Seaford Would match both:
24
24 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Space token is _ (underscore). Patterns for reclassification can contain spaces. If the space is left in reclassification will only match pattern with space. Best to remove spaces completely. Example: Use of Spaces in Token Patterns will match both: _ (1 space) __ (2 spaces) _ will only match: _
25
25 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Classify: Identifies semantic meaning of data by classifying tokens based on rules and token checks (e.g. list, RegEx, pattern). Reclassify: Optionally allows the reduction of token patterns by recognizing sequences and reclassifying as a new token, with a given confidence level. Also allows identification of text by positional reference to an existing token. – E.g. Extract the number immediately before a recognised unit. Classification vs Reclassification
26
26 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Lab Overview Lab 4: Examine Pre-Configured Reclassification Rules Examine an Existing Rule Lab 5: Create a New Reclassification Rule Create Reference Data Create a New Classification Rule for AccountTypeLiteral Create a Reclassification Rule for AccountType
27
27 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Topic 4: Selection
28
28 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Select Sub-Processor (1) Tokenize Classify Reclassify Select Resolve Output data and metadata Syntactic analysis of data. Split data into base tokens. Semantic analysis of data. Assign meaning to tokens. Examine token sequences for new classified tokens. Select the best description of the data, where possible. Resolve data to its desired structure and give a result. Input Map Select the inputs to be parsed. Map the inputs. Can use as a template. Input data
29
29 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. All possible token patterns for each record initially created. Selection step selects the most likely correct pattern. Resolution rules created in Select tab of Results Browser. The Select Sub-Processor (2)
30
30 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Select sub-processor uses tuneable point scoring algorithm. Subtracts points for each token marked as unclassified or possible. Pattern with winning score is selected. Default points to subtract can be tuned in the Select dialog box: More details plus a worked example in online Help. The Select Sub-Processor (3)
31
31 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Lab Overview Lab 6: Selection Investigate Results from the Select sub-Processor
32
32 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Topic 5: Resolution
33
33 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. The Resolve Sub-Processor (1) Tokenize Classify Reclassify Select Resolve Output data and metadata Syntactic analysis of data. Split data into base tokens. Semantic analysis of data. Assign meaning to tokens. Examine token sequences for new classified tokens. Select the best description of the data, where possible. Resolve data to its desired structure and give a result. Input Map Select the inputs to be parsed. Map the inputs. Can use as a template. Input data
34
34 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Resolution Rules determine whether data is output through Pass, Review or Fail ports. The Resolve Sub-Processor (2)
35
35 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Resolution Rules: Control whether records are output through the Pass, Review or Fail ports. Can include comments. E.g. "Fail because of suspect data." "Review because the patterns contains not title". Can be: Exact: Apply to one specific pattern of tokens only. Fuzzy: Include wildcards and options. Can match against multiple different token patterns. Resolution Rules
36
36 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Exact Rules are used to match single token patterns precisely. Fuzzy Rules use wildcards to match a number of token patterns. Rules are processed in order: exact first, then fuzzy and each in order of appearance within that section. [.]* may be used as a fuzzy wildcard to output “everything else remaining”: This should be the final rule in the list. Exact vs Fuzzy Rules
37
37 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Optional feature creates output without need for Resolution Rules. Also deals with records not picked up by Resolution Rules. Switched off in the Unstructured Name Parser. Automatic Extraction Automatic Extraction ON Automatic Extraction OFF
38
38 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Lab Overview Lab 7: Examine Existing Resolution Rules Examine a Pre-Configured Resolution Rule Lab 8: Create a New Resolution Rule Create a New Output Attribute for the Account Number Create a New Resolution Rule for the Account Number Lab 9 (Optional): Create Further Resolution Rules Resolve Patterns that Include a Amend the Parser’s Configuration to Resolve the Remaining Records
39
39 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. With Run Profiles you can: Set up configuration overrides for a job. E.g. to: – Load a different set of data. – Load only a sample of data. – Use different processor options (including reference data). – Export to a different target. Apply your overrides at run-time. Run Profiles promote: Configuration reuse. Efficient configuration management. Run Profiles Overview
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.