Download presentation
Presentation is loading. Please wait.
2
Data Mining, Information Theory and Image Interpretation Sargur N. Srihari Center of Excellence for Document Analysis and Recognition and Department of Computer Science and Engineering State University of New York at Buffalo Buffalo, NY 14260 USA
3
Data Mining Search for Valuable Information in Large Volumes of Data Knowledge Discovery in Databases (KDD) Discovery of Hidden Knowledge, Unexpected Patterns and new rules from Large Databases
4
Information Theory Definitions of Information: –Communication Theory Entropy (Shannon-Weaver) Stochastic Uncertainty Bits –Information Science Data of Value in Decision Making
5
Image Interpretation Use of knowledge in assigning meaning to an image Pattern Recognition using Knowledge Processing Atoms (Physical) as Bits (Information)
6
Address Interpretation Model Interpretation I(x) Address Interpretation (AI) Address image (x) Knowledge source ( K ) Mail stream ( S ) Postal address Directory ( D )
7
Typical American Address Address Directory Size: 139 million records
8
Assignment Strategies Typical street address ZIP Code: 14221 Primary number: 276 Database query Results Word Recognizer selects (after lexicon expansion) Delivery point: 142213557 Address encoding
9
Australian Address Delivery Point ID: 66568882 Postal Directory Size: 9.4 million records
10
Canadian Address Postal code: H1X 3B3 Postal Directory: 12.7 million records
11
United Kingdom Address Postcode: TN23 1EU (unique postcode) Delivery Point Suffix: 1A (default) Address Directory Size: 26 million records
12
Motivation for Information Theoretic Study Understand information interaction in postal address fields to overcome uncertainty in fields Compare the efficiency of assignment strategies Rank processing priority for determining a component value Select most effective component to help recover an ambiguous component
13
Address Fields in a US Postal Address Sargur N. Srihari f 6 street name f 7 secondary designator abbr. f 5 primary number f 8 secondary number Lee EntranceSTE520202 f 2 state abbr. f 3 5-digit ZIP Code f 4 4-digit ZIP+4 add-on f 1 city name -AmherstNY142282583 Delivery point: 142282583 Address fields
14
Probability Distribution of Street Name Lexicon Size | f 6 | Size of street name lexicon log (Number of ZIP Codes) No. of ZIP’s with | f 6 | = 1 => 6,264 (14.97%) | f 3 | = 41,840 Mean | f 6 | = 95.04 Max | f 6 | = 1,513 (3.80, 1) Size of street name lexicon log (No. of (ZIP, primary) pairs) No. of (ZIP, pri) with | f 6 | = 1 => 34,102,092 (69.11%) | (f 3, f 5 ) | = 49,347,888 Mean | f 6 | = 2.21 Max | f 6 | = 542 (7.53, 1)
15
Number of Address Records for Different Countries
16
A component c is an address field f i, a portion of f i (e.g., a digit), or a combination of components. 1. Entropy H (x) = information provided by component x (assuming uniform distribution) H (x) = log 2 | x | bits 2. Conditional Entropy H x (y) = uncertainty of component y when component x is known where x i is a value of component x; y j is a value of component y p ij is the joint probability of p(x i, y j ) 3. Redundancy of component x to y R x (y) = ( H (x) + H (y) - H (x, y)) / H (y) 0 <= R x (y) <= 1 Higher value of R x (y) indicates that more information in y is shared by x. Definitions
17
Example of Information Measure Value sets: field A (a,b,c,d) field C (e,f) field B (0,1,9)(0,1) B1B1 B2B2 Address records Information measure p a10 = 1/5, p ae = 2/5, etc.
18
Measure of Information from National City State File, D 1 (July 1997) Measure: –H (x); x: any combination of f 1, f 2, and f 3i –H x (f 3 ); x: any combination of f 1, f 2, and f 3i Field f 2 State abbr. 62 Field f 3 ZIP Code 42,880 f 31 f 32 f 33 f 34 f 35 Field f 1 City name 39,795 Value sets D 1 = 79,343
19
Measure of Information from Delivery Point Files, D 2 (July 1997) Measure: –H (x); x: any combination of f 3, f 4, f 5, f 6, f 7, f 8, and f 9 –H x (f 4 ); x: f 3 with any combination of f 3 ~ f 9 f 4 (ZIP+4 add-on) 9,999 Value sets D 2 = 139,080,291 f 5 (Primary No.) 1,155,740 f 6 (Street name) 1,220,880 f 7 (Secondary Abbr.) 24 f 8 (Sec. No.) 123,829 f 9 (Building/firm) 946,199 Value sets
20
Measure of Information from D Uncertainty in component Uncertainty in ZIP Code when City, State or a digit is known To determine f 3 (5-digit ZIP) from f 1, f 2 and f 3i : - City name reduces uncertainty the most
21
Propagation of Uncertainty for Assignment Strategies
22
Ranking Processing Priority for Confirming ZIP Code 12.08 12.07 12.09 12.12 12.07 9.98 2.01 knowing 1 component 1st 2nd 3rd 4th 5th state city H f 1 (f 3 ) 15.39 H (f 3 ) 0.002 0.001 0.000 state 1st 2nd H f 1 f 35 f 34 f 33 f 2 (f 3 ) knowing 5 components f 1 : City name f 2 : State abbreviation f 3 : ZIP Code 1.02 1.22 1.20 1.17 0.89 0.63 1st 2nd 3rd 4th 5th state knowing 2 components H f 1 f 35 (f 3 ) 0.37 0.36 0.33 0.10 0.33 1st 2nd 3rd 4th state knowing 3 components H f 1 f 35 f 34 (f 3 ) 0.03 0.01 0.02 1st 2nd 3rd state knowing 4 components H f 1 f 35 f 34 f 33 (f 3 ) Processing flow: city, 5th, 4th, 3rd, state
23
Modeling Processing Cost For component y Location rate = l(y)0 <= l(y) <= 1 Recognition rate = r(y)0 <= r(y) <= 1 Processing speed = s(y)in msec Existence rate = e(y)0 <= e(y) <= 1 Patron rate = p(y)0 <= p(y) <= 1 Lexicon size of y, given x = | y x | = 2 ( H (x,y) - H (x)) Cost of processing component y given component x (1 + log | y x |) * s(y) l(y) * r(y) * e(y) * p(y) Cost x (y) =H x (y) *
24
Example Cost Table
25
Ranking Processing Priority for Confirming ZIP Code Based on Cost 0.55 0.896 0.02 state 1st 3rd process 5th component 26.57 25.71 15.82 9.46 44.88 1st 3rd 4th 5th state process 3rd component 8.56 7.62 0.73 14.08 1st 3rd 4th state process 4th component process 2nd component 232.01 231.69 232.09 230.87 692.16 188.21 1st 3rd 4th state city 5th process 1st component 318.57 318.76 319.63 318.31 1027.6 1st 2nd 3rd 4th 5th state city 373.39 318.31 0 0 state 1st process 6th component Processing flow based on cost: 2nd, city, 5th, 4th, 3rd, 1st Processing flow based on H x (y): city, 5th, 4th, 3rd, state
26
Recovery of 1st ZIP-Code Digit, f 31, from State Abbr. (f 2 ) and Other ZIP-Code Digits (f 32 -f 35 ) Usage: If recognition of a component (e.g., f 31 ) fails, this component has higher probability of recovery by knowing another component with largest redundancy (f 2 ). There are 62 state abbr’s. In 60 of them, 1st ZIP digit is unique. For NY and TX, there are two valid 1st ZIP-Code digits. NY? 4 2 2 8 f2f2 f 31 f 32 f 33 f 34 f 35
27
Measure of Information from Mail Stream, S Eighteen sets, each from a mail processing site, of mail pieces We measure –Information provided by H (f 2 ), H (f 3i ) –Uncertainty of f 3 by Hf 2 (f 3 ), Hf 3i (f 3 ) Each set is measured separately The results are shown on the average of these sets
28
Comparison of ZIP-Code Uncertainty from D and S
29
Comparison of Results from D and S ZIP-Code uncertainty from S < from D Information from S is more effective for determining a ZIP Code The most effective processing flow of using f 3i and f 2 to determine f 3 is (consistent between S and D ) f 2 -> f 35 -> f 34 -> f 33 -> f 32 -> f 31
30
UK Address Interpretation Field Recognition & Database Query Fields of interest: –Locality –Post town –County –Outward postcode Target: Outward postcode Control flow: Based on data mining Locality Post town/ county Outward postcode
31
UK Address Interpretation Last Line Parsing & Resolution Address block image Chaincode generation Pre-scan digit recognition Line segmentation Word separation Last line parsing (shape, syntax) Field recognition & Database query Field assignment Outward postcode assigned Other choices Y N N Y Candidate outward postcodes Assigned outward postcode Last line resolution
32
Discussion (Reliability of information) For selecting effective processing flow in address interpretation, the prediction is accurate when the information can be the most representative in the current processing situation Use of unreliable information for determining a candidate value may cause error. Unreliable information used to choose an effective processing flow is less effective.
33
Reliability of information Measure of information from D –Not reflecting the current processing situation –Full coverage of all valid values Measure of information from S –Assuming that site specific preceding history represents current processing situation –Mail distribution could be season-specific –Should consider the coverage of valid samples –Should consider the information bias if valid samples are from AI engine
34
Complexity of collecting mail information ( S ) Information from mail streams should be collected automatically and only high confidence information is collected Address interpretation is not ideal Some error cases would be collected Address interpretation may always reject a certain patterns of mail pieces, resulting in biased collected information
35
Conclusion Information content of postal addresses can be measured The efficiency of assignment strategies can be compared Redundancy of two components can be measured –An uncertain component has higher probability of recovery when another component with larger redundancy is known Information measure can suggest most effective processing flow Information Theory is an effective tool for Data Mining
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.