Download presentation
Presentation is loading. Please wait.
1
Approaches To Address Matching
Nigel Legg Knowtext Bristol Data Scientists 16 April 2019 Nigel Legg Knowtext
2
What is an Address? Defines a specific geographic location.
Human-understandable. Sequence of information. eg. 1 Canon’s Road, Bristol, BS1 5TX Bristol Data Scientists 16 April 2019 Nigel Legg Knowtext
3
Why are Addresses important?
Bristol Data Scientists 16 April 2019 Nigel Legg Knowtext
4
Components of an address
Sub division name or number – office suite, flat, etc * House name or number Street name Super-street name * Locality name * (x2?) Post Town County * Postcode * Optional components Bristol Data Scientists 16 April 2019 Nigel Legg Knowtext
5
Address Variability All the following addresses refer to the same property Flat 3, 3, Leonora Tyson Mews, Croxted Road, Dulwich, London SW14 4GB Flat 3, 3, Leonora Tyson Mews, Dulwich, London SW14 4GB 3, 3, Leonora Tyson Mews, London, SW14 4GB Bristol Data Scientists 16 April 2019 Nigel Legg Knowtext
6
Is this a problem? An individual could Government records could
Claim benefit from two places Avoid bills Be charged council tax twice Government records could Have problems with historical tracking Be disjointed (Grenfell!) Management of census forms Bristol Data Scientists 16 April 2019 Nigel Legg Knowtext
7
Solution. Use Address Matching algorithm to match addresses entered into multiple systems to a common format. Use the Universal Property Reference Number (UPRN) In some cases, possibility of creating an Address Index is being investigated. Bristol Data Scientists 16 April 2019 Nigel Legg Knowtext
8
Algorithms Simple – Levenshtein
Machine Learning – ONS / Conditional Random Field Complex SQL – MHCLG / Hufton Bristol Data Scientists 16 April 2019 Nigel Legg Knowtext
9
Levenshtein – finding the closest match
Matching what? Simple match will fail on Leanora Tyson Mews Example (optional components) Need to split address and compare like fields. Can be useful for spell checking parts of an address. Bristol Data Scientists 16 April 2019 Nigel Legg Knowtext
10
Machine Learning – Linear Chain Conditional Random Fields
Linear Chain CRF used by ONS in their Address Index Project. A sequential classifier; labels given to text elements taking into consideration their neighbours. Related to Hidden Markov Models. Used to label the components of the address, and thus compare like for like. Creating a balanced training set is difficult and time consuming. Re-training? Bristol Data Scientists 16 April 2019 Nigel Legg Knowtext
11
Comparison method – CLG / Hufton
Dr Hufton at CLG – 7 step method. Filtering and string comparison in both directions – target and example. SQL implementation – up to 99% accuracy, with 1% false positives. Unoptimised python implementation – 62% accuracy on first run. SQL in use on small data sets in CLG. Bristol Data Scientists 16 April 2019 Nigel Legg Knowtext
12
The Future? Will ONS Method prove more reliable in the long term?
Will CLG Method Scale? How will a Python implementation of the CLG method compare to a Python implementation of the ONS method at scale? Use algorithm to build an Address Index…. More work required across Government… and elsewhere. Bristol Data Scientists 16 April 2019 Nigel Legg Knowtext
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.