Presentation is loading. Please wait.

Presentation is loading. Please wait.

- Darshana Pathak - Dr. Hye-Chung Kum.  Overview  Entity resolution process  About Framework  Configuration file  Class Details  How to …  Future.

Similar presentations


Presentation on theme: "- Darshana Pathak - Dr. Hye-Chung Kum.  Overview  Entity resolution process  About Framework  Configuration file  Class Details  How to …  Future."— Presentation transcript:

1 - Darshana Pathak - Dr. Hye-Chung Kum

2  Overview  Entity resolution process  About Framework  Configuration file  Class Details  How to …  Future Work  Questions?

3 ‘sdlink’  Framework for developing Entity Resolution Tool - named ‘sdlink’  Idea is to provide a ‘Lab’  For whom? ◦ Research assistants, students  Why? ◦ To contribute towards research

4 Configure: Define link Variable Compare: Similarity Metrics, Find Distance Decide: Supervised/ Unsupervised Decision Model Search: Reduce space (Blocking) Refine: Relationships and Deduplication Analyze: Error Propagation Evaluate: Assess the linked data Data Management

5

6  Searching Methods ◦ Blocking ◦ Sorting ◦ Hashing ◦ Sorted Neighborhood  Comparison Functions ◦ Hamming Distance ◦ Edit Distance ◦ Jaro’s Algorithm ◦ N-grams ◦ Soundex Code

7  Decision Models ◦ Probabilistic Model ◦ Induction Model ◦ Clustering Model ◦ Hybrid Model  Measurement Tools ◦ Reduction Ratio ◦ Pairs Completeness ◦ Accuracy ◦ Completeness

8  Basic framework includes: ◦ Configuration file: configure.xml ◦ Main class: SDLink.java ◦ ConfigFile and ConfigReader ◦ CSVFile, CSVReader and CSVWriter ◦ BlockingModel.java ◦ DistanceCalculator.java Everything explained in further slides.

9  Name: configure.xml  Specifies: ◦ 2 CSV Files to be linked ◦ List of attributes ◦ Blocking method ◦ Weight for each attribute ◦ Clustering method

10  SDLink.java – Initializes all classes to ◦ Read configuration file ◦ Read 2 CSV Files ◦ Perform blocking ◦ Calculate distances ◦ Perform clustering ◦ Writing output to output files

11  ConfigFile.java and ConfigReader.java ◦ Read configure.xml ◦ Know everything about CSVFiles, attributes, blocking methods and clustering method. ◦ Store all these information in an instance of ConfigFile.java so that other classes can readily access this information whenever required.

12  CSVFile.java, CSVReader.java & CSVWriter.java ◦ Read both CSV Files ◦ Combine two files into one ◦ Form a 2-D matrix of all attributes in CSV files ◦ Store all the data from CSV file into an instance of CSVFile.java

13  BlockingModel.java ◦ Performs blocking on the 2-D matrix of data ◦ Knows how to partition rows from configure.xml ◦ Important step as further clustering is done on each block. ◦ Necessary to handle large data.

14  DistanceCalculator.java ◦ Performs operations on each block (formed in blocking step) separately. ◦ Calculates distance between two attributes ◦ Compares distances and calculates densities iteratively ◦ Forms many tiny clusters as the process runs for multiple iterations ◦ Process runs until no clusters can be formed.

15  Everything runs in a big LOOP…  There can be multiple blocking attributes.  The whole process of blocking and clustering runs for each blocking attribute.  The output of every iteration is an input to the next iteration.  Be careful: It should not be an infinitely long process!

16  Using this basic framework, you can implement your own ideas  E.g. A new clustering algorithm – ◦ Write the code and just plug it into distance calculator class ◦ Make sure not to disturb existing functionality ◦ Be purely object oriented ◦ Check the new algorithm’s output

17  This code is available on Macbeth (but no version control till now…)  We will have version control system like SVN, where multiple developers can check out and check in code…  To avoid risk, we can add separate methods and classes without touching existing code.

18  Version Control System  Generate proper output files  Implement and test various clustering algorithms  Develop graphical user interface  And much more…

19  TAILOR: A Record Linkage Toolbox (2002) Mohamed Elfeky, Vassilios Verykios, Ahmed Elmagarmid.  A GLASS BOX APPROACH FOR LINKING ADMINISTRATIVE RECORDS: PI: Gale Boyd, Co-PI: Wayne Gray and Hye- Chung Kum

20 Questions ???


Download ppt "- Darshana Pathak - Dr. Hye-Chung Kum.  Overview  Entity resolution process  About Framework  Configuration file  Class Details  How to …  Future."

Similar presentations


Ads by Google