- Darshana Pathak - Dr. Hye-Chung Kum.  Overview  Entity resolution process  About Framework  Configuration file  Class Details  How to …  Future.

Slides:



Advertisements
Similar presentations
What is Test Director? Test Director is a test management tool
Advertisements

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
DETAILED DESIGN, IMPLEMENTATIONA AND TESTING Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.
Inpainting Assigment – Tips and Hints Outline how to design a good test plan selection of dimensions to test along selection of values for each dimension.
Objectives Understand the software development lifecycle Perform calculations Use decision structures Perform data validation Use logical operators Use.
Linkage Editors Difference between a linkage editor and a linking loader: Linking loader performs all linking and relocation operations, including automatic.
Why python? Automate processes Batch programming Faster Open source Easy recognition of errors Good for data management What is python? Scripting programming.
Programming Logic and Design Fourth Edition, Introductory
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Physical design. Stage 6 - Physical Design Retrieve the target physical environment Create physical data design Create function component implementation.
Cliff Rhyne and Jerry Fu June 5, 2007 Parallel Image Segmenter CSE 262 Spring 2007 Project Final Presentation.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
New Mexico Computer Science For All Introduction to Algorithms Maureen Psaila-Dombrowski.
Access 2007 ® Use Databases How can Access help you to find and use information?
The Project AH Computing. Functional Requirements  What the product must do!  Examples attractive welcome screen all options available as clickable.
Financial Information System Running Reports in FIS.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Hardware.  Learn what hardware is  Learn different input and output devices  Learn what the CPU is.
Discovering Computers Fundamentals, 2012 Edition Your Interactive Guide to the Digital World.
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Black Box Software Testing Domain Testing Assignment Fall 2005 Assignment 2 This assignment is due on September 24, Please use the latest version.
XP New Perspectives on Microsoft Office Access 2003 Tutorial 12 1 Microsoft Office Access 2003 Tutorial 12 – Managing and Securing a Database.
Objectives Overview Define the term, database, and explain how a database interacts with data and information Define the term, data integrity, and describe.
C LIENT R EGISTRY OpenEMPI: Operations Support Training SYSNET International, Inc.
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
Final Year Project Interim Presentation Software Visualisation and Comparison Tool Presented By : Shane Lillis, , 4th Year Computer Engineering.
PgP MIS 202 Access Overview 1 Microsoft Access Introduction to Relational Databases Powerful tool to collect and analyze business data, facilitates decision-
Introduction. 2COMPSCI Computer Science Fundamentals.
1 Data Structures CSCI 132, Spring 2014 Lecture 3 Programming Principles and Life Read Ch. 1.
Chapter 9 Moving to Design
Introduction of Geoprocessing Topic 7a 4/10/2007.
Objectives Overview Define the term, database, and explain how a database interacts with data and information Describe the qualities of valuable information.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)
Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey.
1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.
Documentation Dr. Andrew Wallace PhD BEng(hons) EurIng
The Software Development Process
Introduction of Geoprocessing Lecture 9. Geoprocessing  Geoprocessing is any GIS operation used to manipulate data. A typical geoprocessing operation.
Lecture 26: Reusable Methods: Enviable Sloth. Creating Function M-files User defined functions are stored as M- files To use them, they must be in the.
1 Chapter 9 Database Management. Objectives Overview Define the term, database, and explain how a database interacts with data and information Describe.
Introduction of Geoprocessing Lecture 9 3/24/2008.
Object-Oriented Software Engineering Practical Software Development using UML and Java Modelling with Classes.
Cluster Analysis, an Overview Laurie Heyer. Why Cluster? Data reduction – Analyze representative data points, not the whole dataset Hypothesis generation.
Query Processing and Query Optimization Database System Implementation CSE 507 Some slides adapted from Silberschatz, Korth and Sudarshan Database System.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Week#3 Software Quality Engineering.
Experience Report: System Log Analysis for Anomaly Detection
Delft-FIAT.
IST 220 – Intro to Databases
Hierarchical Clustering: Time and Space requirements
An Introduction to Programming and VB.NET
Scripts & Functions Scripts and functions are contained in .m-files
The Process of Object Modeling
When I want to execute the subroutine I just give the command Write()
Learning.
Python I/O.
Programming Logic and Design Fourth Edition, Comprehensive
Microsoft Office Access 2003
Overview of the Lab 2 Assignment: Multicore Real-Time Tasks
Overview of Query Evaluation
Evaluation of Relational Operations: Other Techniques
Loops.
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Lecture 13 Teamwork Bryan Burlingame 1 May 2019.
Overview Activities from additional UP disciplines are needed to bring a system into being Implementation Testing Deployment Configuration and change management.
Testing & Security Dr. X.
Presentation transcript:

- Darshana Pathak - Dr. Hye-Chung Kum

 Overview  Entity resolution process  About Framework  Configuration file  Class Details  How to …  Future Work  Questions?

‘sdlink’  Framework for developing Entity Resolution Tool - named ‘sdlink’  Idea is to provide a ‘Lab’  For whom? ◦ Research assistants, students  Why? ◦ To contribute towards research

Configure: Define link Variable Compare: Similarity Metrics, Find Distance Decide: Supervised/ Unsupervised Decision Model Search: Reduce space (Blocking) Refine: Relationships and Deduplication Analyze: Error Propagation Evaluate: Assess the linked data Data Management

 Searching Methods ◦ Blocking ◦ Sorting ◦ Hashing ◦ Sorted Neighborhood  Comparison Functions ◦ Hamming Distance ◦ Edit Distance ◦ Jaro’s Algorithm ◦ N-grams ◦ Soundex Code

 Decision Models ◦ Probabilistic Model ◦ Induction Model ◦ Clustering Model ◦ Hybrid Model  Measurement Tools ◦ Reduction Ratio ◦ Pairs Completeness ◦ Accuracy ◦ Completeness

 Basic framework includes: ◦ Configuration file: configure.xml ◦ Main class: SDLink.java ◦ ConfigFile and ConfigReader ◦ CSVFile, CSVReader and CSVWriter ◦ BlockingModel.java ◦ DistanceCalculator.java Everything explained in further slides.

 Name: configure.xml  Specifies: ◦ 2 CSV Files to be linked ◦ List of attributes ◦ Blocking method ◦ Weight for each attribute ◦ Clustering method

 SDLink.java – Initializes all classes to ◦ Read configuration file ◦ Read 2 CSV Files ◦ Perform blocking ◦ Calculate distances ◦ Perform clustering ◦ Writing output to output files

 ConfigFile.java and ConfigReader.java ◦ Read configure.xml ◦ Know everything about CSVFiles, attributes, blocking methods and clustering method. ◦ Store all these information in an instance of ConfigFile.java so that other classes can readily access this information whenever required.

 CSVFile.java, CSVReader.java & CSVWriter.java ◦ Read both CSV Files ◦ Combine two files into one ◦ Form a 2-D matrix of all attributes in CSV files ◦ Store all the data from CSV file into an instance of CSVFile.java

 BlockingModel.java ◦ Performs blocking on the 2-D matrix of data ◦ Knows how to partition rows from configure.xml ◦ Important step as further clustering is done on each block. ◦ Necessary to handle large data.

 DistanceCalculator.java ◦ Performs operations on each block (formed in blocking step) separately. ◦ Calculates distance between two attributes ◦ Compares distances and calculates densities iteratively ◦ Forms many tiny clusters as the process runs for multiple iterations ◦ Process runs until no clusters can be formed.

 Everything runs in a big LOOP…  There can be multiple blocking attributes.  The whole process of blocking and clustering runs for each blocking attribute.  The output of every iteration is an input to the next iteration.  Be careful: It should not be an infinitely long process!

 Using this basic framework, you can implement your own ideas  E.g. A new clustering algorithm – ◦ Write the code and just plug it into distance calculator class ◦ Make sure not to disturb existing functionality ◦ Be purely object oriented ◦ Check the new algorithm’s output

 This code is available on Macbeth (but no version control till now…)  We will have version control system like SVN, where multiple developers can check out and check in code…  To avoid risk, we can add separate methods and classes without touching existing code.

 Version Control System  Generate proper output files  Implement and test various clustering algorithms  Develop graphical user interface  And much more…

 TAILOR: A Record Linkage Toolbox (2002) Mohamed Elfeky, Vassilios Verykios, Ahmed Elmagarmid.  A GLASS BOX APPROACH FOR LINKING ADMINISTRATIVE RECORDS: PI: Gale Boyd, Co-PI: Wayne Gray and Hye- Chung Kum

Questions ???