Large-Scale Record Linkage Support for Cloud Computing Platforms Yuan Xue, Bradley Malin, Elizabeth Durham EECS Department, Biomedical Informatics Department,

Slides:



Advertisements
Similar presentations
Trusted Computing in Government Networks May 16, 2007 Richard C. (Dick) Schaeffer, Jr. Information Assurance Director National Security Agency.
Advertisements

Configuration management
Protecting Software Code By Guards - by Hoi Chang and Mikhail J. Atallah “Many software-based mechanisms for protecting program code are too weak[…] or.
A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.
Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.
CHOICE Pathology Informatics 2010 Boston, Massachusetts DataReady ® : A Deployable Data Management and Integration System for Large-scale Cancer Repositories.
Metrics for Process and Projects
Computer science is a field of study that deals with solving a variety of problems by using computers. To solve a given problem by using computers, you.
Big Data and Predictive Analytics in Health Care Presented by: Mehadi Sayed President and CEO, Clinisys EMR Inc.
Nordisk Statistikermøde i København august 2010 The archive statistical method years - A Summary by Svein Nordbotten 8/11/20101Svein.
1 IBM SanFrancisco Product Evaluation Negotiated Option Presentation By Les Beckford May 2001.
Privacy and Integrity Preserving in Distributed Systems Presented for Ph.D. Qualifying Examination Fei Chen Michigan State University August 25 th, 2009.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
System Analysis and Design
Lecture Nine Database Planning, Design, and Administration
Software Process and Product Metrics
Affiliated Information Security Collaborative An Affiliated Enterprise Approach to Information Security Deans and Vice Presidents Meeting April 17, 2014.
Cloud Usability Framework
Distributed Databases
New Challenges in Cloud Datacenter Monitoring and Management
1 Data Strategy Overview Keith Wilson Session 15.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Cong Wang1, Qian Wang1, Kui Ren1 and Wenjing Lou2
Chapter 9 Database Planning, Design, and Administration Sungchul Hong.
Overview of the Database Development Process
Computers Are Your Future Tenth Edition Chapter 12: Databases & Information Systems Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall1.
CSIAC is a DoD Information Analysis Center (IAC) sponsored by the Defense Technical Information Center (DTIC) Presentation to: Insider Threat SOAR Workshop.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Cluster Reliability Project ISIS Vanderbilt University.
Components of Database Management System
Co-design Environment for Secure Embedded Systems Matt Eby, Janos L. Mathe, Jan Werner, Gabor Karsai, Sandeep Neema, Janos Sztipanovits, Yuan Xue Institute.
By: Dr Alireza Kazemi.  Computer science, the study of complex systems, information and computation using applied mathematics, electrical engineering.
High Level Architecture Overview and Rules Thanks to: Dr. Judith Dahmann, and others from: Defense Modeling and Simulation Office phone: (703)
MODULE 12 Control Audit And Security Of Information System 12.1 Controls in Information systems 12.2 Need and methods of auditing Information systems 12.3.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
TASHKENT UNIVERSITY OF INFORMATION TECHNOLOGIES Lesson №18 Telecommunication software design for analyzing and control packets on the networks by using.
Major Disciplines in Computer Science Ken Nguyen Department of Information Technology Clayton State University.
Abstract With the advent of cloud computing, data owners are motivated to outsource their complex data management systems from local sites to the commercial.
TRUST : Team for Research in Ubiquitous Secure Technology National Science Foundation Site Visit February 24-26, 2009 │Berkeley, California Health Infrastructures.
Actualog Social PIM Helps Companies to Manage and Share Product Information Using Secure, Scalable Ease of Microsoft Azure MICROSOFT AZURE ISV PROFILE:
Shaping a Health Statistics Vision for the 21 st Century 2002 NCHS Data Users Conference 16 July 2002 Daniel J. Friedman, PhD Massachusetts Department.
Chapter 6 – Architectural Design Lecture 1 1Chapter 6 Architectural design.
MODEL-BASED SOFTWARE ARCHITECTURES.  Models of software are used in an increasing number of projects to handle the complexity of application domains.
The Health Information Technology Summit October 20-23, 2004 Margaret VanAmringe Vice President, Public Policy & Government Relations.
SAPIR Search in Audio-Visual Content using P2P Information Retrival For more information visit: Support.
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
1 Privacy Preserving Data Mining Introduction August 2 nd, 2013 Shaibal Chakrabarty.
Harnessing the Cloud for Securely Outsourcing Large- Scale Systems of Linear Equations.
Modeling Security-Relevant Data Semantics Xue Ying Chen Department of Computer Science.
LineUp: Visual Analysis of Multi- Attribute Rankings Samuel Gratzl, Alexander Lex, Nils Gehlenborg, Hanspeter Pfister, and Marc Streit.
1 Management Information Systems M Agung Ali Fikri, SE. MM.
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
CSE 5810 Biomedical Informatics and Cloud Computing Zhitong Fei Computer Science & Engineering Department The University of Connecticut CSE5810: Introduction.
Informatics for Scientific Data Bio-informatics and Medical Informatics Week 9 Lecture notes INF 380E: Perspectives on Information.
This has been created by QA InfoTech. Choose QA InfoTech as your Automated testing partner. Visit for more information.
Authors: Christos Stergiou Andreas P. Plageras Kostas E. Psannis
Data Management Program Introduction
Clouds , Grids and Clusters
CCNT Lab of Zhejiang University
Joseph JaJa, Mike Smorul, and Sangchul Song
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
State of the art – Part 1 Xin Jin
Algorithms for Big Data Delivery over the Internet of Things
CIS 515 STUDY Lessons in Excellence-- cis515study.com.
HLN Consulting, LLC® November 8, 2006
Instructor Materials Chapter 5: Ensuring Integrity
Overview: Chapter 2 Localization and Tracking
Presentation transcript:

Large-Scale Record Linkage Support for Cloud Computing Platforms Yuan Xue, Bradley Malin, Elizabeth Durham EECS Department, Biomedical Informatics Department, Vanderbilt University Background Record linkage is the process of comparing the records from multiple resources to aggregate information on the same real-world entity Many applications across different industry sectors and government agencies. Healthcare It is difficult to track a patient’s record across healthcare providers, as healthcare information systems are marred by gross fragmentation. This can hinder primary care and limit biomedical research. The negative effects of fragmentation can be mitigated through record linkage systems. Counter-terrorism Record linkage is applied to identify records from multiple data owners to detect aliases or combine information about the individual to learn about their actions or co-conspirators Cloud computing platforms can enable cost-efficient, high-performance large-scale record linkage Cloud platforms provide massive distributed computing resources Record linkage tasks are usually performed on an infrequent basis. The ability to pay for the use of computing resources on a short-term basis as needed and release them after use, can achieve great cost efficiency. Research Objective Building an end-to-end solution that enables record linkage as a core service on the Cloud computing platform Support a transition toward cloud-based data management for distributed services 1) Flexible usage: for both novice users (service users) and expert users (service component developers using high-level programming primitives) 2) Cost and performance awareness: users can estimate the time, the expense, the linkage quality, and thus choose the appropriate linkage method and configuration. Facilitate the information exchange in many domains, for example, most notably, national health information exchange network. blocking record pair comparison record pair classification record set A record set B Matched pairs Non-matched pairs Data preparation Cloud encoded record set A encoded record set B field comparison Challenge Massive amounts of data are now being collected In 2007, for instance, it was estimated that over 281 exabytes of new data was generated and the quantity of data is growing at an exponential rate. Data in the real-world is dirty Sophisticated linkage techniques need to be applied to record linkage to deal with the noise and semantic errors. Expensive detailed comparison of fields (or attributes) are required between pairs of records, which forms a performance bottleneck. Record linkage over large-scale data sources is extremely time and resource intensive. This challenge gets worse as the quantity of data and number of sources grows. Current Research Efforts Privacy-preserving data encoding protects sensitive fields through a set of well-designed encoding functions. In addition to protecting the confidentiality of the data, the encoded data need to be compared later for similarity to identify the same entity for linkage. It is essential to develop a model that can quantify these encoding schemes in terms of their linkage accuracy, computational complexity and security. Blocking (data partitioning) is pertaining to parallel execution of record linkage. It determines the group of records that most likely to match, retrieves these records and creates partitions of the record set within which records will be compared and linked independently from other partitions. An optimal blocking model can be used in determining optimal data partitions in parallel execution of record linkage, where linkage quality, execution time, and resource requirement are considered as optimization objectives. Record Linkage is a multi-step process Future Research Plan 1)High-level parallel programming model and its run-time support that are tailored to the semantic of privacy-preserving record linkage, exploring the multi-level parallelism in the record linkage 2)Develop estimation and optimization techniques that enable user-aware cost- optimal record linkage in this multi- dimensional space. 3)Develop security analysis framework and new cryptographic models and methods for privacy-preserving record linkage in the Cloud; Analyze the security properties of record linkage on the Cloud by considering a wide variety of threats.