VERTICAL DATA INTEGRATION FOR CLINICAL GENOMICS PhD Thesis, cycle XXIII Andrea Calabria CNR - Institute for Biomedical Technologies Università degli Studi.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

1 Use Cases Application provisioning (version control) Workload management/load-balancing (server consolidation) Data Federation/sharing E-utilities (provisioning.
Lecture 2 Strachan and Read Chapter 13
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
CVRG Presenter Disclosure Information Tahsin Kurc, PhD Center for Comprehensive Informatics Emory University CardioVascular Research Grid Core Infrastructure.
CA's Management Database (MDB): The EITM Foundation -WO108SN.
Consistent and standardized common model to support large-scale vocabulary use and adoption Robust, scalable, and common API to reduce variation in clinical.
1 Single Nucleotide Polymorphisms (SNP) Gary Jones SPE, Technology Center 1600 (703)
Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.
Basic guidelines for the creation of a DW Create corporate sponsors and plan thoroughly Determine a scalable architectural framework for the DW Identify.
The CrossGrid project Juha Alatalo Timo Koivusalo.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
1 CIS607, Fall 2006 Semantic Information Integration Instructor: Dejing Dou Week 10 (Nov. 29)
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
University of ViennaP. Brezany 1 Knowledge Discovery in Grid Datasets – Goals, Design Concepts and the Architecture Peter Brezany University of Vienna.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
GRID job tracking and monitoring Dmitry Rogozin Laboratory of Particle Physics, JINR 07/08/ /09/2006.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
1 UK NeSC Meeting, November 18 th, 2004 Terry Sloan EPCC, The University of Edinburgh INWA : using OGSA-DAI in a commercial environment.
DynamicBLAST on SURAgrid: Overview, Update, and Demo John-Paul Robinson Enis Afgan and Purushotham Bangalore University of Alabama at Birmingham SURAgrid.
From BIOPATTERN to Bioprofiling over Grid for eHealthcare Emmanuel Ifeachor University of Plymouth, U.K.
Cluster Reliability Project ISIS Vanderbilt University.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE Molecular Science in NPACI Russ B. Altman NPACI Molecular Science Thrust Stanford Medical.
Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.
Scalable Clustering on the Data Grid Patrick Wendel Moustafa Ghanem Yike Guo Discovery Net Department of Computing Imperial College,
INFSO-RI Enabling Grids for E-sciencE V. Breton, 30/08/05, seminar at SERONO Grid added value to fight malaria Vincent Breton EGEE.
Value Set Resolution: Build generalizable data normalization pipeline using LexEVS infrastructure resources Explore UIMA framework for implementing semantic.
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Data Warehousing Data Mining Privacy. Reading Bhavani Thuraisingham, Murat Kantarcioglu, and Srinivasan Iyer Extended RBAC-design and implementation.
Tel Aviv University - Industrial Engineering Department 1 Data Grid In Engineering TOC Grid Overview The PF5 definition: A very high-speed trans-European.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
ACGT: Open Grid Services for Improving Medical Knowledge Discovery Stelios G. Sfakianakis, FORTH.
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
CGW 04, Stripped replication for the grid environment as a web service1 Stripped replication for the Grid environment as a web service Marek Ciglan, Ondrej.
A GRID solution for Gravitational Waves Signal Analysis from Coalescing Binaries: preliminary algorithms and tests F. Acernese 1,2, F. Barone 2,3, R. De.
Bioinformatics and Computational Biology
BIOINFOGRID: Bioinformatics Grid Application for life science MILANESI, Luciano National Research Council Institute of.
NeuroLOG ANR-06-TLOG-024 Software technologies for integration of process and data in medical imaging A transitional.
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
Big Data to Knowledge Panel SKG 2014 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China August Geoffrey Fox
ATLAS Database Access Library Local Area LCG3D Meeting Fermilab, Batavia, USA October 21, 2004 Alexandre Vaniachine (ANL)
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,
The Health-e-Child Project EGEE 2006 – Industry Task Force David Manset MAAT GKnowledge.
Local ICTS Mirror of UCSC Genome Browser Local ICTS Mirror of UCSC Genome Browser Lucas Van Tol: Gi-yung Ryu:
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Bioinformatics activity Christophe BLANCHET.
Developing GRID Applications GRACE Project
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Genomic Medicine Grid Juan Pedro Sánchez Merino Instituto de Salud Carlos III
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.
Issues in Cloud Computing. Agenda Issues in Inter-cloud, environments  QoS, Monitoirng Load balancing  Dynamic configuration  Resource optimization.
The LIBI Federated database
A web portal for management of biological data and applications
Why Create a PGDB? Perform pathway analyses as part of a genome project Analyze omics data Create a central public information resource for the organism,
Development of an interactive pipeline for Genome wide association analysis Falola Damilare & Adigun Taiwo – Covenant University Bioinformatics research.
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Roberto Barbera (a nome di Livia Torterolo)
Joseph JaJa, Mike Smorul, and Sangchul Song
Similarities between Grid-enabled Medical and Engineering Applications
A Replica Location Service
Attività grid di Biomedicina in Italia e in Europa
Introduction to bioinformatics lecture 11 SNP by Ms.Shumaila Azam
Introduction to D4Science
Defining the Grid Fabrizio Gagliardi EMEA Director Technical Computing
NextGRID: From Compute Grids to Grid SOAs and beyond
Internal and External Quality Assurance Systems for Cycle 3 (Doctoral) programmes "PROMOTING INTERNATIONALIZATION OF RESEARCH THROUGH ESTABLISHMENT AND.
Presentation transcript:

VERTICAL DATA INTEGRATION FOR CLINICAL GENOMICS PhD Thesis, cycle XXIII Andrea Calabria CNR - Institute for Biomedical Technologies Università degli Studi Milano Bicocca – DISCO Saturday, January 30, 2016

Project Context and Domain  Genetic Studies  Genome Wide Association Studies  Family Based and Population Based  Domain  Complex Diseases, focus on Brain dysfunctions and NeuroPathologies: Alzheimer (medium stage), Schizophrenia  Data Types  Personal Data  Phenotypes: clinical data, functional magnetic resonance  Genotypes: using SNPs (Single Nucleotide Polymorphism) 2 A.Calabria PhD Thesis XXIII cycle

Motivation and Objective  Data Mining and Genetics Studies (Linkage analysis, CNV, etc) on brain diseases need for Data Integration and High Performance Infrastructures with distributed environmet  Data Integration must be both Vertical and Horizontal  security and privacy policies for experimental data  Grid Environment merges security and privacy issues and distributed computing  Project’s Objective  to designVertical Integration on experimental data in Grid environment for genetics studies and data mining analyses purpose 3 A.Calabria PhD Thesis XXIII cycle

Why Grid Environment A.Calabria PhD Thesis XXIII cycle 4 LayerRequirements and Properties ObjectivesGrid-enabled Infras. Layer ToDo Applicat. Layer Data Mining and Genetic Studies Coputational Resources and parallelization; web services oriented. Reliablility Availability Robustness Scalability Genetics analyses and brain dependecies discovery (domain related) Parallelization problem specific. Grid ensures: Reliability, Robustness and Scalability native Availability depends on sites Adapt algorithms to distributed environments (ie: linkage analysis); web service oriented Data Layer Horizontal Integration Security, Privacy, Replication (space). Flexibility Scalability Consistency and Quality To Integrate experimental data in global view (filtering, quality, std schema) OLTP Security and Privacy issues are granted native. Consistency and Quality are site- related Global schema study, grid db adaptation (AMGA). Replica/Quality /Distrib mgt Vertical Integration Comput. resources, web services oriented Scalability Updating To integrate gene knowledge data (DW) Not necessary; ensures Scalability; updating to be configured. Quality control, gene data fusion (conflicts study)

Application Layer – Genetics Studies A.Calabria PhD Thesis XXIII cycle 5  Algorithm domain related (population genetics studies)  linkage analysis: the problem is computational intensive, NP-hard. Limits are related to number of markers (<40).  Our problem: chip of 1M SNPs (markers), need to compute linkage analysis for population with 1M SNPs  Solution: heuristic for distributing linkage analysis  Preliminary results on Cluster: 70% average time improvement respect to single CPU  Work in progress  grid porting algorithm and comparison performance test  specific monitoring and job controlling system  Next steps  release linkage on grid as web services

Data Layer – Horizontal Integration A.Calabria PhD Thesis XXIII cycle 6  Database of Genotypes  genotypic database design and creation  standard HL7 analysis  Work in progress  HL7 application to global database schema  database porting in EGEE grid with AMGA  Next steps  studies of grid db problems related to distribution, federation and hub and spoke paradigm adaptation for extension to biobanks approach  testing of data integration

Data Layer – Vertical Integration A.Calabria PhD Thesis XXIII cycle 7  Objective  integrate genes’ knowledge with data fusion approach  Genes’ Knowledge quality control  predicted genes can present conflicts among main different databases (NCBI, EnsEMBL, UCSC)  conflicts could affect analyses  need for evaluating conflict impact within the genome  Work in progress and Next Steps  data extraction (API, Web Services, DB access, parsing)  data integration  data fusion: conflict analysis and evaluation

Project Plan A.Calabria PhD Thesis XXIII cycle 8  Linkage algorithm Grid enabling (May-September)  grid porting  application testing and performance measurements  Gene-oriented Data quality (September-November)  data extraction  genes’ knowledge integration  conflicts evaluation  Database Design for Grid porting (November-March)  HL7 schema design  AMGA database creation  Query and data management issues, data import and testing

References A.Calabria PhD Thesis XXIII cycle 9  Bibliografy  Pubblications