Distributed Data Mining in Discovery Net Dr. Moustafa Ghanem Department of Computing Imperial College London.

Slides:



Advertisements
Similar presentations
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Advertisements

April 21, 2005EPSRC E-Science Meeting, NeSC Real-time Text Mining for the Biomedical Literature a collaboration between Discovery Net & myGrid Rob Gaizauskas.
Copyright Discovery Net Imperial College SARS Analysis on the Grid Discovery Net in Bioinformatics.
Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.
Discovery Workflow: (ServiceFlow) Programming the Grid Prof. Yike Guo Imperial College London.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Archives and Information Retrieval
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Scientific Data Mining: Emerging Developments and Challenges F. Seillier-Moiseiwitsch Bioinformatics Research Center Department of Mathematics and Statistics.
Protein Modules An Introduction to Bioinformatics.
Institut für Softwarewissenschaft - Universität WienP.Brezany 1 Toward Knowledge Discovery in Databases Attached to Grids Peter Brezany Institute for Software.
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
Overview of Bioinformatics A/P Shoba Ranganathan Justin Choo National University of Singapore A Tutorial on Bioinformatics.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Everything you always wanted to know about the Grid and never dared to ask Tony Hey and Geoffrey Fox.
Knowledgebase Creation & Systems Biology: A new prospect in discovery informatics S.Shriram, Siri Technologies (Cytogenomics), Bangalore S.Shriram, Siri.
DOE Genomics: GTL Program IT Infrastructure Needs for Systems Biology David G. Thomassen Office of Biological and Environmental Research DOE Office of.
Yike Guo/Jiancheng Lin InforSense Ltd. 15 September 2015 Bioinformatics workflow integration.
Multiple Examples of tumor tissue (public data from Whitehead/MIT) SVM Classification of Multiple Tumor Types DNA Microarray Data Oracle Data Mining 78.25%
Life Sciences Integrated Demo Joyce Peng Senior Product Manager, Life Sciences Oracle Corporation
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
1 A National Virtual Specimen Database for Early Cancer Detection June 26, 2003 Daniel Crichton NASA Jet Propulsion Laboratory Sean Kelly NASA Jet Propulsion.
Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,
Scalable Clustering on the Data Grid Patrick Wendel Moustafa Ghanem Yike Guo Discovery Net Department of Computing Imperial College,
4 th Annual EPSRC e-science meeting The need for e-Science An industrial perspective Stephen Calvert – VP Cheminformatics GSKYike Guo – Imperial College.
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
GEM Portal and SERVOGrid for Earthquake Science PTLIU Laboratory for Community Grids Geoffrey Fox, Marlon Pierce Computer Science, Informatics, Physics.
Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Biological Networks & Systems Anne R. Haake Rhys Price Jones.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
EB3233 Bioinformatics Introduction to Bioinformatics.
6 February 2009 ©2009 Cesare Pautasso | 1 JOpera and XtremWeb-CH in the Virtual EZ-Grid Cesare Pautasso Faculty of Informatics University.
An overview of Bioinformatics. Cell and Central Dogma.
Mining the Biomedical Research Literature Ken Baclawski.
Bioinformatics and Computational Biology
An approach to carry out research and teaching in Bioinformatics in remote areas Alok Bhattacharya Centre for Computational Biology & Bioinformatics JAWAHARLAL.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
BioVLAB-Microarray: Microarray Data Analysis in Virtual Environment Youngik Yang, Jong Youl Choi, Kwangmin Choi, Marlon Pierce, Dennis Gannon, and Sun.
Partnerships in Innovation: Serving a Networked Nation Grid Technologies: Foundations for Preservation Environments Portals for managing user interactions.
Virtual Lab AMsterdam VLAMsterdam Abstract Machine Toolbox A.S.Z. Belloum, Z.W. Hendrikse, E.C. Kaletas, H. Afsarmanesh and L.O. Hertzberger Computer Architecture.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
High throughput biology data management and data intensive computing drivers George Michaels.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Genomic Medicine Grid Juan Pedro Sánchez Merino Instituto de Salud Carlos III
Department of Computer Science Sir Syed University of Engineering & Technology, Karachi-Pakistan. Presentation Title: DATA MINING Submitted By.
Biological Databases By: Komal Arora.
Grid Portal Services IeSE (the Integrated e-Science Environment)
University of Technology
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Distributed Data Mining in Discovery Net Dr. Moustafa Ghanem Department of Computing Imperial College London

1.What is Discovery Net 2.Distributed Data Mining for Compute Intensive Tasks 3.Distributed Data Mining for Sensor Grids 4.Knowledge Discovery from Naturally Distributed Data Sources 5.What Do Scientists Really Want?

1. What is Discovery Net

What is Discovery Net?  Funding : One of the eight UK national e-science Pilot Projects funded by EPSRC (£2.2M)  Start Oct 2001, End March 2005  Goal :Construct the World’s first Infrastructure for Global Knowledge Discovery Services  Key Technologies:  Open Service Computing  High Throughput Devices and Real Time Data Mining  Real Time Data Integration & Information Structuring  Cross Domain Knowledge Discovery and Management  Discovery Workflow and Discovery Planning

Discovery Net Applications  Life Sciences  High throughput genomics and proteomics  Distributed Databases and Applications  Environmental Modelling  High throughput dispersed air sensing technology  Sensor Grids  Real time geo-hazard modelling  Earthquake modelling through satellite imagery  High performance Distributed Computation NM L KJIHGFEDCBA

Discovery Net Architecture High Performance Communication Protocol (GridFTP, DSTP..) Grid Infrastructure (GSI) DPML Web/Grid Services OGSA  Goal: Plug & Play Data Sources, Analysis Components & Knowledge Discovery Processes D-Net Clients: End-user applications and user interface allowing scientists to construct and drive knowledge discovery activities D-Net Middleware: Provides services and execution logic for distributed knowledge discovery and access to distributed resources and services Computation & Data Resources: Distributed databases, compute servers and scientific devices.

Discovery Net Data Mining Components  Generic Data Mining  Classification, Clustering, Associations,..  Unstructured-Data Mining  Text Mining, Image Mining  Domain-specific Mining  Bioinformatics, Cheminformatics,..

 2. Distribution of Compute Intensive Tasks a. Distributed Data Mining for Geo-hazard Prediction

Grid-based Geo-hazard Data Mining  Grid-based HPC Computation  Automatically co-register a stack of imagery layers at high precision and speed.  Workflow to Co- ordinate Grid Computation Data Warehousing & Modelling Co-registration & geo-rectification Image features extraction Cluster & classification  Grid-based Data Access and Integration

Normalised cross-correlation (NCC) template algorithm Operating on a remotely accessed MPI UNIX parallel computer through fast network with DNet interface. Slow but high accuracy: 24 processors 10 hours for one scene of Landsat-7 ETM+ Pan imagery data. The algorithm also run on GRID. Image “before” Image “after” Reading Data set Reading Data set Setting comparing window Setting search window Delta X Correlation coefficient Setting comparing window Significant correlation coefficient Y N

 2. Distribution of Compute Intensive Tasks b. Distributed Clustering

Workflows for Distributed Data Clustering

3. Distributed Mining over Sensor Grid Data Distributed Spatial Data Mining for Air Pollution Modelling

Sensor Specification High throughput open path spectrometer system Robust algorithm for pollutant concentration retrievals Measures SO 2, NO, NO 2,O 3 & Benzene to ppb levels every few seconds Geared for networking of multiple GUSTO units within a GRID Infrastructure Can support Remote Sensing data for (contour) mapping of pollutants The GUSTO Project - Update (Generic UV Sensors Technologies & Observations)

GRID Infrastructure GUSTO unit 1 GUSTO unit 2 GUSTO unit 3 GUSTO unit 4 Sensor registry & control service Data upload service Warehouse Archived weather data Data access service Archived health data Monitoring and control software Visualisation and Data Mining Public access Web visualizer Wireless connectivity SensorML HTTP, SOAP, GSI HTTP, SOAP, GSI Networking of Multiple GUSTO Units

Pollution analysis

4. Knowledge Discovery from Naturally Distributed Data Sources Distributed Data Mining in Life Sciences

ATGCAAGTCCCT AAGATTGCATAA GCTCGCTCAGTT polymorphism patient records epidemiology linkage maps cytogenetic maps physical maps sequences alignments expression patterns physiology receptors signals pathways secondary structure tertiary structure Distributed Data Mining for Life Sciences

Gene Expression Warehouse ProteinDisease SNP Enzyme Pathway Known Gene Sequence Cluster Affy Fragment Sequence LocusLink MGD ExPASy SwissProt PDB OMIM NCBI dbSNP ExPASy Enzyme KEGG SPAD UniGene Genbank NMR Metabolite Information Integration  Given a collection of microarray generated gene expression data, what kind of questions the users wish to pose.  Design an integration schema?

From Data Integration to Knowledge Unification In Silico Experiment D-World I-World K-World

Life Science Application: SC2002 HPC Challenge D-Net based Global Collaborative Real- Time Genome Annotation Genome Annotation blastgenscan Repeat Masker grail genscanE-PCR Identify Genes Gene markers tRNAs, rRNAs Non-translated RNAs Regulatory Regions Repetitive Elements Segmental Duplication SNP Variations Literature References ….. 3D-PSSMblast Motif Search PFAM DSCpredator Inter Pro Inter Pro SMART SWISS PROT Identify Functional Characteisation Homologues Domain3-D Structure Fold Prediction Secondary structure Literature References ….. Proteins Classify into Protein Families Identify Organism Chromosomes Organism’s DNA Relate Cell Cycle Metabolism Drugs Biological Process….. Cell death Embryogenesis Literature References ….. Ontologies Pathway Maps GeneMapsAmiGO GenNav virtual chip High Throughput Sequencers Nucleotide-level Annotation Protein-level Annotation Process-level Annotation NCBIEMBL TIGRSNP GOCSNDB GKKEGG 15 DBs 21 Applications

HPC Challenge SC2002 Nucleotide Annotation Workflows NCBIEMBL TIGRSNP Inter Pro SMART SWISS PROT GO KEGG  1800 clicks  500 Web access  200 copy/paste  3 weeks work in 1 workflow and few second execution Real-time sequencing in London Download sequence from Reference Server Save to Distributed Annotation Server Interactive Editor & Visualisation Execute distributed annotation workflow Distributed data and computation

Discovery Net in Action: China SARS Virtual Lab Relationship between SARS and other virus Mutual regions identification Homology search against viral genome DB Annotation using Artemis and GenSense Gene prediction Phylogenetic analysis Exon prediction Splice site prediction Immunogenetics Multiple sequence alignment Microarray analysis Bibliographic databases Key word search GeneSense Ontology D-Net: Integration, interpretation, and discovery Epidemiological analysis Predicted genes SARS patients diagnosis Homology search against protein DB Homology search against motif DB Protein localization site prediction Protein interaction prediction Relationship between SARS virus and human receptors prediction Classification and secondary structure prediction Bibliographic databases Genbank Annotation using Artemis and GenSense

Discovery Net in Action: SARS Virus Mutation Analysis

5. What do Scientist Really Want? Does it really work?

Towards Compositional Grid Services Workflow Execution A compositional GRID Workflow Management Collaborative Knowledge Management Workflow Deployment: Grid Service and Portal Workflow Warehousing Resource Mapping Service Abstraction Workflow Authoring Composing services Condor-G Native MPI OGSA-service Web Service Unicore Oralce 10g Web Wrapper Sun Grid Engine Service Browsing

Discovery Net Service Composition

Full Workflow

Executing Protein Annotation Workflow

Deployment of Node

Deploying Protein Annotation Workflow

Executing Deployed Service

Locating & Executing Deployed Service from Discovery Net

Workflow Provenance

Workflow Warehousing

Using Distributed Resources Scientific Information Scientific Discovery In Real Time Literature Databases Operational Data Images Instrument Data Discovery Net Snapshot Real Time Data Integration Dynamic Application Integration Discovery Services Integrative Knowledge Management Service Workflow