Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,

Similar presentations


Presentation on theme: "Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,"— Presentation transcript:

1 Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre, School of Computer Engineering, NTU, Singapore Michael Li Semantic Technology Group, Institute for Infocomm Research (I 2 R), A-Star, Singapore 11 th Feb 2011

2 Overview Motivation Problem Definition Objective Proposed Architecture A case study in Bio-informatics Demo Future works Summary

3 Motivation Deluge of biological data Biomedical data is available on heterogeneous databases Data: structured and semi/un-structured formats Demand for fast, large-scale and cost-effective computing strategies

4 Problem Definition Data – PubMed contains 20+ million abstracts – UniProt contains 13.5+ million records Case study on antiviral proteins – Over 70,000 citations in Pubmed – Over 14,000 proteins in Uniprot Integration and Analysis

5 Related Works Using NLP to link documents to existing ontologies (e.g. GoPubMed, Textpresso) – No querying & reasoning – Not scalable RDF/OWL based integration tools (e.g. TopBraid Suite) – No NLP – Not bio specific. Also not biologist friendly Cloud-based bio data mining works (e.g. Kudtarkar P 2010) – Still in early stages – Challenging to perform semantic integration on cloud

6 Objective To provide a framework that enables Better data infrastructure – Scalability – Management of heterogeneity – Cost-effectiveness Better data analytics – Integrative data mining – Visual query interface

7 Proposed Framework Our Approach Data Infrastructure Module Data Analytics Module

8 Data Infrastructure module Data Analytics module Our Approach Biomedical sources Web Crawler Parser Query & Reasoner Knowle Population Service Cloud-based data store Ontology User Interface

9 Our Approach Data Infrastructure Module – Cloud based: Amazon EC2, Hadoop, Microsoft Azure – Parallel processing: MapReduce – Distributed Storage: Big Table, HBase, HDFS Data Analytics Module – Non-semantic: database driven – Semantic: ontology driven (Knowle, Allegrograph, TopBraid)

10 Data Infrastructure Module (Hadoop) Software framework for data-intensive and distributed applications Hadoop distributed file system provides a distributed, scalable, and portable file system that support for large data set Hadoop Map-reduce allows to program in parallel on large amount of data

11 Cloud Based Data Store Hadoop Distributed File System Name node Data node -Meta data (in memory) -Data nodes -Data blocks -Node attributes -Name of files - Mapping of block-node Secondary Name node -Stores file contents -File is chunked to block -each block is spread to data nodes

12 Data Analytics Module (Knowle) Semantic Technology Toolkit Knowle services used in Data Analytics Module – Data/Text mining – Ontology Population – Ontology Query – Visual Ontology Query Developed in Institute for Infocomm Research, Singapore

13 Data Infrastructure module Data Analytics module Our Approach Biomedical data sources Web Crawler Parser Query & Reasoner Knowle Population Service Cloud-based data store Ontology User Interface

14 Web Crawler UniProt Crawler UniProt Crawler Cloud-based data store Bio-medical data source UniProt PubMed Crawler PubMed Crawler

15 Parser UniProt Parser UniProt Parser PubMed Parser PubMed Parser Knowle Ontology Population Service Crawled UniProt data Crawled PubMed data Crawled PubMed data Cloud-based data store

16 Ontology Protein Ontology Protein + Literature Ontology

17 Ontology Populator Parsed Uniprot Data Parsed Pubmed Data Parsed Pubmed Data Ontology Triplestore Protein + Literature ontology Knowle Ontolgy Population Service Knowle Text mining Service Populate concepts Assert Datatype Properties Assert Object Properties Entity Detection Relation Extraction

18 Query & Reasoner Ontology Triplestore User Interface OWLIM Reasoner SAIL Sesame Knowle Query Service

19 User Interface Ontology Triplestore Knowle Population Service Knowle Population Service Search Web Crawler Parser KnowleGator Ontology Visual Query Visual Query Translator Ontology Query & Reasoner

20 A case study in Bio-informatics Integration, cross-querying from PubMed and UniProt Data – 70,054 citations from Pubmed – 14,527 proteins in Uniprot Infrastructure (virtual computers) – 4 data node ( RAM : 1Gb, CPU : Intel Xeon 2.4Ghz) – 2 master node ( 1 name node,1 secondary name node) (RAM : 512 Mb, CPU : Intel Xeon 2.4Ghz) – 1 virtual CPU = Intel Xeon 2.4 Ghz

21 Demo Data – Uniprot : 853 antiviral protein entries – Pubmed : 2000 citations

22 Demo Snapshot

23 Summary We proposed a new framework – Data infrastructure module (cloud-based infrastructure ) – Data analytics module(semantic technologies) We tested on a prototype – Using our own infrastructure – With integration, cross-querying from PubMed and UniProt

24 Future works Integrated user interface Explore other cloud-based data store: HBase, BigTable Apply map-reduce concept on data analytics and crawling Integrate Knowle into cloud-based environment

25 Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre, School of Computer Engineering, NTU, Singapore Michael Li Semantic Technology Group, Institute for Infocomm Research (I 2 R), A-Star, Singapore 11 th Feb 2011


Download ppt "Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,"

Similar presentations


Ads by Google