Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,

Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre, School of Computer Engineering, NTU, Singapore Michael Li Semantic Technology Group, Institute for Infocomm Research (I 2 R), A-Star, Singapore 11 th Feb 2011

Overview Motivation Problem Definition Objective Proposed Architecture A case study in Bio-informatics Demo Future works Summary

Motivation Deluge of biological data Biomedical data is available on heterogeneous databases Data: structured and semi/un-structured formats Demand for fast, large-scale and cost-effective computing strategies

Problem Definition Data – PubMed contains 20+ million abstracts – UniProt contains 13.5+ million records Case study on antiviral proteins – Over 70,000 citations in Pubmed – Over 14,000 proteins in Uniprot Integration and Analysis

Related Works Using NLP to link documents to existing ontologies (e.g. GoPubMed, Textpresso) – No querying & reasoning – Not scalable RDF/OWL based integration tools (e.g. TopBraid Suite) – No NLP – Not bio specific. Also not biologist friendly Cloud-based bio data mining works (e.g. Kudtarkar P 2010) – Still in early stages – Challenging to perform semantic integration on cloud

Objective To provide a framework that enables Better data infrastructure – Scalability – Management of heterogeneity – Cost-effectiveness Better data analytics – Integrative data mining – Visual query interface

Proposed Framework Our Approach Data Infrastructure Module Data Analytics Module

Data Infrastructure module Data Analytics module Our Approach Biomedical sources Web Crawler Parser Query & Reasoner Knowle Population Service Cloud-based data store Ontology User Interface

Our Approach Data Infrastructure Module – Cloud based: Amazon EC2, Hadoop, Microsoft Azure – Parallel processing: MapReduce – Distributed Storage: Big Table, HBase, HDFS Data Analytics Module – Non-semantic: database driven – Semantic: ontology driven (Knowle, Allegrograph, TopBraid)

Data Infrastructure Module (Hadoop) Software framework for data-intensive and distributed applications Hadoop distributed file system provides a distributed, scalable, and portable file system that support for large data set Hadoop Map-reduce allows to program in parallel on large amount of data

Cloud Based Data Store Hadoop Distributed File System Name node Data node -Meta data (in memory) -Data nodes -Data blocks -Node attributes -Name of files - Mapping of block-node Secondary Name node -Stores file contents -File is chunked to block -each block is spread to data nodes

Data Analytics Module (Knowle) Semantic Technology Toolkit Knowle services used in Data Analytics Module – Data/Text mining – Ontology Population – Ontology Query – Visual Ontology Query Developed in Institute for Infocomm Research, Singapore

Data Infrastructure module Data Analytics module Our Approach Biomedical data sources Web Crawler Parser Query & Reasoner Knowle Population Service Cloud-based data store Ontology User Interface

Web Crawler UniProt Crawler UniProt Crawler Cloud-based data store Bio-medical data source UniProt PubMed Crawler PubMed Crawler

Parser UniProt Parser UniProt Parser PubMed Parser PubMed Parser Knowle Ontology Population Service Crawled UniProt data Crawled PubMed data Crawled PubMed data Cloud-based data store

Ontology Protein Ontology Protein + Literature Ontology

Ontology Populator Parsed Uniprot Data Parsed Pubmed Data Parsed Pubmed Data Ontology Triplestore Protein + Literature ontology Knowle Ontolgy Population Service Knowle Text mining Service Populate concepts Assert Datatype Properties Assert Object Properties Entity Detection Relation Extraction

Query & Reasoner Ontology Triplestore User Interface OWLIM Reasoner SAIL Sesame Knowle Query Service

User Interface Ontology Triplestore Knowle Population Service Knowle Population Service Search Web Crawler Parser KnowleGator Ontology Visual Query Visual Query Translator Ontology Query & Reasoner

A case study in Bio-informatics Integration, cross-querying from PubMed and UniProt Data – 70,054 citations from Pubmed – 14,527 proteins in Uniprot Infrastructure (virtual computers) – 4 data node ( RAM : 1Gb, CPU : Intel Xeon 2.4Ghz) – 2 master node ( 1 name node,1 secondary name node) (RAM : 512 Mb, CPU : Intel Xeon 2.4Ghz) – 1 virtual CPU = Intel Xeon 2.4 Ghz

Demo Data – Uniprot : 853 antiviral protein entries – Pubmed : 2000 citations

Demo Snapshot

Summary We proposed a new framework – Data infrastructure module (cloud-based infrastructure ) – Data analytics module(semantic technologies) We tested on a prototype – Using our own infrastructure – With integration, cross-querying from PubMed and UniProt

Future works Integrated user interface Explore other cloud-based data store: HBase, BigTable Apply map-reduce concept on data analytics and crawling Integrate Knowle into cloud-based environment

Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre, School of Computer Engineering, NTU, Singapore Michael Li Semantic Technology Group, Institute for Infocomm Research (I 2 R), A-Star, Singapore 11 th Feb 2011

Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,

Similar presentations

Presentation on theme: "Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,

Similar presentations

Presentation on theme: "Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,"— Presentation transcript:

Similar presentations

About project

Feedback