Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre, School of Computer Engineering, NTU, Singapore Michael Li Semantic Technology Group, Institute for Infocomm Research (I 2 R), A-Star, Singapore 11 th Feb 2011
Overview Motivation Problem Definition Objective Proposed Architecture A case study in Bio-informatics Demo Future works Summary
Motivation Deluge of biological data Biomedical data is available on heterogeneous databases Data: structured and semi/un-structured formats Demand for fast, large-scale and cost-effective computing strategies
Problem Definition Data – PubMed contains 20+ million abstracts – UniProt contains million records Case study on antiviral proteins – Over 70,000 citations in Pubmed – Over 14,000 proteins in Uniprot Integration and Analysis
Related Works Using NLP to link documents to existing ontologies (e.g. GoPubMed, Textpresso) – No querying & reasoning – Not scalable RDF/OWL based integration tools (e.g. TopBraid Suite) – No NLP – Not bio specific. Also not biologist friendly Cloud-based bio data mining works (e.g. Kudtarkar P 2010) – Still in early stages – Challenging to perform semantic integration on cloud
Objective To provide a framework that enables Better data infrastructure – Scalability – Management of heterogeneity – Cost-effectiveness Better data analytics – Integrative data mining – Visual query interface
Proposed Framework Our Approach Data Infrastructure Module Data Analytics Module
Data Infrastructure module Data Analytics module Our Approach Biomedical sources Web Crawler Parser Query & Reasoner Knowle Population Service Cloud-based data store Ontology User Interface
Our Approach Data Infrastructure Module – Cloud based: Amazon EC2, Hadoop, Microsoft Azure – Parallel processing: MapReduce – Distributed Storage: Big Table, HBase, HDFS Data Analytics Module – Non-semantic: database driven – Semantic: ontology driven (Knowle, Allegrograph, TopBraid)
Data Infrastructure Module (Hadoop) Software framework for data-intensive and distributed applications Hadoop distributed file system provides a distributed, scalable, and portable file system that support for large data set Hadoop Map-reduce allows to program in parallel on large amount of data
Cloud Based Data Store Hadoop Distributed File System Name node Data node -Meta data (in memory) -Data nodes -Data blocks -Node attributes -Name of files - Mapping of block-node Secondary Name node -Stores file contents -File is chunked to block -each block is spread to data nodes
Data Analytics Module (Knowle) Semantic Technology Toolkit Knowle services used in Data Analytics Module – Data/Text mining – Ontology Population – Ontology Query – Visual Ontology Query Developed in Institute for Infocomm Research, Singapore
Data Infrastructure module Data Analytics module Our Approach Biomedical data sources Web Crawler Parser Query & Reasoner Knowle Population Service Cloud-based data store Ontology User Interface
Web Crawler UniProt Crawler UniProt Crawler Cloud-based data store Bio-medical data source UniProt PubMed Crawler PubMed Crawler
Parser UniProt Parser UniProt Parser PubMed Parser PubMed Parser Knowle Ontology Population Service Crawled UniProt data Crawled PubMed data Crawled PubMed data Cloud-based data store
Ontology Protein Ontology Protein + Literature Ontology
Ontology Populator Parsed Uniprot Data Parsed Pubmed Data Parsed Pubmed Data Ontology Triplestore Protein + Literature ontology Knowle Ontolgy Population Service Knowle Text mining Service Populate concepts Assert Datatype Properties Assert Object Properties Entity Detection Relation Extraction
Query & Reasoner Ontology Triplestore User Interface OWLIM Reasoner SAIL Sesame Knowle Query Service
User Interface Ontology Triplestore Knowle Population Service Knowle Population Service Search Web Crawler Parser KnowleGator Ontology Visual Query Visual Query Translator Ontology Query & Reasoner
A case study in Bio-informatics Integration, cross-querying from PubMed and UniProt Data – 70,054 citations from Pubmed – 14,527 proteins in Uniprot Infrastructure (virtual computers) – 4 data node ( RAM : 1Gb, CPU : Intel Xeon 2.4Ghz) – 2 master node ( 1 name node,1 secondary name node) (RAM : 512 Mb, CPU : Intel Xeon 2.4Ghz) – 1 virtual CPU = Intel Xeon 2.4 Ghz
Demo Data – Uniprot : 853 antiviral protein entries – Pubmed : 2000 citations
Demo Snapshot
Summary We proposed a new framework – Data infrastructure module (cloud-based infrastructure ) – Data analytics module(semantic technologies) We tested on a prototype – Using our own infrastructure – With integration, cross-querying from PubMed and UniProt
Future works Integrated user interface Explore other cloud-based data store: HBase, BigTable Apply map-reduce concept on data analytics and crawling Integrate Knowle into cloud-based environment
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre, School of Computer Engineering, NTU, Singapore Michael Li Semantic Technology Group, Institute for Infocomm Research (I 2 R), A-Star, Singapore 11 th Feb 2011