Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

From Ontology Design to Deployment Semantic Application Development with TopBraid Holger Knublauch
From Web Archiving services to Web scale data processing platform Internet Memory Research GA IIPC, Paris, May 19th 2014.
University of Illinois Visualizing Text Loretta Auvil UIUC February 25, 2011.
C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith School of Informatics and Computing.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.
Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.
Software Architecture
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.
Software Engineering for Business Information Systems (sebis) Department of Informatics Technische Universität München, Germany wwwmatthes.in.tum.de Data-Parallel.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of Atmosphere.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Applying the Semantic Web at UCHSC - Center for Computational Pharmacology Ian Wilson.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Trisolda Jakub Yaghob Charles University in Prague, Czech Rep.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Event-Based Hybrid Consistency Framework (EBHCF) for Distributed Annotation Records Ahmet Fatih Mustacoglu Advisor: Prof. Geoffrey.
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
ICCS 2008, CracowJune 23-25, Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.
Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization Using MapReduce on Cloud.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Sesame A generic architecture for storing and querying RDF and RDFs Written by Jeen Broekstra, Arjohn Kampman Summarized by Gihyun Gong.
Load Rebalancing for Distributed File Systems in Clouds.
1 NETE4631 Using Google Web Services Lecture Notes #6.
Developing GRID Applications GRACE Project
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
Viet Tran Institute of Informatics, SAS Slovakia.
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Big Data & Test Automation
Introduction to MapReduce and Hadoop
Big Data is a Big Deal!.
Hadoop-based Distributed Web Crawler
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Running virtualized Hadoop, does it make sense?
CCNT Lab of Zhejiang University
Modern Data Management
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Hadoop Clusters Tess Fulkerson.
Extraction, aggregation and classification at Web Scale
07 | Analyzing Big Data with Excel
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Big Data - in Performance Engineering
RichAnnotator: Annotating rich (XML-like) documents
Introduction to Apache
Overview of big data tools
Cloud Programming Models
Database Systems Summary and Overview
Big Data, Bigger Data & Big R Data
Big-Data Analytics with Azure HDInsight
Cloud versus Cloud: How Will Cloud Computing Shape Our World?
A framework for ontology Learning FROM Big Data
Presentation transcript:

Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre, School of Computer Engineering, NTU, Singapore Michael Li Semantic Technology Group, Institute for Infocomm Research (I 2 R), A-Star, Singapore 11 th Feb 2011

Overview Motivation Problem Definition Objective Proposed Architecture A case study in Bio-informatics Demo Future works Summary

Motivation Deluge of biological data Biomedical data is available on heterogeneous databases Data: structured and semi/un-structured formats Demand for fast, large-scale and cost-effective computing strategies

Problem Definition Data – PubMed contains 20+ million abstracts – UniProt contains million records Case study on antiviral proteins – Over 70,000 citations in Pubmed – Over 14,000 proteins in Uniprot Integration and Analysis

Related Works Using NLP to link documents to existing ontologies (e.g. GoPubMed, Textpresso) – No querying & reasoning – Not scalable RDF/OWL based integration tools (e.g. TopBraid Suite) – No NLP – Not bio specific. Also not biologist friendly Cloud-based bio data mining works (e.g. Kudtarkar P 2010) – Still in early stages – Challenging to perform semantic integration on cloud

Objective To provide a framework that enables Better data infrastructure – Scalability – Management of heterogeneity – Cost-effectiveness Better data analytics – Integrative data mining – Visual query interface

Proposed Framework Our Approach Data Infrastructure Module Data Analytics Module

Data Infrastructure module Data Analytics module Our Approach Biomedical sources Web Crawler Parser Query & Reasoner Knowle Population Service Cloud-based data store Ontology User Interface

Our Approach Data Infrastructure Module – Cloud based: Amazon EC2, Hadoop, Microsoft Azure – Parallel processing: MapReduce – Distributed Storage: Big Table, HBase, HDFS Data Analytics Module – Non-semantic: database driven – Semantic: ontology driven (Knowle, Allegrograph, TopBraid)

Data Infrastructure Module (Hadoop) Software framework for data-intensive and distributed applications Hadoop distributed file system provides a distributed, scalable, and portable file system that support for large data set Hadoop Map-reduce allows to program in parallel on large amount of data

Cloud Based Data Store Hadoop Distributed File System Name node Data node -Meta data (in memory) -Data nodes -Data blocks -Node attributes -Name of files - Mapping of block-node Secondary Name node -Stores file contents -File is chunked to block -each block is spread to data nodes

Data Analytics Module (Knowle) Semantic Technology Toolkit Knowle services used in Data Analytics Module – Data/Text mining – Ontology Population – Ontology Query – Visual Ontology Query Developed in Institute for Infocomm Research, Singapore

Data Infrastructure module Data Analytics module Our Approach Biomedical data sources Web Crawler Parser Query & Reasoner Knowle Population Service Cloud-based data store Ontology User Interface

Web Crawler UniProt Crawler UniProt Crawler Cloud-based data store Bio-medical data source UniProt PubMed Crawler PubMed Crawler

Parser UniProt Parser UniProt Parser PubMed Parser PubMed Parser Knowle Ontology Population Service Crawled UniProt data Crawled PubMed data Crawled PubMed data Cloud-based data store

Ontology Protein Ontology Protein + Literature Ontology

Ontology Populator Parsed Uniprot Data Parsed Pubmed Data Parsed Pubmed Data Ontology Triplestore Protein + Literature ontology Knowle Ontolgy Population Service Knowle Text mining Service Populate concepts Assert Datatype Properties Assert Object Properties Entity Detection Relation Extraction

Query & Reasoner Ontology Triplestore User Interface OWLIM Reasoner SAIL Sesame Knowle Query Service

User Interface Ontology Triplestore Knowle Population Service Knowle Population Service Search Web Crawler Parser KnowleGator Ontology Visual Query Visual Query Translator Ontology Query & Reasoner

A case study in Bio-informatics Integration, cross-querying from PubMed and UniProt Data – 70,054 citations from Pubmed – 14,527 proteins in Uniprot Infrastructure (virtual computers) – 4 data node ( RAM : 1Gb, CPU : Intel Xeon 2.4Ghz) – 2 master node ( 1 name node,1 secondary name node) (RAM : 512 Mb, CPU : Intel Xeon 2.4Ghz) – 1 virtual CPU = Intel Xeon 2.4 Ghz

Demo Data – Uniprot : 853 antiviral protein entries – Pubmed : 2000 citations

Demo Snapshot

Summary We proposed a new framework – Data infrastructure module (cloud-based infrastructure ) – Data analytics module(semantic technologies) We tested on a prototype – Using our own infrastructure – With integration, cross-querying from PubMed and UniProt

Future works Integrated user interface Explore other cloud-based data store: HBase, BigTable Apply map-reduce concept on data analytics and crawling Integrate Knowle into cloud-based environment

Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre, School of Computer Engineering, NTU, Singapore Michael Li Semantic Technology Group, Institute for Infocomm Research (I 2 R), A-Star, Singapore 11 th Feb 2011