A Study in Hadoop Streaming with Matlab for NMR data processing Kalpa Gunaratna1, Paul Anderson2, Ajith Ranabahu1 and Amit Sheth1 1Ohio Center of Excellence.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

Learning Introductory Signal Processing Using Multimedia 1 Outline Overview of Information and Communications Some signal processing concepts Tools available.
Spread Spectrum Chapter 7.
Spread Spectrum Chapter 7. Spread Spectrum Input is fed into a channel encoder Produces analog signal with narrow bandwidth Signal is further modulated.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Building Cloud-ready Video Transcoding System for Content Delivery Networks(CDNs) Zhenyun Zhuang and Chun Guo Speaker: 饒展榕.
Implementation of an Android Phone Based Video Streamer 2010 IEEE/ACM International Conference on Green Computing and Communications 2010 IEEE/ACM International.
Auto-tuning for Electric Guitars using Digital Signal Processing Pat Hurney, 4ECE 31 st March 2009.
ECE 8443 – Pattern Recognition ECE 3163 – Signals and Systems Objectives: Review Resources: Wiki: State Variables YMZ: State Variable Technique Wiki: Controllability.
Linked Sensor Data Harshal Patni, Cory Henson, Amit P. Sheth Ohio Center of Excellence in Knowledge enabled Computing (Kno.e.sis) Wright State University,
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Proposal for a Standard Representation of the Results of GC-MS Analysis: A Module for ArMet Helen Fuell 1, Manfred Beckmann 2, John Draper 2, Oliver Fiehn.
Instructor: Tasneem Darwish
Software Testing and Quality Assurance
1 Software Testing and Quality Assurance Lecture 30 - Introduction to Software Testing.
 Image Search Engine Results now  Focus on GIS image registration  The Technique and its advantages  Internal working  Sample Results  Applicable.
Chapter 1 Introduction to C Programming. 1.1 INTRODUCTION This book is about problem solving with the use of computers and the C programming language.
Surface Mine Truck Safety Training Design And Implementation of a Multi-user VR Driving Simulator Yan W. Ha, Jeremy Murray, and Dr. Frederick C. Harris,
Present by Napasakorn Sukjay Poom Samaharn
C++ fundamentals.
Design and Analysis of Algorithms
Automatic assignment of NMR spectral data from protein sequences using NeuroBayes Slavomira Stefkova, Michal Kreps and Rudolf A Roemer Department of Physics,
A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data † Kno.e.sis Center Wright State University Dayton OH, USA.
OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.
A User Experience-based Cloud Service Redeployment Mechanism KANG Yu.
CHAPTER 4: INTRODUCTION TO COMPUTER ORGANIZATION AND PROGRAMMING DESIGN Lec. Ghader Kurdi.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Introduction to M ATLAB EE 100 – EE Dept. - JUST.
An Architecture for Video Surveillance Service based on P2P and Cloud Computing Yu-Sheng Wu, Yue-Shan Chang, Tong-Ying Juang, Jing-Shyang Yen speaker:
MIT Lincoln Laboratory XYZ 9/15/2015 APS-2 Chip: W21 R5C6 Has Quartz Support.
Understanding the CORBA Model. What is CORBA?  The Common Object Request Broker Architecture (CORBA) allows distributed applications to interoperate.
Mihir Daptardar Software Engineering 577b Center for Systems and Software Engineering (CSSE) Viterbi School of Engineering 1.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
報告人 : 葉瑞群 日期 :2012/01/9 出處 : IEEE Transactions on Knowledge and Data Engineering.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.
An Instructable Connectionist/Control Architecture: Using Rule-Based Instructions to Accomplish Connectionist Learning in a Human Time Scale Presented.
Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.
Service - Oriented Middleware for Distributed Data Mining on the Grid ,劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.
P ARALLEL A NALYSIS OF E GG D ATA WITH HADOOP ON FUTUREGRID Project Member: Rewati Ovalekar Project Guide : Gregor von Laszweski, Lizhe Wang.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing Fang Liu and Masha Sosonkina Scalable Computing Lab, USDOE.
How to create property volumes
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.
Motivation Thus far we have dealt primarily with the input/output characteristics of linear systems. State variable, or state space, representations describe.
Parallelizing Video Transcoding Using Map-Reduce-Based Cloud Computing Speaker : 童耀民 MA1G0222 Feng Lao, Xinggong Zhang and Zongming Guo Institute of Computer.
Vehicular Cloud Networking: Architecture and Design Principles
# load data originaldata = load_data_from_csv(rawdatafile) #filter out a range filtered = range_filter({:min=> 20,:max =>50},originaldata) # sum normalize.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
1 1 Slide Simulation Professor Ahmadi. 2 2 Slide Simulation Chapter Outline n Computer Simulation n Simulation Modeling n Random Variables and Pseudo-Random.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs C.G. Puntonet and A. Prieto (Eds.): ICA 2004 Presenter.
Application of a Charge Transfer Model to Space Telescope Data Paul Bristow Dec’03
SOFTWARE TESTING AND QUALITY ASSURANCE. Software Testing.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
System is a set of interacting or interdependent components forming an integrated whole.
A Controlled Experiment in Maintenance Comparing Design Patterns to Simpler Solutions By Prechelt, Unger, Tichy, Brössler, and Votta Presentation by Chris.
SNS COLLEGE OF TECHNOLOGY
Submitted by: Ala Berawi Sujod Makhlof Samah Hanani Supervisor:
Ministry of Higher Education
Applying Twister to Scientific Applications
Using Tensorflow to Detect Objects in an Image
Srinivas Aluri Jaimin Mehta
Charles Tappert Seidenberg School of CSIS, Pace University
Paper ID: XX Track: Track Name
Presentation transcript:

A Study in Hadoop Streaming with Matlab for NMR data processing Kalpa Gunaratna1, Paul Anderson2, Ajith Ranabahu1 and Amit Sheth1 1Ohio Center of Excellence in Knowledge Enabled Computing (Kno.e.sis) Wright State University, Dayton, Ohio f kalpa, ajith, amit 2Air Force Research Laboratory, Biosciences & Protection Division Wright-Patterson AFB, Dayton, Ohio Speaker: 饒展榕

Outline Introduction Motivation and Background Implementation Discussion Conclusion

Introduction Our approach presented here is to use Cloud computing for Nuclear Magnetic Resonance (NMR) data analysis which normally consists of large amounts of data. In order to extract useful information from this spectral data, they have to be subjected to numerical processing such as base line correction and normalization that include complex computations.

Performing analysis over these extremely large datasets at once is difficult and sometimes impossible due to the limitations in memory and processing power a single computer can provide. Computing clusters and others types of distributed computing systems are generally used to analyze large datasets.

Matlab is one such commercial software that provides specific data structures and modules that biologists need in their routine workflows. Matlab, however, typically runs as a desktop software and hence, constrained in computational power.

In this paper we present the experience in our preliminary attempt to use Hadoop streaming with Matlab.

Motivation and Background Our research is motivated by the difficulty scientists encounter in analyzing large data files conveniently. Analyzing NMR spectroscopic data requires a variety of computationally intensive algorithms that range from signal processing to pattern recognition techniques.

Baseline distortion and correction These distorted baselines will result in incorrect metabolites quantification, thus, leading to spurious scientific conclusions. Baseline correction we performed using WS is shown in Figure 1 and a clear view of baseline correction with a closer view of Figure 1(a) is shown in Figure 1(b).

Implementation The WS algorithm balances these two goals as the sum:Q=S+λR

NMR Data Streaming for Matlab In order to enable streaming to use Matlab, we compiled Matlab code and created a C++ shared library. The mapper driver invokes the Matlab mapper function, in this case the baseline correction implementation.

The NMR spectra are usually generated as column oriented data files, i.e the data values are present in rows. However Hadoop streaming architecture reads data files line-byline.

Hence we collected all the spectra to a single file and inverted the data, i.e a single row now represents the full spectrum. The wrapper creates the relevant Matlab object for a column and passes it to the Matlab function.

In this particular case, a reducer was not used since each spectrum was represented in one line. If a spectrum spreads across multiple lines, then a reducer is needed to properly formulate the results.

Discussion We observed that Matlab had issues accessing the standard input and output provided by Hadoop streaming mechanism. We implemented the process with a C++ shared library and wrapped the Matlab function.

Results

Conclusion Our preliminary experiments with NMR datasets show that using Matlab is indeed feasible and could be extended for various requirements.

Thanks.