Big Data Bioinformatics By: Khalifeh Al-Jadda. Is there any thing useful?!

Slides:



Advertisements
Similar presentations
Siniša Ivković, Goran Rakočević, Prof. Veljko Milutinovic University of Belgrade School of Electrical Engineering.
Advertisements

Pfam(Protein families )
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Profiles for Sequences
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
A Grid implementation of the sliding window algorithm for protein similarity searches facilitates whole proteome analysis on continuously updated databases.
Applications of Hidden Markov Models in the Avian/Mammalian Genome Comparison Christine Bloom Animal Science College of Agriculture University of Delaware.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Bioinformatics and Phylogenetic Analysis
Phylogenetic Trees Presenter: Michael Tung
Protein Modules An Introduction to Bioinformatics.
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc.,
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
341: Introduction to Bioinformatics Dr. Natasa Przulj Deaprtment of Computing Imperial College London
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Human SNPs from short reads in hours using cloud computing Ben Langmead 1, 2, Michael C. Schatz 2, Jimmy Lin 3, Mihai Pop 2, Steven L. Salzberg 2 1 Department.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
The Pfam and MEROPS databases EMBO course 2004 Robert Finn
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
11 Overview Paracel GeneMatcher2. 22 GeneMatcher2 The GeneMatcher system comprises of hardware and software components that significantly accelerate a.
Construction of Substitution Matrices
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
BioPerf: A Benchmark Suite to Evaluate High- Performance Computer Architecture on Bioinformatics Applications David A. Bader, Yue Li Tao Li Vipin Sachdeva.
1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June.
Bioinformatics Core Facility Guglielmo Roma January 2011.
Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole.
Pfam, DAS and the future Rob Finn DAS Workshop 2009.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
(H)MMs in gene prediction and similarity searches.
PORTING HMMER AND INTERPROSCAN TO THE GRID Daniel Alberto Burbano Sefair ( ) Michael Angel Pérez Cabarcas.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Gene models and proteomes for Saccharomyces cerevisiae (Sc), Schizosaccharomyces pombe (Sp), Arabidopsis thaliana (At), Oryza sativa (Os), Drosophila melanogaster.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark
Protein families, domains and motifs in functional prediction May 31, 2016.
Bioinformatics Computation in the Cloud A Joint Collaboration Between Microsoft’s External Research and eXtreme Computing Groups
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Sequence similarity, BLAST alignments & multiple sequence alignments
Introduction to Bioinformatics Resources for DNA Barcoding
Sequence based searches:
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
InterPro An Introduction
Presentation transcript:

Big Data Bioinformatics By: Khalifeh Al-Jadda

Is there any thing useful?!

What is Big Data? Volume Variety Velocity 800,000 Petabytes 35 Zettabytes 1 ZB = 10 6 PB

Solutions

Motivation The 1000 genomes project will produce 1 petabyte of data per year from multiple sources in multiple countries. The Reciprocal Smallest Distance (RSD) algorithm to generate orthology groups requires: – ((N)(N-1)/2)*M – Where N is the number of genomes and M represents the number of different parameter settings for evalue and divergence. With 1000 genomes and 12 parameters – the total number of processes required for a full complement of results would be 5,994,000. Further assuming that each individual process takes on average 4 hours (generally a lower bound for big genomes), and constant access to 300 cores of computer processing power, the total time to complete this task would be 79,920 hours, or 9.1 years Cloud computing for comparative genomics

Motivation Asian Individual Genome was assembled based on 3.3 billion reads ( 104 GB). African Individual Genome was assembled based on 4 billion reads ( 144 GB).

Motivation InterPro is an integrated documentation resource for protein families, domains, regions and sites. InterPro combines a number of databases (referred to as member databases) that use different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: are providers of hidden Markov models (HMMs).

Motivation Searching unknown sequences: – 10,000 average proteins – 1 average HMM (239 states) = 1.18 CPU sec BUT, – 17,159,442 proteins – 44,117 HMMs = 1.4 million CPU hrs

Success Crossbow (implemented over Hadoop) condenses over 1000 CPU hours of resequencing computation into a few hours without requiring the user to own or operate a computer cluster.

Bioinformatics projects Michael C. Schatz Ben Langmead Dennis P. Wall

CloudBurst Is a new parallel read-mapping algorithm optimized for mapping next-generation sequence data to the human genome and other reference genomes, for use in a variety of biological analyses including SNP discovery, genotyping, and personal genomics. It reports either all alignments or the unambiguous best alignment for each read with any number of mismatches or differences. This level of sensitivity could be prohibitively time consuming, but CloudBurst uses the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes.

Roundup Is a large-scale orthology database. The orthologs are computed using the Reciprocal Smallest Distance (RSD) algorithm. This algorithm detects more (and more accurate) orthologs than reciprocal best blast hits and gives each ortholog a score based on its maximum likelihood evolutionary distance. Every release of Roundup downloads the latest genomes from UniProt and computes orthologs for them.

Bioinformatics Tools ToolPlatformAuthors BLASTHadoopM. Gaggero Gene Set Enrichment Analysis (GSEA) HadoopM. Gaggero GRAMMARHadoopM. Gaggero NCBI BLAST2HadoopAndrea Matsunaga MSAHadoopSadasivam SeqWare query engineHbaseBrian O’Connor

Glad Tiding Hadoop and Hbase have been installed on a cluster at NERSC (40 nodes soon to double). Users interested in using cloud for their research may fill out the Magellan Cloud Computing statement of interest form.

Thanks