BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.

Slides:



Advertisements
Similar presentations
GPU and machine learning solutions for comparative genomics Usman Roshan Department of Computer Science New Jersey Institute of Technology.
Advertisements

RNAseq.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
The Central Dogma of Molecular Biology (Things are not really this simple) Genetic information is stored in our DNA (~ 3 billion bp) The DNA of a.
BNFO 235 Lecture 5 Usman Roshan. What we have done to date Basic Perl –Data types: numbers, strings, arrays, and hashes –Control structures: If-else,
Phylogeny - based on whole genome data
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Human Genome Project Seminal achievement. Scientific milestone. Scientific implications. Social implications.
High Throughput Sequencing
Considerations for Analyzing Targeted NGS Data BRCA Tim Hague,CTO.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Considerations for Analyzing Targeted NGS Data BRCA Tim Hague,CTO.
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
Todd J. Treangen, Steven L. Salzberg
Igor Ulitsky.  “the branch of genetics that studies organisms in terms of their genomes (their full DNA sequences)”  Computational genomics in TAU ◦
Whole genome comparison Kelley Crouse And Greg Matuszek.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Genome alignment Usman Roshan. Applications Genome sequencing on the rise Whole genome comparison provides a deeper understanding of biology – Evolutionary.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
No reference available
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.
Analysis of Next Generation Sequence Data BIOST /06/2015.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
BNFO 615 Usman Roshan. Projects and papers An opportunity to do hands on work Proposal presentations due by end of September Papers: present at least.
From Reads to Results Exome-seq analysis at CCBR
Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression.
BNFO 615 Fall 2016 Usman Roshan NJIT. Outline Machine learning for bioinformatics – Basic machine learning algorithms – Applications to bioinformatics.
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
RNAseq: a Closer Look at Read Mapping and Quantitation
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
Introduction to Bioinformatics Resources for DNA Barcoding
Short Read Sequencing Analysis Workshop
Gil McVean Department of Statistics
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Disease risk prediction
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Genome alignment Usman Roshan.
Pairwise and NGS read alignment
Genomes and Their Evolution
BNFO 236 Smith Waterman alignment
Distance based phylogeny reconstruction
Genome organization and Bioinformatics
CSC2431 February 3rd 2010 Alecia Fowler
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Iterative resolution of multi-reads in multiple genomes
BF528 - Genomic Variation and SNP Analysis
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

BNFO 615 Usman Roshan

Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine Reads are fragments of a longer DNA sequence present in the sample given as input to the machine Usually in the millions – Genome sequence: a reference DNA sequence much longer than the read length

ACCAG ACCCG Heterozygous SNP ATT--A ATTGA Heterozygous indel ATTGA Human genome reference Short reads from a single individual ATTGA ATTAA Homozygous SNP encoded as 2 ATTGA (2, 1, 0, 1) ACCAG Here no variant is reported. Short read alignment

Applications – Genome assembly – RNA splicing studies – Gene expression studies – Discovery of new genes – Discovering of cancer causing mutations

Short read alignment Two approaches – Hashing based algorithms BFAST SHRIMP MAQ STAMPY (statistical alignment) – Burrows Wheeler transform Bowtie BWA

BFAST overview PLoS ONE 4(11): e7767.

BFAST algorithm PLoS ONE 4(11): e7767.

BFAST masked keys

Short read alignment Empirical performance: Simulated data: – Extract random substrings of fixed length with random mutations and gaps – Realign back to reference genome Real data: – Paired reads: two ends of the same sequence – Count number of paired reads within 500 to bases of each other

Short read alignment Courtesy of Genome Res. June : ;

Short read alignment Courtesy of Genome Res. June : ;

Short read alignment

Metagenomics Study of DNA sequences (fragments) from environmental samples Bioinformatics problem: classify the DNA sequence of an organism Methods: – Sequence similarity – Machine learning Key papers

Metagenomics Project: use an advanced GPU sequence similarity program to classify simulated reads Data: available from published study Steps: – Align simulated metagenomic reads to a set of genomes with programs called MaxSSmap and NextGenMap. – Since these are simulated we know the true genome names – Determine number of correctly aligned pairs

Genome alignment Comparison of two large genome sequences Bioinformatics problem: comparison of distantly related genome sequences like chicken and mouse Methods: – Mainly hash table based methods Key papers

Genome alignment Project: develop a GPU program for accurate genome alignment Data: simulated genomes available from published study Alignathon Steps: – Make fragments of one genome – Align each fragment to other genome (with short read mapping) – Determine accuracy with mafcomparator program

Whole genome phylogeny Whole genome phylogeny to understand evolution at the genome level Bioinformatics problem: compare phylogeny from whole genome data vs concatenated data vs multiple gene trees Methods: – Concatenate alignment – Multiple gene trees Key papers

Whole genome phylogeny Project: determine accuracy of phylogenies from simulated whole genome data with three different methods – Gene concatenation – Tree reconciliation – Whole genome distance based methods (new) Data – Simulated genome sequences from Evolver program Steps – Construct trees with three methods – Determine their Robinson-Foulds distance

Variant detection Variants in unmapped reads could contain Bioinformatics problem: are there variants in unmapped reads? Methods: – GATK for variant detection Key papers

Variant detection Project: determine variants in unmapped reads Data: reads from 1000 genomes human sample (NA12878) that are unmapped by BWA Steps: – Map reads with MaxSSmap and NextGenMap programs – Run GATK to determine variants

Projects Metagenomics: Clarissa and Komal Genome alignment: Kayla and Kathryn Risk prediction webserver: Edward and Ittehad Variant detection: Shijie and Sruti Whole genome phylogeny: Catherine and Samantha