Workshop Schedule Schedule has links to introductory presentations and the FungiDB workshops Tuesday 3rdWednesday.

Slides:



Advertisements
Similar presentations
In Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu.
Advertisements

MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.
DNAseq analysis Bioinformatics Analysis Team
Next Generation Sequencing, Assembly, and Alignment Methods
Introduction to Short Read Sequencing Analysis
Lecture 14 Genome sequencing projects
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
Bioinformatics Applications
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
Introduction to Short Read Sequencing Analysis
File formats Wrapping your data in the right package Deanna M. Church
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Next Generation DNA Sequencing
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Quick introduction to genomic file types Preliminary quality control (lab)
The Changing Face of Sequencing
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Human Genome.
billion-piece genome puzzle
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Accessing and visualizing genomics data
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Short Read Workshop Day 5: Mapping and Visualization
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
Lesson: Sequence processing
VCF format: variants c.f. S. Brown NYU
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Genome sequence assembly
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Introduction to Genome Assembly
Removing Erroneous Connections
Jin Zhang, Jiayin Wang and Yufeng Wu
CS 598AGB Genome Assembly Tandy Warnow.
Maximize read usage through mapping strategies
BIOINFORMATICS Fast Alignment
Introduction to Sequencing
Canadian Bioinformatics Workshops
BF528 - Sequence Analysis Fundamentals
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Fragment Assembly 7/30/2019.
Presentation transcript:

Workshop Schedule Schedule has links to introductory presentations and the FungiDB workshops Tuesday 3rdWednesday 4thThursday 5th AM Session 1Introductory Presentations FungiDB Exercises 3, 6FungiDB Exercises 9, 10 AM Session 2Introductory Presentations FungiDB Exercises 5, 4FungiDB Exercises 13 Lunch/Discussion PM Session 3FungiDB Exercise 1, 2FungiDB Exercises 7, 8FungiDB Exercises 11, 14 EOD/Discussion

FungiDB Exercises FungiDB is a genome database with integrated bioinformatics tools; similar to FlyBase, TAIR, PlantGDB FungiDB is part of EuPathDB and uses same software but is less mature. – Not as much data has been loaded – Not all application are available When possible we will do oomycete genomes for the FungiDB exercises – In one exercise we will use Fungi genomes because not enough oomycete data was available – In three exercises we will use EuPathDB genomes because the applications are not yet available These applications will be available in FungiDB in the future

Gene and Genome Sequencing Brent Kronmiller Center for Genome Research and Biocomputing Oregon State University

From Sequence to Genome We’ve got our sequence off the machines, what is it and what do we do with it? Three versions of sequencers – Gen1: Sanger ~800bp sequence – Gen2: Illumina bp sequence – Gen3: PacBio +5000bp We still can’t sequence a genome from end to end Using these small sequences we need to assemble a 100Mb genome

FASTQ Files FASTQ is the standard flat file sequence read format with integrated quality scores Extension of FASTA, but with two extra lines 1) Header, starts with a instead of ‘>’ 2) Sequence 3) Quality header – usually not used ‘+’ 4) Quality scores, one per base to line up with 1:N:0:CGATGT CAGGNTGGTAGTCAAAGGATTGTTTTTTCCTGTAAGCATCTCATCAGGTGAATAAATGACTTCTCCAGTATCTGA + >DB775P1:230:D1U9KACXX:5:1101:1474:1950 1:N:0:CGATGT CAGGNTGGTAGTCAAAGGATTGTTTTTTCCTGTAAGCATCTCATCAGGTGAATAAATGACTTCTCCAGTATCTGG FASTA FASTQ

Sequence Quality Scores Each base of a sequence is assigned a quality score. Quality is (usually) used by the assembler or aligner to determine the validity of the overlap and the consensus quality. Phred log scale: Q40 is the max score for a sequence base. Depending on the calling software bases in a sequence can be influenced by surrounding bases and show a score higher than Q40. Phred ScoreChance the base was called incorrectly Q101 in 10 Q201 in 100 Q301 in 1,000 Q401 in 10,000

FASTQ Scoring Quality scores characters can be translated to Phred scores by looking up the Ascii value (Decimal value – 33) – Illumina has used 4 version of FASTQ quality scores – be careful that you have told your software which version to 1:N:0:CGATGT CAGGNTGGTAGTCAAAGGATTGTTTTTTCCTGTAAGCATCTCATCAGGTGAATAAATGACTTCTCCAGTATCTGA +

Alignments vs Assemblies Alignment: align your sequences to a reference genome – Quicker, each sequence is compared to the reference sequence Assembly: de novo reconstruction of genome from sequences – Slower, each sequence is compared to each other Applications: AlignmentAssembly SNP Identification RNAseq ChIPseq Re-Sequence Genome Sequencing Transcriptome Seq

Alignment Programs Alignment programs use one of two algorithms; – Hash table – Burrows Wheeler Transformation (BWT) Hash table is a data structure in programming – Quick lookup of exact matching sequence – Sorting is not necessary – Programs that use Hash MAQ, ELAND, SHRiMP, SOAP BWT – Faster than Hash aligners – Reference genome is pre-processed into a quickly searchable sorted index, subsequent assemblies will not need to reindex – Programs that use BWT BWA, Bowtie, SOAP

Genome Assembly The genome is broken into fragments, these are sequenced and assembled to reconstruct the genome. Hierarchical Shotgun SequencingWhole Genome Shotgun Sequencing BACs Minimal BAC Path Plasmids End Sequenced Typically sequenced at 6-8x coverage before finishing - Sequenced at much higher depth - But locations of sequences are random across whole genome, some areas will be sparse

Hierarchical Genome Assembly Possible Assembly Issues: Simple Repeat Transposable Element Repeat Gene Family Tandem Duplicated Gene TE Terminal Repeat Low Coverage Constrained mate paired ends can alleviate some of these assembly issues. GCGCGC The few gaps can be closed with finishing techniques Relative location on genome is known BAC

Whole Genome Assembly Possible Assembly Issues: Simple Repeat Transposable Element Repeat Gene Family Tandem Duplicated Gene TE Terminal Repeat Low Coverage GCGCGC Chromosome

Solutions for WGS What can we do about these issues? – Paired end sequences – Mate pair sequencing (long range) – Longer sequences – Hybrid sequence types (2 nd Gen and 3 rd Gen) – Long range libraries to span issues Gene region enhancement for WGS – Methyl filtration, HiCot Transcriptome sequencing – As a hybrid assembly RNAseq – With reference – de novo Or, only target the genes

Assembly Programs Hierarchical assemblers can use an Overlap-layout-consensus – Graph is constructed from overlapping reads – Phrap, arachne, etc A WGS assembly, expecially with 2 nd -Gen will have too many sequences Many short read assemblers use de Bruijn graph algorithm – ABySS, Velvet, ALLPATHS, SOAPdenovo – Uses fixed-length K-mer substrings – Assembler doesn’t store sequences, just counts of K-mers Interestingly, with long sequences from Gen3 sequencers, overlap-layout-consensus is making a comeback – We don’t want to chop up the long sequences into k-mers

WGS - de Bruijn Graph Sequences are chopped up into overlapping substrings (k- mers) – K-mer length decided by user, generally determined based on read length along with other factors, like expected depth and genome size Path is created across all k-mers Repeated regions will determine the complexity of the graph Errors or missing sequence will directly affect the ability to find the correct path AGTGTAGATCTGATCCATTT 4-mers AGTGTAGATC GTAGATCTGA TGATCCATTT AGTG-GTGT-TGTA-GTAG-TAGA-AGAT-GATC GTAG-TAGA-AGAT-GATC-ATCT-TCTG-CTGA TGAT-GATC-ATCC-TCCA-CCAT-CATT-ATTT Sequences Genome AGTG-GTGT-TGTA-GTAG-TAGA-AGAT ATCC-TCCA-CCAT-CAAT-ATTT ATCT-TCTG-CTGA-TGAT GATC de Bruijn Graph

N50 for Assembly Assessment To calculate N50 for an assembly: – Order all contigs produced in the assembly by size – Calculate total length of all contigs – Find the contig where 50% of the total length of all contigs are found in that contigs of that size or greater, e.g. 100kb 95kb 80kb 70kb 60kb 50kb 40kb 30kb 25kb 20kb 10kb 5kb 605Kb / 2 = 302.5Kb 100Kb+95Kb+80Kb+70Kb = 345Kb N50 = 70Kb – N50 does not assess quality, either from sequence quality or correctness of sequence overlaps

Assembly/Alignment File Formats Assembly: output is a FASTA file – Multiple FASTA of the contigs – contiguous sequence assemblies – Multiple FASTA of the scaffolds – contigs joined when their order and orientation is known by spanning sequences from paired-end or mate pair sequences Alignment: SAM/BAM has become the standardized output format SAM: Sequence Alignment/Map format BAM: Binary SAM – Binary format allows for quicker retrieval and indexing of information – For each sequence read give information on where it aligns and how it matches

SAM/BAM Format Header section (optional) – Form TAG:VALUE TAG:VALUE … – RECORD and TAG are 2-letter codes – 5 RECORD and 25 TAG categories some are required, some optional Alignment section – One line per sequence in assembly – 11 required fields – 37 possible optional tags: TAG:TYPE:VALUE format found after field 11 Notice header line begins with same as FASTQ header. If your assembler doesn’t remove the from the sequence name the alignment section will get confused for the header VN:1.3 SN:ref – header of the header VN – SAM version SO – sorting – Reference SN – Ref name LN – Ref length r ref M2I4M1D3M = TTAGATAAAGGATACTG * r002 0 ref S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref H6M * 0 0 AGCTAA * r004 0 ref M14N5M * 0 0 ATAGCTTCAGC * 1Seq name7Ref Mate 2Bit Flag8Position Mate 3Ref name9Insert Length 4Position10Sequence (RC) 5Map quality11Phred Quality 6CIGAR string

Bitwise Flag A numerical code to give you the status of the read in the assembly – 163 = Has pair Has alignment Pair is reversed First of the pair to align 2 nd best alignment – 12 = 2+10 Has alignment Reverse-complemented – 83 = Has pair Has alignment Second of the pair to align BitDescription 0x1template having multiple segments in sequencing 0x2each segment properly aligned according to the aligner 0x4segment unmapped 0x8next segment in the template unmapped 0x10SEQ being reverse complemented 0x20SEQ of the next segment in the template being reversed 0x40the first segment in the template 0x80the last segment in the template 0x100secondary alignment 0x200not passing quality controls 0x400PCR or optical duplicate

CIGAR String A compact view of the sequence in the assembly – 8M2I4M1D3M 8 bases match 2 bases inserted in sequence relative to ref 4 bases match 1 base deleted in sequence relative to ref 3 bases match – 6M14N5M 6 bases match 14 bases skipped 5 bases match MAlignment Match IInsertion to the reference DDeletion from the reference NSkipped sequence (intron in RNAseq) SSoft clipped HHard clipped PPadded =Sequence Match XSequence Mismatch ref …ATGTTAGATAA**GATAGCTGTGC… seq TTAGATAAAGGATA*CTG ref …TGATAGCTGTGCTAGTAGGCAGTCAGCGCC… seq ATAGCT TCAGC