June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support
National Center for Genome Analysis Support: Summary What does the raw data look like? What is “sequence quality”? What needs to be done before assembly? What is assembly anyway?
National Center for Genome Analysis Support: Hot out of the oven! The most common sequence format for raw “reads” is called fastq. It has 4 lines per sequence: There are many methods for getting to this point. Chemistry and technique, machinery, and approach can be different, but all must call bases and qualities.
National Center for Genome Analysis Support: What is Sequence Quality? The quality “score” is assigned by the sequencing machine as it reads a single base. It is a rough estimate of how ambiguous the signal is – how “sure” the machine is that it’s labeling the base correctly.
National Center for Genome Analysis Support: What needs to be done before assembly? Quality Control – Assess the state of the reads using FastQC ~ Demo ~ Trim and shape the reads based on your assessment using Trimmomatic.
National Center for Genome Analysis Support: What is assembly anyway? An assembler attempts to create one long string of nucleotides from the millions of short pieces it is given (ideally, one string per mRNA transcript). There are many approaches to this puzzle problem.
National Center for Genome Analysis Support: What is assembly anyway? We will explore the Trinity de novo assembler. De novo means “from scratch” – without a reference
National Center for Genome Analysis Support: What is assembly anyway? Other assemblers will try to align raw reads to a reference genome or transcriptome (e.g., Tophat or Bowtie)
National Center for Genome Analysis Support: It finished! We’re done, right? An assembler solves a computer problem of putting together a puzzle from tiny pieces. The output of the assembler is a guess – but we don’t know how accurate it is. We could look at: Basic stats of the assembly – “Contigs” Number of “Contigs” vs. Expected Number N50 – a weighted average Average Length Max Length Check contigs against known genes - Blast
National Center for Genome Analysis Support: What could go wrong? If the assembly came out poorly, handling the data differently could solve the problem. More/less stringent quality cutoff Clean data of “primers” Assemble with a different program/parameters Normalize data – this removes redundant reads from the set, making the dataset much smaller and making the job easier on the assembler
National Center for Genome Analysis Support: What could go wrong? Sometimes, the problem could be with the biological samples. More sequence will usually help. Genetic hiccups for the assembler – repeats, related genes Sample Prep was incorrect or poorly suited Low “Coverage”: Coverage: Like layers of paint on a stubborn surface, too few can leave holes, or “gaps”
National Center for Genome Analysis Support: Fin Thanks for watching! Questions and comments: