B3- Olympic High School Bioinformatics

Slides:



Advertisements
Similar presentations
Huong Le Department of Molecular & Clinical Genetics, Royal Prince Alfred Hospital Click mouse to move to the next slide.
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
THE DISTRIBUTION OF SAMPLE MEANS How samples can tell us about populations.
SOLiD Sequencing & Data
13-2 Manipulating DNA.
Workshop in Bioinformatics 2010 Class # Class 8 March 2010.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
CS 6293 Advanced Topics: Current Bioinformatics
Journal Meeting Jung-Yun Ko DNA Sequencing & ABI DNA Sequencer.
Introduction to next generation sequencing Rolf Sommer Kaas.
Manipulating DNA.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
13-1 Changing the Living World
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
Quick introduction to genomic file types Preliminary quality control (lab)
GENE SEQUENCING. INTRODUCTION CELL The cells contain the nucleus. The chromosomes are present within the nucleus.
Chapter 10: Genetic Engineering- A Revolution in Molecular Biology.
Locating and sequencing genes
Sequence File Formats.
Section 14-3: Studying the Human Genome. Manipulating DNA The SMALLEST human chromosome contains 50 million bases DNA is a HUGE molecule that is difficult.
Canadian Bioinformatics Workshops
Genome sequencing and annotation Week 2 reading assignment - pages 63-78, 93-98, Boxes 2.1 and don’t worry about details of similarity scoring.
Title: Studying whole genomes Homework: learning package 14 for Thursday 21 June 2016.
Cse587A/Bio 5747: L2 1/19/06 1 DNA sequencing: Basic idea Background: test tube DNA synthesis DNA polymerase (a natural enzyme) extends 2-stranded DNA.
DNA Sequencing First generation techniques
A Little More Advanced Biotechnology Tools
Next-generation sequencing technology
Virginia Commonwealth University
DNA Sequencing Second generation techniques
Lesson: Sequence processing
What is a Hidden Markov Model?
Sequencing technologies
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
Copyright Pearson Prentice Hall
CHAPTER 20 PART 3: A LITTLE MORE ADVANCED BIOTECHNOLOGY TOOLS
Next-generation sequencing technology
Sequencing technology and assembly
Swapping Segmented paging allows us to have non-contiguous allocations
Data Compression.
Section 3: Gene Technologies in Detail
Department of Computer Science
Bellwork: What is the human genome project. What was its purpose
B3- Olympic High School Bioinformatics
B3- Olympic High School Bioinformatics
B3- Olympic High School Bioinformatics
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
B3- Olympic High School Bioinformatics
A Little More Advanced Biotechnology Tools
Copyright Pearson Prentice Hall
2nd (Next) Generation Sequencing
Copyright Pearson Prentice Hall
A Little More Advanced Biotechnology Tools
DNA and the Genome Key Area 8a Genomic Sequencing.
Copyright Pearson Prentice Hall
DNA FINGERPRINTING Gel Electrophoresis
Ion Torrent Semiconductor Sequencing
A Sequenciação em Análises Clínicas
A Little More Advanced Biotechnology Tools
Copyright Pearson Prentice Hall
A Little More Advanced Biotechnology Tools
Garbage In, Garbage Out: Quality control on sequence data
Introduction to Sequencing
BF nd (Next) Generation Sequencing
Copyright Pearson Prentice Hall
Applying principles of computer science in a biological context
Canadian Bioinformatics Workshops
Copyright Pearson Prentice Hall
Presentation transcript:

B3- Olympic High School Bioinformatics Dr. Jennifer Weller May 2016 Sequencing Technologies How do you go from the organism (the American Chestnut tree) to a sample you can collect (the leaf), to the part of the leaf of interest (the chloroplast) to the chloroplast DNA (from sequencing) to putting the sequence together in order (a map) and labeling the parts (annotation) so you can pick out one particular region that you want to barcode? Part of this involves the wet-lab process that we do in the Summer Science camp, where we use wet-lab tools to purify the DNA sample and carry out the sequencing. Part of it is a computational process – it still requires tools and knowing how to use them. One of the things you can do is complete the circle to go back to the lab with better (more focused) assays. 9/19/2018 Dr. Weller B3 Olympic HS

Topics Signal Detection Sequence Assembly errors, base calls and quality scores Sequence Assembly Overlaps, ordering, redundancy 9/19/2018 Weller UNCC

Signal Quality Types of signals Types of errors Quality Scores

Primary and interpreted signal The primary signal is what the instrument measures For example, with the fluorescent-dye capillary electrophoresis system the dye passes the detector in a cloud, which is converted to a nicer visual form by signal processing software.

Bands are not equally/completely spaced Detector Detector Detector Detector Detector Detector Detector Detector Detector Detector Detector Detector Detector Detector Detector Detector Detector

Detector selects wavelength band The dyes produce photons at wavelengths that overlap – depending on where the detector is set, the signal might be quite specific, but not very sensitive. The dyes overlap – all wavelengths in the set are tracked: the software mathematically filters fluorescence signal from dyes when there is overlap.

Signal Processing: Basecalling Each base is labeled with its own dye. There is a background level of fluorescence – you set a level so that only a peak greater than that level is a ‘real’ band. There should be some minimal separation between peaks to be sure the order is correct. The dyes change the behavior of the fragments, so one type of chemistry behaves a little different from another – you use training sets for this What makes a good training set? The software will assign ONE base to each peak (A, C, G, T, or N if it really can’t tell). 9/19/2018 Weller UNCC

9/19/2018 Weller UNCC

Data Processing: Quality scores When sequence data was first collected only the base calls were submitted. Bad science: trying to interpret as a mutation a call that was a mistake. Phil Green at the University of Washington first came up with the idea of assigning a quality score to predict the probability of a basecall error. His algorithm is called Phred – you often hear people talk about ‘Phred scores’ even when they did not use that algorithm. FinchTV is an electropherogram viewer/editor that implements Phred – if there is time we will try it out. File extensions are .abi or .sff for sequence files and .phd for phred calls. Different algorithms use different assessment methods. Using the same raw data, Phred may assign a score of 20 while KB assigns a score of 15. A QV of 20 predicts an error rate of 1%. 9/19/2018 Weller UNCC

What are some things you notice about these peaks? Is there a characteristic peak height and width for each color? Is there a characteristic separation between peaks at the top? Is there a characteristic separation at the bottom? Do some of the characteristics depend on the neighbors? Use a training set of known sequences, run them many times, look at the frequency of incorrect base calls, turn that into the probability that a given call is correct - basically a weight.

Q = -10logP

Using Quality Scores If you have 5 sequences of the ‘same’ gene but there is a sequence difference at several positions in two of the reads, do you conclude you have found a variant, or is there an error? If you have sequenced across a gene that is 5000bp long, using 12 overlapping sequences, which sequence are you less likely to use when assembling the virtual sequence (the contig)?

What about the NGS platforms? Ion Torrent: Shape of voltage response, whether there is background, how close the response it to a unit level, whether there is response for only one type of base in each set of flows.

Pacific Biosciences – this has a very high base call error rate (15%) Illumina: the image has to be carefully registered, the dyes can overlap, etc.

Why don’t we have sequence viewer/editors for NGS platforms? It is possible to access the raw files, but each run has hundreds of thousands of reads. Keeping track of the subset you have looked at and the changes made will require a lot of bookkeeping and Data Science skills. The image files are extremely large and there is a huge stack of them – you will need some Data Science skills. Therefore biologists don’t tend to interact with the raw data and edit it – they take the sequence data with quality scores and focus on the downstream operations. Note: research does go on to improve our understanding of what affects the signal, the processing and therefore the data quality! Note: because the signal properties are so different, the quality scores for different platforms may not mean quite the same thing – combining data from different platforms may mean some filtering has to be done in addition to using the Q values.

FastQ files for encoding Base call + Q value With electropherogram data, the phred scores were stored in a separate (.phd) file. This is very inconvenient – what if they get separated? What if you use the wrong one? A single file prevents this, but you could do this in several ways. AAA then 202520, or A20A25A20. Generally you will see 4 lines/read @ with an identifier and a description that is optional The base calls as letters (AGTC or N) ‘+’ (if paired-end there will be a matched ‘-’ for the other strand) The base call+ quality values, encoded as single ASCI characters (starting at code 33 and adding the Q value so Q20 is ASCII character 53, which is 5). 9/19/2018

ASCII characters. simplified 9/19/2018

NCBI fastQ The file is now being loaded with more information – but it is intended to help identify the exact sequence and sample. So what are the quality scores here? I= 40, 9 = 24, G=38, C = 34 Instrument ID :Lane:tile:Xcoor:Ycoord @db identifier 9/19/2018

QC Lab The Human Genome Project did not allow data that did not achieve a Phred Score of at least 30 for each base to be included in the final data set. What fastQ characters would you filter for to know what files to exclude? The amount of usable sequence in a read has to be at least 75 nucleotides, and they have to all be of Q30 or higher. If you trim off the ends of the sequences in the file (which is mostly where low values occur) which of the sequences is now usable for the project?

The Fragment Assembly Problem Sequencing platforms produce high-quality sequence for only 200- 500nt lengths. Genes and genomes are much longer (a bacterial gene is ~3000nt on average, a genome might be 3million bp) The DNA is broken into pieces you can sequence by one of 2 strategies Sequence walking: using restriction enzymes and cloning vectors Shotgun sequencing: shearing randomly The pieces have to be put back together Method 1: other data provides the order Method 2: The pieces contain overlaps.

Assembly Rules How much overlap (the k-mer where k is the overlap length) should you require? Are there ways you could estimate this length for a known genome? Should the quality score matter? Should you trim the sequence? Should you filter out sequences with low-quality bases in the middle? Is it useful to keep track of how many times a nucleotide has been incorporated (through individual reads) in the longer contig?

Randomness, size filtering and X-fold coverage We have a 42-bp sequence. In one experiment we only had 2 copies, in one we had 4 copies and in one we had 6 copies. Random fragmentation was performed. Fragments 8bp and smaller were discarded, there was no upper limit (that is all fragments longer than 8bp were kept). Perform and record your strategy for re-assembling the original sequence. CATCACG CTTGTCG ATTTACT TGCATCG CATCACT TATTGCC

Quality Scores and 5X coverage To the previous rules, add that bases used in the actual assembly (the overlaps) must have a quality score of at least Q30, unless there are two fragments with the same base call in that position and neither has a quality score lower than Q25.

Why we like longer reads, even at low quality The long sequence can only be used to order shorter contigs when 80% of the nucleotides are in the same order OR when all of the nucleotides in the overlap have a Q30 or better.

Algorithms and Software for the FAP There are competitions for algorithms, using datasets that include specific types of problems There are approaches to managing the computer memory problems – how to handle millions of fragments efficiently. So  not a solved problem Some examples of software are given here, with reviews and recommendations, http://omictools.com/genome-assembly- category although some links are no longer active.