Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads.

Slides:



Advertisements
Similar presentations
In Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu.
Advertisements

Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
Recombinant DNA Technology. Recombinant DNA Technology combines DNA from different sources – usually different species Utility: this is done to study.
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
9 Genomics and Beyond Brief Chapter Outline
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
CISC667, F05, Lec3, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Molecular Biology Tools Gel electrophoresis Cloning PCR DNA Sequencing.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Recombinant DNA Technology
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genomic DNA & cDNA Libraries
© Wiley Publishing All Rights Reserved. Working with a Single DNA Sequence.
Reading the Blueprint of Life
Recombinant DNA Technology for the non- science major.
Designing CAPS markers using SGN CAPS Designer
Presentation on genome sequencing. Genome: the complete set of gene of an organism Genome annotation: the process by which the genes, control sequences.
Biotechnology SB2.f – Examine the use of DNA technology in forensics, medicine and agriculture.
Trends in Biotechnology
PHYSICAL MAPPING AND POSITIONAL CLONING. Linkage mapping – Flanking markers identified – 1cM, for example Probably ~ 1 MB or more in humans Need very.
Genomic walking (1) To start, you need: -the DNA sequence of a small region of the chromosome -An adaptor: a small piece of DNA, nucleotides long.
1 Genetics Faculty of Agriculture Instructor: Dr. Jihad Abdallah Topic 13:Recombinant DNA Technology.
Tools of Bioinformatics
Applications of DNA technology
Technological Solutions. In 1977 Sanger et al. were able to work out the complete nucleotide sequence in a virus – (Phage 0X174) This breakthrough allowed.
Module 1 Section 1.3 DNA Technology
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Chapter 3 Fundamentals of Mapping and Sequencing Basic principles.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Recombinant DNA Technology Prof. Elena A. Carrasquillo Chapter 4 Molecular Biotechnology Lecture 4.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
The Changing Face of Sequencing
Polymerase Chain Reaction (PCR) Developed in 1983 by Kary Mullis Major breakthrough in Molecular Biology Allows for the amplification of specific DNA fragments.
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Linkage and Mapping. Figure 4-8 For linked genes, recombinant frequencies are less than 50 percent.
PHYSICAL MAPPING AND POSITIONAL CLONING. Linkage mapping – Flanking markers identified – 1cM, for example Probably ~ 1 MB or more in humans Need very.
Wageningen, April 24-25, 2008 II Tomato Finishing Workshop Chromosome 12 Update ENEA, Rome University of Naples ‘Federico II’ CRIBI and Univ. of Padua.
Human Genome.
Lecture # 04 Cloning Vectors.
Genetic Engineering Genetic engineering is also referred to as recombinant DNA technology – new combinations of genetic material are produced by artificially.
Molecular Biology II Lecture 1 OrR. Restriction Endonuclease (sticky end)
Plasmids that contain l cos sites.
Genomics Part 1. Human Genome Project  G oal is to identify the DNA sequence of every gene in humans Genome  all the DNA in one cell of an organism.
DNA Technology Ch. 20. The Human Genome The human genome has over 3 billion base pairs 97% does not code for proteins Called “Junk DNA” or “Noncoding.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
MOLECULAR BIOLOGY IN ACTION In this project, students will use what they have learned in the previous courses to complete a larger multi-step molecular.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Gene Technologies and Human ApplicationsSection 3 Section 3: Gene Technologies in Detail Preview Bellringer Key Ideas Basic Tools for Genetic Manipulation.
Genome Analysis. This involves finding out the: order of the bases in the DNA location of genes parts of the DNA that controls the activity of the genes.
The genetic engineers toolkit A brief overview of some of the techniques commonly used.
Biotechnology You Will Learn About… Transformation Cloning DNA Fingerprinting by Restriction Fragment Length Polymorphism (RFLP) What is the name of the.
Title: Studying whole genomes Homework: learning package 14 for Thursday 21 June 2016.
Albia Dugger Miami Dade College Cecie Starr Christine Evers Lisa Starr Chapter 15 Biotechnology (Sections )
Topic Cloning and analyzing oxalate degrading enzymes to see if they dissolve kidney stones with Dr. VanWert.
Virginia Commonwealth University
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
DNA Technologies (Introduction)
Seminar on :- Constructing Contigs Sequencing
Pre-genomic era: finding your own clones
Section 3: Gene Technologies in Detail
COURSE OF MICROBIOLOGY
Gene Isolation and Manipulation
CISC 667 Intro to Bioinformatics (Spring 2007) Molecular Biology Tools
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
Recombinant DNA Unit 12 Lesson 2.
Molecular Cloning.
Introduction to Sequencing
Biotechnology Mr. Greene Page: 78.
Presentation transcript:

Applied Bioinformatics Week 5

Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads

Theoretical Part I DNA sequencing Next generation sequencing Cleaning nucleotide sequences

DNA Sequencing Sanger Method –Please explain Other methods –Too many to discuss –

Shotgun Sequencing Many short (~700 N) sequences Human genome sequencing project –Finished? How can you make sense of these sequences? Contrast: –Genome walking

Next Generation Sequencing Increases the throughput of sequencing –More sequence per time –Not more sequence per read (still around 500) Many commercial platforms available –454 pyrosequencing –Illumina (Solexa) sequencing –... Price is dropping –Whole genomes in a day –

454 Pyrosequencing

Illumina sequencing

Where from is your DNA Did you just clone and sequence? Did you sent a sample to a company? Did you find the sequence in a database? Better make sure it is correct and clean

Vector Contaminations Long DNA pieces are fragmented and cloned into vectors before sequencing. This usually causes some amount of vector to be sequenced along with the insert. image: Wikipedia

Adapter Contaminations Long DNA pieces are fragmented and adapter sequences are ligated to both ends of the fragments before sequencing. This causes adapters to be sequenced along with the desired sequence.

Contaminations Cause Misassembly One important outcome of not removing contaminations from genomic sequences is that they cause misassembly of sequences

Cleaning Contaminations Several approaches and tools to clean vector contaminations from genomic sequences have been developed. Most of them rely on a reference vector library, including: –LUCY, LUCY2 –SeqTrim –DeconSeq –TagCleaner –cross_match –SeqClean –VecScreen

Problem Definition A vector is a circular DNA sequence. After being linearized in reference libraries, vector contaminations around the linearization point can no more be detected and cleaned by currently available tools.

UniVec A vector library by NCBI Problems: –Has complete sequences for only 8 vectors, although full length sequences are available on public databases for the rest as well. –Only these 8 vectors are appended to themselves by 49 nt to overcome circularization problem. –Some vectors are divided into partitions, for no apparent reason. –Some adapter sequences are appended to themselves as well, whereas some are not.

Previous Solution Not designed for entire libraries Proposes cutting the first 60 nucleotides from the start of a vector sequence and pasting it to the end by using a simple text editor No more has an implementation Y.-A. Chen, C.-C. Lin, C.-D. Wang, H.-B. Wu, and P.-I. Hwang, “An optimized procedure greatly improves EST vector contamination removal,” 2007.

Our Solution Appending all (or filtered by the user) vector sequences in a reference library to themselves or to first n number of nucleotides (n chosen by the user) As customizable as possible, but still efficient with a single click Has a GUI for target- users

Our Solution Possible Customisations –Cleaning already introduced appendices in the library –Filtering the sequences by a keyword in their definition lines and/or by length –Virtual Circularization Appending sequences to themselves by first n nucleotides

Efficiency of Our Method Datasets: –Every 600th EST –P. somniferum EST –Artificial Data Vector Libraries –rawUV –cleanUV –appUV The Percentage of Sequences Cleaned rawUVcleanUVappUV Every 600th EST P. Somniferum EST Artificial Data The Percentage of Nucleotides Cleaned rawUVcleanUVappUV Every 600th EST P. Somniferum EST Artificial Data

Theoretical Part I Mind Mapping Break 10 min

Practical Part I

Screening for Vector seqs Get the U87251 sequence (FASTA) –What is this number? –Enter the sequence and run the analysis What do you see as a result? –Would you continue with the experiment? –Would you discard the sequence?

Sequencing Since we cannot do any sequencing here we have to prepare a simulation 1.Select a nucleotide sequence of about bases 2.Copy and paste that sequence into word 1.3 times 2.Separated by empty lines

Sequencing 3.Arbitrarily add linebreaks into the resulting document 1.At least 30 (10 per copy min) 2.Spread out throughout the sequence 4.Add a FASTA definition line after each line break –Use >Copy-N-Fragment-X as a template for the definition line Ensure that the overall number of characters is less than 50000

Practical Part I 10 min break

Theoretical Part II Sequence Assembly

Assembling Sequences Shotgun sequencing –Sequence fragments –Find overlapping fragments –Build contiguous sequences (contig) –Assemble into whole genomes Genetic and physical maps –Help orient fragments and contigs Problems with repetitive sequences

Sequence Tagged Sites Physical map Up to 200 bp long Unique for a region of the genome STS reference map –Map to assemble BAC/ PAC clones –Repeat process to map contigs to clones

Sequence Tagged Site Chromosome Sequence Tagged Site Endonuclease Site The restriction enzyme should digest the DNA into approximately 200 kB long fragments

Fragments with STS If it fits into a plasmid (Up to 10 kB) Up to 700 kB! Shortest Chromosome (21) 47 mB -> 250 BACs

1 BAC -> 10 – 50 Plasmids / Cosmids Plasmid / Cosmid

Primer Polymerase Chain Reaction will lead predominantely to: Use several nucleases EcoRI BamHI HindIII Target ~ 1000 nucleotides

Restriction Sequence with degenerate primers? or subclone and sequence Clone01: ACCGACTACGATCGCACTCAGCATCGCGATCCGATACGTAGCTAGCTAGCT Clone02: TGTGTAGCTAGCTGCGGCGCTAGGATAGGCATCTAGCTATCGGACTCTGTG... Clone20: GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT... Sequencing

Clone01 ACCGACTACGATCGCACTCAGCATCGCGATCCGATACGTAGCTAGCTAGCT |||||||||||| Clone20 GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT Clones SYMETRIC YMETRIC METRIC.0ETRIC.0TRIC RIC.0IC.0C >Clone01 ACCGACTACGATCGCACTCAGCATCGCGA TCCGATACGTAGCTAGCTAGCT >Clone02 TGTGTAGCTAGCTGCGGCGCTAGGATAGG CATCTAGCTATCGGACTCTGTG... >Clone20 GTAGTACGTGCTAGCTACGTACGTACGAT CGTACGTAGTACCGACTACGAT... Smith-Waterman or more specialized Alg. all vs all Check here as well

GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT ACCGACTACGATCGCACTCAGCATCGCGATCCGATACGTAGCTAGCTAGCT ACCGACTACGATCGCACT |||||| |||||||||||| |||||||||||| |||||||||||| |||||||||||| TAGTACCG GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT Chromosome Not proportional For each plasmid the BAC and therefore the position on the chromosome is known Sequencing all plasmids will give the complete sequence of the genome !Caution! Highly simplified Why?What does coverage mean?

Assembling Software As you just saw assembling sequences is computationally expensive Therefore most software is not available online but often freely for download

Theoretical Part II Mind mapping 10 min break

Practical Part II

Restriction Maps You sent a sample for sequencing. You might want to check if the sequence makes sense What is a restriction map?

CAP3 Assembly GOTO: Use the sequences you prepared earlier to assemble them with cap3 Analyze the results –Did you get a full correct assembly?