MES Genome Informatics I - Lecture V. Short Read Alignment

Slides:



Advertisements
Similar presentations
NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
Advertisements

SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.
Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.
High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Sequence Alignment in DNA Under the Guidance of : Prof. Kolin Paul Presented By: Lalchand Gaurav Jain.
Next Generation Sequencing, Assembly, and Alignment Methods
TEMPLATE DESIGN © SSAHA: Search with Speed Nick Altemose, Kelvin Gu, Tiffany Lin, Kevin Tao, Owen Astrachan Duke University.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Near Duplicate Detection
A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
CS 106 Introduction to Computer Science I 10 / 16 / 2006 Instructor: Michael Eckmann.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM.
SOAP3-dp Workflow.
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
NGS Analysis Using Galaxy
Instructor: Dr. Sahar Shabanah Fall Lectures ST, 9:30 pm-11:00 pm Text book: M. T. Goodrich and R. Tamassia, “Data Structures and Algorithms in.
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
James A. Edwards, Uzi Vishkin University of Maryland.
Variant Calling Workshop Chris Fields Variant Calling Workshop | Chris Fields | PowerPoint by Casey Hanson.
MES Genome Informatics I - Lecture IV. NGS basics Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University.
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
File formats Wrapping your data in the right package Deanna M. Church
DAY 1. GENERAL ASPECTS FOR GENETIC MAP CONSTRUCTION SANGREA SHIM.
NGS data analysis CCM Seminar series Michael Liang:
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
MA/CSSE 473 Day 18 Permutations by lexicographic order number.
1/20 A Novel Technique for Input Vector Compression in System-on-Chip Testing Student: Chien Nan Lin Satyendra Biswas, Sunil Das, and Altaf Hossain,” Information.
Quick introduction to genomic file types Preliminary quality control (lab)
Sangwoo Kim, Ph.D. Assistant Professor,
Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID
Lecture 15 Algorithm Analysis
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.
Short Read Workshop Day 5: Mapping and Visualization
Canadian Bioinformatics Workshops
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Assignment 5 is posted. Exercise 8 is very similar to what you will be doing with assignment 5. Exam.
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
Using command line tools to process sequencing data
Day 5 Mapping and Visualization
Burrows-Wheeler Transformation Review
COMP9319 Web Data Compression and Search
RNA Sequencing Day 7 Wooohoooo!
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Indexing Graphs for Path Queries with Applications in Genome Research
Integrative Genomics Viewer (IGV)
Information and Coding Theory
VCF format: variants c.f. S. Brown NYU
GE3M25: Data Analysis, Class 4
CSC2431 February 3rd 2010 Alecia Fowler
Next Gen. Sequencing Files and pysam
Lecture 14 Algorithm Analysis
Maximize read usage through mapping strategies
Next Gen. Sequencing Files and pysam
Next Gen. Sequencing Files and pysam
Applying principles of computer science in a biological context
Canadian Bioinformatics Workshops
Alignment of Next-Generation Sequencing Data
Computational Pipeline Strategies
Presentation transcript:

MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) Overview Goal of this lecture You will learn the principle of mapping NGS short read to reference genome and practice alignment tools Short Read Alignment Theory Why do we need special algorithm? The Burrows-Wheeler Transformation (BWT) BWT indexing LF search Examples Practice with BWA with NA18507 sequences Understanding alignment information Viewing/Converting SAM/BAM format Interpreting alignment information Genome Informatics I (2015 Spring)

Short READ alignment theory Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) RAW NGS DATA (FASTQ) @SRR764745.4352210/1 TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA + 5FIFEFHFGHHEFFEEIFFIFHFGGGGKGFJHFEKJJIFKKJGHGGGJFKHGGGLLFGGHLKHJJMGGGJNJKIJJLLIIIKJIHIKJEGFACGEEEDC>F @SRR764746.682219/1 ATATATGAAGGAAAGATACAGTCATTTTCAGACAAACAAATGCTGACAGAATTTGCCATTACCAAGCCAGGACTCTAAGAACTGCTAAAAGGAGCTCTAAA 6FFDBDGDEGFEEEGEDBEEFDFEEDEEFFGEEFGFFFGFEHGGHEFFGFGEFFHGGFFFDGGGGHGGGHHGFHGGEGHGHFGIIGCFFFED?ADC>B<>> @SRR764746.2695391/1 TAAAAGAGACAAAGAGAGACAGTATATCATCTGTCATCTGACAGTCTCATCCAACAGAAAAATATGACAATCCTAAACATATGTGAACCTAACACTGGAGC 6FIEEFDFEEEFEFEFEFEEEFDBECEFFEFFGEFFEFGHEFFGDGGFFEEGFGFFHFGGGGEDFHFFGHFGFHFGGGFFEFIGJFGGIHBDECCCD?;>H @SRR764746.1063237/1 TTAAATAACCTGCTCCTGAATGAGCATTGGGTGAAAAACGAAATCAAGATGGAAATGTAAAAAATTTCTTCGAACTGGATGACACAACCTATCAAGACCTC 5FBCC@A*CHDFDDDDEFBDDGADFCBDFFEEGEGADEEAE4DEFFEGBEHE8;ADHD@DGGFCGDEDGFB==B?GNG@FMC@JFF>:FG=DDED=&>@A# @SRR764746.5506495/1 CACAACCTATCAAGACCTCTGGGATACAGCAAAGGCAGTGCTAAGAGGAAAGTTTATAGCACTAAACACCTACGTCGAAAAGTCTGAAAGAGCACAGACAA 5HIDDDEEBDEEEFEEEFEFGFFEECFFGFFFFGFFFGDHGGCFGFGGFGGHDEFDFDHGGFGDGGFGFGFDFAEFBCFFFFJDIKCEEFACFBCA?;A@H @SRR764746.5390417/1 CCATAGAAAGGAATGAATTAACAGCATTTCCTGTGACCTGGACGAGATTGGAGACTATTGTTCTAAGTGATGTAACCCAGGAATGGAAAACTCAACATTGT 5IHCBE@EEFFDEDGDEDDCFEEGFEEEDFDFGEHEFFFHEBHABHDEDHGDGFFGDFFHEEGGDGHFIFFIEDGFGHGHHCJCIGCEEEHFAB?B@<EDA @SRR764745.6298885/1 TGTCCTTTCCAGGGACATGGATGAAGCTGGAAACCATCATTCTCAGCAAACTAACACAAGAAAAGAAAACCAGGCCAGGAGCAGTGGCTCATGCCTGTAGT 5JIAIHEDHHDHGGFFFEIJFFHDCIHHHKFGHIIGGFGGGGHIGDGGIIIIGGJGFGGIIFHHKHIJIJKHLKILGCIIHMHKDKMLKFJBHHHBGFABB @SRR764745.944258/1 GAGAACACATGGACACAGGGAGGGGAACATCACACACTGGGGCCTGTCAAAGGGTGGGAGGCTGGGGGAGGAACAGCATTAGGAGAAATACCTAATGTAGA 5FFDEFEFEDIH?CECEHEHCHIJI>BCCCIDFFFFIHIBHBHFAAFEGGFHMM8FDCDGIEHGAGG@BGAAFKH?6>DKDDNIK?9<FHGBICDBG@<<= @SRR764745.15058086/1 TGGGGAAAAAAAACATTCTCTGAAATTTGCTTTTATACCATTAAAGACTTATTTTTTATTACCAGCAATACAGGGCAACTCATTCAGGTTGAATCTTGAAG 6NMHHFBGGFFEGHEEEIHIDIFGFDFFHFFEFEEGFIJGGGEHHLHIJEFHGHGHFFGGFJKHJJHHFFMHKNBEIFMMGLEIGJHMJCM@CA?FCD;GB Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) Mapping back to genome Where is this sequence in human genome? TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) Mapping back to genome Where is this sequence in human genome? TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA Do this as fast as possible! Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) brute force way Find “GATTCAAA” in human genome This is very long (3 billion) The reference genome (chr1, start) T G A C G A T C Your query G A T C G A T C G A T C Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) How fast should it be? time per 1 read (sec) time per 80x WGS (sec) is equal to eyeballing 3x109 3.6x1018 1x1011 yrs naïve matching 2400 1.2x109 7,608 yrs improved algorithm 3 3.6x108 10 yrs minimum required 0.01 1.2x107 11.5 days desired 0.001 1.2x106 1.2 days based on 200bp read length, 80x single-end wgs Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) Searching with index Assume you’re searching “genome” in a English dictionary You don’t search every line in every page You first find the page range of “g” in the dictionary in the above range (of ‘g’), you find the page range of “ge” in the dictionary in the above range (of ‘ge’), you find the page range of “gen” in the dictionary ... until you find “genome” Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) Indexing genome We are going to make an index for genome to make it possible to search a read-sequence as we do it in an English dictionary Genome Informatics I (2015 Spring)

Burrows-Wheeler Transformation BANANA

Burrows-Wheeler Transformation Lexicographically smallest BANANA$

Burrows-Wheeler Transformation BANANA$ ANANA$B

Burrows-Wheeler Transformation BANANA$ ANANA$B NANA$BA

Burrows-Wheeler Transformation BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA

Burrows-Wheeler Transformation 0 BANANA$ 1 ANANA$B 2 NANA$BA 3 ANA$BAN 4 NA$BANA 5 A$BANAN 6 $BANANA

Burrows-Wheeler Transformation 0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B 4 NA$BANA 4 0 BANANA$ sort 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA

Burrows-Wheeler Transformation 0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA

Burrows-Wheeler Transformation 0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA BWT(“BANANA$”) = “ANNB$AA”

Burrows-Wheeler Transformation 0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA BWT(“BANANA$”) = “ANNB$AA” BWT just changes the order of the string BWT tends to collect similar characters together With only the transformed string, we can easily get the original string

Inverse BWT We are given “ANNB$AA”

Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA ANNB$AA

Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA ANNB$AA $AAABNN sort

Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA ANNB$AA $AAABNN sort

Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA ANNB$AA $AAABNN Attach the last column

Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA A$NANABA$BANAN sort

Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA A$NANABA$BANAN $B A$ AN BA NA sort

Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA A$NANABA$BANAN ANNB$AA $B A$ AN BA NA sort Attach the last column

Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA A$NANABA$BANAN ANNB$AA $B A$ AN BA NA sort Attach the last column

LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ Question: Find “NAN” from BANANA 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA

NAN LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B Question: Find “NAN” from BANANA 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA NAN N AN NAN

LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA start The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point the number of ‘N’ to determine the end point end

LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA start The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point =5 the number of ‘N’ to determine the end point =2 end

LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point =5 the number of ‘N’ to determine the end point =2 start end

LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end

LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end This is a range for ‘A’ not ‘AN’!!

LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end

LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA count of ‘A’ before start point = 1 The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end

LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ Question: Find “NAN” from BANANA NAN “Ax” is not “AN” and less than “AN” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA count of ‘A’ before start point = 1 The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ + number of ‘A’ before start point to determine the start point =1 + 1 = 2 the number of ‘A’ before end point to determine the end point =3 start end

LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “NAN” can be calculated from: the number of symbols that are lexicographically less than ‘N’ + number of ‘N’ before start point to determine the start point =5 + 1 = 6 the number of ‘N’ before end point to determine the end point =2 start end

LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA BANANA 2nd row at the original permutation =number of rotations of original string =“NAN” exists at the 3rd position of “BANANA” start end

Genome Informatics I (2015 Spring) Genome query imported from Mike Schatz’s slide http://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) Genome query imported from Mike Schatz’s slide http://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) Genome query imported from Mike Schatz’s slide http://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) Genome query imported from Mike Schatz’s slide http://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) Genome query imported from Mike Schatz’s slide http://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) Genome query imported from Mike Schatz’s slide http://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) Genome query imported from Mike Schatz’s slide http://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) Inexact matching T G A C G A T When exact match does not exist: continue other possible candidates (G -> A, C, T) and increase the mismatch count If another mismatch occurs, again branch it out. So edit distance is critical to alignment speed Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) Goal achieved time per 1 read (sec) time per 80x WGS (sec) is equal to eyeballing 3x109 3.6x1018 1x1011 yrs naïve matching 2400 1.2x109 7,608 yrs improved algorithm 3 3.6x108 10 yrs minimum required 0.01 1.2x107 11.5 days desired 0.001 1.2x106 1.2 days Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) practice with bwa Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) BWA Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) bwa practice In the cluster >bwa Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) bwa process bwa index to index the reference genome (one time process) = to create bwt for reference genomoe bwa aln will calculate suffix array (SA) coordinate bwa samse (or bwa sampe for paired end sequencing) will convert the SA coordinate to chromosomal locations Input for bwa reference genome fastq file (the raw NGS data) Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) reference data Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) reference data “bwa index” will index the reference genome (so reference is ready) it is already done here, do not try do it again Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) sequence data - Pick one chromosome for you copy the fastq file to your directory use “cp” command to do it example (copying chr8 NGS data to rachmani directory) >cp NA18507_chr8.* /scratch/2015_GenomeInformatics/rachmani/ Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) run bwa aln >bwa aln reference yourdata.fastq > yourdata.sai example >bwa aln /data/resources/reference/human/UCSC/hg19/BWAIndex/genome.fa NA18507_chr8.01.fastq > NA18507_chr8.01.sai write a job script runbwaaln.sh submit to cluster >qsub runbwaaln.sh Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) run bwa samse >bwa samse reference yourdata.sai yourdata.fastq > yourdata.sam example >bwa aln /data/resources/reference/human/UCSC/hg19/BWAIndex/genome.fa NA18507_chr8.01.sai NA18507_chr8.01.fastq > NA18507_chr8.01.sam write a job script runbwasamse.sh submit to cluster >qsub runbwasamse.sh Genome Informatics I (2015 Spring)

the output This is your first alignment with real NGS data >less NA18507_chr8.01.sam This is your first alignment with real NGS data Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) break Please ask any questions to us if you have problems (do not give up) If possible, try mapping in a paired-end mode bwa sampe reference data01.sai data02.sai data01.fastq data02.fastq > output.sam Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) The SAM Format For more details about SAM format please refer to: https://samtools.github.io/hts-specs/SAMv1.pdf Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) SAM/BAM SAM and BAM are convertible (exactly same information) SAM file human readable text file BAM file (binary) human unreadable binary file compressed (much smaller size) able to index (for random access) Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring) Converting SAM to BAM >samtools view yourdata.sam –Sb > yourdata.bam -S option means input is SAM format -b option means output is BAM format Genome Informatics I (2015 Spring)

Sorting and Indexing BAM samtools sort yourdata.sam yourdata.sorted will create yourdata.sorted.bam samtools index yourdata.bam will create yourdata.bam.bai Now everything’s ready Genome Informatics I (2015 Spring)

Visualizing alignment IGV (Integrative Genomics Viewer) Genome Informatics I (2015 Spring)

Visualizing alignment samtools tview yourdata.bam reference example: >samtools tview NA18507_chr8.01.sorted.bam /data/resource/reference/human/UCSC/hg19/BWAIndex/genome.fa Genome Informatics I (2015 Spring)

Genome Informatics I (2015 Spring)