Download presentation
Published byAubrey Alexander Modified over 9 years ago
1
MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment
Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine Genome Informatics I (2015 Spring)
2
Genome Informatics I (2015 Spring)
Overview Goal of this lecture You will learn the principle of mapping NGS short read to reference genome and practice alignment tools Short Read Alignment Theory Why do we need special algorithm? The Burrows-Wheeler Transformation (BWT) BWT indexing LF search Examples Practice with BWA with NA18507 sequences Understanding alignment information Viewing/Converting SAM/BAM format Interpreting alignment information Genome Informatics I (2015 Spring)
3
Short READ alignment theory
Genome Informatics I (2015 Spring)
4
Genome Informatics I (2015 Spring)
RAW NGS DATA (FASTQ) @SRR /1 TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA + 5FIFEFHFGHHEFFEEIFFIFHFGGGGKGFJHFEKJJIFKKJGHGGGJFKHGGGLLFGGHLKHJJMGGGJNJKIJJLLIIIKJIHIKJEGFACGEEEDC>F @SRR /1 ATATATGAAGGAAAGATACAGTCATTTTCAGACAAACAAATGCTGACAGAATTTGCCATTACCAAGCCAGGACTCTAAGAACTGCTAAAAGGAGCTCTAAA 6FFDBDGDEGFEEEGEDBEEFDFEEDEEFFGEEFGFFFGFEHGGHEFFGFGEFFHGGFFFDGGGGHGGGHHGFHGGEGHGHFGIIGCFFFED?ADC>B<>> @SRR /1 TAAAAGAGACAAAGAGAGACAGTATATCATCTGTCATCTGACAGTCTCATCCAACAGAAAAATATGACAATCCTAAACATATGTGAACCTAACACTGGAGC 6FIEEFDFEEEFEFEFEFEEEFDBECEFFEFFGEFFEFGHEFFGDGGFFEEGFGFFHFGGGGEDFHFFGHFGFHFGGGFFEFIGJFGGIHBDECCCD?;>H @SRR /1 TTAAATAACCTGCTCCTGAATGAGCATTGGGTGAAAAACGAAATCAAGATGGAAATGTAAAAAATTTCTTCGAACTGGATGACACAACCTATCAAGACCTC @SRR /1 CACAACCTATCAAGACCTCTGGGATACAGCAAAGGCAGTGCTAAGAGGAAAGTTTATAGCACTAAACACCTACGTCGAAAAGTCTGAAAGAGCACAGACAA @SRR /1 CCATAGAAAGGAATGAATTAACAGCATTTCCTGTGACCTGGACGAGATTGGAGACTATTGTTCTAAGTGATGTAACCCAGGAATGGAAAACTCAACATTGT @SRR /1 TGTCCTTTCCAGGGACATGGATGAAGCTGGAAACCATCATTCTCAGCAAACTAACACAAGAAAAGAAAACCAGGCCAGGAGCAGTGGCTCATGCCTGTAGT 5JIAIHEDHHDHGGFFFEIJFFHDCIHHHKFGHIIGGFGGGGHIGDGGIIIIGGJGFGGIIFHHKHIJIJKHLKILGCIIHMHKDKMLKFJBHHHBGFABB @SRR /1 GAGAACACATGGACACAGGGAGGGGAACATCACACACTGGGGCCTGTCAAAGGGTGGGAGGCTGGGGGAGGAACAGCATTAGGAGAAATACCTAATGTAGA @SRR /1 TGGGGAAAAAAAACATTCTCTGAAATTTGCTTTTATACCATTAAAGACTTATTTTTTATTACCAGCAATACAGGGCAACTCATTCAGGTTGAATCTTGAAG Genome Informatics I (2015 Spring)
5
Genome Informatics I (2015 Spring)
Mapping back to genome Where is this sequence in human genome? TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA Genome Informatics I (2015 Spring)
6
Genome Informatics I (2015 Spring)
Mapping back to genome Where is this sequence in human genome? TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA Do this as fast as possible! Genome Informatics I (2015 Spring)
7
Genome Informatics I (2015 Spring)
brute force way Find “GATTCAAA” in human genome This is very long (3 billion) The reference genome (chr1, start) T G A C G A T C Your query G A T C G A T C G A T C Genome Informatics I (2015 Spring)
8
Genome Informatics I (2015 Spring)
How fast should it be? time per 1 read (sec) time per 80x WGS (sec) is equal to eyeballing 3x109 3.6x1018 1x1011 yrs naïve matching 2400 1.2x109 7,608 yrs improved algorithm 3 3.6x108 10 yrs minimum required 0.01 1.2x107 11.5 days desired 0.001 1.2x106 1.2 days based on 200bp read length, 80x single-end wgs Genome Informatics I (2015 Spring)
9
Genome Informatics I (2015 Spring)
Searching with index Assume you’re searching “genome” in a English dictionary You don’t search every line in every page You first find the page range of “g” in the dictionary in the above range (of ‘g’), you find the page range of “ge” in the dictionary in the above range (of ‘ge’), you find the page range of “gen” in the dictionary ... until you find “genome” Genome Informatics I (2015 Spring)
10
Genome Informatics I (2015 Spring)
Indexing genome We are going to make an index for genome to make it possible to search a read-sequence as we do it in an English dictionary Genome Informatics I (2015 Spring)
11
Burrows-Wheeler Transformation
BANANA
12
Burrows-Wheeler Transformation
Lexicographically smallest BANANA$
13
Burrows-Wheeler Transformation
BANANA$ ANANA$B
14
Burrows-Wheeler Transformation
BANANA$ ANANA$B NANA$BA
15
Burrows-Wheeler Transformation
BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA
16
Burrows-Wheeler Transformation
0 BANANA$ 1 ANANA$B 2 NANA$BA 3 ANA$BAN 4 NA$BANA 5 A$BANAN 6 $BANANA
17
Burrows-Wheeler Transformation
0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B 4 NA$BANA 4 0 BANANA$ sort 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA
18
Burrows-Wheeler Transformation
0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA
19
Burrows-Wheeler Transformation
0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA BWT(“BANANA$”) = “ANNB$AA”
20
Burrows-Wheeler Transformation
0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA BWT(“BANANA$”) = “ANNB$AA” BWT just changes the order of the string BWT tends to collect similar characters together With only the transformed string, we can easily get the original string
21
Inverse BWT We are given “ANNB$AA”
22
Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA ANNB$AA
23
Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA ANNB$AA $AAABNN sort
24
Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA ANNB$AA $AAABNN sort
25
Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA ANNB$AA $AAABNN Attach the last column
26
Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA A$NANABA$BANAN sort
27
Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA A$NANABA$BANAN $B A$ AN BA NA sort
28
Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA A$NANABA$BANAN ANNB$AA $B A$ AN BA NA sort Attach the last column
29
Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA A$NANABA$BANAN ANNB$AA $B A$ AN BA NA sort Attach the last column
30
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA
31
NAN LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
Question: Find “NAN” from BANANA 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA NAN N AN NAN
32
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA start The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point the number of ‘N’ to determine the end point end
33
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA start The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point =5 the number of ‘N’ to determine the end point =2 end
34
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point =5 the number of ‘N’ to determine the end point =2 start end
35
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end
36
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end This is a range for ‘A’ not ‘AN’!!
37
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end
38
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA count of ‘A’ before start point = 1 The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end
39
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN “Ax” is not “AN” and less than “AN” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA count of ‘A’ before start point = 1 The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ + number of ‘A’ before start point to determine the start point =1 + 1 = 2 the number of ‘A’ before end point to determine the end point =3 start end
40
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “NAN” can be calculated from: the number of symbols that are lexicographically less than ‘N’ + number of ‘N’ before start point to determine the start point =5 + 1 = 6 the number of ‘N’ before end point to determine the end point =2 start end
41
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA BANANA 2nd row at the original permutation =number of rotations of original string =“NAN” exists at the 3rd position of “BANANA” start end
42
Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)
43
Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)
44
Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)
45
Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)
46
Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)
47
Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)
48
Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)
49
Genome Informatics I (2015 Spring)
Inexact matching T G A C G A T When exact match does not exist: continue other possible candidates (G -> A, C, T) and increase the mismatch count If another mismatch occurs, again branch it out. So edit distance is critical to alignment speed Genome Informatics I (2015 Spring)
50
Genome Informatics I (2015 Spring)
Goal achieved time per 1 read (sec) time per 80x WGS (sec) is equal to eyeballing 3x109 3.6x1018 1x1011 yrs naïve matching 2400 1.2x109 7,608 yrs improved algorithm 3 3.6x108 10 yrs minimum required 0.01 1.2x107 11.5 days desired 0.001 1.2x106 1.2 days Genome Informatics I (2015 Spring)
51
Genome Informatics I (2015 Spring)
practice with bwa Genome Informatics I (2015 Spring)
52
Genome Informatics I (2015 Spring)
BWA Genome Informatics I (2015 Spring)
53
Genome Informatics I (2015 Spring)
bwa practice In the cluster >bwa Genome Informatics I (2015 Spring)
54
Genome Informatics I (2015 Spring)
bwa process bwa index to index the reference genome (one time process) = to create bwt for reference genomoe bwa aln will calculate suffix array (SA) coordinate bwa samse (or bwa sampe for paired end sequencing) will convert the SA coordinate to chromosomal locations Input for bwa reference genome fastq file (the raw NGS data) Genome Informatics I (2015 Spring)
55
Genome Informatics I (2015 Spring)
reference data Genome Informatics I (2015 Spring)
56
Genome Informatics I (2015 Spring)
reference data “bwa index” will index the reference genome (so reference is ready) it is already done here, do not try do it again Genome Informatics I (2015 Spring)
57
Genome Informatics I (2015 Spring)
sequence data - Pick one chromosome for you copy the fastq file to your directory use “cp” command to do it example (copying chr8 NGS data to rachmani directory) >cp NA18507_chr8.* /scratch/2015_GenomeInformatics/rachmani/ Genome Informatics I (2015 Spring)
58
Genome Informatics I (2015 Spring)
run bwa aln >bwa aln reference yourdata.fastq > yourdata.sai example >bwa aln /data/resources/reference/human/UCSC/hg19/BWAIndex/genome.fa NA18507_chr8.01.fastq > NA18507_chr8.01.sai write a job script runbwaaln.sh submit to cluster >qsub runbwaaln.sh Genome Informatics I (2015 Spring)
59
Genome Informatics I (2015 Spring)
run bwa samse >bwa samse reference yourdata.sai yourdata.fastq > yourdata.sam example >bwa aln /data/resources/reference/human/UCSC/hg19/BWAIndex/genome.fa NA18507_chr8.01.sai NA18507_chr8.01.fastq > NA18507_chr8.01.sam write a job script runbwasamse.sh submit to cluster >qsub runbwasamse.sh Genome Informatics I (2015 Spring)
60
the output This is your first alignment with real NGS data
>less NA18507_chr8.01.sam This is your first alignment with real NGS data Genome Informatics I (2015 Spring)
61
Genome Informatics I (2015 Spring)
break Please ask any questions to us if you have problems (do not give up) If possible, try mapping in a paired-end mode bwa sampe reference data01.sai data02.sai data01.fastq data02.fastq > output.sam Genome Informatics I (2015 Spring)
62
Genome Informatics I (2015 Spring)
The SAM Format For more details about SAM format please refer to: Genome Informatics I (2015 Spring)
63
Genome Informatics I (2015 Spring)
SAM/BAM SAM and BAM are convertible (exactly same information) SAM file human readable text file BAM file (binary) human unreadable binary file compressed (much smaller size) able to index (for random access) Genome Informatics I (2015 Spring)
64
Genome Informatics I (2015 Spring)
Converting SAM to BAM >samtools view yourdata.sam –Sb > yourdata.bam -S option means input is SAM format -b option means output is BAM format Genome Informatics I (2015 Spring)
65
Sorting and Indexing BAM
samtools sort yourdata.sam yourdata.sorted will create yourdata.sorted.bam samtools index yourdata.bam will create yourdata.bam.bai Now everything’s ready Genome Informatics I (2015 Spring)
66
Visualizing alignment
IGV (Integrative Genomics Viewer) Genome Informatics I (2015 Spring)
67
Visualizing alignment
samtools tview yourdata.bam reference example: >samtools tview NA18507_chr8.01.sorted.bam /data/resource/reference/human/UCSC/hg19/BWAIndex/genome.fa Genome Informatics I (2015 Spring)
68
Genome Informatics I (2015 Spring)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.