Presentation is loading. Please wait.

Presentation is loading. Please wait.

MES Genome Informatics I - Lecture V. Short Read Alignment

Similar presentations


Presentation on theme: "MES Genome Informatics I - Lecture V. Short Read Alignment"— Presentation transcript:

1 MES7594-01 Genome Informatics I - Lecture V. Short Read Alignment
Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine Genome Informatics I (2015 Spring)

2 Genome Informatics I (2015 Spring)
Overview Goal of this lecture You will learn the principle of mapping NGS short read to reference genome and practice alignment tools Short Read Alignment Theory Why do we need special algorithm? The Burrows-Wheeler Transformation (BWT) BWT indexing LF search Examples Practice with BWA with NA18507 sequences Understanding alignment information Viewing/Converting SAM/BAM format Interpreting alignment information Genome Informatics I (2015 Spring)

3 Short READ alignment theory
Genome Informatics I (2015 Spring)

4 Genome Informatics I (2015 Spring)
RAW NGS DATA (FASTQ) @SRR /1 TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA + 5FIFEFHFGHHEFFEEIFFIFHFGGGGKGFJHFEKJJIFKKJGHGGGJFKHGGGLLFGGHLKHJJMGGGJNJKIJJLLIIIKJIHIKJEGFACGEEEDC>F @SRR /1 ATATATGAAGGAAAGATACAGTCATTTTCAGACAAACAAATGCTGACAGAATTTGCCATTACCAAGCCAGGACTCTAAGAACTGCTAAAAGGAGCTCTAAA 6FFDBDGDEGFEEEGEDBEEFDFEEDEEFFGEEFGFFFGFEHGGHEFFGFGEFFHGGFFFDGGGGHGGGHHGFHGGEGHGHFGIIGCFFFED?ADC>B<>> @SRR /1 TAAAAGAGACAAAGAGAGACAGTATATCATCTGTCATCTGACAGTCTCATCCAACAGAAAAATATGACAATCCTAAACATATGTGAACCTAACACTGGAGC 6FIEEFDFEEEFEFEFEFEEEFDBECEFFEFFGEFFEFGHEFFGDGGFFEEGFGFFHFGGGGEDFHFFGHFGFHFGGGFFEFIGJFGGIHBDECCCD?;>H @SRR /1 TTAAATAACCTGCTCCTGAATGAGCATTGGGTGAAAAACGAAATCAAGATGGAAATGTAAAAAATTTCTTCGAACTGGATGACACAACCTATCAAGACCTC @SRR /1 CACAACCTATCAAGACCTCTGGGATACAGCAAAGGCAGTGCTAAGAGGAAAGTTTATAGCACTAAACACCTACGTCGAAAAGTCTGAAAGAGCACAGACAA @SRR /1 CCATAGAAAGGAATGAATTAACAGCATTTCCTGTGACCTGGACGAGATTGGAGACTATTGTTCTAAGTGATGTAACCCAGGAATGGAAAACTCAACATTGT @SRR /1 TGTCCTTTCCAGGGACATGGATGAAGCTGGAAACCATCATTCTCAGCAAACTAACACAAGAAAAGAAAACCAGGCCAGGAGCAGTGGCTCATGCCTGTAGT 5JIAIHEDHHDHGGFFFEIJFFHDCIHHHKFGHIIGGFGGGGHIGDGGIIIIGGJGFGGIIFHHKHIJIJKHLKILGCIIHMHKDKMLKFJBHHHBGFABB @SRR /1 GAGAACACATGGACACAGGGAGGGGAACATCACACACTGGGGCCTGTCAAAGGGTGGGAGGCTGGGGGAGGAACAGCATTAGGAGAAATACCTAATGTAGA @SRR /1 TGGGGAAAAAAAACATTCTCTGAAATTTGCTTTTATACCATTAAAGACTTATTTTTTATTACCAGCAATACAGGGCAACTCATTCAGGTTGAATCTTGAAG Genome Informatics I (2015 Spring)

5 Genome Informatics I (2015 Spring)
Mapping back to genome Where is this sequence in human genome? TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA Genome Informatics I (2015 Spring)

6 Genome Informatics I (2015 Spring)
Mapping back to genome Where is this sequence in human genome? TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA Do this as fast as possible! Genome Informatics I (2015 Spring)

7 Genome Informatics I (2015 Spring)
brute force way Find “GATTCAAA” in human genome This is very long (3 billion) The reference genome (chr1, start) T G A C G A T C Your query G A T C G A T C G A T C Genome Informatics I (2015 Spring)

8 Genome Informatics I (2015 Spring)
How fast should it be? time per 1 read (sec) time per 80x WGS (sec) is equal to eyeballing 3x109 3.6x1018 1x1011 yrs naïve matching 2400 1.2x109 7,608 yrs improved algorithm 3 3.6x108 10 yrs minimum required 0.01 1.2x107 11.5 days desired 0.001 1.2x106 1.2 days based on 200bp read length, 80x single-end wgs Genome Informatics I (2015 Spring)

9 Genome Informatics I (2015 Spring)
Searching with index Assume you’re searching “genome” in a English dictionary You don’t search every line in every page You first find the page range of “g” in the dictionary in the above range (of ‘g’), you find the page range of “ge” in the dictionary in the above range (of ‘ge’), you find the page range of “gen” in the dictionary ... until you find “genome” Genome Informatics I (2015 Spring)

10 Genome Informatics I (2015 Spring)
Indexing genome We are going to make an index for genome to make it possible to search a read-sequence as we do it in an English dictionary Genome Informatics I (2015 Spring)

11 Burrows-Wheeler Transformation
BANANA

12 Burrows-Wheeler Transformation
Lexicographically smallest BANANA$

13 Burrows-Wheeler Transformation
BANANA$ ANANA$B

14 Burrows-Wheeler Transformation
BANANA$ ANANA$B NANA$BA

15 Burrows-Wheeler Transformation
BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA

16 Burrows-Wheeler Transformation
0 BANANA$ 1 ANANA$B 2 NANA$BA 3 ANA$BAN 4 NA$BANA 5 A$BANAN 6 $BANANA

17 Burrows-Wheeler Transformation
0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B 4 NA$BANA 4 0 BANANA$ sort 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA

18 Burrows-Wheeler Transformation
0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA

19 Burrows-Wheeler Transformation
0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA BWT(“BANANA$”) = “ANNB$AA”

20 Burrows-Wheeler Transformation
0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA BWT(“BANANA$”) = “ANNB$AA” BWT just changes the order of the string BWT tends to collect similar characters together With only the transformed string, we can easily get the original string

21 Inverse BWT We are given “ANNB$AA”

22 Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA ANNB$AA

23 Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA ANNB$AA $AAABNN sort

24 Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA ANNB$AA $AAABNN sort

25 Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA ANNB$AA $AAABNN Attach the last column

26 Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA A$NANABA$BANAN sort

27 Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA A$NANABA$BANAN $B A$ AN BA NA sort

28 Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA A$NANABA$BANAN ANNB$AA $B A$ AN BA NA sort Attach the last column

29 Inverse BWT 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
We are given “ANNB$AA” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA A$NANABA$BANAN ANNB$AA $B A$ AN BA NA sort Attach the last column

30 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA

31 NAN LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
Question: Find “NAN” from BANANA 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA NAN N AN NAN

32 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA start The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point the number of ‘N’ to determine the end point end

33 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA start The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point =5 the number of ‘N’ to determine the end point =2 end

34 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point =5 the number of ‘N’ to determine the end point =2 start end

35 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end

36 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end This is a range for ‘A’ not ‘AN’!!

37 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end

38 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA count of ‘A’ before start point = 1 The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end

39 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN “Ax” is not “AN” and less than “AN” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA count of ‘A’ before start point = 1 The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ + number of ‘A’ before start point to determine the start point =1 + 1 = 2 the number of ‘A’ before end point to determine the end point =3 start end

40 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “NAN” can be calculated from: the number of symbols that are lexicographically less than ‘N’ + number of ‘N’ before start point to determine the start point =5 + 1 = 6 the number of ‘N’ before end point to determine the end point =2 start end

41 LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA BANANA 2nd row at the original permutation =number of rotations of original string =“NAN” exists at the 3rd position of “BANANA” start end

42 Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)

43 Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)

44 Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)

45 Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)

46 Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)

47 Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)

48 Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)

49 Genome Informatics I (2015 Spring)
Inexact matching T G A C G A T When exact match does not exist: continue other possible candidates (G -> A, C, T) and increase the mismatch count If another mismatch occurs, again branch it out. So edit distance is critical to alignment speed Genome Informatics I (2015 Spring)

50 Genome Informatics I (2015 Spring)
Goal achieved time per 1 read (sec) time per 80x WGS (sec) is equal to eyeballing 3x109 3.6x1018 1x1011 yrs naïve matching 2400 1.2x109 7,608 yrs improved algorithm 3 3.6x108 10 yrs minimum required 0.01 1.2x107 11.5 days desired 0.001 1.2x106 1.2 days Genome Informatics I (2015 Spring)

51 Genome Informatics I (2015 Spring)
practice with bwa Genome Informatics I (2015 Spring)

52 Genome Informatics I (2015 Spring)
BWA Genome Informatics I (2015 Spring)

53 Genome Informatics I (2015 Spring)
bwa practice In the cluster >bwa Genome Informatics I (2015 Spring)

54 Genome Informatics I (2015 Spring)
bwa process bwa index to index the reference genome (one time process) = to create bwt for reference genomoe bwa aln will calculate suffix array (SA) coordinate bwa samse (or bwa sampe for paired end sequencing) will convert the SA coordinate to chromosomal locations Input for bwa reference genome fastq file (the raw NGS data) Genome Informatics I (2015 Spring)

55 Genome Informatics I (2015 Spring)
reference data Genome Informatics I (2015 Spring)

56 Genome Informatics I (2015 Spring)
reference data “bwa index” will index the reference genome (so reference is ready) it is already done here, do not try do it again Genome Informatics I (2015 Spring)

57 Genome Informatics I (2015 Spring)
sequence data - Pick one chromosome for you copy the fastq file to your directory use “cp” command to do it example (copying chr8 NGS data to rachmani directory) >cp NA18507_chr8.* /scratch/2015_GenomeInformatics/rachmani/ Genome Informatics I (2015 Spring)

58 Genome Informatics I (2015 Spring)
run bwa aln >bwa aln reference yourdata.fastq > yourdata.sai example >bwa aln /data/resources/reference/human/UCSC/hg19/BWAIndex/genome.fa NA18507_chr8.01.fastq > NA18507_chr8.01.sai write a job script runbwaaln.sh submit to cluster >qsub runbwaaln.sh Genome Informatics I (2015 Spring)

59 Genome Informatics I (2015 Spring)
run bwa samse >bwa samse reference yourdata.sai yourdata.fastq > yourdata.sam example >bwa aln /data/resources/reference/human/UCSC/hg19/BWAIndex/genome.fa NA18507_chr8.01.sai NA18507_chr8.01.fastq > NA18507_chr8.01.sam write a job script runbwasamse.sh submit to cluster >qsub runbwasamse.sh Genome Informatics I (2015 Spring)

60 the output This is your first alignment with real NGS data
>less NA18507_chr8.01.sam This is your first alignment with real NGS data Genome Informatics I (2015 Spring)

61 Genome Informatics I (2015 Spring)
break Please ask any questions to us if you have problems (do not give up) If possible, try mapping in a paired-end mode bwa sampe reference data01.sai data02.sai data01.fastq data02.fastq > output.sam Genome Informatics I (2015 Spring)

62 Genome Informatics I (2015 Spring)
The SAM Format For more details about SAM format please refer to: Genome Informatics I (2015 Spring)

63 Genome Informatics I (2015 Spring)
SAM/BAM SAM and BAM are convertible (exactly same information) SAM file human readable text file BAM file (binary) human unreadable binary file compressed (much smaller size) able to index (for random access) Genome Informatics I (2015 Spring)

64 Genome Informatics I (2015 Spring)
Converting SAM to BAM >samtools view yourdata.sam –Sb > yourdata.bam -S option means input is SAM format -b option means output is BAM format Genome Informatics I (2015 Spring)

65 Sorting and Indexing BAM
samtools sort yourdata.sam yourdata.sorted will create yourdata.sorted.bam samtools index yourdata.bam will create yourdata.bam.bai Now everything’s ready Genome Informatics I (2015 Spring)

66 Visualizing alignment
IGV (Integrative Genomics Viewer) Genome Informatics I (2015 Spring)

67 Visualizing alignment
samtools tview yourdata.bam reference example: >samtools tview NA18507_chr8.01.sorted.bam /data/resource/reference/human/UCSC/hg19/BWAIndex/genome.fa Genome Informatics I (2015 Spring)

68 Genome Informatics I (2015 Spring)


Download ppt "MES Genome Informatics I - Lecture V. Short Read Alignment"

Similar presentations


Ads by Google