Sequencing Data Analysis Debashis Sahoo Department of Computer Science CSE291 – H00 – Lecture 17
Sanger dideoxy sequencing--basic method Single stranded DNA 3’ 5’ 5’ 3’ a) Anneal the primer
An automated sequencer The output
Sequence output Computer calls Raw data GNNTNNTGTGNCGGATACAATTCCCCTCTAGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGCACCACCAC CACCACCACCCCATGGGTATGAATAAGCAAAAGGTTTGTCCTGCTTGTGAATCTGCGGAACTTATTTATGATCCAGAAAG GGGGGAAATAGTCTGTGCCAAGTGCGGTTATGTAATAGAAGAGAACATAATTGATATGGGTCCTAAGTGGCGTGCTTTTG ATGCTTCTCAAAGGGAACGCAGGTCTAGAACTGGTGCACCAGAAAGTATTCTTCTTCATGACAAGGGGCTTTCAACTGCA ATTGGAATTGACAGATCGCTTTCCGGATTAATGAGAGAGAAGATGTACCGTTTGAGGAAGTGGCANTCCANATTANGAGT TAGTGATGCAGCANANAGGAACCTAGCTTTTGCCCTAAGTGAGTTGGATAGAATTNCTGCTCAGTTAAAACTTCCNNGAC ATGTAGAGGAAGAAGCTGCAANGCTGNACANAGANGCAGNGNGANAGGGACTTATTNGANGCAGATCTATTGAGAGCGTT ATGGCGGCANGTGTTTACCCTGCTTGTAGGTTATTAAAAGNTCCCGGGACTCTGGATGAGATTGCTGATATTGCTAGAGC
Amplifying DNA in Vitro: The Polymerase Chain Reaction (PCR) The polymerase chain reaction, PCR, can produce many copies of a specific target segment of DNA A three-step cycle—heating, cooling, and replication—brings about a chain reaction that produces an exponentially growing population of identical DNA molecules
The three main steps of PCR Step 1: Denature DNA At 95C, the DNA is denatured (i.e. the two strands are separated) Step 2: Primers Anneal At 40C- 65C, the primers anneal (or bind to) their complementary sequences on the single strands of DNA Step 3: DNA polymerase Extends the DNA chain At 72C, DNA Polymerase extends the DNA chain by adding nucleotides to the 3’ ends of the primers.
PCR: Polymerase Chain Reaction Step 1: denaturation Step 2: annealing Step 3: extension
PCR PCR tubes PCR C1000 Thermal Cycler
Denaturation of DNA This occurs at 95 ºC mimicking the function of helicase in the cell.
Step 2 Annealing or Primers Binding Reverse Primer Forward Primer Primers bind to the complimentary sequence on the target DNA. Primers are chosen such that one is complimentary to the one strand at one end of the target sequence and that the other is complimentary to the other strand at the other end of the target sequence.
Step 3 Extension or Primer Extension DNA polymerase catalyzes the extension of the strand in the 5-3 direction, starting at the primers, attaching the appropriate nucleotide (A-T, C-G)
The next cycle will begin by denaturing the new DNA strands formed in the previous cycle
The Size of the DNA Fragment Produced in PCR is Dependent on the Primers The PCR reaction will amplify the DNA section between the two primers. If the DNA sequence is known, primers can be developed to amplify any piece of an organism’s DNA. Forward primer Reverse primer Size of fragment that is amplified
FASTA >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
FASTQ @HWUSI-EAS466_0001:1:1:6:1464#0/1 CAAATGTCTATTTTNTCCGTCAATCTGTGAGTGNCA +HWUSI-EAS466_0001:1:1:6:1464#0/1 abaa`[]aaaaaaaB_aa`_aaaaa^W]\_VV^Ba` @HWUSI-EAS466_0001:1:1:6:579#0/1 ACCTGGTCCTCTTTNAAGACGCGATGTGTCACGNTG +HWUSI-EAS466_0001:1:1:6:579#0/1 `aa_YWY`abaa`aBa_T`_O`Y__VYQ[aaBBBBB @HWUSI-EAS466_0001:1:1:6:1050#0/1 CGAATATCGTGACCNACCGCGGTACAATTGCATNCT +HWUSI-EAS466_0001:1:1:6:1050#0/1 a``aaaa`Y\T`aaBaa_^_``\`a```[O]__Ba`
The different types of BLAST BLAST = Basic Local Alignment Search Tool “The most popular data mining tool ever” BLASTN DNA sequence vs. DNA sequence database BLASTP protein sequence vs. protein sequence database BLASTX DNA sequence translated in 6 reading frames vs. protein sequence database tBLASTX DNA sequence translated in 6 reading frames vs. DNA sequence database translated in 6 frames
Steps to use Blast #1) Paste sequence here #2) Choose search set (Either nucleotide collection or Protein Data Bank) #4 push blast button #3) select program to use
An example of aligning text strings Raw Data ??? T C A T G C A T T G 2 matches, 0 gaps T C A T G | | C A T T G 3 matches (2 end gaps) T C A T G . | | | . C A T T G 4 matches, 1 insertion T C A - T G | | | | . C A T T G T C A T - G | | | | . C A T T G
Terminologies of sequence comparison Sequence identity -- exactly the same Amino Acid or Nucleotide in the same position. Sequence similarity -- Substitutions with similar chemical properties. Sequence homology -- general term that indicates evolutionary relatedness among sequences; we usually measure of percentage identity of sequence homology Pairwise alignment -- used to find the best-matching piecewise (local) or global alignments of two query sequences. Pairwise alignments can only be used between two sequences at a time. Multiple sequence alignment -- try to align all of the sequences in a given query set.
Where are the coding regions? TCAGCGAAGATGAGATAGTTTTTAAAGGTGGGATTTCCCCACCTTTAAAAAGCGAGAAGTCCCGGTTTTAAAGAGGAGTAAAATCCTCTTTTTCTAGCCCACTCAGGTGGTTTTTTTGGTTTTCGCTCCTTGCCGCATCTTCTGTGCCTTTGATGGCGGCTGGTTGGGGTGAAAGGCTGCATATTCCAGAATTTCAGACAGTAGATTGTTTTTGAAATCTTCCGTTTTATCGTTGACGAACTTAACCATCCTGTTGAAATCATCTTCCTTTGATACACCTTCAGGAAATGCCTTAGGAACTGATGTTTGGCTATCCAAGGCATCTTGCAATATCTGCACGATCTCCGAATTCATTGATCGCCCATTGGCCTTTGCTCTGGCGGCAACTGCGTCACGCATACCGTCAGGCATCCTAACTGTAAATCTCTCAATGAAAGCTGGATCTTCTTTTTCAGTCATCATCTTAAACCATAAAAATTTATACAAAACACACTAGCATCATATTGACATTACCCACAATGACATCATAATGGTGTCAGGCATCAAAATGATGTCATCATGACAAGGGGAAAGTAAATGCAAGATGTTCTCTATACAGGTCGTAAGAACGACAGCTTTCAGCTTCGTCTGCCTGAGCGAATGAAAGAAGAGATCCGTCGCATGGCAGAGATGGACGGCATTTCGATTAATTCTGCAATCGTGCAGCGCCTTGCTAAAAGCTTGCGTGAGGAAAGAGTTAATGGGCAGTAAAAACAGCGAAGCCCGGAAGTGTGGGGACACTAACCGGGCTTCTAATGTCAGTTACCTAGCGGGAAACCAACAATGACCAGTATAGCAATCTTTGAAGCAGTAAACACTATCTCTCTTCCATTCCACGGACAGAAGATCATAACTGCGATGGTGGCGGGTGTGGCGTATGTGGCAATGAAGCCCATCGTGGAAAACATCGGTTTAGACTGGAAGAGCCAGTATGCCAAGCTCGTTAGTCAGCGTGAAAAGTTCGGGTGTGGTGATATCACCATACCTACCAAAGGTGGTGTTCAGCAGATGCTTTGCATCCCTTTGAAGAAACTGAATGGATGGCTCTTCAGCATTAACCCAGCAAAAGTACGTGATGCAGTTCGTGAAGGTTTAATTCGCTATCAAGAAGAGTGTTTTACAGCTTTGCACGATTACTGGAGCAAAGGTGTTGCAACGAATCCCCGGACACCGAAGAAACAGGAAGACAAAAAGTCACGCTATCACGTTCGCGTTATTGTCTATGACAACCTGTTTGGTGGATGCGTTGAATTTCAGGGGCGTGCGGATACGTTTCGGGGGATTGCATCGGGTGTAGCAACCGATATGGGATTTAAGCCAACAGGATTTATCGAGCAGCCTTACGCTGTTGAAAAAATGAGGAAGGTCTACTGATTGGCGTATTGGAAGGCGCAAAAAGAAAAGCCAGCAGATGGGCTGCTGGCATTCATTGGGTATATGAACTTTCGGAGAACATATGAAGTCAATTATCAAGCATTTTGAGTTTAAGTCAAGTGAAGGGCATGTAGTGAGCCTTGAGGCTGCAAGCTTTAAAGGCAAGCCAGTTTTTTTAGCAATTGATTTGGCTAAGGCTCTCGGGTACTCAAATCCGTCA
Exon prediction in Eukaryotic DNA using Genescan: Net result is a protein sequence GeneScan looks for start and stop codons, promoters, splice sites, polyA tails, provides statistics for coding potential
NGS sequencing pipeline http://www.slideshare.net/mkim8/a-comparison-of-ngs-platforms
Sequencing steps Library preparation Library amplification Parallel sequencing Voelkerding KV et al., J Mol Diagn (2010) 12,539-51.
NGS Application Whole genome sequencing Whole exome sequencing RNA sequencing ChIP-seq/ChIP-exo CLIP-seq GRO-seq/PRO-seq Bisulfite-Seq
Shyr D, Liu Q. Biol Proced Online. (2013)15,4 Patient Technologies Data Analysis Integration and interpretation point mutation Small indels Further understanding of cancer and clinical applications Genomics WGS, WES Copy number variation Functional effect of mutation Structural variation Differential expression Transcriptomics RNA-Seq Network and pathway analysis Gene fusion Alternative splicing RNA editing Integrative analysis Methylation Epigenomics Bisulfite-Seq ChIP-Seq Histone modification Transcription Factor binding Shyr D, Liu Q. Biol Proced Online. (2013)15,4