Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basics of high-throughput sequencing Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data.

Similar presentations


Presentation on theme: "Basics of high-throughput sequencing Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data."— Presentation transcript:

1 Basics of high-throughput sequencing Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data Analysis Workshop, June 2012

2 Plan 1. What high-throughput sequencing is used for 2. Illumina technology 3. Primary data analysis (alignment, QC) 4. Read formats 5. Secondary Analysis (mutation calling, transcript level quantification, etc) 6. Read data visualization 7. Useful R/BioC packages 8. Challenges and evolution of sequencing and its analysis

3 1. What high-throughput sequencing is used for

4 Full genome sequencing

5

6

7

8 Targeted sequencing

9 Exome sequencing

10 C  C U  T After PCR mC  C C  U Bisulfite treatment DNA methylation profiling

11 RNA-seq

12 ChIP-seq Transcription factor of interest Antibody DNA

13 High-throughput mapping of chromatin interactions (HiC) Elemento lab (more on this next week)

14

15 And many others Gene fusion detection Translational profiling (which mRNAs localize to ribosomes) Small/miRNA sequencing Bacterial communities Protein-RNA interactions (PAR-CLIP, HITS-CLIP) …

16 2. Illumina technology

17 DNA (0.1-1.0 ug) Single molecule array Sample preparation Cluster growth 5’ 3’ G T C A G T C A G T C A C A G T C A T C A C C T A G C G T A G T 123789456 Image acquisition Base calling T G C T A C G A T … Sequencing Illumina SBS Technology Reversible Terminator Chemistry Foundation © Illumina, Inc. http://www.illumina.com/technology/sequencing_technology.ilmn http://seqanswers.com/forums/showthread.php?t=21

18 Single end vs pair end sequencing

19 What comes out of the machine: short reads in fastq format @D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1 CTCCTGGAAAACGCTTTGGTAGATTTGGCCAGGAGCTTTCTTTTATGTA AATTG +D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1 [^^cedeefee`cghhhfcRX`_gfghf^bZbecg^eeb[caef`ef^a_`eXa @D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1 TCCANCCATGGCAAATTCCATGGCACCGTCAAGGCTGAGAACGGGAA GCTTGTC +D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1 ab_eBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBBB @D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1 TACAAGTGCAGCATCAAGGAGCGAATGCTCTACTCCAGCTGCAAGA GCCGCCTC +D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1 _[_ceeec[^eeghdffffhh^efh_egfhfgeec_fbafhhhhd`caegfheh @D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1 GAAGGAGAGAAGGGGAGGAGGGCGGGGGGCACCTACTACATCGCCC TCCACATC +D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1 \^_accceg`gga`f[fgcb`Ucgfaa_LVV^[bbbbbRWW`W^Y[_[^bbbbb @D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1 GTGGCCGATTCCTGAGCTGTGTTTGAGGAGAGGGCGGAGTGCCATCT GGGTAGC +D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1 aa_eeeeegggggihhiiifgeghfeghbgcghifiidg^dbgggeeeee`dcd @D3B4KKQ1_0166:8:1101:2358:2174#CGATGT/1 CTGACCTGGGTCCTGTGGTGCTCAGCCTTTTGAAGATGCCAGAAAAAT ACGTCG +D3B4KKQ1_0166:8:1101:2358:2174#CGATGT/1 \^_cccccg^Y`ega`fg`ebegfhd^egghhghfffhghdhbfffhhhfgfcf QS to int In R: as.integer( charToRaw (‘e'))-33

20 Pair end sequencing s_8_1_sequence.txt.gz s_8_2_sequence.txt.gz @D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1 CTCCTGGAAAACGCTTTGGTAGATTTGGCCAGGAGCTTTCTTTTATGTAAATTG +D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1 [^^cedeefee`cghhhfcRX`_gfghf^bZbecg^eeb[caef`ef^a_`eXa @D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1 TCCANCCATGGCAAATTCCATGGCACCGTCAAGGCTGAGAACGGGAAGCTTGTC +D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1 ab_eBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1 TACAAGTGCAGCATCAAGGAGCGAATGCTCTACTCCAGCTGCAAGAGCCGCCTC +D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1 _[_ceeec[^eeghdffffhh^efh_egfhfgeec_fbafhhhhd`caegfheh @D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1 GAAGGAGAGAAGGGGAGGAGGGCGGGGGGCACCTACTACATCGCCCTCCACATC +D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1 \^_accceg`gga`f[fgcb`Ucgfaa_LVV^[bbbbbRWW`W^Y[_[^bbbbb @D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1 GTGGCCGATTCCTGAGCTGTGTTTGAGGAGAGGGCGGAGTGCCATCTGGGTAGC +D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1 aa_eeeeegggggihhiiifgeghfeghbgcghifiidg^dbgggeeeee`dcd @D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/2 GGCATATTTAACAGCATTGAACAGAATTCTGTGTCCTGTAAAAAAATTAGCTTA +D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/2 a__aaa`ce`cgcffdf_acda^ea]befffbeged`g[a`e_caaac]cb`gb @D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/2 TTGAGGCTGTTGTCATACTTCTCATGGTTCACACCCATGACGAACATGGGGGCG +D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/2 a__eeeeeggegefhhhiiihhhhhiieghhhghhiiffhiififhhiihegic @D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/2 CGGGGTGCACCTCGTCGTAGAGGAACTCTGCCGTCAGCTCTGCCCCATCGCCAA +D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/2 ^__ee__cge`cghghhfgddgfgi]ehhfffff^ec[beegidffhhfhadba @D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/2 CTTAGTCTCAGTTTTCCTCCAGCAGCCTGAGGAAACTCAAAGGCACAGTTCCCA +D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/2 _abeaaacg^g^eghhhhgafghhdfghfedeghfiiicfbgdHYagfeecggf @D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/2 TAGGCTCAAAGTCTAACGCCAATCCCGAACCTGGGCATCTGTACACACACACAC +D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/2 abbeceeegggcghiihiihhhhiifhiiiiihiiiiiiihegh`eggfebfhg ……

21 Illumina sequencing using HiSeq2000 Previously: GAIIx: ~30M reads per lane, 8 lanes (1QC) Now: HiSeq2000 + TruSeq v3: 200M reads per lane, 8- 16 lanes (1-2QC) in parallel with HiSeq2000 Multiplexing: attach barcode, mix samples, sequence, identify and remove barcode

22 Full Genome Sequencing using Illumina technology ~$4-6K reagent with Illumina (storage+analysis costs not included) Exercise: you want to sequence 1 human genome at 100X coverage; how many lanes ?

23 QC for Illumina (part 1) 5’ 3’ G T C A G T C A G T C A C A G T C A T C A C C T A G C G T A G T Sequencing

24 3. Primary data analysis (alignment, QC)

25 Read alignment programs BWA (Burrows-Wheeler Aligner) – http://bio-bwa.sourceforge.net/ http://bio-bwa.sourceforge.net/ – Fast, accurate, can find (short) indels – Allow 1-3 mismatches by default – Can also align longer 454 reads Bowtie – http://bowtie-bio.sourceforge.net/index.shtml http://bowtie-bio.sourceforge.net/index.shtml – Ultrafast, accurate, newest version finds indels too – Allow 1-3 mismatches by default – Integrated into TopHat (splice aligner) Others: Eland, Maq, SOAP, etc

26 BWA tutorial (for aligning single end reads to genome) Get genome, e.g., from UCSC – http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz Combine into 1 file – tar zvfx chromFa.tar.gz – cat *.fa > wg.fa Indexing the genome – bwa index -p hg19bwaidx -a bwtsw wg.fa Align – bwa aln -t 4 hg19bwaidx s_3_sequence.txt.gz > s_3_sequence.txt.bwa Convert to SAM format – bwa samse hg19bwaidx s_3_sequence.txt.bwa s_3_sequence.txt.gz > s_3_sequence.txt.sam

27 Aligning pair end reads Align two files separately – bwa aln -t 4 hg19bwaidx s_3_1_sequence.txt.gz > s_3_1_sequence.txt.bwa – bwa aln -t 4 hg19bwaidx s_3_2_sequence.txt.gz > s_3_1_sequence.txt.bwa Convert to SAM format – bwa sampe hg19bwaidx s_3_1_sequence.txt.bwa s_3_1_sequence.txt.bwa s_3_1_sequence.txt.gz s_3_1_sequence.txt.gz > s_3_sequence.txt.sam

28 TopHat (spliced alignment) Trapnell et al, 2009 tophat –r 100 –p 4 –o outdir/ hg18 s_1_1_sequence.txt s_1_2_sequence.txt D~100bp Download genome index ftp://ftp.cbcb.umd.edu/pub/data/bowtie_inde xes/hg18.ebwt.zip

29 Basic QC Fraction of mapped reads How many unique mappers ? Fraction of clonal reads (PCR duplicates)

30 4. Read formats

31 Read formats SAM/BAM Eland/Eland Export

32 SAM format DH1608P1_0130:6:1103:10579:166379#TTAGGC 16 chr1 1249828 37 51M * 0 0 GGGCGTGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG eb`XXYbZdadee^ceV]X][ccTcc^ebeece eeeWbeeeeeeeceeaee XX:Z:NM_017871,32 NM:i:0 MD:Z:51 DH1608P1_0130:6:1102:3415:150915#TTAGGC 16 chr1 1249828 37 51M * 0 0 GGGCGGGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG BBBBBBBBBBBac]bbbceedaeddeZceeea_ba_\_eee eeeedaeeee XX:Z:NM_017871,32 NM:i:1 MD:Z:5T45 DH1608P1_0130:6:1102:13118:62644#TTAGGC 16 chr1 1249828 37 51M * 0 0 GGGCGTGCCTCGGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG BBBBBBBBBBBBBBBBBBBBB`XTbSa`cffegdggeccbe effdeggggg XX:Z:NM_017871,32 NM:i:2 MD:Z:7A3T39 DH1608P1_0130:6:1203:3012:157120#TTAGGC 16 chr1 1249826 25 51M * 0 0 AAGGCCGTGACTCTGATCTCAGCCCTCGTCTCCGCCGCGCTCCCGGACCCG BBBBBBBB^`QWZZ]UXYSZSTFRU]Z__SO[adcc[acdV \`Y]YWY][_ XX:Z:NM_017871,34 NM:i:3 MD:Z:4G17G1A26 DH1608P1_0130:6:2206:4445:12756#TTAGGC 16 chr1 1246336 25 1M3487N50M * 0 0 CCAAAGGGTGTGACTCTGATCTCGGGCATCGTCTCCGCCGCGCTCCCGGAC BBBBBBBBBBBBBBBBBBBBBBBB`YdddYdc\ cacaNddddcdddaeeee XX:Z:NM_017871,37 NM:i:3 MD:Z:2C5C14A27 DH1608P1_0130:6:2203:7903:43788#TTAGGC 16 chr1 1246336 37 1M3487N50M * 0 0 CCCAAGGGCGTGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGAC adbe[fbcbccb_cb^cb^^c^edgegggggdf ggefffgfbfggggegeg XX:Z:NM_017871,37 NM:i:0 MD:Z:51 MD tag, e.g, MD:Z:4T46 = 5 matches, 1 mismatch (T in read), 46 matches CIGAR string, eg 5M3487N46M = 5bp-long block, 3487 insert, 46bp-long block XT tag, e.g. XT:A:U = unique mapper; XT:A:R = more than 1 high-scoring matches

33 Pair end SAM D3B4KKQ1_0161:8:2206:11080:31374#CTTGTA 83 chr1 4481348 255 51M = 4481165 0 TTAGATGCATTTTCTTACCATTGTAAGAAAAATGAAAATTTTACAATTAAG hiiiiiiihihhdhghggdiiihihffihhheihihhhgggggeeeeebbb NM:i:0 NH:i:1 D3B4KKQ1_0161:8:2206:8294:192062#CTTGTA 147 chr1 4481355 255 51M = 4481284 0 CATTTTCTTACCATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACAC efehffhgfdiihhhhhihghiiihfhihdhiihgghigefggeeeeebbb NM:i:0 NH:i:1 D3B4KKQ1_0161:8:2204:6985:145082#CTTGTA 147 chr1 4481360 255 51M = 4481202 0 TCTTACCATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACACTTCTA ghfhgihihghgihgiiiifiiiiihhhhfifhihhiigggeeceeeea__ NM:i:0 NH:i:1 D3B4KKQ1_0161:8:2205:15014:60805#CTTGTA 83 chr1 4481360 255 51M = 4481238 0 TCTTACCATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACACTTCTA hihheiihiiiiiiiiiiiiiiiiiifhiefhiiiiiigggggeceeebba NM:i:0 NH:i:1 D3B4KKQ1_0161:8:1105:17802:25847#CTTGTA 83 chr1 4481362 255 51M = 4481198 0 TTACCATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACACTTCTAAT gheiiiihhhiiiiiiiiiihiiiiiihgfiiiiiiiigeggceeeeebb_ NM:i:0 NH:i:1 D3B4KKQ1_0161:8:1208:2232:73719#CTTGTA 147 chr1 4481366 255 51M = 4481277 0 CATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACACTTCTAATTGTA fhghiiiiiiiiiiiiiiiiiiihghiihiiiiihgggegfggeeeeebbb NM:i:0 NH:i:1 D3B4KKQ1_0161:8:2104:18142:93861#CTTGTA 83 chr1 4481367 255 51M = 4481198 0 ATTGTAAGAAAAATGAAAATTTTACAATTAAGTATACACTTCTAATTGTAT ihghiiiheiiiiihhihfhifgghhhhfgfhiggge_ggggeeeeee_bb NM:i:0 NH:i:1 NM=edit distance NH=number of alignments for that read

34 BAM format Compressed, indexable version of SAM Can be uploaded to UCSC Genome Browser

35 SAMtools http://samtools.sourceforge.net/ Convert SAM to BAM – samtools view –bS file.sam > file.bam Sort BAM file – samtools sort file.bam file.sorted # (will create file.sorted.bam) Index BAM file – samtools index file.sorted.bam Convert BAM to SAM – samtools view file.bam > file.sam Rsamtools http://www.bioconductor.org/packages/2.6/bioc/html/Rsamtools.html

36 Get alignment statistics – samtools flagstat pairendfile.bam 149923886 in total 0 QC failure 0 duplicates 124520915 mapped (83.06%) 149923886 paired in sequencing 74961943 read1 74961943 read2 120504218 properly paired (80.38%) 121586068 with itself and mate mapped 2934847 singletons (1.96%) 482748 with mate mapped to a different chr 143256 with mate mapped to a different chr (mapQ>=5) SAMtools

37 Get pileup – samtools pileup file.sorted.bam chr1 1156 T 26 tTttTTTtTttTttTtTtTTGTTTTT ggggeggggg^Vgf_fggggJceb_g chr1 1157 T 26 tTttTTTtTttTttTtTtTTTTTTTT ggggfggggg[RgfNfgfgg`ed^]f chr1 1158 G 26 g$GggGGGgGggGggGgGgGGGGGGGG gggg_ggggg[Ugfddgggga_eW\c chr1 1159 A 25 AaaAAAaAaaAaaAaAaAAAAAAAA gggaefggg_Xgf_fggggadd]Zg chr1 1160 A 25 AaaAAAaAaaAaaAaAaAAAAAAAA ggefggggdNVgbZbgggg`ee[\g chr1 1161 C 25 C$c$c$CCCcCccCccCcCcCCCCCCCC gfgfggfggYYgeadgggg`ea^\g chr1 1162 C 23 C$CCcCccCccCcCcCCCCCCCC^FC fgggge_`gf_dgggge_e]_gg chr1 1163 T 22 T$T$tTttTttTtTtTTTTTTTTT ggffg\Rgf_dggeggde]_cg chr1 1164 C 20 cCccCccCcCcCCCCCCCCC ggg`[gf_dggggg\d[]fg chr1 1165 A 22 a$AaaAaaAaAaAAAAAAAAA^FA^FA ged_]ggadffgggecX^ggfg chr1 1166 G 21 G$g$g$GggGgGgGGGGGGGGGGG ggc`gfWfggfggcaSdggfe chr1 1167 C 19 CccCcCcCCCCCCCCCCC^FC agg\dgggggbZUdfgfgg chr1 1168 T 19 TttTtTtTTTTTTTTTTTT eggcbfgfgg_cXdegfgg chr1 1169 T 19 TttTtTtTTTTTTTTTTTT aggccggdggccZdggfgf chr1 1170 T 19 TttTtTtTTTTTTTTTTTT `gfcfgggggccUcggcgg chr1 1171 A 19 AaaAaAaAAAAAAAAAAAA ege_fgggggcc[aggcgg chr1 1172 A 19 A$aaAaAaAAAAAAAAAAAA XggLfggfggdeM_ggagg chr1 1173 G 18 g$gGgGgGGGGGGGGGGGG gf\fgggggcfPcggegg chr1 1174 A 17 a$AaAaAAAAAAAAAAAA fce[gggg_eL]ggfdf chr1 1175 A 16 A$aAaAAAAAAAAAAAA dfggfggdfS[ggegg ^ = start of read at that position $ = end of read at that position SAMtools

38 Removing clonal reads – Multiple reads that map to same position, with same orientation as usually considered PCR duplicates – For mutation detection (less important for RNA-seq), need to collapse them into 1 read (e.g. read with highest quality score) – samtools rmdup –s file.bam file_noclonal.bam SAMtools

39 5. Secondary Analysis (transcript level quantification, mutation calling)

40 RPKM Reads per kilobase of transcript per million reads R: Count how many reads map to a transcript K: Divide by ( length of transcript / 1,000 ) M: Divide by (total number of mapped reads in sample / 1,000,000 ) CuffLinks uses FPKM (same as RPKM, F=fragment, for paired end reads)

41 CuffLinks Trapnell et al, 2010 cufflinks -p 4 –o outdir/ s_1_sequence.txt.sorted.bam

42 http://www.broadinstitute.org/software/scripture/ http://genes.mit.edu/burgelab/miso/

43

44 Detecting Single Nucleotide Variations (SNVs)

45 AAAAATTCTCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT G Reference Human Genome (hg18) AAAATACGCGTATTCTCCCAAAACAATATC Short read

46 AAAAATTCTCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT G Reference Human Genome (hg18) AAAATACGCCTATTCTCCCAAAACAATATC Short read

47 AAAAATTCTCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT G Reference Human Genome (hg18) AAAATACGCCTATTCTCCCATAACAATATC Short read

48 Sequencing has high error rate Mismatch = real variation OR sequencing error AAAAATTCTCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT G Reference Human Genome (hg18) AAAATACGCCTATTCTCCCAAAACAATATC Short read Typical mismatch rate of entire datasets = 0.5-2% (errors >> real variations)

49 chr2, pos=85623221 bp Single Nucleotide Variation

50 chr14, pos=35859525 bp Single Nucleotide Variation

51 chr1, pos=220952447 Single Nucleotide Variation

52 All cells in tumor have heterozygous mutation A fraction of cells have heterozygous mutation Loss of heterozygocity due to loss of genetic material Cancer mutations

53 The error/mismatch rate is not uniform across read length Mismatch

54 Popular SNV calling programs GATK http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit VarScan http://varscan.sourceforge.net/

55 genome N reads at considered position k reads with mutation Is k greater than expected by chance, given error rates p i ? p1p1 p3p3 p5p5 p6p6 p8p8 p9p9 p 10 p 11 p 14 p 17 The Poisson-Binomial distribution Wacker et al, 2012; Jiang et al, 2012 Chen & Liu, 1997 SNVseeqer: Single Nucleotide Variation detection from deep sequencing data

56 Indel calling Complicated because indels often occur within microsatellite regions, eg CACACACA – CA--CACACA as good as CACA--CACA, CACACA--CA Since reads are aligned independently, local realignment is needed DINDEL (used in 1000 Genomes Project) http://www.sanger.ac.uk/resources/software/dindel/

57 Variant annotation Variants can be either mutation or (more often) polymorphism. dbSNP catalogs all known polymorphisms Missense, nonsense, intron, 3’UTR, 5’UTR, etc – SeattleSNP http://pga.gs.washington.edu/http://pga.gs.washington.edu/ Severity of missense mutations – PolyPhen http://genetics.bwh.harvard.edu/pph2/http://genetics.bwh.harvard.edu/pph2/ – Mutation Assessor http://mutationassessor.org/http://mutationassessor.org/ GATK for variant annotation http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_A nalysis_Toolkit http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_A nalysis_Toolkit Cross-species conservation

58 6. Read data visualization

59 SAMtools samtools tview file.sorted.bam wg.fa

60 UCSC Genome Browser Upload BAM file to genome browser or make it accessible to UCSC from your own web page

61 Integrated Genome Viewer (IGV)

62 Read count genome T A T T A A T T A T C C C C A T A T A T G A T A T Read densities

63 Wiggle files for Genome Browser variableStep chrom=chr1 span=10 1471 0.3 1481 0.6 1491 0.6 1501 0.6 1511 0.6 1521 0.6 1531 1.1 1541 1.7 1551 1.9 1561 2.1 1571 2.5 1581 2.8 1591 3.2 1601 3.9 1611 3.9 1621 4.5 1631 4.8 1641 4.2 1651 3.9 1661 3.8 1671 3.2 1681 2.4 1691 1.9 1701 1.4 1711 1.3 1721 0.8 1871 1.4 1881 4.9 1891 9.1 1901 9.7 1911 10.7 1921 11.2 1931 12.3 1941 16.5 1951 23.4 1961 29.9 1971 32.6 1981 31.8 1991 28.0 2001 29.6 2011 30.6 2021 32.7 2031 32.7 2041 29.2 http://genome.ucsc.edu/goldenPath/help/bigWig.html http://genome.ucsc.edu/goldenPath/help/wiggle.html

64

65 7. BioConductor packages for high- througput sequencing

66 BioC packages IRanges http://bioconductor.org/packages/release/bioc/h tml/IRanges.html Rsamtools http://bioconductor.org/packages/2.7/bioc/html/ Rsamtools.html ShortRead http://bioconductor.org/packages/release/bioc/h tml/ShortRead.html rtracklayer http://bioconductor.org/packages/2.8/bioc/html/ rtracklayer.html BSgenome http://bioconductor.org/packages/release/bioc/h tml/BSgenome.html And many more

67 SAMTools, Unix programs and R/BioC RSAMtools Unix commands can be ran in R system(“samtools rmdup –s file.bam file_noclonal.bam”)

68 http://manuals.bioinformatics.ucr.edu/home/ht-seq

69 8. Challenges and evolution of sequencing and its analysis

70 Storage is becoming a real problem Kahn, 2011, Science

71 Sequencing is becoming faster

72 Reads are becoming longer PacBio

73 How do you interpret sequencing data in a clinical context ?

74

75 Data integration ChIP-seq for BCL6, BCOR, SMRT, H3K79me2, H3K4me1, H3K4me3, H3K27Ac, H3K9Ac, H3K27me3, and DNA methylation (HELP) in LY1 cells Integrative statistical model Predictions / Mechanisms Experiments ChIP-seq / siRNA etc HiC

76 The end ole2001@med.cornell.edu eug2002@med.cornell.edu


Download ppt "Basics of high-throughput sequencing Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data."

Similar presentations


Ads by Google