Download presentation
Presentation is loading. Please wait.
Published byTyrone Webster Modified over 9 years ago
1
26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012 Wi-Fi: twgroup / password: group5500
2
IMGS 2012 Bioinformatics Workshop Deanna Church, NCBI Carol Bult, The Jackson Laboratory
3
Tutorial Resources Galaxy – https://main.g2.bx.psu.edu/ https://main.g2.bx.psu.edu/ Genome Analysis for Biologists – http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/ http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/ NCBI 1000 Genomes Browser – http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/ http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/ Genome Reference Consortium – http://genomereference.org/ http://genomereference.org/
4
Schedule 9-10 am: Intro Genome Assembly Basics Alignment Basics 10-11 am: Getting Stuff Done File formats (sequences, alignments, annotations) 11-12 am: Doing stuff Typical RNA-Seq workflow RNA Seq in Galaxy Differential Gene Expression with RNA Seq data
5
Assembly Basics 19 Oct 2012
6
Some assembly required…
7
Restrict and make libraries 2, 4, 8, 10, 40, 150 kb End-sequence all clones and retain pairing information “mate-pairs” Find sequence overlaps Each end sequence is referred to as a read WGS contig tails WGS: Sanger Reads Layout-Consensus-Overlap
8
http://schatzlab.cshl.edu/teaching/2010/Lecture%203%20-%20Graphs%20and%20Genomes.pdf
9
Alignable trace count in frameshift window vs control in Opossum: 51nt window, >95% identity 23,894 genes 452 models with >1 exon, sym.best hit, and one frameshift 334 cases have 3 or less hits Alexander Souvorov, NCBI
10
Fragmented genomes tend to have less frame shifts Alexander Souvorov, NCBI
11
Fragmented genomes tend to have more partial models Alexander Souvorov, NCBI
12
BAC insert BAC vector Shotgun sequence Assemble Fold sequence Gaps deeper sequence coverage rarely resolves all gaps GAPS “finishers” go in to manually fill the gaps, often by PCR Clone based assemblies
13
Scaffold N50 by chromosome
14
7 May 2010 Spanned Gaps by Assembly
15
Church et al., 2011 PLoS Biology http://genomereference.org
16
NCBI36 (hg18) GRCh37 (hg19)
17
NCBI35 (hg17) GRCh37 (hg19) AL139246.20 AL139246.21
18
Build sequence contigs based on contigs defined in TPF (Tiling Path File). Check for orientation consistencies Select switch points Instantiate sequence for further analysis Switch point Consensus sequence
19
NCBI36
20
nsv832911 (nstd68) Submitted on NCBI35 (hg17)
21
NCBI35 (hg17) Tiling Path GRCh37 (hg19) Tiling Path Gap Inserted Moved approximately 2 Mb distal on chr15 NC_0000015.8 (chr15) NC_0000015.9 (chr15) Removed from assembly Added to assembly HG-24
22
Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes
23
AC074378.4 AC079749.5 AC134921.2 AC147055.2 AC140484.1 AC019173.4 AC093720.2 AC021146.7 NCBI36 NC_000004.10 (chr4) Tiling Path Xue Y et al, 2008 TMPRSS11E TMPRSS11E2 GRCh37 NC_000004.11 (chr4) Tiling Path AC074378.4 AC079749.5 AC134921.1 AC147055.2 AC093720.2 AC021146.7 TMPRSS11E GRCh37 : NT_167250.1 (UGT2B17 alternate locus) AC074378.4 AC140484.1 AC019173.4 AC226496.2 AC021146.7 TMPRSS11E2 nsv532126 (nstd37)
24
GRCh37 (hg19) http://genomereference.org 7 alternate haplotypes at the MHC Alternate loci released as: FASTA AGP Alignment to chromosome UGT2B17MHC MAPT
25
Assembly (e.g. GRCh37.p2) Primary Assembly Non-nuclear assembly unit (e.g. MT) ALT 1 ALT 2 ALT 3 ALT 4 ALT 5 ALT 9 ALT 6 ALT 7 ALT 8 PAR Patches … Genomic Region (MHC) Genomic Region (UGT2B17) Genomic Region (MAPT) Genomic Region (ABO) Genomic Region (SMA) Genomic Region (PECAM1)
26
MHC (chr6) Chr 6 representation (PGF) Alt_Ref_Locus_2 (COX)
27
Richa Agarwala Eugene Yaschenko
29
GenBank Data Archives Data in a common format Data in a single location (and mirrored) Most quality checked prior to deposition Robust data tracking mechanism (accession.version) Data owned by submitter
30
Data tracking ABC14-1065514J1 GapsPhaseLengthDate FP565796.111 21-Oct-2009 FP565796.210 14-Oct-2010 FP565796.330 07-Nov-2010
31
Mouse chrX: 35,000,000-36,000000
32
X MGSCv3MGSCv36
33
Unique Identification NC_000086.6 chrX in MGSCv36 List of scaffolds and gaps (AGP) List of components and gaps (AGP)
34
hg19 GRCh37 mm8 MGSCv37 NCBIM37 danRer5 Zv7 What’s in a name?
36
Assemblies with the same name aren’t always the same chr21:8,913,216-9,246,964
37
Assemblies with the same name aren’t always the same Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX
38
hg19 GRCh37 GRCh37.p2 GCA_000001405.1 Assembly Database to the rescue GCA_000001405.3
39
http://www.ncbi.nlm.nih.gov/genome/assembly GRCh37hg19
41
Assembly (e.g. GRCh37.p5) GCA_000001405.6/GCF_000001405.17 Primary Assembly GCA_000001305.1/ GCF_000001305.13 ALT 1 GCA_000001315.1/ GCF_000001315.1 ALT 2 GCA_000001325.1/ GCF_000001325.2 ALT 3 GCA_000001335.1/ GCF_000001335.1 ALT 4 GCA_000001345.1/ GCF_000001345.1 ALT 5 GCA_000001355.1/ GCF_000001355.1 ALT 6 GCA_000001365.1/ GCF_000001365.2 ALT 7 GCA_000001375.1/ GCF_000001375.1 ALT 8 GCA_000001385.1/ GCF_000001385.1 ALT 9 GCA_000001395.1/ GCF_000001395.1 Patches GCA_000005045.5 GCF_000005045.4 Non-nuclear assembly unit (e.g. MT) GCA_000006015.1/ GCF_000006015.1
42
GenBankRefSeq vs Submitter OwnedRefSeq Owned RedundancyNon-Redundant Updated rarelyCurated INSDCNot INSDC BRCA1 83 genomic records 31 mRNA records 27 protein records 3 genomic records 5 mRNA records 1 RNA record 5 protein records
43
Sequence Alignments Basics
44
Hypothesis
45
The biological basis of sequence alignment is evolution Sequences that share a common ancestor are homologous – Sequence similarity is evidence of homology – Sequences, genes, etc. are homologous or not, there is no “percent homology”
46
Homology Orthologous sequences – Common ancestor; speciation Paralogous sequences – Gene duplication within a species ( lineage specific expansion) http://www.nature.com/nrd/journal/v2/n8/box/nrd1152_BX2.html
47
Alignment to NR -> Homology Alignment to an Assembly -> Mapping
49
Global and local alignments Optimal global alignment Needleman-Wunsch Sequences align essentially from end to end Optimal local alignment Smith-Waterman Sequences align only in small, isolated regions References Needleman and Wunsch (1970). J. Mol. Biol. 48, 443-453. Smith and Waterman (1981). Nucleic Acids Res 13, 645-656.
51
http://en.wikipedia.org/wiki/Sequence_alignment
52
Hashing methods MVRRLPERTSTPACE MVR VRR RRL RLP LPE PER ERT RTS TST STP TPA PAC ACE Query sequence Word size = 3 (configurable) References Wilbur & Lipman (1983), PNAS 80, 726- 30 Lipman & Pearson (1985), Science 227, 1435-1441 Pearson & Lipman (1988), PNAS 85, 2444-2448
56
http://wwwdev.ebi.ac.uk/fg/hts_mappers/ Fonseca et al., 2012
57
Sensitivity vs. Specificity Sensitivity = actual number of true positives (tp) identified Specificity = number of true negatives (tn) identified Actual Predicted TPFN FPTN positives negatives positivesnegatives Sensitivity= TP/(TP+FN) Specificity=TN/(TN+FP)
58
Aligner technology specific? Gapped vs. ungapped alignments? Spliced alignments (cDNAs/RNA-Seq) Can use paired-end data?
59
Ruffalo et al., 2012
60
Li and Homer, 2010
61
Indels have correct and consistent alignment in reads after multiple sequence local realignment 61 DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. Phase 1: NGS data processing Highlighted as one of the major methodological advances of the 1000 Genomes Pilot Project!
62
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes CDC27
64
Richa Agarwala MHC Alternate locus Alignment to chr6
65
Mouse Ren1 chr1 (NC_000067.6): 133350674-133360320 NM_031192.3: transcript from C57BL/6J NM_031193.2: transcript from FVB/N
67
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes CEPH: A=1.000 G=0 APOL1
68
YRI: A=0.5852 G=0.4148 Multiple submissions Frequency Data 1000G Suspect Sudmant et al., 2010
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.