2000-2007: from the White House to another celebrated breakthrough 26 Jun 2000: Craig Venter, Bill Clinton, Francis Collins.

2000-2007: from the White House to another celebrated breakthrough 26 Jun 2000: Craig Venter, Bill Clinton, Francis Collins

open access has been official policy since 1996 Bermuda Rules UCSC’s browser (US) http://genome.ucsc.edu/cgi-bin/hgTracks?org=human Ensembl’s contigView (UK) http://www.ensembl.org/Homo_sapiens/contigview?chr=X&start=1510730 54&end=151383976 NCBI’s mapViewer (US) http://www.ncbi.nlm.nih.gov/mapview/maps.cgi?taxid=9606&CHR=X&BE G=151073054&END=151383976 Nature’s human genome http://www.nature.com/nature/supplements/collections/humangenome/ind ex.html

example1 of UCSC’s browser

example2 of UCSC’s browser

clone-by-clone versus whole- genome-shotgun sequencing whole genome shotgun (WGS) clone by clone (hierarchical) 200 kb 500 bp

all shotgun sequencing methods randomly over-sample the genome 500-bp sequence read << 3-Gb (3  10 9 ) human genome Lander ES, Waterman MS. 1988. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2: 231-239 LW: random sampling at 1x coverage will be full of gaps LW: need 4x~8x coverage in order to bridge most of the gaps redundancy sounds expensive but it is actually cheaper private investments in technology development determine costs redundancy of coverage lowers substitutional error rate

repeated sequences cause the shotgun data to be misassembled 4 16 = 4.3-Gb suggests that a 16-bp overlap rule will suffice for the 3-Gb human genome, but this assumes the sequence is random, and it is not repeat1repeat2 repeat1 impossible to decide between assemblies

biological definition of “repeat” overstates misassembly problem SIMILARITY of two DNA sequences is statistically significant if 50% of the bases match exactly (in contrast the threshold for similarity of two protein sequences is not well defined but the criterion normally used is that 30% of the residues match exactly) shotgun misassemblies occur when two sequences are more-or-less exactly repeated, in comparison to the mathematical rule programmed into the computer to determine when an overlap should be accepted, which is in general much more stringent than 50% biological nature of the repeated motif (e.g. transposable elements vs gene duplicates) is not directly important

factors that do affect misassembly age of repeat: two identical sequences will diverge over time, so it is the younger repeats that are more likely to cause problems length of repeat: the longer a repeat is, in comparison to the sequence read, the more likely it will cause problems copy number: we can detect high copy number repeats by counting how often every sequence motif occurs in the data set, so it is only the undetectable low copy number repeats that cause problems inbred vs outbred: the overlap rules must allow for polymorphisms, so if we use an outbred that causes problems Phred quality: one of the most important breakthroughs was the ability to compute an accurate error probability for every base call

clone-end pairs (read mates) bridge over troublesome repeats clone-end pairs did not resolve that 2%~3% of the genome attributed to recent segmental duplications >15-kb in length and >97% identical

gene poor regions due to large introns or large intergenic regions animal vs plant comparisons are discussed in rice genome paper Yu J, Hu S, Wang J, Wong GK, Li S, et al. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296: 79-92 Model 1: observed in animals large intron Model 2: observed in plants large intergenic region

many human introns exceed 10-kb size, but few plant introns ever get up to 1-kb Arabidopsis HumanRice (indica)

many human introns have a 20-mer copy number over 10, but few plant introns do

nearly half of the genome can be attributed to transposable elements after a transposable element (TE) is inserted into the genome it is under no selectional pressure; hence over a period of a few hundred million years it becomes indistinguishable from random sequence

TE histories are species specific notice the burst of activity for human SINE/Alu (blue) at 7% in divergence notice how mouse has a completely distribution of TEs human mouse

warning 1: computer predicted genes are not entirely reliable warning 2: computer predicted functional assignments are not entirely reliable warning 3: the experiments suggest that, like the computers, the cellular machinery is not perfect either

the human gene count betting pool 3 June 2003: Ewan Birney of the European Bioinformatics Institute (EBI) at Cambridge (UK) estimated that there are 24,500 protein coding genes, of which 3000 are most likely pseudogenes that do not count. Apparently only a few bettors placed their money on a number that low. Lee Rowen at the Institute for Systems Biology (Seattle) was the closest, betting in 2001 on a gene count of 25,947. She got half of the $1200 pool money. Sharing the other half were Paul Dear of the UK Medical Research Council, and Olivier Jaillon of Genoscope in France, with the next closest bets.

many gene sequences are conserved even in prokaryotes IHGSC’s predicted genes; transcript data subsequently taken indicated many non-conserved genes even relative to mouse ???

many functions are unknown Celera’s 26,383 genes; outer circle shows Gene Ontology (GO) functions, while inner circle shows Panther functions

many proteins arose by a rearrangement of the domains proteins are organized into multiple DOMAINS; each is a compact 3D structure that folds/evolves/functions independent of the protein chain

lineage-specific expansions of transcription factor architectures protein domains are biology’s answer to LEGO bricks

three distinct modes of evolution Carroll SB. 2005. Evolution at two levels: on genes and form. PLoS Biol 3: e245 common in plants common in animals

changes in the coding sequences play a key role in the evolution of physiology (e.g. ability of organism to digest particular nutrients) gene duplication changes occur on evolutionary time scales; alternative splicing or isoform changes occur on developmental time scales changes in the regulatory sequences play a key role in the evolution of form (i.e. anatomical features commonly used in systematics) alternative splicing is another form of regulation, just not often thought of in that way

known instances of regulatory evolution in animal development many genes are pleiotropic (i.e. involved in multiple processes), and it is difficult to change the coding sequences without deleterious effects; the advantage of evolving through regulatory motifs and/or alternative splicing is they can be changed in a process or tissue specific manner

2000-2007: from the White House to another celebrated breakthrough 26 Jun 2000: Craig Venter, Bill Clinton, Francis Collins.

Similar presentations

Presentation on theme: "2000-2007: from the White House to another celebrated breakthrough 26 Jun 2000: Craig Venter, Bill Clinton, Francis Collins."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2000-2007: from the White House to another celebrated breakthrough 26 Jun 2000: Craig Venter, Bill Clinton, Francis Collins.

Similar presentations

Presentation on theme: "2000-2007: from the White House to another celebrated breakthrough 26 Jun 2000: Craig Venter, Bill Clinton, Francis Collins."— Presentation transcript:

Similar presentations

About project

Feedback