Topics The topics: basic concepts of molecular biology more on Perl overview of the field biological databases and database searching sequence alignments phylogenetic trees protein structure prediction microarray data analysis
The Human Genome Project The human genome sequence is complete - almost - approximately 3 billion base pairs. 23 chromosomes, starting from 1990 Some of these slides are adapted from Lecture Notes of Stuart M. Brown at NYU
Whole genome sequencing has now become routine
How does the human genome stack up? Organism Genome Size (Bases) Estimated Genes Human (Homo sapiens) 3.2 billion 25,000 Laboratory mouse (M. musculus) 2.6 billion Mustard weed (A. thaliana) 100 million Roundworm (C. elegans) 97 million 19,000 Fruit fly (D. melanogaster) 137 million 13,000 Yeast (S. cerevisiae) 12.1 million 6,000 Bacterium (E. coli) 4.6 million 3,200 Human immunodeficiency virus (HIV) 9700 9 U.S. Department of Energy Genome Programs, Genomics and Its Impact on Science and Society, 2003
The Path Forward How does DNA impact health? What do all the genes do? Identify and understand the difference in DNA sequence (A,T,C,G) among human populations What do all the genes do? Discover the functions of human genes by experimentation and by finding genes with similar funcs in the model organisms What are the functions of nongene areas? Identify important elements in the nongene regions of DNA How does info in the genome enable life? Explore life at the ultimate level of the whole organism instead of single genes/proteins. U.S. Department of Energy, 2005
Diverse applications Medicine – customized treatments, … Microbes for energy and the environment – generate clean energy source, clean up toxic wastes,… Bioanthropology – human lineage Agriculture, livestock breeding, Bioprocessing – crops&animals more resistant to diseases, efficient industrial processes,… DNA identification – implicate people accused of crimes, identify contaminants in air, water, … U.S. Department of Energy, 2005
Genomics: Journey to the Center of Biology Without doubt, the greatest achievement in biology over the past millennium has been the elucidation of the mechanism of heredity. The instructions for assembling every organism on the planet are all specified in DNA sequences that can be translated into digital information and stored in a computer for analysis. As a consequence of this revolution, biology in the 21st century is rapidly becoming an information science. Powerful new types of bioinformatics will clearly be required to assimilate and interpret the data that will issue from various types of genomics research. Eric Lander & Robert Weinberg, Science, 2000
Nucleic Acid Sequence Databases the principal nucleic acid sequence databases are GeneBank, EMBL and DDBJ, which each collect a portion of the total sequence data reported world-wide, and exchange new and updated entries on a daily basis Nucleic acid sequence Databases EMBL (European Molecular Biology Laboratory) GenBank (USA) DDBJ (DNA Data Bank of Japan) ENSEMBL (project between EMBL - EBI and the Sanger Institute, to produce and maintain automatic annotation on selected eukaryotic genomes ) dbEST (division of GenBank) GSDB (Genome Sequence DataBase, division of GenBank)
GenBank Once upon a time, GenBank sent out sequence updates on CD-ROM disks a few times per year. .
Specialised Genomic Resources In addition to the comprehensive DNA sequence DBs, there is a variety of more specialised genomic resources. These so called boutique DBs bring focus to species-specific genomics and to particular sequencing techniques. Specialised Genomic Resources SGD – Saccharomyces Genome Database UniGene - gene-oriented clusters from GenBank TIGR - Databases of The Institute for Genomic Research ACeDB – A C.elegans DataBase
Protein Information Resources The primary structure of a protein is its amino acid sequence The second structure of a protein corresponds to regions of local regularity (e.g., α-helices and β-strands). The tertiary structure of a protein arises from the packing of its secondary structure elements, which may form discrete domains within a fold. Levels of protein sequence and structural organisation: primary tertiary secondary
Primary Protein Databases The primary structure of a protein is its amino acid sequence. These are stored in primary databases as linear alphabets that denote the constituent residues. Protein sequence Databases SWISS-PROT - Protein knowledgebase TrEMBL - Computer-annotated supplement to Swiss-Prot PIR – Protein Information Resource MIPS – Munich Information Centre for Protein Sequences NRL-3D - produced by PIR
Structure Classification DBs Contain 3D structures available from crystallographic and spectroscopic studies Structure Classification Databases PDB – Protein Data Bank CATH – Class, Architecture, Topology, Homology SCOP – Structural Classification of Proteins
PDB: Growth (2006)
Databases concerning Mutations dbSNP http://www.ncbi.nlm.nih.gov/SNP HGBASE (Human Genome Variation Database) http://hgbase.cgr.ki.se The SNP Consortium (TSC) http://snp.cshl.org
Literature Databases PubMed http://www.ncbi.nlm.nih.gov/entrez/query Bioinformatics Online http://www.bioinformatics.oupjournals.org Nature http://www.nature.com Science http://www.sciencemag.org
Systems Biology Integrate different levels of information to understand how biological systems function Use computational and mathematical models to analyze, model and simulate cellular networks, interactions and pathways.
Microarray DNA microarray is a new technology to measure the level of the mRNA gene products of a living cell.
Affymetrix GeneChip® Probe Arrays Hybridized Probe Cell * * GeneChip Probe Array * * * * Single stranded, fluorescently labeled cRNA target Oligonucleotide probe 24~50µm 1.28cm Each probe cell or feature contains millions of copies of a specific oligonucleotide probe Image of Hybridized Probe Array BGT108_DukeUniv
Bioinformatics Tools Database & searching Computational algorithms Alignment Similarity Clustering Pattern Searching Structure predictions Statistical methods Data visualization
Bioinformatics Bioinformatics is the research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data; Computational biology is the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.