Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genomics and Personalized Care in Health Systems Lecture 1: Introduction Leming Zhou, PhD Department of Health Information management School of Health.

Similar presentations


Presentation on theme: "Genomics and Personalized Care in Health Systems Lecture 1: Introduction Leming Zhou, PhD Department of Health Information management School of Health."— Presentation transcript:

1 Genomics and Personalized Care in Health Systems Lecture 1: Introduction Leming Zhou, PhD Department of Health Information management School of Health and Rehabilitation Sciences The University of Pittsburgh

2 Department of Health Information Management Text Books Jonathan Pevsner, Bioinformatics and Functional Genomics, Second Edition, Wiley-Blackwell, 2009. Ebook: Genes and Disease, searchable and freely available http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/gnd/gnd.pdf or http://www.ncbi.nlm.nih.gov/disease/

3 Department of Health Information Management Course Description This course will focus on general introduction to genomics, gene structure and annotation, and gene and disease association. Other topics such as RNA and protein structure, and microarray experiments will also be briefly covered. Students will understand gene structure and be familiar with various genome analysis tools by working on novel gene annotation projects.

4 Department of Health Information Management Course Objectives (1/2) Explain eukaryotic gene structure and molecular biology central dogma Demonstrate the skills of annotating eukaryotic genes using online tools Demonstrate the skills of performing sequence similarity search using blast Demonstrate the skills of collecting evidence from UCSC genome browser Describe major DNA and protein databases and the method of extracting data from them

5 Department of Health Information Management Course Objectives (2/2) Explain major gene finding methods, their advantages and disadvantages Describe different types of genetic diseases and the relationship between genetic variations and diseases Demonstrate the skills of determining protein and RNA secondary structures using online tools Explain basic ideas behind microarray and DNA sequencing technologies

6 Department of Health Information Management Method of Presentation Lectures In-Class Laboratory Sessions Student Projects and Presentations Term Paper (graduate students)

7 Department of Health Information Management Course Outline (Tentative)

8 Basic Concepts

9 Department of Health Information Management DNA (1/3) DNA (Deoxyribonucleic Acid), a helical molecular comprising a sequence of four nucleotides (bases) –Adenine (A) – purine; Thymine (T) – pyrimidine –Guanine (G) – purine; Cytosine (C) - pyrimidine Cytosine Thymine Adenine Guanine

10 Department of Health Information Management DNA (2/3) A is always paired with T, while G always with C

11 Department of Health Information Management DNA (3/3) A DNA sequence can be either single- stranded or double- stranded DNA sequences have an orientation: from 5’ to 3’ or from 3’ to 5’ (chemical conventions)

12 Department of Health Information Management Nucleotides

13 Department of Health Information Management RNA RNA (RiboNucleic Acid), usually a single- stranded molecular It comprises four nucleotides –A, C, G, and U (Uracil) Produced by copying one of the two strands of a DNA molecule in the 5’ to 3’ direction Different types of RNAs –Messenger RNA (mRNA) –Transfer RNA (tRNA) –Ribosomal RNA (rRNA) –… Uracil

14 Department of Health Information Management Protein A molecule comprising a long chain of amino acids connected by peptide bonds There are 20 standard amino acids encoded by the universal genetic code Molecular Biology of the Cell,Alberts et al. 2002

15 Department of Health Information Management Cell Types Prokaryotes: a group of organisms that lack of nucleus membrane, such as blue-green algae and common bacteria (Escherichia coli). It has two major taxa: Archaea and Bacteria Eukaryotes: unicellular and multicellular organisms, such as yeast, fruitfly, mouse, plants, and human

16 Department of Health Information Management Gene A stretch of DNA containing the information necessary for coding a protein/polypeptide Promoter region Transcription Factor Binding Site Translation Start Site Exon: coding (informative) regions of the DNA Intron: noninformative regions between exons Untranslated region (UTR) Codons

17 Department of Health Information Management Eukaryotic Gene Structure http://www.nslij-genetics.org/pic/dna-rna-protein.jpg

18 Department of Health Information Management Eukaryotes In eukaryotes, transcription is complex: –Many genes contain alternating exons and introns –Introns are spliced out of mRNA –mRNA then leaves the nucleus to be translated by ribosomes Genomic DNA: entire gene including exons and introns –The same genomic DNA can produce different proteins by alternative splicing of exons Complementary DNA (cDNA): spliced sequence containing only exons –cDNA can be manufactured by capturing mRNA and performing reverse transcription

19 Department of Health Information Management Central Dogma of Molecular Biology DNA  RNA  Protein DNARNAprotein Transcription Translation

20 Department of Health Information Management DNA Transcription RNA molecules synthesized by RNA polymerase RNA polymerase binds to promoter region on DNA Promoter region contains start site Transcription ends at termination signal site Primary transcript: direct coding of RNA from DNA RNA splicing: introns removed to make the mRNA mRNA: contains the sequence of codons that code for a protein Splicing and alternative splicing Post-transcriptional modification

21 Department of Health Information Management DNA Translation Ribosomes is made of protein and rRNA mRNA goes through the ribosomes Initiation factors: proteins that catayze the start of transcription tRNA brings the different amino acids to the ribosome complex so that the amino acids can be attached to the growing amino acid chain When a STOP codon is encountered, the ribosome releases the mRNA and synthesis ends An open reading frames (ORF): a contiguous sequence of DNA starting at a start codon and ending at a STOP codon http://www.youtube.com/watch?v=5bLEDd-PSTQ

22 Department of Health Information Management Chromosomes A chromosome is a long and tightly wound DNA string (visible under a microscope) Chromosomes can be linear or circular Prokaryotes usually have a single chromosome, often a circular DNA molecule Eukaryotic chromosome appear in pairs (diploid), each inherited from one parent –Homologous chromosomes carry the same genes –Some genes are the same in both parents –Some genes appear in different forms called alleles, e.g., human blood type has three alleles: A, B, and O All genes are presented in all cells, but a give cell types only expressed a small portion of the genes

23 Department of Health Information Management Chromosomal Location

24 Department of Health Information Management Genome The genome is formed by one or more chromosomes A genome is the entire set of all DNA contained in a cell A human genome has 46 chromosomes The total length of a human genome is 3 billion bases

25 Department of Health Information Management SpeciesCompleteDraft Assembly (Almost complete) In processTotal All115312858893327 Eukaryotes36319294649 http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html Genome Sequences Retrieved on 1/8/2012

26 Department of Health Information Management Genome Sequence Sizes DNA Sequence size is measured as base pairs (bp) Phage phiX1745,368 HIV virus9,193 SARS29,751 Haemophilus influenzae (bacteria)1,830,000 Escherichia coli K124,600,000 Saccharomyces cerevisiae (yeast)12,500,000 Drosophila melanogaster (fruit fly)180,000,000 Arabidopsis thaliana (thale cress)125,000,000 Homo sapiens (human)3,000,000,000

27 Department of Health Information Management The Whole Picture

28 Department of Health Information Management Genomics The definition of genomics may be different from person to person Genomics involves large data sets (whole genome sequences) and high-throughput methods (DNA sequencing technologies) –Genetics research focuses on one or a set of genes Genomics may or may not include other specific research areas, such as proteomics, transcriptomics, variomics, metabolomics, etc. In this course, genomics includes DNA sequence analysis, genomics variations, gene expression, and proteomics.

29 Department of Health Information Management Topics in This Course Molecular Biology Databases Sequence Alignment Blast Search Genome Browser Gene Finding Methods Genomic Variations and Disease Protein and RNA Secondary Structure High-throughput Technologies

30 Molecular Biology Databases

31 Department of Health Information Management Important Databases Genome –NCBI –European Molecular Biology Lab ( EMBL ) –DNA Database of Japan ( DDBJ ) –Go ( Gene Ontology ) –Consortium of databases Flybase, Mouse Genome Database (MGD) Protein –Protein Data Bank (PDB) –ENBL-EBI ( European Bioinformatics Institute ) Uniprot, Expasy, Swiss-Prot KEGG: Kyoto Encyclopedia of Genes and Genomes

32 Department of Health Information Management NCBI (www.ncbi.nlm.nih.gov) NCBI – National Center for Biotechnology Information Established in 1988 as a national resource for molecular biology information NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information Databases –GenBank, dbSNP, RefSeq, etc. –PubMed, OMIM, MMDB, UniGene –The Taxonomy Browser Tools –Blast, Cn3D, etc. –Entrez is NCBI’s search and retrieval system that provides users with integrated access to sequence, mapping, taxonomy, and structural data

33 Department of Health Information Management PDB (www.pdb.org) The Protein Data Bank (PDB) is the single worldwide depository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. Understanding the shape of a molecule helps to understand how it works. The PDB was established in 1971 at Brookhaven National Lab and originally contained 7 structures In 1998, the Research Collaboratory for Structural Bioinformatics(RCSB) became responsible for the management of the PDB PDB provides –Sequence, atomic coordinates, derived geometric data, secondary structure, and annotations about protein literature references

34 Department of Health Information Management KEGG KEGG: Kyoto Encyclopedia of Genes and Genomes Contains Pathway information as well as (1/10/2011) –KEGG PATHWAY:126,336 pathways generated from 379 reference pathways –KEGG GENES: 6,121,933 genes in 139 eukaryotes + 1144 bacteria + 94 archaea –KEGG GENOME: 1,508 organisms –KEGG DISEASE: 375 disease –KEGG DRUG: 9,316 drugs

35 Sequence Alignment

36 Department of Health Information Management Sequence Similarity Similarity: The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. Identity: The extent to which two sequences are invariant. Conservation: Changes at a specific position of a DNA or amino acid sequence that preserve the properties of the original residue. The distance between two sequences, based on an evolutionary model, describes when the two sequences had a common ancestor

37 Department of Health Information Management Sequence Alignment Sequence alignment is the procedure of comparing two or more DNA or protein sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences. Given two sequences A and B, an alignment is a pair of sequences A’ and B’ such that: 1. A’ is obtained from A by inserting gap character ‘-’ 2. B’ is obtained from B by inserting gap character ‘-’ 3. A’ and B’ have some length: |A’|=|B’| 4. No position has gap characters in both A’ and B’ Example: A = ATGGCT B = TGCTA A’= ATGGCT- B’= -TG-CTA Goal: given two sequences, find the “best” alignment according some scoring function

38 Department of Health Information Management Types of Sequence Alignment Pairwise Alignment – compare two sequences Multiple Alignment – compare one sequence to many others For each of the above we can do Local Alignment – compare similar parts of two sequences Global Alignment – compare the whole sequence For the different types of alignments there are different assumptions and methods

39 Department of Health Information Management Global Alignment vs. Local Alignment Local alignment: finds continuous or gapped high-scoring regions which do not span the entire length of the sequences being aligned Global alignment: finds the optimal full-length alignment between the two sequences being aligned

40 Department of Health Information Management Pairwise Alignment The process of lining up two sequences to achieve maximal levels of identity/similarity for the purpose of assessing the degree of similarity and the possibility of homology. It is used to decide if two genes are structurally or functionally related It is used to identify domains or motifs that are shared between proteins It is used in the analysis of genomes

41 Department of Health Information Management An Example of Pairwise Alignment 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 LAC 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 LAC 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 LAC Symbols between two sequences (Ssearch format):  Bar: identical; One dot: somewhat similar; Two dots: very similar Dots in sequences: gaps

42 Department of Health Information Management Multiple Sequence Alignment Multiple sequence alignment is an alignment of three or more sequences such that each column of the alignment is an attempt to represent the evolutionary changes I one sequence position, including substitutions, insertions, and deletions. It is believed that over time the functional components embedded within the sequences are conserved in order to retain function –One of the most important elements of sequences is the phylogenetic information that similarities represent –The sequence similarities gives insight into the evolution of families of protein or DNA sequences

43 Department of Health Information Management An Example of Multiple Sequence Alignment fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA

44 Department of Health Information Management Evolutionary Basis of Sequence Comparison The simplest molecular mechanisms of evolution are substitution, insertion, and deletion If a sequence alignment represents the evolutionary relationship of two sequences, residues that are aligned but do not match equal substitutions Residues that are aligned with a gap in the sequence represent insertions or deletions

45 Department of Health Information Management Homology Homology: Similarity attributed to descent from a common ancestor. There are two type of homology: Paralogs and Orthologs Orthologs: –Homologous sequences in different species that arose from a common ancestral gene during speciation; –May or may not be responsible for a similar function. –Members of a gene family in various organisms Paralogs: –Homologous sequences within a single species that arose by gene duplication. –Members of gene family within a species Genes either are homologous, or they are not. There are no degrees of homology

46

47 Blast Search

48 Department of Health Information Management Similarity Search Find statistically significant matches to a protein or DNA sequence of interest. Obtain information on inferred function of the gene Sequence alignment algorithms –Dynamic Programming Needleman-Wunsch Global Alignment (1970) Smith-Waterman Local Alignment (1981) Guaranteed to find the best alignment Slow, especially search against a large database

49 Department of Health Information Management FASTA and BLAST Sequence Alignment Heuristics –FASTA and BLAST: heuristic approximations to Smith- waterman Fast and results comparable to the Smith-Waterman algorithm FASTA and BLAST also calculate significance of the search results alignments

50 Department of Health Information Management BLAST Basic Local Alignment Search Tool: A sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query. Expected Value (E) –The number of matches expected to occur randomly with a given score. –The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. –The lower the E value, more significant the match. –The Expect value can be any positive real number.

51 Department of Health Information Management BLAST Search >seq example GAKKVIISAPSADAPMFVCGVNLDAYKPDMKVVSNASCTTNCLAPLAK VINDNFEIVEGLMTTVHATTATQKTVDGPSGKLWRDGRGAAQNIIPAST GAAKAVGKVIPALNGKLTGMAFRVPTPNVSVVDLTVRLGKGASYDEIKA K

52 Genome Browser

53 Department of Health Information Management Genome Browser Genome Browser is a computer program which helps to display gene maps, browse the chromosomes, align genes or gene models with ESTs or contigs etc. UCSC Genome Browser (http://genome.ucsc.edu)

54 Department of Health Information Management NCBI Mapviewer

55 Gene Finding Methods

56 Department of Health Information Management Gene Prediction Methods Ab initio genes prediction programs Programs using expressed sequences Programs using evolutionary conservation

57 Department of Health Information Management Evolution Evolution in two ways: –Mutation –Selection pressure to eliminate random mutations Mutations which cause frame shifts in the coding exon regions of important proteins will most likely not survive. Mutations in introns or in non-gene regions will have very little effect on the survival of the species and therefore they will be kept in the sequence. When two sequences are aligned and compared, the regions that are conserved will be most likely the gene-regions.

58 Department of Health Information Management Gene Annotation http://www.pggrc.co.nz/Portals/0/Mbb%20ruminantium%20genome%20DIAGRAM.jpg

59 Genomic Variations and Disease

60 Department of Health Information Management DNA Variations DNA Mutation –Synonymous mutations –Non-synonymous mutation

61 Department of Health Information Management Genome Sequences and Diseases http://genomics.energy.gov

62 Department of Health Information Management Single Nucleotide Polymorphisms Genomic sequences from two unrelated individuals are 99.9% identical. The 0.1% difference is due to genetic variations, and mainly one form of variation called single nucleotide polymorphisms (single-base mutations). Other genetic variations may produced from nucleotide insertions and deletions (Tandem repeat polymorphisms and insertion / deletion polymorphisms) These polymorphisms are considered one of the key factors that makes each and every one of us different and can have a major impact on how we respond to diseases; environmental insults such as bacteria, viruses and chemicals; and drugs and other therapies.

63 Department of Health Information Management SNPs and Mutations Terminology for variation at a single nucleotide position is defined by allele frequency. –A single base change, occurring in a population at a frequency of >1% is termed a single nucleotide polymorphism (SNP) –When a single base change occurs at <1% it is considered to be a mutation

64 Protein/RNA Structure

65 Department of Health Information Management RNA Structure RNA can have a complicated secondary structure Gene VIII, Lewin, 2004

66 Department of Health Information Management Protein Structure Primary structure: amino acid sequence Secondary structure: local structure such as alpha helix and beta sheets Tertiary structure: 3D structure of a protein monomer Quaternary structure: 3D structure of a fully functional protein (protein complexes)

67 Department of Health Information Management Protein Secondary Structure Protein can have secondary structure Alpha helix and Beta sheet Molecular Cell Biology, Lodish et al. 2000

68 Department of Health Information Management Protein 3D Structure Protein structure is closely related to its biological function/activity One protein may have multiple domains which are used to have functional interactions with different molecules –Domains in one protein may have extensively interaction or simply be connected by the protein sequence Human P53 core domain MMDB ID: 69151 PDB ID: 3D0A

69 High-Throughput Technologies

70 Department of Health Information Management DNA Sequencing Technologies Sanger method, 1977 Used in Human Genome Project Slow, and expensive ($300m/genome) Whole Genome Shotgun sequencing (1990s) Break the genome into short pieces Sequence all the pieces in parallel Put all the pieces back together (sequence assembly) Faster and cheaper (~$10m/genome) Next generation sequencing technologies (2000s) Much faster speed & lower cost (<$5k/genome,2010) May be used for personal genomics

71 Department of Health Information Management http://beespotter.mste.uiuc.edu/topics/genome/Honey%20bee%20genome.html

72 Department of Health Information Management Microarray http://www.coriell.org/index.php/content/view/93/184/


Download ppt "Genomics and Personalized Care in Health Systems Lecture 1: Introduction Leming Zhou, PhD Department of Health Information management School of Health."

Similar presentations


Ads by Google