Genomics and Personalized Care Lab Session Leming Zhou, PhD School of Health and Rehabilitation Sciences Department of Health Information Management.

Genomics and Personalized Care Lab Session Leming Zhou, PhD School of Health and Rehabilitation Sciences Department of Health Information Management

Outline Nucleotide, protein, genetic variation, gene and disease association databases –NCBI GenBank; protein structure; dbSNP; OMIM Pairwise sequence alignment BLAST search UCSC genome browser

NCBI Created as a part of National Library of Medicine in 1988 –Establish public databases –Perform research in computational biology –Develop software tools for sequence analysis –Disseminate biomedical information Databases –Sequence, such as GeneBank, RefSeq, dbSNP –Literature, such as PubMed, OMIM Tools –Entrez. Blast, Cn3D, etc.

NCBI Homepage

GenBank Nucleotide only sequence database GenBank Data –Direct submissions individual records (BankIt, Sequin) –Batch submissions via email (EST, GSS, STS) –ftp accounts established for sequencing centers Data shared nightly amongst three collaborating databases: –GenBank –DNA Database of Japan (DDBJ). –European Molecular Biology Laboratory Database (EMBL)

GenBank Record (Header) LOCUSNM_001963 4913 bp mRNA linear PRI 20-SEP-2009 DEFINITIONHomo sapiens epidermal growth factor (beta-urogastrone) (EGF), mRNA. ACCESSIONNM_001963 VERSIONNM_001963.3 GI:166362727 KEYWORDS. SOURCEHomo sapiens (human) ORGANISMHomo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE1 (bases 1 to 4913) AUTHORSHosgood,H.D. III, Menashe,I., He,X., Chanock,S. and Lan,Q. TITLEPTEN identified as important risk factor of chronic obstructive pulmonary disease JOURNALRespir Med (2009) In press PUBMED19625176 REMAKRGeneRIF: Observational study of gene-disease association.

GenBank Record (Features) FEATURESLocation/Qualifiers source1..4913 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606"9606 /chromosome="4" /map="4q25" gene1..4913 /gene="EGF" /gene_synonym="HOMG4; URG" / note="epidermal growth factor (beta-urogastrone)" /db_xref="GeneID:1950"1950 /db_xref="HGNC:3229"3229 /db_xref="HPRD:00578"00578 /db_xref="MIM:131530"131530 exon1..579 /gene="EGF" /gene_synonym="HOMG4; URG" /inference="alignment:Splign" /number=1 CDS453..4076 /gene="EGF" /gene_synonym="HOMG4; URG" /note="beta-urogastrone" /codon_start=1 /product="epidermal growth factor precursor" /protein_id="NP_001954.2" /db_xref="GI:166362728" /db_xref="CCDS:CCDS3689.1" /db_xref="GeneID:1950" /db_xref="HGNC:3229" /db_xref="HPRD:00578" /db_xref="MIM:131530" /translation="MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP APFLIFSHGNSIFRIDTEGTNYEQLVVDAGVSVIMDFHYNEKRIYWVDLERQLLQRVFNP_001954.2CCDS3689.11950322900578131530

GenBank Record (Sequence) ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc 61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt 121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt 181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc 241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga 301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag 361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg 421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc 481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg 541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt 601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg 661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt 721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga 781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag 841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt 901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa 961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg

FASTA: Sequence Format

Protein Structure

Crystal Structure of a Protein

Protein Structure Databases Proteins take on 3D structure 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallography –PDB http://www.pdb.org/http://www.pdb.org/ –SCOP http://scop.mrc-lmb.cam.ac.uk/scophttp://scop.mrc-lmb.cam.ac.uk/scop –MMDB http://www.ncbi.nlm.nih.gov/Structure/http://www.ncbi.nlm.nih.gov/Structure/

Genetic Variations

Polymorphisms Genomic sequences from two unrelated individuals are 99.9% identical. The 0.1% difference is due to genetic variations, and mainly (~90%) one form of variation called Single Nucleotide Polymorphisms (SNPs, single- base variations).

Importance of Genetic Variations Genetic variations underlie phenotypic differences among different individuals Genetic variations determine our predisposition to diseases and responses to drugs, therapies, and environmental insults such as bacteria, virus, and chemicals Genetic variations reveal clues of ancestral human migration history

Major Types of Genetic Variations Single nucleotide mutation –Majority of SNPs do NOT directly contribute to any phenotypes Insertion or deletion of one or more nucleotides –Tandem repeat polymorphisms ( Genomic regions consisting of variable length, usually 1-100 bases long, of sequence motifs repeating in tandem with variable copy number) Used as genetic markers for DNA finger printing (forensic, parentage testing) Many cause genetic diseases –Insertion/Deletion polymorphisms ( Often resulted from localized rearrangements between homologous tandem repeats) Gross chromosomal aberration –Deletions, inversions, or translocation of large DNA fragments –Often causing serious genetic diseases

The Effect of SNPs The phenotypic consequence of a SNP is significantly affected by the location where it occurs (gene or non- gene), as well as the nature of the mutation (synonymous or non-synonymous) –No consequence –Affect gene transcription quantitatively or qualitatively –Affect gene translation quantitatively or qualitatively –Change protein structure and functions –Change gene regulation at different steps

Simple/Complex Genetic Diseases and SNPs Simple genetic diseases (Mendelian diseases) are often caused by mutations in a single gene –e.g. Huntington’s, Cystic fibrosis, etc. Many complex diseases are the result of mutations in multiple genes, the interactions among them as well as between the environmental factors –e.g. cancers, heart diseases, Alzheimer's, diabetes, asthmas, obesity, etc.

Genetic Variations Databases dbSNP –http://www.ncbi.nlm.nih.gov/SNP/http://www.ncbi.nlm.nih.gov/SNP/ Online Mendelian Inheritance in Man (OMIM) –http://www.ncbi.nlm.nih.gov/omimhttp://www.ncbi.nlm.nih.gov/omim International HapMap Project –http://www.hapmap.org/http://www.hapmap.org/ Genome Variation Server (Seattle SNPs) –http://gvs.gs.washington.edu/GVS/http://gvs.gs.washington.edu/GVS/

dbSNP The Single Nucleotide Polymorphism database (dbSNP) is a public- domain archive for a broad collection of simple genetic variations This collection of polymorphisms includes: –Single-base nucleotide substitutions (or single nucleotide polymorphisms -SNPs) Roughly 10 million in human population or on average 1 per 300 bps Less than half of these SNPs are identified and stored in the database –Microsatellite repeat variations (or short tandem repeats - STRs) In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100,000 across the human genome –Small-scale multi-base deletions or insertions The short insertion/deletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

A dbSNP Record >gnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=960 6|alleles='A/G'|mol=Genomic ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCC ATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAA CTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACA TTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCA GTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTG AAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Different Ways to Search SNPs in dbSNP dbSNP web site –Direct search of SS record; batch search; allow SNP record submission; No search limit Entrez SNP –http://www.ncbi.nlm.nih.gov/sites/entrez?db=Snphttp://www.ncbi.nlm.nih.gov/sites/entrez?db=Snp –Search limits options allows precise retrieval Entrez Gene Record’s SNP Links Out Feature –Direct links to corresponding SNP records; access to genotype and linkage disequilibrium data NCBI’s MapViewer –Visualize SNPs in the genomic context along with other types of genetic data

Search SNPs from dbSNP Web Page http://www.ncbi.nlm.nih.gov/SNP/index.html

Search SNPs from Entrez SNP Web Page http://www.ncbi.nlm.nih.gov/sites/entrez?db=Snp The dbSNP is a part of the Entrez integrated information retrieval system and may be searched using either qualifiers or a combination search limits from 14 different categories

Gene and Disease

Disease Causing Genes Disease centric databases: OMIM: http://www.ncbi.nlm.nih.gov/omim/ http://www.ncbi.nlm.nih.gov/omim/ CDC HugeNavigator: http://hugenavigator.net/ http://hugenavigator.net/ HGMD: https://portal.biobase- international.com/hgmd/pro/start.php https://portal.biobase- international.com/hgmd/pro/start.php A Catalog of Published Genome-Wide Association Studies: http://www.genome.gov/26525384 http://www.genome.gov/26525384

Online Mendelian Inheritance in Man (OMIM) http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM OMIM is a human genetic disorders database built and curated using results from published studies Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder, which contains the following information: –description and clinical features of a disorder or a gene involved in genetic disorders; –biochemical and other features; –cytogenetics and mapping; –molecular and population genetics; –diagnosis and clinical management; –animal models for the disorder; –allelic variants. OMIM is searchable via NCBI Entrez, and its records are cross-linked to other NCBI resources.

OMIM: Variant The OMIM database includes genetic disorders caused by various mutation/variation, from SNPs to large-scale chromosomal abnormalities Variants are represented by a 10-digit OMIM number, and can be searched in two waysOMIM number –Search for a gene or a disease, when retrieved, view its variants

Variants in OMIM Records For most genes, only selected mutations are included –Criteria for inclusion include: the first mutation to be discovered, high population frequency, distinctive phenotype, historic significance, unusual mechanism of mutation, unusual pathogenetic mechanism, and distinctive inheritance. Most of the variants represent disease-producing mutations, NOT polymorphisms. A few polymorphisms are included, many of which show a positive statistical correlation with particular common disorders. Few neutral polymorphisms are included in OMIM Some SNPs in the dbSNP records are not linked to the corresponding OMIM records.

Similarity Search Find statistically significant matches to a protein or DNA sequence of interest. Obtain information on inferred function of the gene Sequence identity/similarity is a quantitative measurement of the number of nucleotides / amino acids which are identical /similar in two aligned sequences –Calculated from a sequence alignment –Can be expressed as a percentage –In proteins, some residues are chemically similar but not identical

Sequence Alignment A linear, one-to-one correspondence between some of the symbols in one sequence with some of the symbols in another sequence –Four possible outcomes in aligning two sequences Identity; mismatch; gap in one sequence; gap in the other sequence May be DNA or protein sequences.

Alignment Algorithms Sequences often contain highly conserved regions These regions can be used for an initial alignment

Alignments Two sequences Seq 1: ACGGACT Seq 2: ATCGGATCT There may be multiple ways of creating the alignment. Which alignment is the best? A – C – G G – A C T | | | | | A T C G G A T - C T A T C G G A T C T | | | A – C G G – A C T

BLAST BLAST - Basic Local Alignment Search Tool: A sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query Most widely used and referenced computational biology resource The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length W with a score of at least T when compared to the query using a substitution matrix Word hits are then extended in both directions to generate an alignment with score exceeding a given threshold S

Four Steps of a BLAST search Enter query sequence Select one BLAST program Choose the database to search Set optional parameters

Enter Query Sequence Sequence can be pasted into a text field in FASTA format or as accession number Sequence can also be uploaded as a file (FASTA format) Users may indicate a sequence range of the query sequence instead of using the whole query sequence Job title will be automatically generated from sequence header

Select one BLAST Program BLAST Programs: –BLASTN: DNA query sequence against a DNA database –BLASTP: protein query sequence against a protein database –BLASTX: DNA query sequence, translated into all six reading frames, against a protein database –TBLASTN: protein query sequence against a DNA database, translated into all six reading frames –TBLASTX: DNA query sequence, translated into all six reading frames, against a DNA database, translated into all six reading frames Choose the right one according to the purpose of the search

Choose the Database to Search BLASTN

Optional Parameters Specify the organism to search or exclude –Common name, taxonomy id, … Exclude certain sequences –Exclude predicted sequences or sequences from metagenomics Use Entrez query to select a subset of the blast database

BLASTN Output (header)

BLASTN Output (Graphic Summary) matches to itself probable homologs distantly related homologs distant homolog with shared domain or motif

BLASTN Output (Descriptions)

BLASTN Output (Sequence Alignments)

Genome Browser Genome Browser is a computer program which helps to display gene maps, browse the chromosomes, align genes or gene models with ESTs or contigs etc. Big Three: –UCSC Genome Browser –NCBI Mapviewer –Ensemble

UCSC Genome Browser http://genome.ucsc.edu

Organization of Genomic Data Genome backbone: base position number sequence Annotation Tracks chromosome band known genes predicted genes evolutionary conservation SNPs sts sites microarray/expression data repeated regions more… Links out to more data

UCSC Genome Browser

Annotation Tracks sequence STS sites Known gene SNP Evolutionary conservation Repeated regions Expression

A Sample of the UCSC Genome Browser gene details Annotation Tracks sequence comparisons SNPs

Genome Browser Gateway Use this Gateway to search by: –Gene names, symbols, IDs –Chromosome number: chr7, or region: chr11:1038475-1075482 –Keywords: kinase, receptor See lower part of page for help with format

Genome Browser Gateway Helpful search examples samples provided text/ID searches

The Genome Browser Gateway Make your Gateway choices: 1.Select Clade 2.Select genome = species: search 1 species at a time 3.Assembly: the official backbone DNA sequence 4.Position: location in the genome to examine 5.Image width: how many pixels in display window; 5000 max 6.Configure: make fonts bigger + other choices 45 1 3 2 assembly 6

Different Species, Different Tracks Species may have different data tracks Layout, software, functions are the same

Sample Genome Viewer Image, TP53 base position UCSC genes RefSeq genes mRNAs & ESTs repeats many species compared SNPs single species compared MGC clones

Visual Cues on the Genome Browser Track colors may have meaning—for example, UCSC Gene track: If there is a corresponding PDB entry = black If there is a corresponding reviewed/validated seq = dark blue If there is a non-RefSeq seq = lightest blue Tick marks; a single location (STS, SNP) For some tracks, the height of a bar is increased likelihood of an evolutionary relationship (conservation track) Intron and direction of transcription >> < exon < < << < ex 5' UTR3' UTR Alignment indications (Conservation pairs: “chain” or “net” style) Alignments = boxes, Gaps = lines

Options for Changing Images: Upper Section Change your view or location with controls at the top Use “base” to get right down to the nucleotides Configure: to change font, window size, more… –Next item, next exon navigation assistance can be turned on Specify a position Fonts, window, next item, more Walk left or right Zoom in Zoom out Click to zoom 3x and re-center

Annotation Track Display Options Some data is ON or OFF by default Menu links to info about the tracks: content, methods You change the view with pulldown menus After making changes, REFRESH to enforce the change enforce change s Enforce changes Change track view Links to info and/or filters

Annotation Track Options Defined Hide: removes a track from view Dense: all items collapsed into a single line Squish: each item = separate line, but 50% height Pack: each item separate, but efficiently stacked (full height) Full: each item on separate line

Mid-page Options to Change Settings You control the views Use pulldown menus Configure options page Reset, back to defaults Start from scratch Enforce any changes (hide, full, squish…) Flip display to Genomic 3’  5’

Base Level and Protein Sequences

Genomics and Personalized Care Lab Session Leming Zhou, PhD School of Health and Rehabilitation Sciences Department of Health Information Management.

Similar presentations

Presentation on theme: "Genomics and Personalized Care Lab Session Leming Zhou, PhD School of Health and Rehabilitation Sciences Department of Health Information Management."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Genomics and Personalized Care Lab Session Leming Zhou, PhD School of Health and Rehabilitation Sciences Department of Health Information Management.

Similar presentations

Presentation on theme: "Genomics and Personalized Care Lab Session Leming Zhou, PhD School of Health and Rehabilitation Sciences Department of Health Information Management."— Presentation transcript:

Similar presentations

About project

Feedback