Sequence, SNP and Mutation Databases

Slides:

Advertisements

Similar presentations

Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.

Advertisements

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.

On line (DNA and amino acid) Sequence Information Lecture 7.

Outline to SNP bioinformatics lecture

CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej

Archives and Information Retrieval

Predicting the Function of Single Nucleotide Polymorphisms Corey Harada Advisor: Eleazar Eskin.

Lecture 2.21 Retrieving Information: Using Entrez.

Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.

Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center

SNP Resources: Finding SNPs, Databases and Data Extraction Debbie Nickerson NIEHS SNPs Workshop.

SNP Resources: Finding SNPs Databases and Data Extraction Mark J. Rieder, PhD Robert J. Livingston, PhD NIEHS Variation Workshop January 30-31, 2005.

Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:

Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.

SNP Resources: Finding SNPs Databases and Data Extraction Mark J. Rieder, PhD SeattleSNPs Variation Workshop March 20-21, 2006.

Login: BITseminar Pass: BITseminar2011 Login: BITseminar Pass: BITseminar2011.

Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.

Course Module: Introduction to Bioinformatics – CS 2001 July CS Databases.

On line (DNA and amino acid) Sequence Information

MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,

NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.

Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &

GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.

CS177 Lecture 10 SNPs and Human Genetic Variation

Genomes and Genomics.

Online Mendelian Inheritance in Man (OMIM): What it is & What it can do for you Knowledge Management & Eskind Biomedical Library January 27, 2012 helen.

GVS: Genome Variation Server Materials prepared by: Warren C. Lathe, PhD Updated: Q Version 2.

Bioinformatics and Computational Biology

Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.

EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.

Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.

Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.

Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.

Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.

Identifying disease causal variants Mendelian disorders A. Mesut Erzurumluoglu 1.

Genome-Wides Association Studies (GWAS) Veryan Codd.

1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.

From Reads to Results Exome-seq analysis at CCBR

Integrated sequence analysis pipeline provides one-stop solution for identifying disease-causing mutations Cougar Hao Hu, MPIMG.

Interpreting exomes and genomes: a beginner’s guide

Genome Annotation (protein coding genes)

Introduction to Genes and Genomes with Ensembl

Introduction to Bioinformatics

Genomic Analysis: GWAS

Week-6: Genomics Browsers

Common variation, GWAS & PLINK

Networks and Interactions

Gil McVean Department of Statistics

Complex disease and long-range regulation: Interpreting the GWAS using a Dual Colour Transgenesis Strategy in Zebrafish.

Retrieving Information: Using Entrez

Consideration for Planning a Candidate Gene Association Study With TagSNPs Shehnaz K. Hussain, PhD, ScM Epidemiology 243: Molecular.

Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.

Archives and Information Retrieval

School of Pharmacy, University of Nizwa

What is Bioinformatics?

Ensembl Genome Repository.

Next Generation Sequencing and Human Genome Databases

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

A Tutorial Lincoln Stein, Cold Spring Harbor Laboratory

School of Pharmacy, University of Nizwa

Medical genomics BI420 Department of Biology, Boston College

One SNP at a Time: Moving beyond GWAS in Psoriasis

Medical genomics BI420 Department of Biology, Boston College

BF528 - Whole Genome Sequencing and Genomic Variation

Welcome - webinar instructions

SNPs and CNPs By: David Wendel.

Analysis of protein-coding genetic variation in 60,706 humans

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Biological Databases.

Presentation transcript:

Sequence, SNP and Mutation Databases I’m mesut, phd student in genetic epidemiology studying the effects of consanguinity on human disorders - disorder called PCD I’m here to talk to you about databases We’ll have two exercises at the lecture which will be very useful to you and for this reason we have Tom and Denis with us also I hated lectures before the coffee break Probably the most boring but also the most useful lecture – there are hundreds of databases out there so we can’t go through all of them However not all of them are applicable to you anyway – so within the next 45 minutes I’ll talk to you about the ones I found the most useful *In front of very important databases – especially Ensembl and GeneCards Professor asked me to explain Feel free to stop me at anytime but there is a lot to go through so unless it is urgent please leave it to the end – also feel free to send me emails Mesut Erzurumluoglu epmmee@bristol.ac.uk

Before we start Very important lecture Hundreds of databases! Database knowledge is a must for analysis Hundreds of databases! Pay special attention to: Ensembl Genome Browser dbSNP GeneCards Exercise and Q&A time at the end

Two lines of data Mapping of locations – Annotation Genomic landmarks e.g. Genes (exon, intron, splice site), binding sites Known associations e.g. HBB gene and sickle cell disease rs9939609 SNP and BMI Recording of variation – Summary stats Frequencies, counts Correlations (e.g. SNP-SNP) So there are two lines of data in these databases: one to with the mapping of locations. So what I mean by this is it would make no sense to just dump the sequence some bit of DNA and dump it into a database. You’d have to make it clear which species is it from, is it from a healthy individual or suffering from some disorder. Also you would want to know where the sequence lies in the genome, whether the sequence is coding, is it functional, what does it do? If mutated, does it cause a disease. Finally the second line of data is the recording of variation. So if you have a variation you’re working on, how many times has it been seen in other individuals.

Why is it important? Mapping of locations Recording of variation Designing experiments e.g. Primers, candidate genes Location of variants e.g. genes within region, exon, intergenic Interpretation of results e.g. LD, consequence Recording of variation What’s out there from other projects? Standardisation There is a standard – so we all know we’re talking about the same loci when we say chr 11, position 1000

Terminology Single nucleotide polymorphism (SNP) Minor allele frequency (MAF) Difference between genotyping and sequencing Linkage disequilibrium (LD) Will be taught on next short-course Slide in appendix Genetic association study Also taught on next short-course In order to benefit from this lecture and databases completely you’d have to understand these terms especially. I’ll talk about the first three but the other two will require a lot of time to explain thus I’ll let go of them. However I did make two slides available at the end for those who are not going to the next short course.

SNP Affects single nucleotide Common ones (>1%) Mostly bi-morphic (e.g. A>C) Minor allele Second most common one Minor allele frequency Frequency of minor allele SNV or SNP? Scottish national party 99.5% similarity 0.5% * 6billion = 30 million mismatches = 3 million SNVs; others are indels, CNVs etc. More in Africans than Europeans Why sequence the whole genome, which is very expensive, when you can just pinpoint these loci which are polymorphic and just genotype them So large scale projects were initiated to find these polymorphic sites and then they designed micro-arrays to just genotype these loci. Of course you miss out on the unique and the rare mutations but still captured most of the variation this way in cheap and quick fashion. Common disease common variant hypothesis – since 1996

Data formats FASTA – raw sequencing data Can be nucleotide or protein SAM/BAM – sequence alignment format VCF – only variations from reference Nucleotide Can contain SNPs, insertion and/or deletions (aka indels), microsatellites PED/MAP – Plink default format Incorporate familial and phenotypic data to genetic data Slide in appendix VEP – Ensembl’s variation annotation format

FASTA (and FASTQ) Quality scores Sample ID DNA Sequence (or amino acid) Very large file sizes ! Lowest and ~ highest Numbers (0 to 9): low-mid Capital letters (A to Z): mid-high Small letters (a to z): high-higher

VCF Small file size and widely used Very useful for Mendelian disease studies Software: Vcftools Reference human genome: GRCh37 (release 74) - http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/

Raw sequence databases Ensembl (Europe), NCBI (USA) DNA NC_123456 (complete genome) NG_123456 (genomic region) mRNA NM_123456 (mRNA) NR_123456 (transcript) Protein NP_123456 (protein) Also see RefSeq (NCBI) slide in appendix Ensembl is European, NCBI is USA based Ensembl is neater but NCBI has more data XM_,XR_,XP_ are not curated

*Ensembl www.ensembl.org/info/data/ftp/index.html DNA Useful for mapping reads, designing primers cDNA = mRNA (transcripts) CDS = all exons in gene VCF HapMap, 1000 Genomes, Venter, Watson VEP 66 species (as of 26/02/14)

Genome Browsers Ensembl Genome Browser UCSC Genome Browser View region Extract region Filter variants in region Prediction MAF Conservation GERP scores UCSC Genome Browser http://genome.ucsc.edu/ 1000 Genomes Browser http://browser.1000genomes.org/index.html VEGA Genome Browser

*Ensembl Genome Browser (EGB) www.ensembl.org/index.html Browse many genomes (>70 vertebrates) Example: BRCA2 Start and end Sequence (exons, introns) Transcripts Orthologues Sequence homology in other species Indicative of similar function Easy sequence extraction: http://www.ensembl.org/das/Homo_sapiens.GRCh37.reference/sequence?segment=13:32889611,32973805 Other EGBs EnsemblPlants, EnsemblBacteria, EnsemblFungi, EnsemblMetazoa, EnsemblProtists

EGB – Conservation (GERP) Location GERP elements

SNP databases Ensembl (and NCBI) dbSNP (largest database for SNPs) 1000 Genomes 1092 whole genome sequencing African, European, Far East HapMap projects Genotyping 270 individuals from 4 populations Exome Variant Server (EVS) 6503 individuals’ whole exomes American European and American African SNP and Phenotype associations GWAS Catalogue Slide in appendix

*dbSNP http://ncbi.nlm.nih.gov/SNP Search, annotate and submit SNPs Apply filters e.g. human, cited, clinical >70 million SNPs just for humans dbSNP handbook www.ncbi.nlm.nih.gov/books/NBK21088/

From GWAS catalogue One of the most famous figures out there Y chromosome From GWAS catalogue

Clinical databases OMIM HGMD mutDB Mitomap (mitochondrial DNA) Online Mendelian Inheritance in Man Database of disease-linked genes and associated phenotypes Links to Entrez, GDB and other databases HGMD Database of sequences and phenotypes of disease-causing mutations Used to train mutation effect prediction algorithms (e.g. FATHMM) Slide in appendix mutDB http://www.mutdb.org Mitomap (mitochondrial DNA) http://www.mitomap.org

*OMIM http://omim.org/ Online Mendelian Inheritance in Man First address for Mendelian disorders Disease > gene and phenotype Gene > disease and phenotype Phenotypes > disease and genes Example search: Autosomal recessive intellectual disability +autosomal +recessive +intellectual +disability

Protein databases Uniprot (Swiss-Prot + TrEMBL) PDB – Protein data bank Slide in appendix Pfam – Protein families STRING – Protein-Protein Interactions HMPD (mitochondrial DNA) http://bioinfo.nist.gov/

Uniprot http://www.uniprot.org/ Swiss-Prot is manually curated TrEMBL is automatically curated Download protein sequence Protein sequence BLAST Align Conserved regions and/or residues View AA properties Try example: Gene: DNALI1 Retrieve FASTA sequence, Blast and align (3 species)

Very useful integrative databases GeneCards Integrated resource of information on human genes and their products Major emphasis on human disease Links to many kinds of biomedical information Sequence databases OMIM, HGMD, MDB Doctors’ Guide to the Internet Ensembl (and NCBI) fits in this category

*Genecards http://www.genecards.org/ Graphical view of many things about your gene Links to Ensembl, OMIM and Literature Example: DNALI1 Entrez Gene Summary Associated disorders Orthologues STRING predictions

Google is your friend! PubMed – Bibliographic database Animal models Zfin – Zebrafish knockouts http://zfin.org/ International Mouse Phenotyping Consortium https://www.mousephenotype.org/ New ones coming out all the time! New cohorts and studies UK10K project http://www.uk10k.org/ Human Microbiome project http://commonfund.nih.gov/hmp/index ARIES - Epigenetics http://www.ariesepigenomics.org.uk/ariesexplorer

*PubMed http://www.ncbi.nlm.nih.gov/pubmed/ “>23 million citations for biomedical literature from MEDLINE, life science journals, and online books” Simple to use Searching Citation manager facility PubMed help book http://www.ncbi.nlm.nih.gov/books/NBK3827/ And finally being able to search the literature efficiently and accessing the right ones is very important; and I’ve found PubMed a very reliable and easy to use database for bibliographic sources

Exercise 1 - SNPs Find where rs9939609 is located Is it in an exon or an intron? Minor allele? Global MAF? Which gene(s) are close by? Associated with any disorders? Which population(s) is the minor allele most frequent in? How many other known human SNPs in this gene?

Answers Chr 16 at position 53,820,527 Intron A, 0.355 FTO BMI, Type 2 diabetes, Menarche Luhya in Webuye, Kenya (0.617) dbSNP, scroll to bottom 8099 (as of 21/02/14) Search for FTO in dbSNP and filter for H.sapiens

Exercise 2 - Genes Find the start and end coordinates of your favourite gene (e.g. DNAH5) How many exons does it have? How many different transcripts does it have? What is the function? Associated with any disorders? Which proteins are predicted to interact with it? Extract the coding sequence – in FASTA

Answers Chr 5 from 13,690,440 to 13,944,652 79 4 Force generating protein of respiratory cilia (from GeneCards – UniprotKB section) Primary ciliary dyskinesia DNAH1, DYNLL1, DNAL4 (from STRING) Many ways to do, for example: Export data in Ensembl (DNA) Q8TE73 in Uniprot (AA)

Thank You Any questions? Please look back at the slides again once you complete the short-course(s)

Appendices Two additional terms which must be understood to make full use of the databases 10 useful websites/databases we do not have the time to go through Two additional ‘must know’ data formats

LD Non random association of alleles at two or more loci Simple example: A at chr1:1000 and T at chr3:500 C at chr1:1000 and A at chr3:500 No other haplotypes Therefore chr1:1000 and chr3:500 are in LD Thus all SNPs are not independent Therefore carefully selected SNPs can save money (e.g. Tag SNPs) Different from linkage Proximal loci on the same chromosome inherited together during meiosis These designed SNP arrays were also used in the HapMap project to discover LD in different populations If you know that 50 SNPs are inherited together (as a block), then you can just genotype one of them and impute the rest – more savings

Genetic association study GWAS SNP arrays - using LD Very cheap (23andme offers for $100) Case v Controls Whole exome sequencing (WES) Whole genome sequencing (WGS) Expensive Candidate gene analyses E.g. ones identified by GWAS Animal models Where thousands of individuals are needed to detect the effect of a SNP on a phenotype, GWAS are used. If very few people are enough then WGS or WES is used. Anything that is in LD with the SNP can be the causal one. That SNP you’ve genotyped just represents the whole LD block

RefSeq (NCBI) http://www.ncbi.nlm.nih.gov/refseq/ “A comprehensive, integrated, non- redundant, well-annotated set of reference sequences including genomic, transcript, and protein” >33000 species (and/or strains) CCDS project

GWAS catalogue www.genome.gov/gwastudies/ Search for SNP-phenotype associations from GWAS Search, view and filter Try example: BMI and P value of 1e-8 Result: 11 papers (as of 21/02/14)

HGMD http://www.hgmd.cf.ac.uk/ac/index.php “HGMD represents an attempt to collate known (published) gene lesions responsible for human inherited disease” Public version registration required Professional version Purchase a licence

PDB – Protein Data Bank http://www.rcsb.org/pdb/home/home.do Links biochemistry to your study If there is data of course! 3D view of protein (in Jmol) Amino acid sequence Try example: FTO (4IDZ)

Pfam http://pfam.sanger.ac.uk/ Protein families and domains Predicted to have similar functions Domain organisation Phylogenetic tree Links to PDB Try example: Dynein_heavy (PF03028)

STRING http://string-db.org/ Predicts protein-protein interactions Coexpression Literature Experiments Genomic context Try example: DNALI1

KEGG http://www.kegg.jp/ Similar to STRING but manually curated “resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high- throughput experimental technologies” Similar to STRING but manually curated More reliable My friends tell me the difference between STRING and KEGG is like the difference between SWISS-Prot and TrEMBL STRING is automated, KEGG is manually curated – more reliable but less information

Regulatory elements Rfam (RNA family) Noncoding RNA database http://rfam.sanger.ac.uk/ Noncoding RNA database http://biobases.ibch.poznan.pl/ncRNA Bioexplorer.net http://www.bioexplorer.net/Databases

ENCODE http://www.nature.com/encode/ Radical shake up to the interrogation of genomic “function” Data available Functional impact of variant sites in multiple tissues Multiple assay types. Analysis/visualisation software and scripts for the generation of figures Massive database – won’t be going through it No time for it either However the good news is that Ensembl has direct links to it and is annotates some variations in accordance with ENCODE (regulatory feature)

Locus specific or disease specific databases By HGVS Human Genome Variation Society http://www.hgvs.org/dblist/dblist.html Example: Ciliome database http://www.sfu.ca/~leroux/ciliome_home.htm

PED/MAP Most used format User friendly software Ped: Map: Plink pngu.mgh.harvard.edu/~purcell/plink/ One of the mostly cited software Ped: FAM001 1 0 0 1 2 A A G G A C ... FAM001 2 0 0 1 2 A A A G 0 0 ... .... Map: 1 rs123456 0 1234555 1 rs234567 0 1237793 1 rs233556 0 1337456 …

Variant effect predictor (VEP) www.ensembl.org/info/docs/tools/vep/index.html Example file: