Download presentation
Presentation is loading. Please wait.
1
Sequence, SNP and Mutation Databases
I’m mesut, phd student in genetic epidemiology studying the effects of consanguinity on human disorders - disorder called PCD I’m here to talk to you about databases We’ll have two exercises at the lecture which will be very useful to you and for this reason we have Tom and Denis with us also I hated lectures before the coffee break Probably the most boring but also the most useful lecture – there are hundreds of databases out there so we can’t go through all of them However not all of them are applicable to you anyway – so within the next 45 minutes I’ll talk to you about the ones I found the most useful *In front of very important databases – especially Ensembl and GeneCards Professor asked me to explain Feel free to stop me at anytime but there is a lot to go through so unless it is urgent please leave it to the end – also feel free to send me s Mesut Erzurumluoglu
2
Before we start Very important lecture Hundreds of databases!
Database knowledge is a must for analysis Hundreds of databases! Pay special attention to: Ensembl Genome Browser dbSNP GeneCards Exercise and Q&A time at the end
3
Two lines of data Mapping of locations – Annotation
Genomic landmarks e.g. Genes (exon, intron, splice site), binding sites Known associations e.g. HBB gene and sickle cell disease rs SNP and BMI Recording of variation – Summary stats Frequencies, counts Correlations (e.g. SNP-SNP) So there are two lines of data in these databases: one to with the mapping of locations. So what I mean by this is it would make no sense to just dump the sequence some bit of DNA and dump it into a database. You’d have to make it clear which species is it from, is it from a healthy individual or suffering from some disorder. Also you would want to know where the sequence lies in the genome, whether the sequence is coding, is it functional, what does it do? If mutated, does it cause a disease. Finally the second line of data is the recording of variation. So if you have a variation you’re working on, how many times has it been seen in other individuals.
4
Why is it important? Mapping of locations Recording of variation
Designing experiments e.g. Primers, candidate genes Location of variants e.g. genes within region, exon, intergenic Interpretation of results e.g. LD, consequence Recording of variation What’s out there from other projects? Standardisation There is a standard – so we all know we’re talking about the same loci when we say chr 11, position 1000
5
Terminology Single nucleotide polymorphism (SNP)
Minor allele frequency (MAF) Difference between genotyping and sequencing Linkage disequilibrium (LD) Will be taught on next short-course Slide in appendix Genetic association study Also taught on next short-course In order to benefit from this lecture and databases completely you’d have to understand these terms especially. I’ll talk about the first three but the other two will require a lot of time to explain thus I’ll let go of them. However I did make two slides available at the end for those who are not going to the next short course.
6
SNP Affects single nucleotide Common ones (>1%)
Mostly bi-morphic (e.g. A>C) Minor allele Second most common one Minor allele frequency Frequency of minor allele SNV or SNP? Scottish national party 99.5% similarity 0.5% * 6billion = 30 million mismatches = 3 million SNVs; others are indels, CNVs etc. More in Africans than Europeans Why sequence the whole genome, which is very expensive, when you can just pinpoint these loci which are polymorphic and just genotype them So large scale projects were initiated to find these polymorphic sites and then they designed micro-arrays to just genotype these loci. Of course you miss out on the unique and the rare mutations but still captured most of the variation this way in cheap and quick fashion. Common disease common variant hypothesis – since 1996
7
Data formats FASTA – raw sequencing data
Can be nucleotide or protein SAM/BAM – sequence alignment format VCF – only variations from reference Nucleotide Can contain SNPs, insertion and/or deletions (aka indels), microsatellites PED/MAP – Plink default format Incorporate familial and phenotypic data to genetic data Slide in appendix VEP – Ensembl’s variation annotation format
8
FASTA (and FASTQ) Quality scores Sample ID
DNA Sequence (or amino acid) Very large file sizes ! Lowest and ~ highest Numbers (0 to 9): low-mid Capital letters (A to Z): mid-high Small letters (a to z): high-higher
9
VCF Small file size and widely used
Very useful for Mendelian disease studies Software: Vcftools Reference human genome: GRCh37 (release 74) -
10
Raw sequence databases
Ensembl (Europe), NCBI (USA) DNA NC_ (complete genome) NG_ (genomic region) mRNA NM_ (mRNA) NR_ (transcript) Protein NP_ (protein) Also see RefSeq (NCBI) slide in appendix Ensembl is European, NCBI is USA based Ensembl is neater but NCBI has more data XM_,XR_,XP_ are not curated
11
*Ensembl www.ensembl.org/info/data/ftp/index.html DNA
Useful for mapping reads, designing primers cDNA = mRNA (transcripts) CDS = all exons in gene VCF HapMap, 1000 Genomes, Venter, Watson VEP 66 species (as of 26/02/14)
12
Genome Browsers Ensembl Genome Browser UCSC Genome Browser
View region Extract region Filter variants in region Prediction MAF Conservation GERP scores UCSC Genome Browser 1000 Genomes Browser VEGA Genome Browser
13
*Ensembl Genome Browser (EGB)
Browse many genomes (>70 vertebrates) Example: BRCA2 Start and end Sequence (exons, introns) Transcripts Orthologues Sequence homology in other species Indicative of similar function Easy sequence extraction: Other EGBs EnsemblPlants, EnsemblBacteria, EnsemblFungi, EnsemblMetazoa, EnsemblProtists
14
EGB – Conservation (GERP)
Location GERP elements
15
SNP databases Ensembl (and NCBI) dbSNP (largest database for SNPs)
1000 Genomes 1092 whole genome sequencing African, European, Far East HapMap projects Genotyping 270 individuals from 4 populations Exome Variant Server (EVS) 6503 individuals’ whole exomes American European and American African SNP and Phenotype associations GWAS Catalogue Slide in appendix
16
*dbSNP http://ncbi.nlm.nih.gov/SNP Search, annotate and submit SNPs
Apply filters e.g. human, cited, clinical >70 million SNPs just for humans dbSNP handbook
17
From GWAS catalogue One of the most famous figures out there
Y chromosome From GWAS catalogue
18
Clinical databases OMIM HGMD mutDB Mitomap (mitochondrial DNA)
Online Mendelian Inheritance in Man Database of disease-linked genes and associated phenotypes Links to Entrez, GDB and other databases HGMD Database of sequences and phenotypes of disease-causing mutations Used to train mutation effect prediction algorithms (e.g. FATHMM) Slide in appendix mutDB Mitomap (mitochondrial DNA)
19
*OMIM http://omim.org/ Online Mendelian Inheritance in Man
First address for Mendelian disorders Disease > gene and phenotype Gene > disease and phenotype Phenotypes > disease and genes Example search: Autosomal recessive intellectual disability +autosomal +recessive +intellectual +disability
20
Protein databases Uniprot (Swiss-Prot + TrEMBL)
PDB – Protein data bank Slide in appendix Pfam – Protein families STRING – Protein-Protein Interactions HMPD (mitochondrial DNA)
21
Uniprot http://www.uniprot.org/ Swiss-Prot is manually curated
TrEMBL is automatically curated Download protein sequence Protein sequence BLAST Align Conserved regions and/or residues View AA properties Try example: Gene: DNALI1 Retrieve FASTA sequence, Blast and align (3 species)
22
Very useful integrative databases
GeneCards Integrated resource of information on human genes and their products Major emphasis on human disease Links to many kinds of biomedical information Sequence databases OMIM, HGMD, MDB Doctors’ Guide to the Internet Ensembl (and NCBI) fits in this category
23
*Genecards http://www.genecards.org/
Graphical view of many things about your gene Links to Ensembl, OMIM and Literature Example: DNALI1 Entrez Gene Summary Associated disorders Orthologues STRING predictions
24
Google is your friend! PubMed – Bibliographic database Animal models
Zfin – Zebrafish knockouts International Mouse Phenotyping Consortium New ones coming out all the time! New cohorts and studies UK10K project Human Microbiome project ARIES - Epigenetics
25
*PubMed http://www.ncbi.nlm.nih.gov/pubmed/
“>23 million citations for biomedical literature from MEDLINE, life science journals, and online books” Simple to use Searching Citation manager facility PubMed help book And finally being able to search the literature efficiently and accessing the right ones is very important; and I’ve found PubMed a very reliable and easy to use database for bibliographic sources
26
Exercise 1 - SNPs Find where rs9939609 is located
Is it in an exon or an intron? Minor allele? Global MAF? Which gene(s) are close by? Associated with any disorders? Which population(s) is the minor allele most frequent in? How many other known human SNPs in this gene?
27
Answers Chr 16 at position 53,820,527 Intron A, 0.355 FTO
BMI, Type 2 diabetes, Menarche Luhya in Webuye, Kenya (0.617) dbSNP, scroll to bottom 8099 (as of 21/02/14) Search for FTO in dbSNP and filter for H.sapiens
28
Exercise 2 - Genes Find the start and end coordinates of your favourite gene (e.g. DNAH5) How many exons does it have? How many different transcripts does it have? What is the function? Associated with any disorders? Which proteins are predicted to interact with it? Extract the coding sequence – in FASTA
29
Answers Chr 5 from 13,690,440 to 13,944,652 79 4 Force generating protein of respiratory cilia (from GeneCards – UniprotKB section) Primary ciliary dyskinesia DNAH1, DYNLL1, DNAL4 (from STRING) Many ways to do, for example: Export data in Ensembl (DNA) Q8TE73 in Uniprot (AA)
30
Thank You Any questions?
Please look back at the slides again once you complete the short-course(s)
31
Appendices Two additional terms which must be understood to make full use of the databases 10 useful websites/databases we do not have the time to go through Two additional ‘must know’ data formats
32
LD Non random association of alleles at two or more loci
Simple example: A at chr1:1000 and T at chr3:500 C at chr1:1000 and A at chr3:500 No other haplotypes Therefore chr1:1000 and chr3:500 are in LD Thus all SNPs are not independent Therefore carefully selected SNPs can save money (e.g. Tag SNPs) Different from linkage Proximal loci on the same chromosome inherited together during meiosis These designed SNP arrays were also used in the HapMap project to discover LD in different populations If you know that 50 SNPs are inherited together (as a block), then you can just genotype one of them and impute the rest – more savings
33
Genetic association study
GWAS SNP arrays - using LD Very cheap (23andme offers for $100) Case v Controls Whole exome sequencing (WES) Whole genome sequencing (WGS) Expensive Candidate gene analyses E.g. ones identified by GWAS Animal models Where thousands of individuals are needed to detect the effect of a SNP on a phenotype, GWAS are used. If very few people are enough then WGS or WES is used. Anything that is in LD with the SNP can be the causal one. That SNP you’ve genotyped just represents the whole LD block
34
RefSeq (NCBI) http://www.ncbi.nlm.nih.gov/refseq/
“A comprehensive, integrated, non- redundant, well-annotated set of reference sequences including genomic, transcript, and protein” >33000 species (and/or strains) CCDS project
35
GWAS catalogue www.genome.gov/gwastudies/
Search for SNP-phenotype associations from GWAS Search, view and filter Try example: BMI and P value of 1e-8 Result: 11 papers (as of 21/02/14)
36
HGMD http://www.hgmd.cf.ac.uk/ac/index.php
“HGMD represents an attempt to collate known (published) gene lesions responsible for human inherited disease” Public version registration required Professional version Purchase a licence
37
PDB – Protein Data Bank http://www.rcsb.org/pdb/home/home.do
Links biochemistry to your study If there is data of course! 3D view of protein (in Jmol) Amino acid sequence Try example: FTO (4IDZ)
39
Pfam http://pfam.sanger.ac.uk/ Protein families and domains
Predicted to have similar functions Domain organisation Phylogenetic tree Links to PDB Try example: Dynein_heavy (PF03028)
40
STRING http://string-db.org/ Predicts protein-protein interactions
Coexpression Literature Experiments Genomic context Try example: DNALI1
42
KEGG http://www.kegg.jp/ Similar to STRING but manually curated
“resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high- throughput experimental technologies” Similar to STRING but manually curated More reliable My friends tell me the difference between STRING and KEGG is like the difference between SWISS-Prot and TrEMBL STRING is automated, KEGG is manually curated – more reliable but less information
43
Regulatory elements Rfam (RNA family) Noncoding RNA database
Noncoding RNA database Bioexplorer.net
44
ENCODE http://www.nature.com/encode/
Radical shake up to the interrogation of genomic “function” Data available Functional impact of variant sites in multiple tissues Multiple assay types. Analysis/visualisation software and scripts for the generation of figures Massive database – won’t be going through it No time for it either However the good news is that Ensembl has direct links to it and is annotates some variations in accordance with ENCODE (regulatory feature)
46
Locus specific or disease specific databases
By HGVS Human Genome Variation Society Example: Ciliome database
47
PED/MAP Most used format User friendly software Ped: Map: Plink
pngu.mgh.harvard.edu/~purcell/plink/ One of the mostly cited software Ped: FAM A A G G A C ... FAM A A A G .... Map: 1 rs 1 rs 1 rs …
48
Variant effect predictor (VEP)
Example file:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.