Evolution Aristotle: classification of animals theories on change (change is the actuality of the potential) Darwin: descent with modification natural selection There is no evolution without change
Evolving nomenclature change in DNA code = genetic variation change with respect to what? any consequence? Mutations Single Nucleotide Polymorphism SNPs Deletion/insertion polymorphism DIPs Short Nucleotide Polymorphism SNPs Short Nucleotide Variants SNVs Short Genetic Variants
Definitions pol·y·mor·phism (pl-môrfzm) n. 1. Biology The occurrence of different forms, stages, or types in individual organisms or in organisms of the same species, independent of sexual variations. 2. Chemistry Crystallization of a compound in at least two distinct forms. Also called pleomorphism. var·i·ant (vâr-nt, vr-) adj. 1. Having or exhibiting variation; differing. 2. Tending or liable to vary; variable. 3. Deviating from a standard, usually by only a slight difference. n. Something that differs in form only slightly from something else, as a different spelling or pronunciation of the same word.
Human Genome Project ENCODE project HapMap project SNP consortium Individual human genomes James Watson, Craig Venter, 3 asian gentlemen
Evolving SNV analysis needs Single SNP Millions of SNPs How to structure the analysis is based on the same theories… It’s a question of scale and heuristics Finding SNPs in single gene sequence Finding SNPs in GWAS studies, other exome sequencing etc…
Calling SNPs in NGS Polymorphisms with respect to a reference genome Challenging because of alignment errors, variable depth of coverage Accuracy is essential – diagnostics, risk assessment False positives and false negatives both a problem Given 1% sequencing error, how many high quality reads do we need to call a variant Quality scores differ per experiment The tools we use should have prior knowledge of known SNPs and their relevance to our question, ie causing disease or not
Prioritization of SNPs You have millions How do you know which are important for your research? First let’s look at what SNPs can do…
So you have a SNP imagine Is it associated with disease? If so, why? Is it to do with protein function or transcriptional regulation or both, or none, or what? If none of the above, then why is it associated with disease? how do you begin to imagine its function? imagine
SNP function prediction (summary) (in coding sequence) Protein Function Ligand binding affinity Co-factor binding affinity targeting to different cellular compartment (in coding or non-coding sequence) Gene Processing Transcriptional regulation Translational regulation Splicing
Assessment of SNP Function 3/14/2018 Position of SNP dbSNP or new SNP: first identify location In a coding sequnce: non-synonymous Protein Data Bank , PolyPhen UniProt, PsiPred (secondary structure prediction tool) ProSite, InterPro Done individually, or incorporated into software to scale up for high throughput Check SNP position at dbSNP If it is in a coding sequence of a gene Is it a synonymous SNP? If yes, then it is probably affecting something other than protein function, treat it as SNP in UTR (below). If it is a non-synonymous SNP, check amino acid substitution check conservation of domain in UniProt and Pfam see if there is a 3D structure (at the Protein Data Bank) If yes, conduct analysis in PolyPhen If no, conduct secondary structure analysis on the domain as defined by Uniprot or Interpro If it is in a UTR or just upstream check if it is on a known regulatory element transcription factor binding site (TRANSFAC) miRNA (miRNA registry) alternative transcriptional start sites (DBTSS) 10
Example: AGT & Hypoxaluria
SNP mutation causes disease CCA > CTA => Proline > Leucine (P11L) C L: Leu N C P: Pro
Two more in AGT Gly82Glu blocks binding to cofactor E: Glu O Gly82Glu blocks binding to cofactor G: Gly H Gly41Arg disrupts intermonomer interactions C R: Arg N G: Gly H
Assessment of SNP Function - I 3/14/2018 Position of SNP In CDS: non-synonymous Protein Data Bank , PolyPhen UniProt, PsiPred ProSite, InterPro Upstream of CDS or in CDS and synonymous SignalP, ProSite, rate of processing? TRANSFAC DBTSS NXSensor Check SNP position at dbSNP If it is in a coding sequence of a gene Is it a synonymous SNP? If yes, then it is probably affecting something other than protein function, treat it as SNP in UTR (below). If it is a non-synonymous SNP, check amino acid substitution check conservation of domain in UniProt and Pfam see if there is a 3D structure (at the Protein Data Bank) If yes, conduct analysis in PolyPhen If no, conduct secondary structure analysis on the domain as defined by Uniprot or Interpro If it is in a UTR or just upstream check if it is on a known regulatory element transcription factor binding site (TRANSFAC) miRNA (miRNA registry) alternative transcriptional start sites (DBTSS) Is it in a regulatory element? 14
Translation initiation site Initiation codon ATG 5’UTR Translation initiation site Initiation codon ATG promoter Exon 1 Exon 2 5’ TSS Transcriptional Start Site 3’ promoter Exon 1 Exon 2 Transcription factor binding sites TFBSs
SNP in a regulatory element TFBS ACAGTCGTAAGGCTGATTGGCTGGATAGCAGTACG ACAGTCGTAAGGCTAATTGGCTGGATAGCAGTACG Single nucleotide polymorphism May disrupt TF binding and therefore functionality
Example: CYP2E1 SNP ATG TSS Track from DBTSS
Nucleosomes
Assessment of SNP Function - II 3/14/2018 In non-coding sequence First, assess conservation TRANSFAC miRNA registry Repeatmasker Alternative splicing HapMap Is it in a regulatory element? Check SNP position at dbSNP If it is in a coding sequence of a gene Is it a synonymous SNP? If yes, then it is probably affecting something other than protein function, treat it as SNP in UTR (below). If it is a non-synonymous SNP, check amino acid substitution check conservation of domain in UniProt and Pfam see if there is a 3D structure (at the Protein Data Bank) If yes, conduct analysis in PolyPhen If no, conduct secondary structure analysis on the domain as defined by Uniprot or Interpro If it is in a UTR or just upstream check if it is on a known regulatory element transcription factor binding site (TRANSFAC) miRNA (miRNA registry) alternative transcriptional start sites (DBTSS) 22
Prioritization of SNPs You have millions How do you know which are important for your research? How do (can you?) you implement this into a pipeline so you can do thousands at once? How can you come up with strategies to prioritise?
Statistical genetics If a SNV is present in all members of the family, affected and not, then it is to do with something innocuous. Some methods are based on how common these variants are in families. ie shared ancestral variants and genetic linkage co-segregation Need pedigree haplotype information Mostly used in GWAS studies BEAGLE, GERMLINE, PLINK IBD, MERLIN
Several Tools Out There For example: SeattleSeq dbNSFP built into other NGS analysis software New ideas continue to emerge…
The Plot Thickens…
If you Google directly to dbSNP 10Nov2015
The NCBI homepage: if you go to dbSNP from here
You get this: but no worries, both access the same underlying database.
Combining gene expr. & variations eQTL: expression quantitative trait locus Correlation between gene expr. and freq. of variation Simple linear regression (matrixeQTL) Significance is assessed by p-value