MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine Genome Informatics I (2015 Spring)
Overview Goal of this lecture – You will learn how to interpret discovered variants to filter and prioritize for associated phenotype (e.g. disease) and practice Predicting functional impact of variants – Utilizing sequence features – Utilizing protein features Popular methods and practice – Polyphen2 – Mutationassessor – SeattleSeq Genome Informatics I (2015 Spring)
FUNCTIONAL IMPACT OF VARIANTS Genome Informatics I (2015 Spring)
We usually have too many variants Genome Informatics I (2015 Spring) Saksena et al, “Developing Algorithms to Dis cover Novel Cancer Genes: A look at the cha llenges and approaches” We want to narrow down the number of “called” variant as small as possible
A simple mutation calling does not give you the final answer Genome Informatics I (2015 Spring) mutation calling (NGS) A lot of candidate variants some from sequencing error some from polymorphisms some from mapping error some from mapping error some are passengers
A simple mutation calling does not give you the final answer Genome Informatics I (2015 Spring) mutation calling (NGS) A lot of candidate variants some from sequencing error some from polymorphisms some from mapping error some from mapping error some are passengers A few real pathogenic variants
Gold mining Genome Informatics I (2015 Spring) Bunch of candidate variants Many variants A few variants Strategy I: Do they really exist? - Any mistakes in sequencing and variant calling? - Any non-disease causing polymorphisms? Strategy II: Are they functional? - Are they damaging? pathogenic? - Are they related to phenotypes?
Five ways to narrow down 1. Include control data 1. eliminate germline variants 2. Use more strict variant quality threshold 1. work on only confident variants 3. Filter out polymorphisms 1. remove non-damaging polymorphisms 4. Predict functional impacts 1. find damaging levels 5. Use disease specific knowledge 1. to acquire final candidates Genome Informatics I (2015 Spring)
Five ways to narrow down 1. Include control data 1. eliminate germline variants 2. Use more strict variant quality threshold 1. work on only confident variants 3. Filter out polymorphisms 1. remove non-damaging polymorphisms 4. Predict functional impacts 1. find damaging levels 5. Use disease specific knowledge 1. to acquire final candidates Genome Informatics I (2015 Spring) Strategy I
Five ways to narrow down 1. Include control data 1. eliminate germline variants 2. Use more strict variant quality threshold 1. work on only confident variants 3. Filter out polymorphisms 1. remove non-damaging polymorphisms 4. Predict functional impacts 1. find damaging levels 5. Use disease specific knowledge 1. to acquire final candidates Genome Informatics I (2015 Spring) Strategy I Strategy II
1. Include control data Genome Informatics I (2015 Spring) germline som atic som atic 100,000~ ~500, ~1000 We should eliminate unwanted germline variants
When controls are unavailable Single nucleotide polymorphism rate = 1/100~1/1000 Whole Genome Sequencing – Total DNA length = 3 billion – Expected SNP numbers = 3~30 million Whole Exome Sequencing – Total DNA length = 50 million – Expected SNP numbers = 50~500 thousands Targeted Sequencing (Panel) – Total DNA length = 100~1000 thousands – Expected SNP numbers = 1000~10,000 Hotspot Panel (only for very well known variants) – Controls can be omitted Genome Informatics I (2015 Spring)
2. Use more strict quality threshold Variant quality Genome Informatics I (2015 Spring) Low Variant Quality - This variant (although it has been called) can be false Cause of low quality - Low read depth (insufficient observation) - Bad basecall/mapping quality - Low allele frequency
2. Use more strict quality threshold Possible actions – Cut out variants based on Variant quality (e.g. QUAL<10) Total read depth (e.g. <20) Number of alt-depth (e.g. <5) Allele frequency (e.g. <0.1) – Prioritize variants Sort with variant quality and inspect from the top Genome Informatics I (2015 Spring)
3. Filter out polymorphisms When you had no control data (panel) – Check if the variants have been reported as polymorphism When you had control data – You may not have polymorphisms Because somatic mutations callers removes germline calls – However, there are some cases that polymorphisms can be reported (as somatic mutations) For example, low read depth in control sample Genome Informatics I (2015 Spring) low depth bad region Variant Undetected Variant Detected
dbSNP Database of SNP Genome Informatics I (2015 Spring) chr7: A>T
dbSNP Database of SNP Genome Informatics I (2015 Spring) chr7: A>T
4. Predict functional impacts Types of point mutations – Coding mutations Synonymous (silent) – Amino acid unchanged Missense – Amino acid changed Nonsense – Stop codon gained Readthrough – Stop codon loss – Non-coding mutations Intron Splice-variants Variants in regulatory elements Genome Informatics I (2015 Spring)
Functional impacts Types of indels – Inframe Insertion or deletion in a multiple of 3 base-pairs – Frameshift Genome Informatics I (2015 Spring)
General classification (priority) Genome Informatics I (2015 Spring)
General classification (priority) Genome Informatics I (2015 Spring) high-impact low-incidence low-confidence High incidence
Functional impact prediction of missense mutations How critical is an AA change to its protein function? – Amino acid conservation If the AA is essential, it would be conserved though the evolution – Amino acid in protein conformation Substitution of AA in active site would be more damaging Genome Informatics I (2015 Spring)
Amino acid conservation Genome Informatics I (2015 Spring)
Protein Structure Genome Informatics I (2015 Spring)
5. Use disease specific knowledge Your knowledge about the disease – e.g. cancer – “Has it been reported in other previous samples?” – Search it in COSMIC, if you found it is recurrent, it is likely to be functional Genome Informatics I (2015 Spring)
Five ways to narrow down 1. Include control data 1. eliminate germline variants 2. Use more strict variant quality threshold 1. work on only confident variants 3. Filter out polymorphisms 1. remove non-damaging polymorphisms 4. Predict functional impacts 1. find damaging levels 5. Use disease specific knowledge 1. to acquire final candidates Genome Informatics I (2015 Spring) Many, uncertain variants A few, reliable variants
Five ways to narrow down 1. Include control data 1. eliminate germline variants 2. Use more strict variant quality threshold 1. work on only confident variants 3. Filter out polymorphisms 1. remove non-damaging polymorphisms 4. Predict functional impacts 1. find damaging levels 5. Use disease specific knowledge 1. to acquire final candidates Genome Informatics I (2015 Spring) Many, uncertain variants A few, reliable variants Functional study, Mechanism study
SUMMARY OF PART I Genome Informatics I (2015 Spring)
- Connect to Linux cluster, Job script writing and submission - NGS technologies, NGS data - Short read alignment - Variant Calling, CNV, SV calling - Interpretation of discovered variants
In the remaining classes Genomic data to expression data – Gene mRNA Protein Pathways and Networks Phenotype Use high throughput data for your study Don’t forget your project Genome Informatics I (2015 Spring)
PRACTICE - FUNCTIONAL VARIANT ANNOTATION WITH SEATTLESEQ Genome Informatics I (2015 Spring)
Today’s data Somatic variants in chr22 of anonymous cancer called from Virmid Data location – /scratch/2015_GenomeInformatics/{yourdir}/virmid output – If you did not complete somatic calling practice, copy it from /scratch/2015_GenomeInformatics/public Genome Informatics I (2015 Spring)
data download to local PC ① move to your virmid out directory ② check your virmid output ③ click FTP
④ double click
seattle-seq search then click here!!!
seattle-seq ① write your ② input your VCF file ③ check!! ④ check!!
① click file > open.. ② select ‘all file’ ③ select annotated file
①②
Filtering phase accession (column H) – for filtering curated isoforms NM: mNRA XM: predicted mRNA model filter functionGVS (column I) – for filtering damaging mutation type missense, missense-near-splice stop-gain, stop-loss splice-donor, splice-acceptor The others filter
① ②
① ②
IGV download search then click here!!!
IGV download download then double click!!
IGV view
① input disease bam file ② input normal bam file ③ input VCF file
IGV view