CANDID: A candidate gene identification tool Part 2 Janna Hutz March 26, 2007.

Slides:



Advertisements
Similar presentations
DNA.
Advertisements

Linkage and Genetic Mapping
Lecture 2 Strachan and Read Chapter 13
Genetic Heterogeneity Taken from: Advanced Topics in Linkage Analysis. Ch. 27 Presented by: Natalie Aizenberg Assaf Chen.
PRIORITIZING REGIONS OF CANDIDATE GENES FOR EFFICIENT MUTATION SCREENING.
GENOMICS TERM PROJECT Assessment of Significance in a SNP.
Basics of Linkage Analysis
Copyright © 2013 Pearson Education, Inc. All rights reserved. Exploring Biological Anthropology: The Essentials, 3 rd Edition CRAIG STANFORD JOHN S. ALLEN.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
DNA marker analysis Mrs. Stewart Medical Interventions Central Magnet School.
Predicting the Function of Single Nucleotide Polymorphisms Corey Harada Advisor: Eleazar Eskin.
Aspects of Genetics and Genomics in Cancer Research Li Hsu Biostatistics and Biomathematics Program Fred Hutchinson Cancer Research Center.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Richard, Rochelle, Zohal, Angie
Positional Cloning LOD Sib pairs Chromosome Region Association Study Genetics Genomics Physical Mapping/ Sequencing Candidate Gene Selection/ Polymorphism.
Human Genetics Overview.
Whole Genome Polymorphism Analysis of Regulatory Elements in Breast Cancer AAGTCGGTGATGATTGGGACTGCTCT[C/T]AACACAAGCGAGATGAAGAAACTGA Jacob Biesinger Dr.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Office hours Wednesday 3-4pm 304A Stanley Hall Review session 5pm Thursday, Dec. 11 GPB100.
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Human Genome Project Seminal achievement. Scientific milestone. Scientific implications. Social implications.
Presented by Karen Xu. Introduction Cancer is commonly referred to as the “disease of the genes” Cancer may be favored by genetic predisposition, but.
Common Disease Findings (case study on diabetes) GWAS Workshop Francis S. Collins, M.D., Ph.D. National Human Genome Research Institute May 1, 2007.
The genetic epidemiology of common hormonal cancers Deborah Thompson Centre for Cancer Genetic Epidemiology.
Understanding Genetics of Schizophrenia
Genetic Analysis in Human Disease. Learning Objectives Describe the differences between a linkage analysis and an association analysis Identify potentially.
Georgia Wiesner, MD CREC June 20, GATACAATGCATCATATG TATCAGATGCAATATATC ATTGTATCATGTATCATG TATCATGTATCATGTATC ATGTATCATGTCTCCAGA TGCTATGGATCTTATGTA.
Multiple testing correction
Standardization of Pedigree Collection. Genetics of Alzheimer’s Disease Alzheimer’s Disease Gene 1 Gene 2 Environmental Factor 1 Environmental Factor.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Introduction to BST775: Statistical Methods for Genetic Analysis I Course master: Degui Zhi, Ph.D. Assistant professor Section on Statistical Genetics.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Multifactorial Traits
The Center for Medical Genomics facilitates cutting-edge research with state-of-the-art genomic technologies for studying gene expression and genetics,
CANDID: A candidate gene identification tool Janna Hutz March 19, 2007.
Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using PomBase Basic PomBase Features Gene Page Overview.
SNPs Daniel Fernandez Alejandro Quiroz Zárate. A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more.
Non-Mendelian Genetics
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
Regulation of gene expression in the mammalian eye and its relevance to eye disease Todd Scheetz et al. Presented by John MC Ma.
What is Genetic Research?. Genetic Research Deals with Inherited Traits DNA Isolation Use bioinformatics to Research differences in DNA Genetic researchers.
CS177 Lecture 10 SNPs and Human Genetic Variation
Sample to Insight Alexander Kaplun, PhD Sep PGMD: a comprehensive pharmacogenomic database for personalized medicine and drug discovery.
Quantitative Genetics
SCRIPPS GENOME ADVISER Galina Erikson Senior Bioinformatics Programmer The Scripps Translational Science Institute Scripps Translational Science Institute.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
Copyright OpenHelix. No use or reproduction without express written consent1.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Pedagogical Objectives Bioinformatics/Neuroinformatics Unit Review of genetics Review/introduction of statistical analyses and concepts Introduce QTL.
De-anonymizing Genomic Databases Using Phenotypic Traits Humbert et al. Proceedings on Privacy Enhancing Technologies 2015 (2) :
The Future of Genetics Research Lesson 7. Human Genome Project 13 year project to sequence human genome and other species (fruit fly, mice yeast, nematodes,
Human Genome Resources Chiki Gupta November 21 st, 2005 Biophysics 101.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
DNA marker analysis Mrs. Stewart Medical Interventions Central Magnet School.
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
Notes: Human Genome (Right side page)
Different microarray applications Rita Holdhus Introduction to microarrays September 2010 microarray.no Aim of lecture: To get some basic knowledge about.
The TDR Targets Database Prioritizing potential drug targets in complete genomes.
1 Seminar 4: Applied Epidemiology Kaplan University School of Health Sciences.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Copyright OpenHelix. No use or reproduction without express written consent1.
Polymorphisms GWAS traits.
Searching the NCBI Databases
Polymorphisms GWAS traits.
Medical genomics BI420 Department of Biology, Boston College
Medical genomics BI420 Department of Biology, Boston College
Presentation transcript:

CANDID: A candidate gene identification tool Part 2 Janna Hutz March 26, 2007

Review Literature –Well-characterized genes Protein domains –All genes Cross-species conservation –All genes

Today’s agenda Expression levels Linkage data Association data CANDID performance measures

Candidate lists vs. single candidates Candidate lists –Complex trait or disease –Disease with known heterogeneity Single candidates –Mendelian trait –New disease –Disease with clear, well-defined pathology

Candidate lists vs. single candidates Microarray SNP typing Sequencing Immunocytochemistry Knockout model ACT[A/G]GGA

Example 4 Goiter - thyroid gland problem Iodine deficiency Genetic causes

Example 4 Iodine is not supplied Iodine is present, but is not added to the molecule Which gene is mutated?

Expression data We know what tissue our gene is expressed in (thryoid). How can we use this knowledge to help identify the candidate? Wouldn’t it be nice if we had an expression database?

Expression databases Our ideal expression database would have: –Expression data for the same genes across many different tissues –As many tissues as possible –As many genes as possible –Good documentation Gene Atlas

Genomics Institute of the Novartis Research Foundation 79 human tissues (160 samples) 2 arrays –Affymetrix HG-U133A –GNF1H (custom) 17,809 genes

Measure of gene expression Our thyroid gene: –Gene that is brightest on the thyroid array? –Gene that is brightest on the thyroid array, compared to all the other arrays.

Measures of gene expression Run CANDID, specifying that we’re interested in the thyroid. User name: workshop Password: perl (We’ll need a tissue code for that.)

Example 4 - Results Our favorite genes: TP53 - rank is… –16314th KRAS - rank is… –5229th What genes are ranked most highly?

Example 4 - Results 192 genes with expression score of 1 The TOP gene is actually responsible for the phenotype described earlier –Its expression score = 1

Prior evidence I’m not interested in examining all of the genes in the genome - just some of them. Linkage and association

Linkage CANDID can: –Weight regions with higher LOD scores –Limit analysis to certain regions –How does it do this?

Linkage scoring 3172 gene’s LOD score maximum genome-wide LOD score

Linkage files How does CANDID get this linkage information? CANDID takes two kinds of files –Unformatted output from GENEHUNTER and MERLIN –Custom linkage files

Custom linkage files Simple format Line 1 of the file must contain the word “custom” somewhere Subsequent lines: Chromosome(tab)cM (tab)LOD score But how do I get cM positions?

Mapmaker Inputs file as: Chromosome(tab) basepair (tab) LOD score Outputs new file in the format: Chromosome(tab) cM (tab) LOD score Will be available on the CANDID website soon

Example 5 Deletion on chromosome 13 between cM and cM. pancreatic cancer

Creating a custom linkage file Example: custom

Running CANDID 1.Try running CANDID using only the linkage criterion. 2.Now, run CANDID with the linkage criterion and literature criterion (your choice of keywords) Linkage weight = 1000 Literature weight = 1

Results From OMIM: “Individuals with mutations in the BRCA2 gene, which predisposes to breast and ovarian carcinoma, have an increased risk of pancreatic cancer; germline mutations in BRCA2 are the most common inherited alteration identified in familial pancreatic cancer.”

But linkage is so last season…

Association Increasing numbers of association studies Increasing numbers of SNPs in each study Can CANDID use this information, too?

Association Database –dbSNP million human SNPs –Includes HapMap SNPs –Most comprehensive –Each snp has a number prefixed with “rs”

Association How does CANDID accept association data? Custom file format - each line is: rs# (tab) p-value

Association scoring For each gene, take the best p-value for that gene’s SNPs Subtract that p-value from 1 Unless you test SNPs in every gene, this can be kind of unfair…

Association scoring Tested 10 genes Gene 9 has a best p-value of 0.8 (bad) Gene X was not tested Should Gene 9 get a higher overall score than Gene X?

p-value threshold User defines a p-value threshold Let’s say it’s 0.1. Any SNPs with p-values above 0.1 are not considered. Now Gene 9 and Gene X have the same score (0).

Example 6 Age-related Eye Disease Study Macular degeneration

Example 6 Make custom association file rs rs rs Run CANDID with this association file

Results rs rs rs } CFH } SLC25A46

So just how well does this work anyway?

Preliminary evidence Online Mendelian Inheritance in Man 154 diseases linked to chromosome 1 Literature, domains - chose keywords Conservation Expression - chose tissue codes

Ideal weights Tested all combinations of weights in those 4 categories –Possible weights: (0, 0.1, …, 0.9, 1) Which weight combination was the best, across all 154 diseases?

Top 10 weight combinations 1.Literature = 1, everything else = 0 2.Literature = 0.9, everything else = 0 3.Literature = 0.8, everything else = 0 4.Literature = 0.7, everything else = 0 5.… 10. Literature = 0.1, everything else = Literature = 1, domains = 0.1

More specifics Literature only: average ranking = 425 –425/38697 = 98.9th percentile –44/154 genes ranked #1 for at least one set of weights Chromosome 1: average ranking = 22 –22/2280 = 99th percentile –84/154 genes ranked #1 for at least one set of weights

Analysis of results They make a lot of sense. Genes in OMIM are, by definition, well- characterized. Many diseases are rare, with particular names or keywords that would only appear in papers about the disease genes.

Next steps Separate OMIM analysis into simple and complex traits –Get new ideal weights See how well these ideal weights do in ranking candidates from chromosome 2.

Next steps CANDID’s databases were last compiled in November Find publications that have come out since then. How well does CANDID do in ranking those genes?

Next steps Many new whole-genome studies and microarray studies implicate lists of candidates. If CANDID analyzes those phenotypes, how significant is the overlap of CANDID’s top genes and those papers’ top genes?

Next steps Any other suggestions? Any interesting data you have?

Any questions?

Acknowledgments Mike Province Howard McLeod Aldi Kraja Ingrid Borecki Qunyuan Zhang Ryan Christensen John Martin