Whole Genome Polymorphism Analysis of Regulatory Elements in Breast Cancer AAGTCGGTGATGATTGGGACTGCTCT[C/T]AACACAAGCGAGATGAAGAAACTGA Jacob Biesinger Dr. Garry Larson City of Hope
Topics Covered Today Cancer and Gene Regulation Combining Data: Bioinformatics Progress So Far Molecular Cause of Genetic Disease
ATGCCGGCTTACCATATCTACCTAAATCCGGTA ATGCCGGCTTACCATAAT Port/files/SICKLE CELL WEBSITE/whatissickle.htm SNPs in coding regions: Sickle Cell Anemia Single Nucleotide Polymorphisms and Genetic Disease GluProPheSerThr STOP Genetic disease may also be caused by differential expression of vital proteins ValProPheSerThr STOP TGTAGA Protein Coding Region Untranslated region Promoter Binding Mechanism Micro RNA Binding Mechanism Chunky sheep from miRNA binding site destruction Nature Rev. Genet. 5, 202–212 (2004) T
Breast Cancer Expression Tumor expression patterns are extremely divergent from normal cells Could SNPs in regulatory regions of genes associated with breast cancer explain their overexpression in tumors? Normal Breast Expression Breast Tumor Expression
Statistical Search for Dysregulated Genes Expression patterns in cancers gives two categories: Estrogen Receptor + and ER- Recent metaanalysis pooled tumor expression data for 9 studies and >15,000 genes Top 1% ER+ > ER- 150 genes Top 1% ER+ < ER- 150 genes Normalized expression difference between ER+ and ER- Consistency across studies
Regulation Motifs Which TF binding sites exist in our selected genes? A recent study identified motifs conserved in regulatory regions across 4 organisms lymphocyte transmembrane adaptor 1 Promoter motifs: 123 known motifs 174 phylogenetically conserved Downstream motifs: 273 conserved 3’ UTR 343 conserved miRNA 6mer 368 conserved miRNA 7mer
Motif Search Use Python and UCSC Genome Browser to: Get promoter region DNA (2kb upstream from transcription start site (TSS) + max of 2kb downstream of TSS, limited by translation start) Get 3’ untranslated region RNA Search for motifs on + and – strand Results for Top 1% up and down: ’ UTR hits mer hits mer hits known motif hits phylo motif hits
SNP Databases SNP information is coming from two databases: HapMap- Four groups (270 total people) genotyped for same SNPs CGEMS- Breast Cancer association study, complete with p-values. A late-comer to our study (June 2007) HapMap ~4 million CGEMS ~550k
Mapping SNPs HapMap ~4 million CGEMS ~550k Gene Promoters and 3’ UTR Motif Matches Use MSSQL 2003 and Python (pymssql) to perform a join of dbSNP, HapMap and CGEMS SNPs with regulatory motifs
Verify Motif Significance How do we know that these motifs are significant? Hypothesis: Due to negative selection, there will be fewer SNPs in motifs than in random areas within the same region. Method: Contrast how many motifs have at least one SNP in them against how many of 100 random sequences from the same region have at least one SNP in them
Motif Counting Results Known Top 1%Motif with SnpMotif without SnpTotal 1-Sided P- Value Actual Random Total Phylo Top 1% 1-Sided P- Value Actual Random Total 3’ UTR results not yet available There is a significant difference between motifs and random sequences.
CGEMS Results A number of SNPs that fall within motifs are associated with Breast Cancer Highest ranking was 1514 out of 550,000 Further analysis required to say if significant
Thanks! SoCalBSI mentors City of Hope Dr. Garry Larson Dr. David Smith Dr. Päl Sætrom Cathryn Lundberg All the SoCalBSI students! Funded by: