Understanding genes using mathematical tools Adam Sartiel COMPUGEN
2 Short History Of Compugen 1993: Founded 1994: First Bioccelerator sold (Merck) 1997: LEADS project initiated 1998: Pfizer collaboration 1999: USPTO agreement; LabOnWeb launched 2000: Launch of Z3; IPO 2001: Gencarta and OligoLibraries launched; Novartis collaboration
3 Unique R&D Team Substantial –120 professionals – 32 PhD/MD, 37 M.Sc. Multidisciplinary –Algorithm development, Molecular biology, Software engineering, Statistics, Physics, Chemistry Integrated –Synergy between disciplines and feedback
4 Gene analysis using mathematics Drug discovery and Bioinformatics Principles of sequence alignment The EST opportunity and the Transcriptome Applications (Gencarta and DNA chips)
5 Cellular pathways are highly complex - identified targets
6 $500M The Drug Development Process
7 Some definitions ‘Drug’ – protein, lipid, antibody, or small organic molecule which has proven effect and approved safety level. ‘Lead’ – A molecule in development which may one day become a drug ‘Target’ – A protein (in most cases) which activity a drug lead would affect, in order to create a desirable effect on the body. ‘Validated target’ – A target which has a proven, demonstrated effect on a disease or condition.
8 30,000 GENES? Fewer genes than initially thought? Some complexity due to alternative splicing Gene prediction is problematic Complex genes (interleaved, nested,...) are especially difficult to identify Both HGP and Celera tried to minimize false positives Conclusion: more genes may be found Wright et al., Genome Biology (7): There are 65,000 – 75,000 genes
9 ONE GENE ONE PROTEIN??? Old Dogma Gene mRNA Protein Gene mRNA Protein Current understanding mRNA Protein Edited mRNA Modified protein Protein
10 Gene identification using sequence comparison
11 Similar sequences, common ancestor common ancestor, similar function Understand genes = know your targets
12 The genetic code is redundant
13 Proteins ‘see’ deeper Unrelated DNA sequences? Highly related proteins! TTACTCCGTCATGATGGGGUG CTGATAAGGAAAGAAGGCTAT LeuLeuArgHisAspGlyVal LeuIleArgLysGluGlyTyr
14 How to align proteins? MARQGEFPSILK M-RHGEFP-LLKWC ‘Good’ ‘Bad’ A good algorithm, vs databases, requires super-computers
15 Another direction: find genes by sequence ACGATCGAGCATGCATCATCAGCATCTAGCGATCAGCAGGCATCGAGCAGCTAGCATGCATG TGCTAGCACGTACGTAGTAGTCGTAGATCGCTAGTCGTCCGTAGCTCGTCGATCGTACGTCAC - Gene regions have different nucleotide composition than non-coding regions. - Intron and exons are distinct in sequences - Splice junctions are clearly detectable
16 Genomic DNA One step ahead: the story of the ESTs mRNA cDNA exon 1 exon 2exon 3 EST cDNA clone Public domain ESTs (Expressed Sequence Tags): > 5,000,000 Craig Venter
17 The ESTs: Rough Diamonds? Short, inaccurate, badly annotated Abundant with repeats, alternative splicing Too many… The shredder effect
18 Input: GenBank- a pool of ESTs and mRNAs Process 1-clustering Process 2- Assembly Output: The transcriptome USING ESTS TO GET THE TRANSCRIPTOME Cluster 1 Cluster 2 Cluster 3 Cluster 4
19 The Transcriptome - Definition “The mRNA collection content, present at any given moment in a cell or a tissue, and its behavior over time and cell states”
20 Introducing the Transcriptome The Genome: –Index to the range of possible proteins –Useful as map and for inter-organisms analysis The Proteome: –Describes what actually happens in the cell –Complex tools, partial results The Transcriptome: –“Golden path”: Proteome information in DNA technology.
21 Transcriptome applications Discovery of new proteins –Which are present in specific tissues –Which have specific cell locations –Which respond to specific cell states Discovery of new variants –Of important genes –Which work to increase/decrease the activity of the ‘native’ protein.
22 Example: Alternative Splicing One Gene - Multiple mRNAs Various Mature mRNA Transcripts Pre m RNA Alternative Splicing 3 4 (tissue A) (tissue B) (Other tissues)
23 Alternative Splicing vs. “Contiging” “Contiging”: “Assembling”: Contig impossible
24 Extreme example of alternative splicing Mature PSA PSA precursor PSA RNA Genomic Modified mRNA LM precursor Mature LM protein Stop codon Signal peptide Alternative splicing Though coded by the same gene, mature proteins PSA and LM have not one residue in common!
25 PSA genomic exon1 exon 2exon 3 exon 4 exon1exon 2exon 3 exon 4 exon 5 KLK-2 genomic LM KLM *Stop codon Is This The Only Example? * * **
26 Validation: Northern Blot Like PSA, LM expression is restricted to prostate tissue Multiple bands may reflect conserved regions or alternative splicing
27 Example: receptor with DN Dominant Negative
28 Natural Antisense – a regulation mechanism?
29 LEADS Antisense Prediction When analyzing EST data for Antisense: –Use original EST orientation annotation –Check splicing signals on both strands –Examine library description for enzymes used –Mark PolyA signals and PolyA tails (compare to genomic PolyA) –Take into account NotI sites
30 Example: A Putative SNP Cluster T07189 Position 347
31 Cluster T07189 Position 347 SNP Verification
32 Using Compugen’s Transcriptome Technology Large-scale collaborations: Pfizer, Novartis Co-development of molecules: TNF, Chemokine receptors, kinases, GPCRs Academia research: UCSF, NYU, TAU. Database products DNA chip design Mass-spec analysis Gene Ontology
33 Chip Design on Alternative Splicing Variant-specific or common probes can be designed
34 How many ‘genes’ are there really? Raw data: –3,770,969 human sequences –2,061,357 mouse sequences – 297,568 rat sequences Non-singleton ‘clusters’: 120,372 H, 63,043 M, 33,396 R % with splice variants: 26% (H), 32% (M), 23% (R) Homology (to SwissProt+Trembl, InterPro, other GC proteins): 20% (H+M), 27% (R). Total unique proteins: 236,797 (H), 106,119 (M), 32,352 (R)
35 The Novartis Agreement Signed August 2001 Novartis non-exclusively licensed the LEADS platform and related software, and plans to use it for: –In-silico drug target identification and prioritization –Genome wide chip design Agreement was signed after a detailed pilot study run in November 2000 –Discovered novel genes and splice variants using Incyte and Celera data Genes were subsequently verified in Novartis laboratory.
36 GENCARTA Result of LEADS applied to: –Public genome information –Published mRNA –ESTs In-house designed interface, Oracle- based infrastructure. Installed: Kyowa-Hakko, Avalon Pharma, Weizmann Institute, YU Version 2.2 out in October 2001.
37 Let’s go for the real thing… Gencarta Demonstration OligoLibrary Demonstration
38 Conclusion: Advantages of the Transcriptome Identify new drug targets Understand splice variant behavior Isolate “natural” drugs Annotate Proteomics experiments Design better DNA chips Solve the real bottlenecks in drug discovery and development
Understanding genes using mathematical tools Adam Sartiel COMPUGEN