1 A new theory of gene regulation based on relationships of DNA sequences flanking genes Richard J. Feldmann Global Determinants, Inc. Derwood, Maryland
2 The intellectual property presented in this talk/document is protected by US and PCT Patent Applications dated May 30,2001
3 Finding the right question to ask is the hard part Answering the question is just a matter of hard work.
4 Have you ever wondered how gene expression is controlled? The TATA box of a gene is 5’ of the start coding Small dimeric proteins bind in and near this area The polymerase assembles around these proteins Enhancer and/or repressor distal to this area can loop back
5 Have you ever wondered how cellular differentiation and development is accomplished? How is gene expression controlled so cells within a tissue are relatively the same? How in a 1,000 cell creature like C. elegans can all the cells have different functions? How is cellular development orchestrated?
6 Simplified Gene Model | | | | |<-Beginning of Translation | | End of Translation----->| + strand strand | | | | | | | | |< Gene |
7 Specificity Region The palindromic specificity area around the TATA box is only 6 to 8 bases in length 4 8 = 65,556 is a relatively small number Not every combination can be used My sense is that the enhancer/repressor elements only modulate the level of expression
8 Promoter Action
9 Range of Gene Numbers Bacteria have 1,000 to 2,500 genes S. cervesiae has 6,000 genes C. elegans has 19,000 genes A. thaliana has 25,000 genes H. sapiens has 40,000 genes
10 How many genes are exposed for promotion at a given time? If the whole compliment of genes is exposed then quantitative regulatory elements have the whole burden of deciding whether a gene is to be expressed or not
11 Is there a binary mechanism that could sequestrate genes from promotion? The promoter regions of sequestrated genes would be hidden from the dimeric initiation proteins The quantitative regulatory elements would have to deal only with the exposed set of genes
12 Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Six Levels of DNA Structure
13 30 nm Chromatin Structure
14 Are the level-4 loops random or specific in length? Is there a sequence specificity to the lengths of these loops? Could a zinc-finger DNA Binding Protein (DBP) be used to make the loops be specific in length? Could RNA be used to latch the loops shut?
15 There are sequence-specific loops! A simple Fortran program run on yeast showed there are specific sequences on the left and right sides of the level-4 loops In bacteria, S. servesiae and C. elegans there are not enough DBPs to be able to make a whole-genome mechanism There are two sequence elements that could be expressed as RNA
16
17 Connectron A left flanking sequence element (T1) of at least 15-bases in length A right flanking sequence element (T2) of at least 15-bases in length A pair of sequence elements (C1 and C2) of at least 15-bases in length in the 3’UTR of some gene
18 Sequence Properties of Connectrons T1 and T2 have a separation of 0.5kb to 100kb C1=T1 and C2=T2 The separation of C1 from C2 is less than 100-bases The separation of C1/C2 from the end of the gene is less than 1,000-bases
19 What constraints are placed on the sequences Only that C1=T1 and C2=T2 Otherwise any tetrad of non-trivial sequences of at least 15-bases can be used
20 Connectron Convergence and Divergence Connectrons form Many-to one relationships Connectrons form One-to-many relationships
21 Transient Connectrons Gene “A” causes some connectron “B” Some other gene “C” causes a connectron “D” that turns off gene “A” When gene “C” expresses connectron “B” eventually expires
22 Permanent Connectrons Gene “A” causes some connectron “B” but no other connectron ever turns off gene “A”
23 Hierarchy of Connectrons Gene “A” causes connectron “B” Gene “C” causes connectron “D”
24 Hierarchy of Connectrons Gene “E” causes connectrons “F” and “G” Connectron “F” turns off gene “A” which eventually causes connectron “B” to disappear Connectron “G” turns off gene “C” which eventually causes connectron “D” to disappear
25 Alternating Layers of a Hierarchy
26 Full Gene Data for Connectron GN ycfc COG2915 GN ycfb COG0482 GN b1134 COG0494 GN ymfc COG1187 GP icda COG0538 TN GC *-* GN ymfd COG0500 | GP lit - | GN inte - | GN ymfh - | GP ymfi - | GN ymfj - | GN b1145 COG1974 | GP b | GP ymfl - | GP ymfn - | GP ymfr - | GP ycfk - | GP b | GN ycfa - | GP pin COG1961 | GP mcra COG1403 | CN GC * | TN GC *-* CN GC * GN ycgw - GN ycgx - GN ycge COG0789 GN b1163 COG2200 GP ycgz - GP ymga - GP ymgb -
27 Gene Abstraction for One-Shot Connectron Genes to be abstracted into Group0069 Final abstraction Driving C1/C2 NC Non-Controlled-Gene(s) TN *-* GG Group0069 | CNT OS-> | TN *-* CNP > NC Non-Controlled-Gene(s) Group0069 Gene_Name COG_Id Chromosome Direction Start Stop Length ymfd COG negative lit - 1 positive inte - 1 negative ymfh - 1 negative ymfi - 1 positive ymfj - 1 negative b1145 COG negative b positive ymfl - 1 positive ymfn - 1 positive ymfr - 1 positive ycfk - 1 positive b positive ycfa - 1 negative pin COG positive mcra COG positive CNT OS-> |
28 Transient Connectron Driving C1/C2 Transient Connectron Abstracted Groups
29 Verbose Description of Transient Connectron
30 Permanent Connectron Driving C1/C2 Permanent Connectron Abstracted Groups
31 Virtual Connectron - Example 1 Driving C1/C2 Virtual Connectron
32 Virtual Connectron - Example 2 Driving C1/C2 Virtual Connectron
33 Deeply Nested Connectrons
34 Geneless Connectrons There is a class of connectrons that are not associated with any gene - the so-called “geneless connectrons” or more properly “orf-less connectrons” The geneless connectrons occur in the non-genic portion of a genome. There are most probably many hierarchies of geneless connectrons for each cell type.
35 Orf-less Gene Model | | | | |<-Beginning of Translation | End of Translation----->| + strand strand | | | |--| |
36 Levels of Connectron Structure
37 SNPs Connectrons are resistant to single base mutations. The RNA forming the two Hoogsteen triple-strand helices is often longer than the minimum 15-base length Any distribution of the C1/C2 length over the minimum is usable. Mutations just make weaker X-shaped structure.
38 Loose X Structure Tight X Structure
39 Connectrons versus Genome Size The number of genes in a genome is not particularly correlated with the size of the genome. The size of the genome is linearly correlated with the number of connectrons.
40 Genome Size vs Connectron Number
41 Connectrons occur across chromosomes In a multi-chromosonal genome, C1/C2 sources on one chromosome create connectrons on the same and other chromosomes. S. cervesiae is a wonderful example.
42 S. cervesiae cross-chromosome connectron table
43 Duplicated Fragments Connectrons are based on the fact that there are duplicated sequences in a genome. Many fragments have only a few instances A few fragments have many instances.
44
45 Genes per Group Many groups of genes controlled by connectrons are only one gene. In S. cervesiae in particular these one-gene groups are the LTR (Long Term Repeats) A few groups have many genes The distribution follows an exponential curve
46
47 Distribution of C1/C2 distance from last econ Many C1/C2 connectron sources occur immediately following the last exon In S. cervesiae some of the C1/C2s are at extreme distances (i.e.10kb) from the last exon with no intervening genes
48
49 Distribution of C1/C2 lengths Many of the C1/C2 fragments are of the minimum length of 15-bases A few C1/C2s are very long (i.e. over 100-bases in length) The distribution follows an exponential pattern
50
51 Distribution of T1/T2 lengths Many of the T1/T2 fragments are of the minimum length of 15-bases A few T1/T2s are very long (i.e. over 100-bases in length) The distribution follows an exponential pattern Because of the many-to-one and the one-to-many relationships the C1/C2 distribution and the T1/T2 distribution can be different.
52
53 Do connectrons occur on both strands? In S. cervesiae the positive strand is favored when the gap between the last exon and the C1/C2 is short. As this gap gets longer the positive and negative strands have equivalent numbers of connectrons
54
55 Clusters of Orthologous Genes The COGs as defined by David Lipman and Eugene Koonin in the NCBI specify the relationships of genes across (bacterial) genomes. Genes that are in co-linear in one genome are distributed in another genome. There seems to be no conservation of flanking T1 and T2 sequences across any two (bacterial) genomes.
56 Connectrons occur across chromosomes and plasmids In single and multi-chromosome genomes connectrons occur in both directions between the chromosomes and the associated plasmids. In D. radiodurans connectrons occur between the two chromosomes and the two plasmids. In S. meliloti the chromosome is a vestigal thing with most of the connectrons originating in the associated mega-plasmid.
57 Emergent Property of a Genome Connectrons are one of the first properties to emerge as the result of whole-genome sequencing. The connectron paradigm replaces the “one-gene - one-effect” paradigm with a rich gene expression control mechanism. Connectrons can be computed (meaningfully) for any complete, stable genome.
58 Connectrons, iRNA and stRNA The 3’UTR RNA produced by the expression of a gene is used to form connectrons, and interference RNA (iRNAs). The iRNA forms Hoogsteen triple-helices around the cognate double-strand DNAs. The lifetime of these triple-helices is determined by their length. small temporal RNA (stRNAs) are distinguished from iRNAs only by their lifetimes. iRNAs and stRNAs block the expression of related RNAs in the 3’UTR of other genes.
59 iRNAs and stRNAs Interference RNAs (iRNAs) and Small Temporal RNAs (stRNAs) are now included in connectron determinations and calculations. stRNAs have short lifetimes iRNAs have longer lifetimes. Connectrons are the same sequences that bind to two widely (i.e. 100kb) targets.
60 Simulation of Connectron Control of Gene Expression Connectrons have lifetimes A C1/C2 connectron source may originate from a gene that is already in a connectron The collection of all the connectrons for a genome forms an abstract state machine
Avail
%Perm Trial Cycle Count Ones %Off 1Shot State = 120 Changes = 2 Time =
63 Simulation of Cellular Behavior The program for the simulation of cellular control of gene expression by connectrons is now at mid-stage of development. First results in the E. coli genome indicate that 60% to 80% of the genes are turned off at any given time. Any gene that is not turned off by connectron control is open to promotion and transcription.
64 An informatic view of the biological world David States (now at Washington University in St. Louis) argued that “All biological systems are essentially informatic systems that happen to be implemented in molecules.”
65 Connectrons do it! In the last two years, I have found A purely informatic system for the high-level control of gene expression exists above the level of promotional control of gene expression. I call these control elements “Connectrons”
66 Connectrons Exist in all three Kingdoms The four-sequence relationship that prevents sets of genes from being expressed has now been found in all public genomes. In most genomes the percentage of genes controlled by connectrons range between 95% to 97%
67 Genomes Covered “The Bad Bug List” Pseudomonas aeruginosa PA01 Deinococcus radiodurans Streptococcus pneumoniae SaccGharomyces cerevisiae Sinorhizobium meliloti Escherichia coli K-12 MG1655 Escherichia coli K-12, Plasmid F & Bacteriophage Caulobacter crescentus Halobacterium sp. NRC-1 Rickettsia conorii Malish 7 Mycobacterium tuberculosis Lactococcus lactis Haemophilus influenzae Helicobacter pylori Methanococcus jannaschii Synechocystis Aquifex aeolicus Bacillus subtilis Aeropyrum pernix Streptococcus pneumoniae - TIGR4 Streptococcus pneumoniae R6 Ureaplasma urealyticum Helicobacter pylori J99 Methanobacterium thermoautotrophicum Mycobacterium leprae Escherichia coli O157:H7 Pasteurella multocida Yersinia pestis Bacillus halodurans Escherichia coli O157:H7:EDL933 Agrobacterium tumefaciens strain C58 Xylella fastidiosa Vibrio cholerae Sulfolobus tokodaii Chlamydia pneumoniae CWL029
68 Genomes Covered (cont.) “The Bad Bug List” Mycoplasma genitalium G37 Thermoplasma acidophilum Chlamydophila pneumoniae J138 Mycoplasma pneumoniae Thermotoga maritima Chlamydophila pneumoniae AR39 Campylobacter jejuni Staphylococcus aureus strain N315 Archaeoglobus fulgidus Listeria monocytogenes strain EGD Staphylococcus aureus strain Mu50 Borrelia burgdorferi Pyrococcus horikoshii Listeria innocua Clip11262 Buchnera sp. APS Salmonella typhimurium LT2 Pyrococcus abyssi Salmonella enterica serovar Typhi Rickettsia prowazekii Chlamydia trachomatis Treponema pallidum
69 Percentage of genes controlled by connectrons There are three parameters that determine the percentage of the genes control by connectrons (1) Minimum fragment length (set to 15-bases) (2) (2) Maximum gap between C1 and C2 (set to 100-bases maximum) (3) (3) Maximum distance from last exon to C1/C2 (determined for each genome)
70
71
72 Collaboration to show the Physical Existence of Connectrons Drs. Sankar Adhya and Susan Garges in the NCI have designed and implemented physical experiments in E. coli First results show that the deletion of a “one-shot” connectron of 50kb with about 60 flagella genes causes changes in gene expression Paper to be published in PNAS by mid year.
73 Need to broaden the range of physical experimentation Since all genomes have connectrons of the same form, the initial proof of the existence of connectrons in E. coli has great importance. The density of connectrons controlling a particular set of genes is very much genome-dependent Physical experiments should be carried out on a whole range of genomes
74 Basic vs Applied Research Most of the conceptual developments are really basic research. The need for patent priority has hampered broader dissemination of the work. When the physical proofs are ready for publication the balance will change. Most commercial investment is concerned with end-use of connectron developments which is still years away.
75 Processing the Human Genome Processing the human genome to determine the connectron structure will make it possible to investigate many human diseases There are “connectron defect”diseases which different from “gene defect” diseases
76 Processing the Human Genome Connectrons are determined from a pair of chromosomes. The half-diagonal of 24*24 jobs is 300 jobs Each pair of chromosomes have to be broken up into 50mb chunks. There are 700 such chunks The total number of jobs is 300 * 700 *700/2 = 73.5*10 6
77 Zinc-finger DNA Binding Proteins (DBPs) as therapeutic agents DBPs can block to C1/C2, T1 or T2 sites DBPs can bind across T1 and T2 sites forming a DBP connectron
78 Where is the competition There are lots of papers appearing on iRNA and stRNA None of these people have understood the nature of the tetradic connectron relationship Thomas Werner who is Genomatix in Munich is studying matrix attachment regions Matrix attachment regions are responsible for bringing the T1 and T2 proximal to each other so connectrons can be formed
79 Genomatix View
80 Patent Status of the Connectron Technology A basic methods US and PCT patent filed May 30th, 2001 USPTO analysis shows that there are 19 inventions 41 Bacterial, Archeal and Eukatyotic genomes covered by US Provisional Patent Applications
81 Patenting whole genomes People get all bent out of shape when they hear that I have been patenting the connectron structure of many whole genomes My view is that if I don’t do it then someone else will reverse-engineer the connectron determination algorithm and do it themselves The connectrons are both an observation and an invention The utility which is the key to patentability is that a particular C1/C2 when expressed forms a T1-T2 connectron that turns off a particular set of genes
82 Where do we go from here Simulation of E. coli to relate Affymetrix-type gene expression measurement to modeled cell behavior Processing, analysis and simulation of C. elegans as the model for differentiation and development Processing of the human genome Modification of genomic properties using zinc-finger DBPs
83 The High Ground of the 21st Century A patented concept of total, systematic gene expression control Ability to compute all the gene expression control structures from genomic information Ability to patent all computed instances of these control structures based on known content-of-matter and function
84 The High Ground of the 21st Century Ability to validate all gene expression events through existing measurement techniques Ability to simulate the gene expression control behavior of the complete organism Ability to set biological engineering standards
85 My responsibility as inventor Modification of genomic behavior by changing connectron interactions will be a very powerful force in our global society in a few years I feel a very deep responsibility for future history of this invention My intention is that everyone should and will have access to this invention But everyone will pay - a small bit here and there
86 Contact Information Richard J. Feldmann (v) Global Determinants, Inc. (f) Mill Creek Dr. (c) Derwood, Maryland