Presentation is loading. Please wait.

Presentation is loading. Please wait.

Summarized by Sun Kim SNU Biointelligence Lab.

Similar presentations


Presentation on theme: "Summarized by Sun Kim SNU Biointelligence Lab."— Presentation transcript:

1 Summarized by Sun Kim SNU Biointelligence Lab.
Highly Specific Localization of Promoter Regions in Large Genomic Sequences by PromoterInspector: A Novel Context Analysis Approach Matthias Scherf et al. Journal of Molecular Biology, 2000 Summarized by Sun Kim SNU Biointelligence Lab.

2 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Promoter DNA sequences near the beginning of genes. Function To mediate and control initiation of transcription of that part of a gene that is located immediately downstream of the promoter (3’). (C) 2001, SNU Biointelligence Lab,

3 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
(cont’d) The DNA region required to fulfill the function can be determined by assays for promoter function in a heterologous context. Often complex regulation involves many more features than just the promoter. e.g) enhancers, locus control regions, etc. If any of these units, which are functionally completely different from promoters, happens to be located adjacent to the promoter, delineation of the promoter becomes difficult. One of the reasons why promoter prediction programs almost exclusively focus on proximal promoter regions or even just on the core promoter. (C) 2001, SNU Biointelligence Lab,

4 Transcriptional promoters
Transcription can proceed only after a competent transcription complex consisting of RNA polymerase II and several general transcription factors have been recruited to the promoter (C) 2001, SNU Biointelligence Lab,

5 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Transcription (C) 2001, SNU Biointelligence Lab,

6 Schematic structure of polymerase II promoter
(C) 2001, SNU Biointelligence Lab,

7 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Assembly of the activator/promoter complex on the proximal and core promoter region (C) 2001, SNU Biointelligence Lab,

8 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
(cont’d) (C) 2001, SNU Biointelligence Lab,

9 Aim of Promoter Recognition
Location of an important part of the regulatory region of a gene. Promoter prediction can be useful in the context of gene prediction. The promoter marks he beginning of the first exon of a gene. Promoter region contain information complementary to the exons and introns because transcriptional regulation cannot be deduced from the predicted amino acid sequence. Transcriptional regulation can play an important part in gene function Promoter may yield first clues towards the function of a completely anonymous protein. Prediction of the functionality of a promoter would be welcome for gene therapy approaches to improve expression of newly created vector constructs (C) 2001, SNU Biointelligence Lab,

10 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Introduction PromoterInspector Locate eukaryotic polymerase II promotor regions in large genomic sequences with a high degree of specificity. Focuses on the genetic context of promoters Based on libraries of IUPAC words extracted from training sequences by an unsupervised learning approach. Polymerase II promoters Do not contain any sequence elements that are consistently shared. Usually consist of multiple binding sites for transcription factors that must occur in a specific context, apparently shared only by a small group of promoters. Combination and orientation of the transcription factors is the crucial information. (C) 2001, SNU Biointelligence Lab,

11 Promoter prediction approaches
Heuristic approaches Approaches that attempt to recognize core promoter elements such as TATA boxes, CAAT boxes, and INR (transcription initiation sites) Approaches that attempt to use the whole ensemble of elements (transcription factor binding sites, oligonucleotides), found in a promoter (C) 2001, SNU Biointelligence Lab,

12 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
(cont’d) Heuristic approaches Use models that describe the orientation and context of several transcription factor binding sites Have been proven to be able to detect promoters with a very high level of specificity But with limited coverage Useful to predict specific promoter classes (C) 2001, SNU Biointelligence Lab,

13 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
(cont’d) Approaches that attempt to recognize promoter elements Mostly predict of the order of one promoter per kilobase in human DNA (Fickett & Hatzigeorgiou, 1997). The average distance between functional promoters has been estimated to be in the range of 30 to 40 kb, with a very uneven distribution. Most of predicted promoters are false positives. Some of the tools use a more restrictive approach to reduce the number of total predictions, but still the problem remains. False positive matches preclude experimental verification. (C) 2001, SNU Biointelligence Lab,

14 Design of the prediction system
Polymerase II promoters are quite different in terms of individual organization, but are probably embedded into a common genomic context. Specific features of such a putative context are not yet known. Based on context features extracted from training sequences by an unsupervised learning technique. (C) 2001, SNU Biointelligence Lab,

15 Definition of context features
Based on an approach using oligonucleotides with one variable mismatch (Wolfertstetter et al., 1996). Extended the approach by the introduction of wildcards at multiple positions. (C) 2001, SNU Biointelligence Lab,

16 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
(cont’d) Context features are defined by by disjunct groups of similar IUPAC words (IUPAC groups). Each IUPAC group is uniquely defined by a set of oligonucleotides and a number of undefined base-pairs (wildcards, “N”). The IUPAC words of a IUPAC group contain all elements of the oligonucleotide set in the same order and orientation, and differ in the number of wildcards between them. (C) 2001, SNU Biointelligence Lab,

17 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
(cont’d) Wildcards at the beginning and end of IUPAC words are discarded. Example A IUPAC group which results from two wildcards and the oligonucleotide set (AGC, GCA)  (AGCGCA, AGCNGCA, AGCNNGCA) (C) 2001, SNU Biointelligence Lab,

18 Definition of decision instances
Prediction of the genomic promoter context is based on several decision instances (classifiers). A classifier is defined by two disjunct sets of IUPAC groups: Promoter-related IUPAC groups  “promoter” Non-promoter-related IUPAC groups  “non-promoter” The classification is based on IUPAC group matches. (C) 2001, SNU Biointelligence Lab,

19 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
(cont’d) The IUPAC group candidates of a classifier are directly extracted from a set of training sequences. All IUPAC groups that match at least once in these sequences are involved. Example If IUPAC groups are defined by a set of two oligonucleotides of length two and two wildcards, From the training sequence AGCTG Candidates (AGCT, AGNCT, AGNNCT), (AGTG, AGNTG, AGNNTG), (GCTG, GCNTG, GCNNTG) (C) 2001, SNU Biointelligence Lab,

20 Determining candidates
Given a set of promoter and non-promoter sequences (training sequences) If the ratio between the number of hits in the promoter and non-promoter training sequences exceeds a certain threshold (“assignment threshold”).  a candidate is assigned to the class promoter. (C) 2001, SNU Biointelligence Lab,

21 Architecture of the prediction system
PromoterInspector is based on three classifiers Each classifier is specialized to differentiate between promoter and one of non-promoter sequences sets: exon, intron and 3’-UTR. Assigns a sequence to the class promoter only if all three classifiers agree. (C) 2001, SNU Biointelligence Lab,

22 Parameter optimization of the prediction system
Parameters The number of wildcards The number and length of the elements in the oligonucleotide sets which define the IUPAC groups (C) 2001, SNU Biointelligence Lab,

23 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
(cont’d) Optimization Three crossvalidation sets are prepared From a given set, 90% for parameter optimization, 10% for evaluation A set of different parameter constellations was generated A classifier was built for every parameter constellation and the best classifier was kept. The classifiers which resulted from step 3 were evaluated on the evaluation set  optimal assignment threshold is 1. (C) 2001, SNU Biointelligence Lab,

24 Results of the three classfiers
(C) 2001, SNU Biointelligence Lab,

25 Application technique
Identification of promoter regions in large genomic sequences is performed by a sliding window approach. A window is moved over the sequence and its content is classified. A promoter region is reported if a certain number of consecutive windows are identified as members of the promoter class.  need parameter optimization: the length of the window, the offset between two consecutive windows, the number of consecutive hits  window size: 100, offset: 4, number of consecutive hits: 24 (C) 2001, SNU Biointelligence Lab,

26 Fickett’s evaluation data set
Consists of 24 promoters covering a total of 33,120 bp. (C) 2001, SNU Biointelligence Lab,

27 Results for the Fickett data set
(C) 2001, SNU Biointelligence Lab,

28 Description of the large genomic sequences
(C) 2001, SNU Biointelligence Lab,

29 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

30 Summary of large genomic sequence analysis
(C) 2001, SNU Biointelligence Lab,

31 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
(C) 2001, SNU Biointelligence Lab,


Download ppt "Summarized by Sun Kim SNU Biointelligence Lab."

Similar presentations


Ads by Google