Promoter Prediction in E.coli using ANN

Name: Promoter Prediction in E.coli using ANN
Uploaded: 2017-08-22T20:59:06+00:00
Duration: PTM21S42
Channel: Stephen Brown
Description: Promoter Prediction in E.coli using ANN

Promoter Prediction in E.coli using ANN
A.Krishnamachari Bioinformatics Centre, JNU

Definition of Bioinformatics
Systematic development and application of Computing and Computational solution techniques to biological data to investigate biological process and make novel observations

Research in Biology General approach Bioinformatics era Organism
Functions Cell Chromosome DNA Sequences

Genome Sequence intergenic TSS RBS CDS TF -35 -10
TF -> Transcription Factor Sites TSS->Transcription Start Sites RBS -> Ribosome Binding sites CDS - > Coding Sequence (or) Gene

Statement of the problem
Given a set of known sequences pertaining to a specific biological feature , develop a computational method to search for new members or sequences

Computational Methods
Pattern Recognition Pattern classification Optimisation Methods

Sequence Analysis AND Prediction Methods
Consensus Position Weight Matrix (or) Profiles Machine Learning Methods Neural Networks Markov Models Support Vector Machines Decision Tree Optimization Methods Statistical Learning

Promoter- TATA BOX TAA T TA Consensus sequence 49%,54% and 58%
14 sites out of 291 sequences [Lisser and Margalitt] Mismatches but which one?

Relative Entropy Plot - Promoters

Relative Entropy Plot - Random Sequences

Alignment

Describing features using frequency matrices
Goal: Describe a sequence feature (or motif) more quantitatively than possible using consensus sequences Need to describe how often particular bases are found in particular positions in a sequence feature

Describing features using frequency matrices
Definition: For a feature of length m using an alphabet of n characters, a frequency matrix is an n by m matrix in which each element contains the frequency at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature

Frequency matrices (continued)
Three uses of frequency matrices Describe a sequence feature Calculate probability of occurrence of feature in a random sequence Calculate degree of match between a new sequence and a feature

Frequency Matrices, PSSMs, and Profiles
A frequency matrix can be converted to a Position-Specific Scoring Matrix (PSSM) by converting frequencies to scores PSSMs also called Position Weight Matrixes (PWMs) or Profiles

Methods for converting frequency matrices to PSSMs
Using log ratio of observed to expected where m(j,i) is the frequency of character j observed at position i and f(j) is the overall frequency of character j (usually in some large set of sequences) Using amino acid substitution matrix (Dayhoff similarity matrix) [see later]

Finding occurrences of a sequence feature using a Profile
As with finding occurrences of a consensus sequence, we consider all positions in the target sequence as candidate matches For each position, we calculate a score by “looking up” the value corresponding to the base at that position

Alignment

Positions (Columns in alignment)
Nucleotides 1 2 3 4 5 A x11 x21 x31 x41 x51 T x12 x22 x32 x42 x52 G x13 x23 x33 x43 x53 C x14 x24 x34 x44 x54 x12 + x21 + x33 + x44 + x52 V1 TAGCT AGTGC if is above a threshold it is a site V1

Building a PSSM PSSM builder Set of Aligned Sequence Features PSSM
Expected frequencies of each sequence element

Searching for sequences related to a family with a PSSM
Set of Aligned Sequence Features PSSM builder Expected frequencies of each sequence element PSSM Sequences that match above threshold PSSM search Threshold Positions and scores of matches Set of Sequences to search

Consensus sequences vs. frequency matrices
consensus sequence or a frequency matrix which one to use? If all allowed characters at a given position are equally "good", use IUB codes to create consensus sequence Example: Restriction enzyme recognition sites If some allowed characters are "better" than others, use frequency matrix Example: Promoter sequences

Consensus sequences vs. frequency matrices
Advantages of consensus sequences: smaller description, quicker comparison Disadvantage: lose quantitative information on preferences at certain locations

Linear Classification Problems
Measure2 Measure1

Nonlinear Classification Problem
Measure 2 Measure 1

FEATURE EXTRACTION

(Artificial) Neural Network

What Is A Neural Network ?
A computational construct based on biological neuron . A neural network can: Learn by adapting its synaptic weights to changes in the surrounding environments; handle imprecise, fuzzy, noisy, and probabilistic information; and generalize from known tasks or examples to unknown ones. Artificial neural networks (ANNs) attempt to mimic some,or all of these characteristics.

Neural Network Characterised by:
- its pattern of connections between the neurons (Network Architecture) - its method of determining the weights on the connections (training or learning algorithm)

Why Neural Network:Applications
-Little or no incomplete understanding of the problem to be solved (very little theory) -Abundant data available

Neural Networks: Applications
Pattern classification Speech synthesis and recognition Image compression Clustering Medical Diagnosis Manufacturing

Neural Network:Bioinformatics
Binding sites prediction Protein Secondary Structure prediction Protein folds Micro array data clustering Gene prediction

Neural Networks Supervised Learning Unsupervised Learning

Perceptron Output inputs Layer 2 (output) Layer 1 (input) 1 W1,3 3 2 W2,3 Direction of information flow

Perceptron Summation Operation xi * wij=x1*w1+x2*w2+x3w3….+xnWnj Thresholding functions Output = 0 if x*w < T Output = 1 if x*w >T Output 1 Output 1 1 Threshold=0 T

Logistic Transfer function 1
Perceptron Output Input Logistic Transfer function 1 Output = - 1 + e Weight updates W(k+1) = w(k)+ µ[T9k) – w(k)x(k)]x(k) for 0 ≤ k≤ N-1

Learning Concepts Generally Stopping criterion
the target output is 1 for +ve The target output is 0 for –ve But practically (0.9, 0.1) combination Stopping criterion Based on certain epochs or cycles Based on certain error estimates

Perceptron Nucleic Acid A T G C 1 2 Position In a sequence Of K nucleotides K-1 K

Bit-Coding let the following binary values represent each base A="0001
then G = 4 A or C = "0011 = 3 A,G or T = "1101 = 13 etc.

NETWORK Nucleic Acid A T G C 1 A 1 Wi,j 2 G 1 Position In a sequence
1 Wi,j 2 G 1 Position In a sequence Of K nucleotides K-1 G 1 K T 1

Test set Learning Model Positive Model Negative Model
TP + TN =100 Note: Training and Test sequences are fixed length

Learning Model Training Set Test Set 1 2 …………50 1 2 …………50 n=10 N=500

Learning Model Prediction Method Output TEST
TRAINING C G T A G C T A T A G T G G G T T T A A A C C C A A G A A T T A T G G A A T T T G G A A G T T T A G G A T A G C A C A G G A T A A G G C C T A G A T A T T T A T G C A T G A G A T G Prediction Method Output TEST C C T G A A C T G A G A T G A T A T A T A A G T G A A A T T C C G Input

Multilayer Perceptron
Input Layer -1 Hidden layer Output Layer

Error function A B (local) (global) weight

Prediction NN Learning NN Recognition Example <x,f(x)> x h(x)
Disease A diagnosis: x – gene expression data (vector of numbers) f(x) – A positive / A negative (boolean 0/1) <x,f(x)> - set of known values

Evaluation Mechanism Sensitivity = TP TP + FN Specificity = TN TN + FP
(TPxTN – FPxFN) C = (TP+FP)x(FP+TN)x(TN+FN)x(FN+TP)

Cross - Validation Benchmarking the Network Performance Step:1
Divide the training set into “N” partitions Step: 2 Train the “N-1” partitions and Test the Left out Step: 3 Evaluate the Performance

References 1. Nucleic Acids Res. 1992 Aug 25;20(16):4331-8.
An assessment of neural network and statistical approaches for prediction of E. coli promoter sites. Horton PB, Kanehisa M. 2. Nucleic Acids Res Jun 11;22(11): Analysis of E.coli promoter structures using neural networks. Mahadevan I, Ghosh I.

O’Neill MC Training back-propagation neural networks to define and detect DNA-binding sites. Nucleic Acids Res Jan 25;19(2):313-8. Escherichia coli promoters: neural networks develop distinct descriptions in learning to search for promoters of different spacing classes. Nucleic Acids Res Jul 11;20(13): A general procedure for locating and analyzing protein-binding sequence motifs in nucleic acids. Proc Natl Acad Sci U S A Sep 1;95(18): Pedersen AG, Engelbrecht Investigations of Escherichia coli promoter sequences with artificial neural networks: new signals discovered upstream of the transcriptional startpoint. Proc Int Conf Intell Syst Mol Biol. 1995;3:292-9.

Siu-wai Leung, Chris Mellish and Dave Robertson
Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia coli promoter DNA sequences Siu-wai Leung, Chris Mellish and Dave Robertson E.Coli PROMOTER DATA R Hershberg, G Bejerano, A Santos-Zavaleta and H Margalit PromEC: An updated database of Escherichia coli mRNA promoters with experimentally identified transcriptionalstart sites. Nucleic Acids Research 2001, 29: 277

Earlier Studies Data set corresponding to 17 spacing class
Very high threshold values

Results - O'Neill (BPNN)
Network Model Performance pBR322(CW) pBR322(CCW) 17-1 34/35 -5, 339,477,1584,1657,1970,4130 125,805,1021,1226,4278 17-2 33/35 …. 17-3 17-4 32/35 Poll(4/4) -5, 1584, 1970,4130 125,807,1226,4278

New New New New

Disadvantages of MLP i) Slow convergence,
ii) Training relies heavily on the choice of the number of hidden layers and iii) Mode of prediction is generally based on high threshold values.

Improvements in Prediction
Pre processing of the data based on DNA structure Clean Model

Structural Atlas of E.coli
: J Mol Biol Jun 16;299(4): A DNA structural atlas for Escherichia coli. Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH, Ussery DW.

Points to remember 1) The size of the positive data set may be increased by incorporating point mutations in non-sensitive positions Negative data sets are generated by several ways i.e. a) Shuffling or randomising the positive data set. This does not destroy the the correlations between the bases completely. b) Using random sequences with a biased composition, c) Extracting the sequences from gene coding segments. 3) For the learning phase, the number of positive and negative input vectors are not generally proportionate and there is no standard prescription 4) Convergence factor and predictive ability depend on the size and the number of input vectors .

THANKS

Promoter Prediction in E.coli using ANN

Similar presentations

Presentation on theme: "Promoter Prediction in E.coli using ANN"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Promoter Prediction in E.coli using ANN

Similar presentations

Presentation on theme: "Promoter Prediction in E.coli using ANN"— Presentation transcript:

Similar presentations

About project

Feedback