Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence analysis – an overview A.Krishnamachari

Similar presentations


Presentation on theme: "Sequence analysis – an overview A.Krishnamachari"— Presentation transcript:

1 Sequence analysis – an overview A.Krishnamachari chari@mail.jnu.ac.in

2 Definition of Bioinformatics Systematic development and application of Computing and Computational solution techniques to biological data to investigate biological process and make novel observations

3 Research in Biology Organism Functions Cell Chromosome DNA Sequences General approachBioinformatics era

4 Information Explosion GENOME PROTEOME TRANSCRIPTOME METABOLOME

5 Databases Literature Sequences Structure Pathways Expression ratios

6 Databases Textual Symbolic (manipulation possible) Numeric (computation possible) Graphs (visualization )

7 January Issue

8

9 Integrated Database Search Engines http://www.genome.ad.jp/dbget/ http://srs.ebi.ac.uk http://www.ncbi.nlm.nih.gov/Entrez/

10

11 COG Locus link Uni Gene Human – Mouse Map

12 Primary sequences DNAProtein Structures Expression data Pathways Gene 1000 Genome 10 8

13 Analysis Individual sequences Between sequences Within a genome Between genomes

14 Sequence Analysis Sequence segments which has a functional role will show a bias in composition, correlation Computational methods tries to capture bias, regularities, correlations Scale invarient properties

15 Sequence Analysis Sequence comparison Pattern Finding –repeats, motifs,restriction sites Gene Prediction Phylogenetic analysis

16 TF TF -> Transcription Factor Sites TSS TSS->Transcription Start Sites RBS RBS -> Ribosome Binding sites CDS CDS - > Coding Sequence (or) Gene intergenic -10 -35

17 Protein-DNA interactions Biological functions Regulation or Modulation Specific binding (Specified DNA pattern)

18 DNA binding sites Promoter Splice site Ribosome binding site Transcription Factor sites Restriction Enzymes sites

19 The dimer is constructed such that it has bifold symmetry allowing the recognition helix of the second protein sub-unit to make the same groove binding interactions as the first. The distance between the recognition helices is 34 angstroms which corresponds to one turn of the B-DNA double helix. This means that when the recognition helix of one sub-unit binds in the groove of a specific region of DNA, the second sub-units' helix can also bind in the DNA groove, one turn along from the first helix

20 Odd Even

21 DNA binding sites - Model Experimental methods  Foot print expts. (Dnase )  Methylation Interference  Immuno precipitation assay  Compilation and Model building

22 TF1TF2 TF3 TF1 -40 -120-145 Design Oligos covering these regions for studying promoter activity Carry out EMSA Carry out Reporter assay Carry out in-vivo experiments Make Observations

23

24 Reporter GeneBS1 BS2 -15-30-56 -105 -150 -100-50 Reporter Gene Measure Expression BS1 BS2 BS1

25 Statement of the problem Given a collection of known binding sites, develop a representation of those sites that can be used to search new sequences and reliably predict where additional binding sites occur.

26 Reference

27 1.Variability becomes inherent in biological sequences 2.manifesting at various length scales 3.Statistical and probabilistic framework is ideal for studying these characteristics

28 Sequence Analysis AND Prediction Methods Consensus Position Weight Matrix (or) Profiles Computational Methods –Neural Networks –Markov Models –Support Vector Machines –Decision Tree –Optimization Methods

29 Strict consensus - TATA Loose consensus - (A/T)R(G/C)YG Weight matrix OR profile

30 Describing features using frequency matrices Goal: Describe a sequence feature (or motif) more quantitatively than possible using consensus sequences Need to describe how often particular bases are found in particular positions in a sequence feature

31 Describing features using frequency matrices Definition: For a feature of length m using an alphabet of n characters, a frequency matrix is an n by m matrix in which each element contains the frequency at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature

32 Frequency matrices (continued) Three uses of frequency matrices –Describe a sequence feature –Calculate probability of occurrence of feature in a random sequence –Calculate degree of match between a new sequence and a feature

33 Frequency Matrices, PSSMs, and Profiles A frequency matrix can be converted to a Position-Specific Scoring Matrix (PSSM) by converting frequencies to scores PSSMs also called Position Weight Matrixes (PWMs) or Profiles

34 Methods for converting frequency matrices to PSSMs Using log ratio of observed to expected where m(j,i) is the frequency of character j observed at position i and f(j) is the overall frequency of character j (usually in some large set of sequences)

35 Finding occurrences of a sequence feature using a Profile As with finding occurrences of a consensus sequence, we consider all positions in the target sequence as candidate matches For each position, we calculate a score by “looking up” the value corresponding to the base at that position

36

37 Nucleotide s 12345 Ax 11 x 21 x 31 x 41 x 51 Tx 12 x 22 x 32 x 42 x 52 Gx 13 x 23 x 33 x 43 x 53 Cx 14 x 24 x 34 x 44 x 54 Positions (Columns in alignment) TAGCT AGTGC x 12 + x 21 + x 33 + x 44 + x 52 if is above a threshold it is a site V1V1 V1V1

38 Building a PSSM PSSM builder Set of Aligned Sequence Features Expected frequencies of each sequence element PSSM

39 Searching for sequences related to a family with a PSSM PSSM search PSSM Set of Sequences to search Sequences that match above threshold Threshold Positions and scores of matches PSSM builder Set of Aligned Sequence Features Expected frequencies of each sequence element

40 Consensus sequences vs. frequency matrices consensus sequence or a frequency matrix which one to use? –If all allowed characters at a given position are equally "good", use IUB codes to create consensus sequence Example: Restriction enzyme recognition sites –If some allowed characters are "better" than others, use frequency matrix Example: Promoter sequences

41 Consensus sequences vs. frequency matrices Advantages of consensus sequences: smaller description, quicker comparison Disadvantage: lose quantitative information on preferences at certain locations

42 Shannon Entropy Expected variation per column can be calculated Low entropy means higher conservation Entropy yields amount of information per column

43 Entropy Or Uncertainty The entropy (H) for a column is: a: is a residue, f a : frequency of residue a in a column, f a  P a as N becomes large

44 Information Information Gain(I)= H before – H after H before = Genomic composition

45 Information Content Maximum Uncertainty = log 2 n –For DNA, log 2 4 = 2 –For Protein log 2 20 Information content I(x) I (x) = Maximum Uncertainty – Observed Uncertainty Note : Observed Uncertainty = Observed Uncertainty – small size sample correction

46

47 Shine-Dalgarno Translation start site Spacer

48 Binding site regions comprises of both signal(s)(binding site) and noise (background). Studies have shown that the information content is above zero at the exact binding site and in the vicinity the it averages to zero The important question is how to delineate the signal or binding site from the background. One possible approach is to treat the binding site (signal) as an outlier from the surrounding (background) sequences.

49 Krishnamachari et al J.theor.biol 2004

50 Assumption of independence Prediction models assumes independence Markov models of higher order require large data sets This require better data mining approaches

51 Regulatory sequence analysis Analysis of upstream sequences of co- regulated genes (micro-array expts.) Phylogenetic foot-printing – Motif discovery

52


Download ppt "Sequence analysis – an overview A.Krishnamachari"

Similar presentations


Ads by Google