Sequence analysis – an overview A.Krishnamachari

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Advertisements

Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, All rights reserved.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Transcription factor binding motifs (part I) 10/17/07.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
Similar Sequence Similar Function Charles Yan Spring 2006.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Regulatory element detection using correlation with expression (REDUCE) Literature search WANG Chao Sept 14, 2004.
Motif Refinement using Hybrid Expectation Maximization Algorithm Chandan Reddy Yao-Chung Weng Hsiao-Dong Chiang School of Electrical and Computer Engr.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Promoter Prediction in E.coli using ANN
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Searching for structured motifs in the upstream regions of hsp70 genes in Tetrahymena termophila. Roberto Marangoni^, Antonietta La Terza*, Nadia Pisanti^,
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Bioinformatics and Computational Biology
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Cis-regulatory Modules and Module Discovery
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
(H)MMs in gene prediction and similarity searches.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™
Regulation of Gene Expression
bacteria and eukaryotes
Bioinformatics Overview
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Mapping Global Histone Acetylation Patterns to Gene Expression
Nora Pierstorff Dept. of Genetics University of Cologne
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

Sequence analysis – an overview A.Krishnamachari

Definition of Bioinformatics Systematic development and application of Computing and Computational solution techniques to biological data to investigate biological process and make novel observations

Research in Biology Organism Functions Cell Chromosome DNA Sequences General approachBioinformatics era

Information Explosion GENOME PROTEOME TRANSCRIPTOME METABOLOME

Databases Literature Sequences Structure Pathways Expression ratios

Databases Textual Symbolic (manipulation possible) Numeric (computation possible) Graphs (visualization )

January Issue

Integrated Database Search Engines

COG Locus link Uni Gene Human – Mouse Map

Primary sequences DNAProtein Structures Expression data Pathways Gene 1000 Genome 10 8

Analysis Individual sequences Between sequences Within a genome Between genomes

Sequence Analysis Sequence segments which has a functional role will show a bias in composition, correlation Computational methods tries to capture bias, regularities, correlations Scale invarient properties

Sequence Analysis Sequence comparison Pattern Finding –repeats, motifs,restriction sites Gene Prediction Phylogenetic analysis

TF TF -> Transcription Factor Sites TSS TSS->Transcription Start Sites RBS RBS -> Ribosome Binding sites CDS CDS - > Coding Sequence (or) Gene intergenic

Protein-DNA interactions Biological functions Regulation or Modulation Specific binding (Specified DNA pattern)

DNA binding sites Promoter Splice site Ribosome binding site Transcription Factor sites Restriction Enzymes sites

The dimer is constructed such that it has bifold symmetry allowing the recognition helix of the second protein sub-unit to make the same groove binding interactions as the first. The distance between the recognition helices is 34 angstroms which corresponds to one turn of the B-DNA double helix. This means that when the recognition helix of one sub-unit binds in the groove of a specific region of DNA, the second sub-units' helix can also bind in the DNA groove, one turn along from the first helix

Odd Even

DNA binding sites - Model Experimental methods  Foot print expts. (Dnase )  Methylation Interference  Immuno precipitation assay  Compilation and Model building

TF1TF2 TF3 TF Design Oligos covering these regions for studying promoter activity Carry out EMSA Carry out Reporter assay Carry out in-vivo experiments Make Observations

Reporter GeneBS1 BS Reporter Gene Measure Expression BS1 BS2 BS1

Statement of the problem Given a collection of known binding sites, develop a representation of those sites that can be used to search new sequences and reliably predict where additional binding sites occur.

Reference

1.Variability becomes inherent in biological sequences 2.manifesting at various length scales 3.Statistical and probabilistic framework is ideal for studying these characteristics

Sequence Analysis AND Prediction Methods Consensus Position Weight Matrix (or) Profiles Computational Methods –Neural Networks –Markov Models –Support Vector Machines –Decision Tree –Optimization Methods

Strict consensus - TATA Loose consensus - (A/T)R(G/C)YG Weight matrix OR profile

Describing features using frequency matrices Goal: Describe a sequence feature (or motif) more quantitatively than possible using consensus sequences Need to describe how often particular bases are found in particular positions in a sequence feature

Describing features using frequency matrices Definition: For a feature of length m using an alphabet of n characters, a frequency matrix is an n by m matrix in which each element contains the frequency at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature

Frequency matrices (continued) Three uses of frequency matrices –Describe a sequence feature –Calculate probability of occurrence of feature in a random sequence –Calculate degree of match between a new sequence and a feature

Frequency Matrices, PSSMs, and Profiles A frequency matrix can be converted to a Position-Specific Scoring Matrix (PSSM) by converting frequencies to scores PSSMs also called Position Weight Matrixes (PWMs) or Profiles

Methods for converting frequency matrices to PSSMs Using log ratio of observed to expected where m(j,i) is the frequency of character j observed at position i and f(j) is the overall frequency of character j (usually in some large set of sequences)

Finding occurrences of a sequence feature using a Profile As with finding occurrences of a consensus sequence, we consider all positions in the target sequence as candidate matches For each position, we calculate a score by “looking up” the value corresponding to the base at that position

Nucleotide s Ax 11 x 21 x 31 x 41 x 51 Tx 12 x 22 x 32 x 42 x 52 Gx 13 x 23 x 33 x 43 x 53 Cx 14 x 24 x 34 x 44 x 54 Positions (Columns in alignment) TAGCT AGTGC x 12 + x 21 + x 33 + x 44 + x 52 if is above a threshold it is a site V1V1 V1V1

Building a PSSM PSSM builder Set of Aligned Sequence Features Expected frequencies of each sequence element PSSM

Searching for sequences related to a family with a PSSM PSSM search PSSM Set of Sequences to search Sequences that match above threshold Threshold Positions and scores of matches PSSM builder Set of Aligned Sequence Features Expected frequencies of each sequence element

Consensus sequences vs. frequency matrices consensus sequence or a frequency matrix which one to use? –If all allowed characters at a given position are equally "good", use IUB codes to create consensus sequence Example: Restriction enzyme recognition sites –If some allowed characters are "better" than others, use frequency matrix Example: Promoter sequences

Consensus sequences vs. frequency matrices Advantages of consensus sequences: smaller description, quicker comparison Disadvantage: lose quantitative information on preferences at certain locations

Shannon Entropy Expected variation per column can be calculated Low entropy means higher conservation Entropy yields amount of information per column

Entropy Or Uncertainty The entropy (H) for a column is: a: is a residue, f a : frequency of residue a in a column, f a  P a as N becomes large

Information Information Gain(I)= H before – H after H before = Genomic composition

Information Content Maximum Uncertainty = log 2 n –For DNA, log 2 4 = 2 –For Protein log 2 20 Information content I(x) I (x) = Maximum Uncertainty – Observed Uncertainty Note : Observed Uncertainty = Observed Uncertainty – small size sample correction

Shine-Dalgarno Translation start site Spacer

Binding site regions comprises of both signal(s)(binding site) and noise (background). Studies have shown that the information content is above zero at the exact binding site and in the vicinity the it averages to zero The important question is how to delineate the signal or binding site from the background. One possible approach is to treat the binding site (signal) as an outlier from the surrounding (background) sequences.

Krishnamachari et al J.theor.biol 2004

Assumption of independence Prediction models assumes independence Markov models of higher order require large data sets This require better data mining approaches

Regulatory sequence analysis Analysis of upstream sequences of co- regulated genes (micro-array expts.) Phylogenetic foot-printing – Motif discovery