MF workshop 08 © Ron Shamir 1 Motif Finding Workshop Project Chaim Linhart January 2008.

Slides:



Advertisements
Similar presentations
Transcription and translation
Advertisements

Unit #3 Schedule: Last Class: – Sanger Sequencing – Central Dogma Overview – Mutation Today: – Homework 5 – StudyNotes 8a Due – Transcription, RNA Processing,
The AMADEUS Motif Discovery Platform C. Linhart, Y. Halperin, R. Shamir Tel-Aviv University ApoSys workshop May ‘ 08 Genome Research 2008.
SBI 4U November 14 th, What is the central dogma? 2. Where does translation occur in the cell? 3. Where does transcription occur in the cell?
Prepared with lots of help from friends... Metsada Pasmanik-Chor, Zohar Yakhini and NUMEROUS WEB RESOURCES. BioInformatics / Computational Biology Introduction.
Transcription and Translation
DNA & genetic information DNA replication Protein synthesis Gene regulation & expression DNA structure DNA as a carrier Gene concept Definition Models.
Molecular genetics of gene expression Mat Halter and Neal Stewart 2014.
Relationship between Genotype and Phenotype
Transcription: Synthesizing RNA from DNA
FROM GENE TO PROTEIN: TRANSCRIPTION & RNA PROCESSING Chapter 17.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.
Chapter 10 genome, gene expression; genes as units of inheritance transmission of heritable characteristics; gene regulation, eukaryote chromosomes, alleles.
NAi_transcription_vo1-lg.mov.
© Ron Shamir & Yaron Oresntein Predicting PBM binding from HT-SELEX data Workshop Project Yaron Orenstein 22 October 2013.
From Gene to Phenotype DNA molecule Gene 1 Gene 2 Gene 3 DNA strand (template) TRANSCRIPTION mRNA Protein TRANSLATION Amino acid A CCAAACCGAGT U G G U.
Gene Expression and Gene Regulation. The Link between Genes and Proteins At the beginning of the 20 th century, Garrod proposed: – Genetic disorders such.
RNA and Protein Synthesis
From Gene to Protein Chapter 17.
From Gene to Protein A.P. Biology. Regulatory sites Promoter (RNA polymerase binding site) Start transcription DNA strand Stop transcription Typical Gene.
UNIT 3 Transcriptionand Protein Synthesis. Objectives Discuss the flow of information from DNA to RNA to Proteins Discuss the flow of information from.
Protein Synthesis 6C transcription & translation.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
DNA to Protein – 12 Part one AP Biology. What is a Gene? A gene is a sequence of DNA that contains the information or the code for a protein or an RNA.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Review of Protein Synthesis. Fig TRANSCRIPTION TRANSLATION DNA mRNA Ribosome Polypeptide (a) Bacterial cell Nuclear envelope TRANSCRIPTION RNA PROCESSING.
A Biology Primer Part III: Transcription, Translation, and Regulation Vasileios Hatzivassiloglou University of Texas at Dallas.
Relationship between Genotype and Phenotype
Transcription Packet #10 Chapter #8.
Protein Synthesis. DNA is in the form of specific sequences of nucleotides along the DNA strands The DNA inherited by an organism leads to specific traits.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Transcription. Recall: What is the Central Dogma of molecular genetics?
RNA and Gene Expression BIO 224 Intro to Molecular and Cell Biology.
Exam #1 is T 2/17 in class (bring cheat sheet). Protein DNA is used to produce RNA and/or proteins, but not all genes are expressed at the same time or.
DNA and RNA II Sapling Chapter 6 short version You are responsible for textbook material covered by the worksheets. CP Biology Paul VI Catholic High School.
Transcription and Translation of DNA How does DNA transmit information within the cell? PROTEINS! How do we get from DNA to protein??? The central dogma.
Transcription and Translation The Objective : To give information about : 1- The typical structure of RNA and its function and types. 2- Differences between.
Finding genes in the genome
Transcription and Translation. Central Dogma of Molecular Biology  The flow of information in the cell starts at DNA, which replicates to form more DNA.
Cells use information in genes to build several thousands of different proteins, each with a unique function. But not all proteins are required by the.
Gene Activity 1Outline Function of Genes  One Gene-One Enzyme Hypothesis Genetic Code Transcription  Processing Messenger RNA Translation  Transfer.
Introduction to molecular biology Data Mining Techniques.
HOW DO CELLS KNOW WHEN TO EXPRESS A GENE? DO NOW:.
TRANSCRIPTION (DNA → mRNA). Fig. 17-7a-2 Promoter Transcription unit DNA Start point RNA polymerase Initiation RNA transcript 5 5 Unwound.
The Central Dogma of Life. replication. Protein Synthesis The information content of DNA is in the form of specific sequences of nucleotides along the.
Gene Activity Chapter 14. Gene Activity 2Outline Function of Genes  One Gene-One Enzyme Hypothesis Genetic Code Transcription  Processing Messenger.
Gene Expression : Transcription and Translation 3.4 & 7.3.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
The flow of genetic information:
Regulation of Gene Expression
Fig Prokaryotes and Eukaryotes
Transcription and Translation.
The Central Dogma Transcription & Translation
Transcription and Translation
Exam #1 is T 9/23 in class (bring cheat sheet).
Nucleic Acids Large polymers Made of linked nucleotides 2 types
Transcription & Gene Expression
Transcription and Translation
Transcription Chapter 10 Section 1a.
Transcription and Translation
A Zero-Knowledge Based Introduction to Biology
Transcription and Translation
PROTEIN SYNTHESIS.
General Animal Biology
Title of notes: Transcription and Translation p. 16 & 17
Transcription and Translation
Biology, 9th ed,Sylvia Mader
Transcription and Translation
Presentation transcript:

MF workshop 08 © Ron Shamir 1 Motif Finding Workshop Project Chaim Linhart January 2008

MF workshop 08 © Ron Shamir 2 Outline 1. Some background again… 2. The project

MF workshop 08 © Ron Shamir 3 1. Background Slides with Ron Shamir and Adi Akavia

MF workshop 08 © Ron Shamir 4 DNA Pre- mRNA protein transcriptiontranslation Mature mRNA splicing Gene: from DNA to protein

MF workshop 08 © Ron Shamir 5 DNA DNA: a “string” over the alphabet of 4 bases (nucleotides): { A, C, G, T } Resides in chromosomes Complementary strands: A-T ; C-G Forward/sense strand: AACTTGCG Reverse-complement/anti-sense strand: TTGAACGC Directional: from 5’ to 3’: (upstream) AACTTGCGATACTCCTA (downstream) 5’ end3’ end

MF workshop 08 © Ron Shamir 6 Gene structure (eukaryotes) Transcription start site (TSS) Promoter Transcription (RNA polymerase) DNA Pre-mRNA Exon Intron Splicing (spliceosome) Mature mRNA 5’ UTR3’ UTR Start codon Stop codon Coding region Translation (ribosome) Protein Coding strand

MF workshop 08 © Ron Shamir 7 Translation Codon - a triplet of bases, codes a specific amino acid (except the stop codons); many-to-1 relation Stop codons - signal termination of the protein synthesis process

MF workshop 08 © Ron Shamir 8 Genome sequences Many genomes have been sequences, including those of viruses, microbes, plants and animals. Human: –23 pairs of chromosomes –3+ Gbps (bps = base pairs), only ~3% are genes –~25,000 genes Yeast: –16 chromosomes –20 Mbps –6,500 genes

MF workshop 08 © Ron Shamir 9 Regulation of Expression Each cell contains an identical copy of the whole genome - but utilizes only a subset of the genes to perform diverse, unique tasks Most genes are highly regulated – their expression is limited to specific tissues, developmental stages, physiological condition Main regulatory mechanism – transcriptional regulation

MF workshop 08 © Ron Shamir 10 Transcription is regulated primarily by transcription factors (TFs) – proteins that bind to DNA subsequences, called binding sites (BSs) TFBSs are located mainly (not always!) in the gene’s promoter – the DNA sequence upstream the gene’s transcription start site (TSS) BSs of a particular TF share a common pattern, or motif Some TFs operate together – TF modules TF Gene 5’5’ 3’3’ BS TSS Transcriptional regulation

MF workshop 08 © Ron Shamir 11 Consensus (“degenerate”) string: TFBS motif models gene 7 gene 9 gene 5 gene 3 gene 2 gene 4 gene 6 gene 8 gene 10 gene 1 AACTGT CACTGT CACTCT CACTGT AACTGT ACAC ACT CGCG T Statistical models… Motif logo representation

MF workshop 08 © Ron Shamir 12 Human G2+M cell-cycle genes: The CHR – NF-Y module CDCA3 (trigger of mitotic entry 1) CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAACT -18 CDCA8 (cell division cycle associated 8) TTGTGATTGGATGTTGTGGGA … [25bp] … TGACTGTGGAGTTTGAATTGG +23 CDC2 (cell division control protein 2 homolog) CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTGGTGAATCCGGG GCCCTTTAGCGCGGTGAGTTTGAAACTGCT 0 CDC42EP4 (cdc42 effector protein 4) GCTTTCAGTTTGAACCGAGGA … [25bp] … CGACGGCCATTGGCTGCTGC -110 CCNB1 (G2/mitotic-specific cyclin B1) AGCCGCCAATGGGAAGGGAG … [30bp] … AGCAGTGCGGGGTTTAAATCT +45 CCNB2 (G2/mitotic-specific cyclin B2) TTCAGCCAATGAGAGT … [15bp] … GTGTTGGCCAATGAGAAC … [15bp] … GGGC CGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA +10 BS ’ s are short, non-specific, hiding in both strands and at various locations along the promoters TFs: NF-Y, CHR

MF workshop 08 © Ron Shamir 13 The computational challenge Given a set of co-regulated genes (e.g., from gene expression chips) Find a motif that is over-represented (occurs unusually often) in their promoters This may be the TF binding site motif Find TF modules – over-represented motifs that tend to co-occur

MF workshop 08 © Ron Shamir 14 The computational challenge (II) Motifs can also be found w/o a given target-set – “genome-wide” Find a motif that is localized - occurs more often neat the TSS of genes Find a motif with a strand bias – occurs more often on the genes’ coding strand Find TF modules with biases in their order / orientation / distance

MF workshop 08 © Ron Shamir 15 Motif finding algorithms >100 motif finding algs Main differences between them: –Type of analysis & input: Target-set vs. genome-wide Single vs. multi-species (conservation) Single motifs vs. modules –Motif model –Score for evaluating motif –Motif search technique: Combinatorial (enumeration) vs. Statistical optimization

MF workshop 08 © Ron Shamir 16 Over-represented motifs in the promoters of genes expressed in the G2 and G2/M phases of the human cell cycle: Example - Amadeus CHR NF-Y

MF workshop 08 © Ron Shamir The project

MF workshop 08 © Ron Shamir 18 General goals Develop software from A-Z: –Design –Implementation –(Optimization) –Execution & analysis of real data A taste of bioinformatics Have fun Get credit…

MF workshop 08 © Ron Shamir 19 The computational task Given a set of DNA sequences Find “interesting” pairs of motifs: –Order bias –Other scores… Main challenges: –Performance (time, memory) –Output redundancy

MF workshop 08 © Ron Shamir 20 Input File with DNA sequences in “fasta” format: >sequence-name1 [header1] ACCCGNNNNTCGGAAATGANN CGGAGTAAAATATGCGAGCGT >sequence-name2 [header2] cggattnnnaccgcannnnnnnnaccgtga >sequence-name3 [header3] agtttagactgctagctcgatcgcta gcggatnggctannnnnatctag

MF workshop 08 © Ron Shamir 21 Input (II) Ignore the header lines Sequence may span multiple lines or one long line Sequence contains the characters A,C,G,T,N in upper or lower case “N” means unknown or masked base Sample input files will be supplied

MF workshop 08 © Ron Shamir 22 (don’t count overlaps, e.g. AAAAAA) Input (III) Search parameters: –Length of motifs (between 5-10) –Min. + Max. distance between the motifs: ACGGATTGATNNNTGGATGCCAT distance=9 –Single vs. two strands search –Min. number of occurrences (hits) of pair: GCGGATTCAGTGATGCCANGNATGCCTCAGGATTGNAATGCCA hit hit hit –Max. p-value –Additional parameters…

MF workshop 08 © Ron Shamir 23 Output A.A list of the string pairs with the best order-bias score (smallest p- values): Motif A Motif B A→B B→A p-value ACGTT GGATT E-15 ACGTT GATTC E-13 TTAAC CAGCC E-12 B.A non-redundant list of motif pairs (motif = consensus string): logos, # of hits, additional scores

MF workshop 08 © Ron Shamir 24 Part A: String pairs with order bias n A = # of A → B ; n B = # of B → A WLOG, n A > n B n = n A + n B H 0 = random order: n A ~ B(n, 0.5) p-value = prob for at least n A occurrences of A → B = tail of B(n, 0.5) Normal approximation (central limit thm.) Fix for multiple testing: x2

MF workshop 08 © Ron Shamir 25 Collect similar strings to motif with better score: (motif = consensus) String pair (p-value) Motif pair ACGTT, GGATT (4.3E-15) ACGAT, GGATT (2.4E-11) AGGAT, GGTTT (1.7E-5) AGGTT, GGTTT (5.9E-5) Don’t report similar motif pairs: –Motifs that consist of similar strings –Motif pairs that are small shifts of one another –Palindromes Part B: Non-redundant list of motif pairs,(8.1E-31)

MF workshop 08 © Ron Shamir 26 Option I: Co-occurrence rate N = total # of sequences s A = # of sequences that contain motif A s AB = # of sequences that contain motifs A and B H 0 = motifs occur independently and randomly p-value = prob for at least joint occurrences, given the number of hits of each single motif = tail of hypergeometric distribution Part B (cont.): Additional score

MF workshop 08 © Ron Shamir 27 Option II: Distance bias Is the distance between the two motifs uniform (H 0 ), or are there specific distances that are very common? Option III: Gap variability Are the sequences between the motifs conserved (H 0 ), or are they highly variable? Other options?? Part B (cont.): Additional score

MF workshop 08 © Ron Shamir 28 Implementation Java (Eclipse) ; Linux GUI: Simple graphical user interface for supplying the input parameters and reporting the results Packages for motif logo and statistical scores will be supplied Time performance will be measured only for part A Reasonable documentation Separate packages for data-structures, scores, GUI, I/O, etc.

MF workshop 08 © Ron Shamir 29 Design document Due in 3 weeks (Feb 24) 3-5 pages (Word), Hebrew/English Briefly describe main goal, input and output of program Describe main data structures, algorithms, and scores for parts A+B Meet with me before submission

MF workshop 08 © Ron Shamir 30 Fin