Welcome to Advanced Molecular Genetics, Bioinformatics, and Computational Genomics Pattern Recognition and Gene Finding Today is the last class. Would.

Slides:



Advertisements
Similar presentations
A Little More Advanced Biotechnology Tools
Advertisements

Syntax and Conventions Click to start This is best viewed as a slide show. To view it, click Slide Show on the top tool bar, then View show. Summary Some.
Protein Synthesis. E. coli Ribosome -70S particle, MW ~2.5 x dissociable into small (30S) and large (50S) subunits -30S contains 16S RNA, 21 polypeptides.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Click anywhere to go on to the next slide This demonstration is best viewed as a slide show, enabling you to simulate a session and make changes in cursor.
How close is close enough? Part II Mendel vs 1000 Ideal Worlds Build the world in BioBIKE biobike.csbc.vcu.edu This demonstration is best viewed as a slide.
Kinship DNA Fingerprinting Simulation Grab the packet from the front table and begin reading.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
Introduction to programming in MATLAB MATLAB can be thought of as an super-powerful graphing calculator Remember the TI-83 from calculus? With many more.
Computer Science 101 Introduction to Programming.
The Search for Small Regulatory RNA Central Dogma: DNA to RNA to Protein Replication Processing / Translocation hnRNA rRNAtRNA mRNA.
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
Combining the microarray and metabolic capabilities of BioBIKE Case Study How is carbon metabolism affected by starving the cyanobacterium Anabaena for.
Welcome to Advanced Molecular Genetics, Bioinformatics, and Computational Genomics Pattern Recognition and Gene Finding.
13–2Manipulating DNA A.The Tools of Molecular Biology 1.DNA Extraction Homogenization: Cell walls, membranes, and nuclear material are broken Emulsification:
Frog’s eye view of the jungle (time frozen) Push to restart time.
Lives of the Scientist Genetic Basis of Differentiation Events in time and space...
Bioinformatics Brad Windle Ph# Web Site:
BBSI Research Simulation News Project proposals - Monday, June 16 - Format (see News, Presentations and other dates) Renaissance fair and other events.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
Click anywhere to go on to the next slide This demonstration is best viewed as a slide show, enabling you to simulate a session and make changes in cursor.
EADGENE and SABRE Post-Analyses Workshop 12-14th November 2008, Lelystad, Netherlands 1 François Moreews SIGENAE, INRA, Rennes Cytoscape.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Aliya Sadeque BIOC 599 Supervisory Committee Meeting Wednesday December 19, 2007.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Integrating the Bioinformatic Technology Group into your research programme Introduction People and Skills Examples Integrating the BTG Contacts BHRC Away.
Advanced Topics- Functions Introduction to MATLAB 7 Engineering 161.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Motif discovery and Protein Databases Tutorial 5.
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Studying the genomes of organisms GENE TECHNOLOGY.
Welcome to Introduction to Bioinformatics Monday, 21 March 2005 Genome Comparison Coming attractions How to compare genomes Chi-squared analysis.
Integrated Bioinformatics Nature of research articles Comparison of genomes – Scenario Regular expressions in Python Installing and running Blast How to.
Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Dept. of Animal Breeding and Genetics Programming basics & introduction to PERL Mats Pettersson.
Green with envy?? Jelly fish “GFP” Transformed vertebrates.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™
Click anywhere to go on to the next slide This demonstration is best viewed as a slide show, enabling you to simulate a session and make changes in cursor.
Pattern Recognition and Gene Finding
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Scripts & Functions Scripts and functions are contained in .m-files
Learning Sequence Motif Models Using Expectation Maximization (EM)
Genomes and Their Evolution
Genomes and Their Evolution
Recitation 7 2/4/09 PSSMs+Gene finding
First Python Program Professor Hugh C. Lauer CS-1004 — Introduction to Programming for Non-Majors (Slides include materials from Python Programming: An.
BLAST.
Using Decision Structures
Introduction to Molecular Biology
BIOBASE Training TRANSFAC® ExPlain™
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Introduction to Bioinformatics Tuesday, 19 March
Presentation transcript:

Welcome to Advanced Molecular Genetics, Bioinformatics, and Computational Genomics Pattern Recognition and Gene Finding Today is the last class. Would please you tell students: 1.Please submit all assignments. Last assignment due today because your class has no assignment. 2. Please finish the course evaluation.

Apr 4 May 2

Welcome to Advanced Molecular Genetics, Bioinformatics, and Computational Genomics Pattern Recognition and Gene Finding An alternative (Through software tools)

Lives of the Scientist

World’s Greatest Explorer

Expect = 4e-98 World’s Greatest Musicologist

World’s Greatest Microbiologist

TACACCAGAT ATTGATGTCG TTTTGATGGA TGTAATGATG CCAGAAATGG ACGGTTACGA AACAACAAGC TTAATCCGCC AAAACGAGCA ATTTAAATCT TTGCCGATTA TTGCACTGAC AGCTAAAGCC ATGCAAGGCG ATCGCGAGAA GTGTATTGAA GCGGGTGCAT CAGACTACAT CACCAAACCC GTAGATACTG AACAACTGCT TTCACTCTTG CGTGTTTGGC TATACCGTTA ATTGGGGCAG GGGGCAGGGA GCCGTTGCAA CTATTTCAAC CCTAATAGGG ATTTTGATGA ATTGCAATTC CTCCTTCCTC TGGCTCTGCC ACCGTTCAGC AACTTGGTTT CAATCCCTGA TAGGGATTTT GATGAATTGC AATATATTAT TTCACAACTG GTAAAAACGC TAAAGGTTTA GTTTCAATCC CTGATAGGGA TTTTGATGAA TTGCAATGTT AAACTGGTCT GCTTTGCCGA TACCCAAATA TTGCTAGGTT TCAATCCCTG ATAGGGATTT TGATGAATTG CAATGAAATC AGAAACATCT TTGATTTTTT TGACCATGTT TCAATCCCTG ATAGGGATTT TGATGAATTG CAATTTTTTG GGGAAGAGGT AATCTGAAAC AGAATTTAGT ATTTGTTTCA ATCCCTGATA GGGATTTTGA TGAATTGCAA TGTTGTTACT TAATCCGTCA AATAGTCCCA TTAGATGTTT CAATCCCTGA TAGGGATTTT GATGAATTGC AATTTTGTGT TACTTGAATT ACTTTGTTGT AATATGCTGG TTTCAATCCC TGATAGGGAT TTTGATGAAT TGCAATCAGC AACGTATGCT GTGGGATGCT GGATATGCAC GTTTCAATCC CTGATAGGGA TTTTGATGAA TTGCAATTTG CATATCTCCA TCCAACTGTA TTCAGCTGAA AAGTTTCAAT CCCTGATAGG GATTTTGATG AATTGCAATC TTCGGCATAA CCATTCTTCC ACCTCCAGTA

AATAAAGCTTTACAAA CCAAACTCTGGCTTCA ATTGTGTAACCCAAGC TTTGATTCTTTCCTCTG TTAAATCGGATTGATT ATCTTCATCAAGGGCA AGACCTACAAATTTAC CATCACGAACAGCTTT AGACTCACTGAATTCA TAACCTTCTGTAGGCC AATAGCCAACTGTTTC ACCACCATTTTCTGAA ATTTTTTCCTCTAGAAT ACCGAGGGCATCTTGA AATGTATCAGGATAAC CAACCTGGTCTCCAGG AGCAAAATAAGCAAC TTTTTTGCCGATGAAGT CAATGTTATCTAACTC ATCATAAAAATTTTCC CAATCACTTTGCAATT CTCCAACATTCCAGGT AGGACAACCAACAAC GATATAATCGTAGTTA TTGAAATCACTTGGTT CAGCTTGTGAAATATC ATATAAAGTTACAACA CTATCACCACCAAACT CCTTCTGAATTATTTCT GATTCAGTTTGGGTATT GCCTGTTTGAGTACCA AAAAATAAACCAATA TTAGACATTTTTACTCC TTTTATGTATTTGCAAA ATTATTTCAATTAAAA TATTTAGTAATAATTA ATTGTTAGCTAGCTAA TAATTAAATTTTTATTA CAATCATTGTAAAAGG CATTGAAAAAGTAAAT AAAAATTTTTATTCTAC GTTATTTCAAAAATAT TTACTTACATATACTTA ACCTTTATAGTGATGT AATATACTCTAATTCC TATTTTACTTATAAATA CCATCTCAGCTTAATG TAACGAATTTTTCTGTT TATCTTTAAATACAAA AAATTCAACAAAACTA CAGAAAATTAATCTTA ATAACACAAAACAAG TATCAATCTGTAATAC AACTAAGCTTAAATAA ATTAATAGAAAGCTTC ATCTATCTAATAGGTT GAGAATAGTTTATGTC TAATGACATAAATTCA TTCGTGTTGATTTCATT TGGGTATATTCATCTG ATTTAGGATTTACTCC ATTAAGTTTGTACTCAT CAATGCCCGCCTGTTG GTATCCACAATTCTCA TACAGTGCGCGAGCAA AGTAATCAATCGTTCG TCGCCATATCTAACTTT GAGTCAAACAAACCA GTTGGATTACCAACCC TCAACTAATCGCTTCTT TAAGGCGAGCGATCGC ACATTTAACTGTTGGTT GTCACAAGAGAACTA ATACTACAGCAGTATA TTTAACAACTAAGGGT GGTTCAACTTTCGCTG CGACTCCTCCAACGCG CTGAAATACACAGGA CTGATGCGATCGCAAA CTCTTTGACTAAATTCC ATACATTATCATGACC ATCTCCCAAACAAACA AGTGGGTTAACCAGAT GCTGACTATTAACATC CCCTGAGTTCGGAGTT GTAGGTCTATTTGACT GGTTCAAAGCGATGAT GGAACGGCTTTGTTGC ATGAATTAAAAAAAG ACACACCATCACCTAC TTCTAGGATAGACACA TCAAACGTCCCACCGC CTAAGTCAAATACCAA GATAATTTCGTTAGTTT TCTTGTCAAGTCCGTA AGCGAGGGCCGCCGC CGTGGGCTAGTTGATA ATTCGCAGAACTTTAA TCCCGGCAATTCTACT GGCATCTTTGGTAGCC TGCCGTTGAGAGTCAT TGAAATAGGCAGGGG TGGTAATTACCGCTTG CCTCACTGGTTCCCCC AGATATGTGCTGGCAT CATCTATCAGCTTGCG GACTACCTCATACCAT TTCACGAAAAACCTGA TACACATGTAAACTCT GAAACCCTTGCTGTAT CAAAGTTTTGTAATTA CGAATTACGAATTACG AATTGATATCAGCCGA GATTTCTTCGGGTGAA AATTCCTTGTTCAGAG CGGGACAGTGTAGCTT GACATTGCCATTACTG TCACGTACCACTTTGT AAGTAACTTGTTTTGC CTCTTGCGTAACTTCAT CATACCTGCGCCCGAT GAACCGCTTCACAGAA TAAAAAGTGTTTTCTG GGTTCATTACACCCTG GCGCTT

Expect = 4e-98 TCTACTTATA TTCAATCCAC AGGGCTACAC AAGAGTCTGT TGAATGAACA CATACATGGT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCGTAAAC CTCTAACATG ATGTCAGCAA TGAATAAACT TTGTTAAAGG TACAAATGAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT AAACCTGTAT GGTTACATGA ACTGCCTAAA TTATATATTT TAAGAAATTA ATTGCAATTA CCCCAGCTGT CATTAAAAAG AGGCAAATAC GACAGCACTG ACCCTCAAGA AGGCACCGGC GCTGAAATTC CGCTGAGAGC AGAGTGGTAC CCCTGCACCA GGTCTTTCCT GTGGGCACTG ATGAATGACT GAACGAACGA TTGAATGAAA Globin Blast

AATAAAGCTTTACAAA CCAAACTCTGGCTTCA ATTGTGTAACCCAAGC TTTGATTCTTTCCTCTG TTAAATCGGATTGATT ATCTTCATCAAGGGCA AGACCTACAAATTTAC CATCACGAACAGCTTT AGACTCACTGAATTCA TAACCTTCTGTAGGCC AATAGCCAACTGTTTC ACCACCATTTTCTGAA ATTTTTTCCTCTAGAAT ACCGAGGGCATCTTGA AATGTATCAGGATAAC CAACCTGGTCTCCAGG AGCAAAATAAGCAAC TTTTTTGCCGATGAAGT CAATGTTATCTAACTC ATCATAAAAATTTTCC CAATCACTTTGCAATT CTCCAACATTCCAGGT AGGACAACCAACAAC GATATAATCGTAGTTA TTGAAATCACTTGGTT CAGCTTGTGAAATATC ATATAAAGTTACAACA CTATCACCACCAAACT CCTTCTGAATTATTTCT GATTCAGTTTGGGTATT GCCTGTTTGAGTACCA AAAAATAAACCAATA TTAGACATTTTTACTCC TTTTATGTATTTGCAAA ATTATTTCAATTAAAA TATTTAGTAATAATTA ATTGTTAGCTAGCTAA TAATTAAATTTTTATTA CAATCATTGTAAAAGG CATTGAAAAAGTAAAT AAAAATTTTTATTCTAC GTTATTTCAAAAATAT TTACTTACATATACTTA ACCTTTATAGTGATGT AATATACTCTAATTCC TATTTTACTTATAAATA CCATCTCAGCTTAATG TAACGAATTTTTCTGTT TATCTTTAAATACAAA AAATTCAACAAAACTA CAGAAAATTAATCTTA ATAACACAAAACAAG TATCAATCTGTAATAC AACTAAGCTTAAATAA ATTAATAGAAAGCTTC ATCTATCTAATAGGTT GAGAATAGTTTATGTC TAATGACATAAATTCA TTCGTGTTGATTTCATT TGGGTATATTCATCTG ATTTAGGATTTACTCC ATTAAGTTTGTACTCAT CAATGCCCGCCTGTTG GTATCCACAATTCTCA TACAGTGCGCGAGCAA AGTAATCAATCGTTCG TCGCCATATCTAACTTT GAGTCAAACAAACCA GTTGGATTACCAACCC TCAACTAATCGCTTCTT TAAGGCGAGCGATCGC ACATTTAACTGTTGGTT GTCACAAGAGAACTA ATACTACAGCAGTATA TTTAACAACTAAGGGT GGTTCAACTTTCGCTG CGACTCCTCCAACGCG CTGAAATACACAGGA CTGATGCGATCGCAAA CTCTTTGACTAAATTCC ATACATTATCATGACC ATCTCCCAAACAAACA AGTGGGTTAACCAGAT GCTGACTATTAACATC CCCTGAGTTCGGAGTT GTAGGTCTATTTGACT GGTTCAAAGCGATGAT GGAACGGCTTTGTTGC ATGAATTAAAAAAAG ACACACCATCACCTAC TTCTAGGATAGACACA TCAAACGTCCCACCGC CTAAGTCAAATACCAA GATAATTTCGTTAGTTT TCTTGTCAAGTCCGTA AGCGAGGGCCGCCGC CGTGGGCTAGTTGATA ATTCGCAGAACTTTAA TCCCGGCAATTCTACT GGCATCTTTGGTAGCC TGCCGTTGAGAGTCAT TGAAATAGGCAGGGG TGGTAATTACCGCTTG CCTCACTGGTTCCCCC AGATATGTGCTGGCAT CATCTATCAGCTTGCG GACTACCTCATACCAT TTCACGAAAAACCTGA TACACATGTAAACTCT GAAACCCTTGCTGTAT CAAAGTTTTGTAATTA CGAATTACGAATTACG AATTGATATCAGCCGA GATTTCTTCGGGTGAA AATTCCTTGTTCAGAG CGGGACAGTGTAGCTT GACATTGCCATTACTG TCACGTACCACTTTGT AAGTAACTTGTTTTGC CTCTTGCGTAACTTCAT CATACCTGCGCCCGAT GAACCGCTTCACAGAA TAAAAAGTGTTTTCTG GGTTCATTACACCCTG GCGCTT Program the computer

Biology researchers do not program Program the computer 10 Biology and Microbiology Depts at major universities

Why hasn't it happened? Programming languages An alternative

Lives of the Scientist (Part II)

Repeated sequences bacterial genomes REP sequences genes Genome of E. coli K12 str MG1655

Algorithm to extract REP sequences Pattern

Algorithm to extract REP sequences Pattern "

Algorithm to extract REP sequences Pattern "repeat_region "

Algorithm to extract REP sequences Pattern "repeat_region "

Algorithm to extract REP sequences Pattern "repeat_region " Special symbols... As many of previous character as possible

Algorithm to extract REP sequences Pattern "repeat_region... " Special symbols... As many of previous character as possible

Algorithm to extract REP sequences Pattern "repeat_region... " Special symbols... As many of previous character as possible # A single digit

Algorithm to extract REP sequences Pattern "repeat_region...# " Special symbols... As many of previous character as possible # A single digit

Algorithm to extract REP sequences Pattern "repeat_region...#... " Special symbols... As many of previous character as possible # A single digit

Algorithm to extract REP sequences Pattern "repeat_region...#... " Special symbols... As many of previous character as possible # A single digit () Capture what's inside

Algorithm to extract REP sequences Pattern "repeat_region...(#...) " Special symbols... As many of previous character as possible # A single digit () Capture what's inside

Algorithm to extract REP sequences Pattern "repeat_region...(#...) " Special symbols... As many of previous character as possible # A single digit () Capture what's inside

Algorithm to extract REP sequences Pattern "repeat_region...(#...) " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character

Algorithm to extract REP sequences Pattern "repeat_region...(#...)** " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character

Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...) " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character

Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)* " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character

Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)* " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary

Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*.. " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary

Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*.. " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or ''

Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*..' '" Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or ''

Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*..'( )'" Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or ''

Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*..'(*..)'" Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or ''

We start Go to: Click: MICR 653

Click MICR 653 Using Firefox

biobike.csbc.vcu.edu

Function palette Workspace Results window

General Syntax of BioBIKE Function-name Argument (object) Keyword object Flag The basic unit of BioBIKE is the function box. It consists of the name of a function, perhaps one or more required arguments, and optional keywords and flags. A function may be thought of as a black box: you feed it information, it produces a product.

Function-name (e.g. SEQUENCE-OF or LENGTH-OF ) Argument: Required, acted on by function Keyword clause: Optional, more information General Syntax of BioBIKE Flag: Optional, more (yes/no) information Function-name Argument (object) Keyword object Flag Function boxes contain the following elements:

General Syntax of BioBIKE Function-name Argument (object) Keyword object Flag … and icons to help you work with functions: Option icon: Brings up a menu of keywords and flags Clear/Delete icon: Removes information you entered or removes box entirely Action icon: Brings up a menu enabling you to execute a function, copy and paste, information, get help, etc

Functions Sin Angle Sin (angle)

Functions Length Entity

Functions Length Entity "icahLnlna bormA" 14 Abraham Lincoln "Abraham Lincoln" variable vs literal

Functions Length Entity "icahLnlna bormA" 14 Abraham Lincoln "Abraham Lincoln" US-presidents 44 list vs single value

Functions Length Entity "icahLnlna bormA" 14 Abraham Lincoln "Abraham Lincoln" US-presidents 44 ( …) single application of a function vs iteration of a function

Arcsin Functions Sin Angle

Arcsin Functions Angle Sin (angle) Nested functions Evaluated from the inside out A box is replaced by its value

Gene (npf0076) Functions "transposase"

Gene (npf0076) Functions Nested functions Evaluated from the inside out A box is replaced by its value

Gene (npf0076) Pitfalls (the most common error in the language) CLOSE BOXES BEFORE EXECUTING White is incompatible with execution

Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*..'(*..)'" Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or ''

Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*..'(*..)'" Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or '' s

Mining files for data Pattern matching Works great Highly flexible Quick and easy BUT... Unforgiving (1 mismatch  death)

Conserved motifs of methyltransferases Pattern "[DS]PP[YF]" Special symbols [ ] Character set

Searching for conserved motifs Pattern matching Ignores lots of information Unforgiving (1 mismatch  death) Quick and easy Position-specific scoring matrices (PSSMs)

Searching for conserved motifs Pattern matching Ignores lots of information Unforgiving (1 mismatch  death) Quick and easy Position-specific scoring matrices (PSSMs) Needs training set What if you don’t have one?

Lives of the Scientist (Part III)

New pattern discovery (Meme, Gibbs sampler, BioProspector) snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT nucleolinGCAGGCTCAGTCTTTCGCCTCAGTCTCGAGCTCTCGCTGG snRNP ETGCCGCCGCGTGACCTTCACACTTCCGCTTCCGGTTCTTT rp S14GACACGGAAGTGACCCCCGTCGCTCCGCCCTCTCCCACTC rp S17TGGCCTAAGCTTTAACAGGCTTCGCCTGTGCTTCCTGTTT ribosomal p. S19ACCCTACGCCCGACTTGTGCGCCCGGGAAACCCCGTCGTT a'-tubulin ba'1GGTCTGGGCGTCCCGGCTGGGCCCCGTGTCTGTGCGCACG b'-tubulin b'2GGGAGGGTATATAAGCGTTGGCGGACGGTCGGTTGTAGCA a'-actin skel-m.CCGCGGGCTATATAAAACCTGAGCAGAGGGACAAGCGGCC a'-cardiac actinTCAGCGTTCTATAAAGCGGCCCTCCTGGAGCCAGCCACCC b'-actinCGCGGCGGCGCCCTATAAAACCCAGCGGCGCGACGCGCCA Human sequences 5’ to transcriptional start What to do with no training set? “TATA box”?

snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence How does Meme work?

snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences How does Meme work?

snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches GACAGGGCAGAA GCCCGGGTGTTT GCCGGGGACGCG GCCCCCGGGCCT GCCGCAGAGCTG How does Meme work? A C G T

snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table GACAGGGCAGAA GCCCGGGTGTTT GCCGGGGACGCG GCCCCCGGGCCT GCCGCAGAGCTG How does Meme work? A C G T

snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table Step 5. If probability score high, remember pattern and score How does Meme work?

snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table Step 5. If probability score high, remember pattern and score Step 6. Repeat Steps How does Meme work?

New pattern discovery (Meme, Gibbs sampler, BioProspector) snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT nucleolinGCAGGCTCAGTCTTTCGCCTCAGTCTCGAGCTCTCGCTGG snRNP ETGCCGCCGCGTGACCTTCACACTTCCGCTTCCGGTTCTTT rp S14GACACGGAAGTGACCCCCGTCGCTCCGCCCTCTCCCACTC rp S17TGGCCTAAGCTTTAACAGGCTTCGCCTGTGCTTCCTGTTT ribosomal p. S19ACCCTACGCCCGACTTGTGCGCCCGGGAAACCCCGTCGTT a'-tubulin ba'1GGTCTGGGCGTCCCGGCTGGGCCCCGTGTCTGTGCGCACG b'-tubulin b'2GGGAGGGTATATAAGCGTTGGCGGACGGTCGGTTGTAGCA a'-actin skel-m.CCGCGGGCTATATAAAACCTGAGCAGAGGGACAAGCGGCC a'-cardiac actinTCAGCGTTCTATAAAGCGGCCCTCCTGGAGCCAGCCACCC b'-actinCGCGGCGGCGCCCTATAAAACCCAGCGGCGCGACGCGCCA Human sequences 5’ to transcriptional start What to do with no training set?

Searching for conserved motifs Pattern matching Ignores lots of information Unforgiving (1 mismatch  death) Quick and easy Position-specific scoring matrices (PSSMs) Needs training set Meme, Gibbs sampler, et al (PSSM in reverse) Relatively unbiased Can't easily handle variable-length gaps DETAILS

Moral of the Stories

Are you comfortable using programming in the service of your research? None…This is beyond my responsibilities in the lab. I have zero experience in computer programming before this class I am about 60% confidant in using python I have experience using Python, Java, Unix & DOS environments, R, mySQL/SQL, and SAS I have no experience in programming I have had some R experience… However, I am still a Novice

Click MICR 653 Using Firefox

Scientific Questions I. What determines the beginning of a gene?

Scientific Questions I. What determines the beginning of a gene?

Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? HIV

Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated?

Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated?

Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs)

Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs)

Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data

Metabolic correlates to N-deprivation What enzymes of carbon metabolism are affected by N-starvation? Cyanobacteria use primarily the reactions of the Pentose Phosphate Pathway to break down glucose derivatives. They use carbon fixation reactions to build glucose. These sets overlap a great deal. Carbon fixation Pentose Phosphate Pathway Glycogen metabolism

Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data RNAseq

Measuring RNA through Microarrays Spot Courtesy of Inst. für Hormon-und Fortpflanzungsforschung, Universität Hamburg RNA from cell type #1 + RNA from cell type #2 Scan for red fluorescence Scan for green fluorescence Combine images Type #1 RNA > Type #2 RNA Type #2 RNA > Type #1 RNA Type #1 RNA  Type #2 RNA

Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data Difference in intensity chip to chip different conditions or different replicates

Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data Difference in intensity chip to chip different conditions or different replicates

Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data V. CRISPRs in enteric bacteria GTTTCAATCCCTGATAGGGATTTTAGAGGGTTTTAACAATAACTGGATAGCACTAGCAGAAGGGCTAGAAGGTTTCAATCCCTGATAGGGATTTTAGAGGGTTTTAACGTAT

Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data V. CRISPRs in enteric bacteria GTTTCAATCCCTGATAGGGATTTTAGAGGGTTTTAACAATAACTGGATAGCACTAGCAGAAGGGCTAGAAGGTTTCAATCCCTGATAGGGATTTTAGAGGGTTTTAACGTAT

Scientific Questions VI. Finding targets for DNA-binding proteins

Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data V. CRISPRs in enteric bacteria VI. Finding targets for DNA-binding proteins (targets known) VII. Finding targets for DNA-binding proteins (genes known)