Some Ideas on Final Project. Feature extraction TGGCCGTACGAGTAACGGACTGGCTGTCTTCTCGT n CCGATACCCCCCACGCGAAACCCTACACATCAAAT p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT.

Slides:



Advertisements
Similar presentations
Liang, Introduction to Java Programming, Ninth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 9 Strings.
Advertisements

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
1 Introduction to Perl Part III: Biological Data Manipulation.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
1 Various Methods of Populating Arrays Randomly generated integers.
The Assembly Language Level
Computer Programming for Biologists Class 9 Dec 4 th, 2014 Karsten Hokamp
Chapter 7 User-Defined Methods. Chapter Objectives  Understand how methods are used in Java programming  Learn about standard (predefined) methods and.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
GCG vs EMBOSS Gary Williams. Which is better GCG or EMBOSS? n You must decide for yourselves n You may find other packages that do what you want n Use.
Differences between Java and C CS-2303, C-Term Differences between Java and C CS-2303, System Programming Concepts (Slides include materials from.
Finding approximate palindromes in genomic sequences.
Chapter 2: Algorithm Discovery and Design
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
1 Chapter 7 User-Defined Methods Java Programming from Thomson Course Tech, adopted by kcluk.
More on Numerical Computation CS-2301 B-term More on Numerical Computation CS-2301, System Programming for Non-majors (Slides include materials from.
Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.
Guide To UNIX Using Linux Third Edition
CS241 PASCAL I - Control Structures1 PASCAL I - Control Structures Philip Fees CS241.
Chapter 2: Algorithm Discovery and Design
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Console and File I/O - Basics Rudra Dutta CSC Spring 2007, Section 001.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
Computer Programming for Biologists Class 2 Oct 31 st, 2014 Karsten Hokamp
 Pearson Education, Inc. All rights reserved Formatted Output.
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
Guide To UNIX Using Linux Fourth Edition
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
Chapter 2: Algorithm Discovery and Design Invitation to Computer Science, C++ Version, Third Edition.
COMP Parsing 2 of 4 Lecture 22. How do we write programs to do this? The process of getting from the input string to the parse tree consists of.
The WinMine Toolkit Max Chickering. Build Statistical Models From Data Dependency Networks Bayesian Networks Local Distributions –Trees Multinomial /
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
CS 330 Programming Languages 10 / 07 / 2008 Instructor: Michael Eckmann.
Term 2, 2011 Week 1. CONTENTS Problem-solving methodology Programming and scripting languages – Programming languages Programming languages – Scripting.
Pipes and Filters Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
MotifClick: cis-regulatory k - length motifs finding in cliques of 2(k-1)- mers Shaoqiang Zhang April 3, 2013.
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
GE3M25: Computer Programming for Biologists Python, Class 5
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Trinity College Dublin, The University of Dublin GE3M25: Computer Programming for Biologists Python, Class 2 Karsten Hokamp, PhD Genetics TCD, 17/11/2015.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Copyright  2004 limsoon wong Using WEKA for Classification (without feature selection)
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Aggregator  Performs aggregate calculations  Components of the Aggregator Transformation Aggregate expression Group by port Sorted Input option Aggregate.
CIT 590 Intro to Programming Files etc. Agenda Files Try catch except A module to read html off a remote website (only works sometimes)
Copyright OpenHelix. No use or reproduction without express written consent1.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Announcements Assignment 2 Out Today Quiz today - so I need to shut up at 4:25 1.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
CIRC Summer School 2016 Baowei Liu
Scoring Sequence Alignments Calculating E
Applied Discrete Mathematics Week 2: Functions and Sequences
Chapter 7 User-Defined Methods.
CIRC Summer School 2017 Baowei Liu
Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts
CIRC Winter Boot Camp 2017 Baowei Liu
A Very Basic Gibbs Sampler for Motif Detection
Learning Sequence Motif Models Using Expectation Maximization (EM)
MATLAB: Structures and File I/O
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
Tutorial for WEKA Heejun Kim June 19, 2018.
Functions continued.
Presentation transcript:

Some Ideas on Final Project

Feature extraction TGGCCGTACGAGTAACGGACTGGCTGTCTTCTCGT n CCGATACCCCCCACGCGAAACCCTACACATCAAAT p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT n AGACAAGAATCAATGCTCGCCCCCGGGTACTGAAT p GTAGGACAACAATATTGGTCCGGTGGTACCGGTAC n ACGCGGGTGGCGGCATGGTGCTCCGAAAGTGTTGT n CTCATATCCTACGCGGCCCCAACTATTAGCTCATG p TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC n AAAAAAAC…AACGAACT…ACGG…TTTTLabel 00…10…1…0n

K-mer based features Exact k-mers: k = 1 to 10 Larger k –More accurate representation: accurate classification –More features: overfitting => lower accuracy –Less efficient (both time and memory) –Feature selection may be useful Smaller k –Less features: less likely to overfit, shorter runing time –Less accurate representation: inaccurate classification –More efficient (both time and memory) K-mers with degenerate chars (wild cards) K-mers with mismatches Two-block kmers: e.g., ACGN{1-3}CGT

PWM-based features Can represent all forms in the previous slides in some way Key: how to get them? –Cluster top-ranked k-mers –Use existing tools (e.g., MEME or alignACE) Find motifs from top positive sequences (for efficiency) Use MAST (in MEME) or ScanACE (in alignACE) to find matches in all sequences. –HMMs for learning and classification

Submission data format TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC n0.005 AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT n0.345 AGACAAGAATCAATGCTCGCCCCCGGGTACTGAAT p0.985 GTAGGACAACAATATTGGTCCGGTGGTACCGGTAC n0.489 CTCATATCCTACGCGGCCCCAACTATTAGCTCATG p0.523

Input format for experiment AAAA AAAC AAAG numeric TTTT class TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC …. p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT …. n

Output format from weka With option –p 0 –distribution(you could try – p 1 to get the seq printed) inst# actual predicted error distribution 1 2:p 1:n + *1,0 2 2:p 1:n + *0.868, :n 1:n *1,0 … 12 1:n 1:n *0.996,0.004

More about “–p 1” With the seq as the first attribute, you’ll get an UnsupportedAttributeTypeException: Cannot handle string attributes Solution – use an unsupervised attribute filter to remove it: java -cp weka jar weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.RemoveType -T string" -W weka.classifiers.trees.J48 -t TF_10_data_1.4mer.arff -T TF_10_data_2.4mer.arff -p 1 -distribution -- -C 0.1 This has the same effect of calling java -cp weka jar weka.classifiers.trees.J48 -C 0.1 -t TF_10_data_1.4mer.arff -T TF_10_data_2.4mer.arff -p 0 -distribution (but without having the seq as the first attribute in the.arff file. And of course you won’t have the seq in your output file.)

About MEME Input seqs in FASTA format: > seq1 ATGACGTTACGTAATCCGTGATTATCCCGGCGTAC > seq2 CAGAATGGAACTTAAGGGAGAATTGTGCAATATTA > seq3 CCAACCAGGATGTTGTGCAACCGGTTCTTTTTATA > seq4 CTTGGTTGCGTCATGTCCTGGGATGCATTTACTAT > seq5 GAGATCTCCCTCTTATGTTGACATCCGTAAGCCCA > seq6 GATCGTGATGCATTGACTTGCGTAATAGTGTTATG > seq7 GCTGCAGAGATGGGTCATTATGTAATGTTGCCCCC

Running MEME /meme/bin/meme tf_1_top_100p.fasta -dna -minw 6 -maxw 15 -minsites 10 -mod zoops -nmotifs 5 -text > meme.out.txt tf_1_top_100p.fasta: my input was the top 100 positive seqs for TF_1 (you should try something differently: e.g. randomly split the positive seqs into 10 subsets) -dna: input is dna seq -minw 6: minimum motif length = 6 -maxw 6: maximum motif length = 15 -minsites: motif occurs in at least 10 sequences -mod zoops: each seq contains zero or one copy of the motif -nmotifs 5: return top 5 motifs -text: output is in plain text format (as opposed to HTML format)

Output from MEME ******************************************************************************** MOTIF 1 width = 10 sites = 99 llr = 717 E-value = 3.8e-141 ******************************************************************************** Motif 1 Description Simplified A 5::2:12991 pos.-specific C 1:::6:51:3 probability G 4:16:9:::1 matrix T :a924:3::5 bits * * 1.5 * * ** Information 1.3 * * ** content 1.1 ** ** ** (10.3 bits) 0.9 ** ** ** 0.7 ***** ** 0.4 ********* 0.2 ********** Multilevel ATTGCGCAAT consensus G AT T C sequence A

Running MAST /meme/bin/mast meme.out.txt -d TF_1_allseq.fasta -text –norc –m -ev 100 -mt 0.01 –b [-stdout > mast_out.txt] -norc: do not search motif from the reverse-complementary strand -m : find matches to the k-th motif present in meme.out.txt (vary k from 1 to max number of motifs) –Easier to use a scripting language to loop through k and parse the output -ev 100: use an E-value cutoff = 100. this allows sequences with poor matches to be output. (default is -ev 10) -mt 0.01: similar to –ev 100. (default is –mt ) -b: simplify output -stdout: redirect the output into a given file instead of the default file (mast... )

Output from MAST Only need Section 1 in output SEQUENCE NAME DESCRIPTION E-VALUE LENGTH seq seq seq seq seq … seq seq seq seq Larger e-values correspond to lower scores (poorer matches). Seq not in the output has lowest score. Therefore, need to transform e-values to scores. E.g.: if seq not in output: score = 0; else score = max_evalue – evalue (max_evalue = 100 was specified by the –ev parameter.) Or score = 1 / evalue.

AlignACE and ScanACE AlignACE: AlignACE –i tf_1_top_100p.fasta > tf_1_top_100p.ace ScanACE: ScanACE -i tf_1_top_100p.ace -z tf_1_data_1.fasta -c 1 -o -c: allows more motif matches to be output Results saved in: _n1.scn, _n2.scn, etc. Higher scores mean better match. No transformation needed.