Download presentation
Presentation is loading. Please wait.
Published byEmerald Flowers Modified over 9 years ago
1
Some Ideas on Final Project
2
Feature extraction TGGCCGTACGAGTAACGGACTGGCTGTCTTCTCGT n CCGATACCCCCCACGCGAAACCCTACACATCAAAT p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT n AGACAAGAATCAATGCTCGCCCCCGGGTACTGAAT p GTAGGACAACAATATTGGTCCGGTGGTACCGGTAC n ACGCGGGTGGCGGCATGGTGCTCCGAAAGTGTTGT n CTCATATCCTACGCGGCCCCAACTATTAGCTCATG p TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC n AAAAAAAC…AACGAACT…ACGG…TTTTLabel 00…10…1…0n
3
K-mer based features Exact k-mers: k = 1 to 10 Larger k –More accurate representation: accurate classification –More features: overfitting => lower accuracy –Less efficient (both time and memory) –Feature selection may be useful Smaller k –Less features: less likely to overfit, shorter runing time –Less accurate representation: inaccurate classification –More efficient (both time and memory) K-mers with degenerate chars (wild cards) K-mers with mismatches Two-block kmers: e.g., ACGN{1-3}CGT
4
PWM-based features Can represent all forms in the previous slides in some way Key: how to get them? –Cluster top-ranked k-mers –Use existing tools (e.g., MEME or alignACE) Find motifs from top positive sequences (for efficiency) Use MAST (in MEME) or ScanACE (in alignACE) to find matches in all sequences. –HMMs for learning and classification
5
Submission data format TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC n0.005 AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT n0.345 AGACAAGAATCAATGCTCGCCCCCGGGTACTGAAT p0.985 GTAGGACAACAATATTGGTCCGGTGGTACCGGTAC n0.489 CTCATATCCTACGCGGCCCCAACTATTAGCTCATG p0.523
6
Input format for weka @relation experiment @attributeseqstring @attribute AAAA numeric @attribute AAAC numeric @attribute AAAG numeric … @attribute TTTT numeric @attribute class {n,p} @data TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC 0 1 1 0 0 …. p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT 1 0 0 1 1…. n
7
Output format from weka With option –p 0 –distribution(you could try – p 1 to get the seq printed) inst# actual predicted error distribution 1 2:p 1:n + *1,0 2 2:p 1:n + *0.868,0.132 3 1:n 1:n *1,0 … 12 1:n 1:n *0.996,0.004
8
More about “–p 1” With the seq as the first attribute, you’ll get an UnsupportedAttributeTypeException: Cannot handle string attributes Solution – use an unsupervised attribute filter to remove it: java -cp weka.3.6.6.jar weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.RemoveType -T string" -W weka.classifiers.trees.J48 -t TF_10_data_1.4mer.arff -T TF_10_data_2.4mer.arff -p 1 -distribution -- -C 0.1 This has the same effect of calling java -cp weka.3.6.6.jar weka.classifiers.trees.J48 -C 0.1 -t TF_10_data_1.4mer.arff -T TF_10_data_2.4mer.arff -p 0 -distribution (but without having the seq as the first attribute in the.arff file. And of course you won’t have the seq in your output file.)
9
About MEME Input seqs in FASTA format: > seq1 ATGACGTTACGTAATCCGTGATTATCCCGGCGTAC > seq2 CAGAATGGAACTTAAGGGAGAATTGTGCAATATTA > seq3 CCAACCAGGATGTTGTGCAACCGGTTCTTTTTATA > seq4 CTTGGTTGCGTCATGTCCTGGGATGCATTTACTAT > seq5 GAGATCTCCCTCTTATGTTGACATCCGTAAGCCCA > seq6 GATCGTGATGCATTGACTTGCGTAATAGTGTTATG > seq7 GCTGCAGAGATGGGTCATTATGTAATGTTGCCCCC
10
Running MEME /meme/bin/meme tf_1_top_100p.fasta -dna -minw 6 -maxw 15 -minsites 10 -mod zoops -nmotifs 5 -text > meme.out.txt tf_1_top_100p.fasta: my input was the top 100 positive seqs for TF_1 (you should try something differently: e.g. randomly split the positive seqs into 10 subsets) -dna: input is dna seq -minw 6: minimum motif length = 6 -maxw 6: maximum motif length = 15 -minsites: motif occurs in at least 10 sequences -mod zoops: each seq contains zero or one copy of the motif -nmotifs 5: return top 5 motifs -text: output is in plain text format (as opposed to HTML format)
11
Output from MEME ******************************************************************************** MOTIF 1 width = 10 sites = 99 llr = 717 E-value = 3.8e-141 ******************************************************************************** -------------------------------------------------------------------------------- Motif 1 Description -------------------------------------------------------------------------------- Simplified A 5::2:12991 pos.-specific C 1:::6:51:3 probability G 4:16:9:::1 matrix T :a924:3::5 bits 2.2 2.0 1.7 * * 1.5 * * ** Information 1.3 * * ** content 1.1 ** ** ** (10.3 bits) 0.9 ** ** ** 0.7 ***** ** 0.4 ********* 0.2 ********** 0.0 ---------- Multilevel ATTGCGCAAT consensus G AT T C sequence A
12
Running MAST /meme/bin/mast meme.out.txt -d TF_1_allseq.fasta -text –norc –m -ev 100 -mt 0.01 –b [-stdout > mast_out.txt] -norc: do not search motif from the reverse-complementary strand -m : find matches to the k-th motif present in meme.out.txt (vary k from 1 to max number of motifs) –Easier to use a scripting language to loop through k and parse the output -ev 100: use an E-value cutoff = 100. this allows sequences with poor matches to be output. (default is -ev 10) -mt 0.01: similar to –ev 100. (default is –mt 0.0001) -b: simplify output -stdout: redirect the output into a given file instead of the default file (mast... )
13
Output from MAST Only need Section 1 in output SEQUENCE NAME DESCRIPTION E-VALUE LENGTH ------------- ----------- -------- ------ seq22 0.11 35 seq13 0.11 35 seq72 0.3 35 seq19 0.3 35 seq29 0.38 35 … seq446 96 35 seq298 96 35 seq264 96 35 seq251 96 35 Larger e-values correspond to lower scores (poorer matches). Seq not in the output has lowest score. Therefore, need to transform e-values to scores. E.g.: if seq not in output: score = 0; else score = max_evalue – evalue (max_evalue = 100 was specified by the –ev parameter.) Or score = 1 / evalue.
14
AlignACE and ScanACE AlignACE: AlignACE –i tf_1_top_100p.fasta > tf_1_top_100p.ace ScanACE: ScanACE -i tf_1_top_100p.ace -z tf_1_data_1.fasta -c 1 -o -c: allows more motif matches to be output Results saved in: _n1.scn, _n2.scn, etc. Higher scores mean better match. No transformation needed.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.