Gene Structure Prediction Using Neural Networks and Hidden Markov Models June 18, 2001 2000-30460 권동섭 2000-30474 신수용 2000-30478 조동연
Data Sets UCSC data Preprocessing Multiple exon genes 7 Fold Cross validation Preprocessing SNNS pattern definition file V3.2 generated at Wed May 16 17:00:00 2001 No. of patterns : 16 No. of input units : 48 No. of output units : 4 # Input pattern 1 : 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 ... # Output pattern 1: 1 0 0 0 # Input pattern 2 : 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 ... # Output pattern 2: 0 0 0 0 Multi_exon_GB.dat pre-propessor
Classification Problem 5 Classes 1. Start – Exon 2. Exon – Intron 3. Intron – Exon 4. Exon – End 5. Others Imbalanced data problem Boundary : Others = 1 : 9 1 2 3 4
Training Data Input Data Output Data Boundary Sequences Others ATGCGA | GCATGA Others GCAGCCAGCTAC or GA | CATGATTTCA Encoding A: 0001, C: 0010, G: 0100, T: 1000 Output Data Boundary: 1 – 0001, 2 – 0010, 3 – 0100, 4 – 1000 Internal: 0000
Neural Networks SNNS (version 4.2) Structure Input: 48 Hidden: 96 Output: 4 Learning: Standard BP with momentum Learning rate: 0.2 Momentum: 0.1 Maximum difference: 0.1
Experimental Setup Training Test Group 0 ~ 5 Online Learning Boundary: 3068 Others: 27612 Online Learning Random order Test Group 6 2 genes: HUMELAFIN and HSCPH
Results – Training Performance Early Stopping: 260 (0.85%) SSE
Results – Test Performance HUMELAFIN (6 boundaries) HSCPH70 (8 boundaries) Re = 4/6 Pre = 4/48 Re = 5/10 Pre = 5/136
Hidden Markov Models Simple Structure Training Test Construct each HMM for 4 boundary classes Input: fixed size sequences for each class Test Compare generation probabilities Threshold value