Artificial Intelligence Project 1 Neural Networks Biointelligence Lab School of Computer Sci. & Eng. Seoul National University
(C) SNU CSE BioIntelligence Lab 2 Outline Classification Problems Two data sets Bioinformatics: DNA Medical diagnosis: Diabetes Generalization performance Epochs Number of hidden units Cross validation Confusion matrix
Bioinformatics: Finding Coding Regions of DNA Sequences
(C) SNU CSE BioIntelligence Lab 4 Bioinformatics What is Bioinformatics? Bio – molecular biology Informatics – computer science Bioinformatics – solving problems arising from biology using methodology from computer science
(C) SNU CSE BioIntelligence Lab 5 DNA Structure Double Helix – Base pairs 4 nucleotides A - Adenine T - Thymine G - Guanine C - Cytosine AACCTGCGGAAGGATCATTA CCGAGTGCGGGTCCTTTGGG CCCAACCTCCCATCCGTGTCT ATTGTACCCGTTGCTTCGGCG GGCCCGCCGCTTGTCGGCCG CCGGGGGGGCGCCTCTGCCC CCCGGGCCCGTGCCCGCCGG AGACCCCAACACGAACACTG TCTGAAAGCGTGCAGTCTGA GTTGATTGA
(C) SNU CSE BioIntelligence Lab 6 Central Dogma Information Flow from DNA to Protein Proteins are synthesized based on the information of DNA DNA: information storage RNA: information intermediate Protein: various cellular functions
(C) SNU CSE BioIntelligence Lab 7 Finding Coding Regions of DNA Sequences RNA Synthesis and Processing Exon: coding sequences Intron: non-coding sequences Given a sequence of DNA, recognize the boundaries between exons and introns. Acceptor: intron/exon boundary Donor: exon/intron boundary
(C) SNU CSE BioIntelligence Lab 8 Neural Networks (1/2) Input (180 units) and Output Input: DNA sequence whose length is 60. A C G T Output: Decide if the middle of the input sequence is a Donor 1 Acceptor 2 Neither 3
(C) SNU CSE BioIntelligence Lab 9 Neural Networks (2/2) Data (3186) Training: 2000 Test: 1186 Class distribution ClassTrainTest 1464 (23.20%)303 (25.55%) 2485 (24.25%)280 (23.61%) (52.55%)603 (50.84%)
(C) SNU CSE BioIntelligence Lab 10 Results (1/3) Number of Epochs
(C) SNU CSE BioIntelligence Lab 11 Results (2/3) Number of Hidden Units At least, 10 runs for each setting # Hidden Units TrainTest Average SD BestWorst Average SD BestWorst Setting 1 Setting 2 Setting 3
(C) SNU CSE BioIntelligence Lab 12 Results (3/3)
Medical Diagnosis: Diabetes
(C) SNU CSE BioIntelligence Lab 14 Pima Indian Diabetes Data (768) 8 Attributes Number of times pregnant Plasma glucose concentration in an oral glucose tolerance test Diastolic blood pressure (mm/Hg) Triceps skin fold thickness (mm) 2-hour serum insulin (mu U/ml) Body mass index (kg/m 2 ) Diabetes pedigree function Age (year) Positive: 500, negative: 268
(C) SNU CSE BioIntelligence Lab 15 Cross Validation (1/2) K-fold Cross Validation The data set is randomly divided into k subsets. One of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. 128 D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D1D1 D2D2 D3D3 D4D4 D6D6 D5D5 D2D2 D3D3 D4D4 D5D5 D6D6 D1D1
(C) SNU CSE BioIntelligence Lab 16 Cross Validation (2/2) Calculation of the error Confusion Matrix True Predict PositiveNegative Positive Negative
(C) SNU CSE BioIntelligence Lab 17 Results Cross validation and Confusion Matrix At least 10 runs for your k value. Show the confusion matrix for the best result of your experiments. RunTest Error 1 2 10 Average
(C) SNU CSE BioIntelligence Lab 18 References Source Codes Free softwares NN libraries (C, C++, JAVA, …) MATLAB Tool box Web sites
(C) SNU CSE BioIntelligence Lab 19 Pay Attention! Due (October 7, 2001): By the begin of class Submission Results obtained from your experiments Compress the data Via Report: Hardcopy!! Used software and running environments Results for many experiments with various parameter settings Analysis and explanation about the results in your own way
(C) SNU CSE BioIntelligence Lab 20 Optional Experiments Various learning rate Number of hidden layers Different k values Output encoding