Project 1: Machine Learning Using Neural Networks Ver 1.1.

Slides:

Advertisements

Similar presentations

Text Categorization.

Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Evaluation of Decision Forests on Text Categorization

Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Harun Ug˘uz 2011.KBS A two-stage feature selection method for text categorization by.

AI Practice 05 / 07 Sang-Woo Lee. 1.Usage of SVM and Decision Tree in Weka 2.Amplification about Final Project Spec 3.SVM – State of the Art in Classification.

Learning for Text Categorization

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Using IR techniques to improve Automated Text Classification

Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Three kinds of learning

The Vector Space Model …and applications in Information Retrieval.

1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,

TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab Computational Intelligence Laboratory Toyota.

Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.

Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk.

Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.

Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Medical Diagnosis via Genetic Programming Project #2 Artificial Intelligence: Biointelligence Computational Neuroscience Connectionist Modeling of Cognitive.

Project 1: Classification Using Neural Networks Kim, Kwonill Biointelligence laboratory Artificial Intelligence.

Text Classification using SVM- light DSSI 2008 Jing Jiang.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.

Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.

No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Artificial Intelligence Project 1 Neural Networks Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.

Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Today Ensemble Methods. Recap of the course. Classifier Fusion

IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.

IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.

IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.

IR Homework #1 By J. H. Wang Mar. 5, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:

Vector Space Models.

Introduction to String Kernels Blaz Fortuna JSI, Slovenija.

Artificial Intelligence Project 1 Neural Networks Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.

Project 2: Classification Using Genetic Programming Kim, MinHyeok Biointelligence laboratory Artificial.

Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.

CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines.

Artificial Intelligence Project 1 Neural Networks Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.

Solving Function Optimization Problems with Genetic Algorithms September 26, 2001 Cho, Dong-Yeon , Tel:

Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,

Project 1: Classification Using Neural Networks Kim, Kwonill Biointelligence laboratory Artificial Intelligence.

Artificial Intelligence Project 1 Neural Networks Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.

Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE Bruno Pinheiro Renato Correa

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Symbolic Regression via Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.

No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.

String Kernels on Slovenian documents Blaž Fortuna Dunja Mladenić Marko Grobelnik.

Medical Diagnosis via Genetic Programming

Application of Classification and Clustering Methods on mVoC (Medical Voice of Customer) data for Scientific Engagement Yingzi Xu, Department of Statistics,

Classification with Perceptrons Reading:

Artificial Intelligence Project 2 Genetic Algorithms

Unsupervised Learning and Autoencoders

Classifying enterprises by economic activity

Optimization and Learning via Genetic Programming

Project 1: Text Classification by Neural Networks

Text Categorization Assigning documents to a fixed set of categories

S.N.U. EECS Jeong-Jin Lee Eui-Taik Na

Unsupervised Machine Learning: Clustering Assignment

Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.

Presentation transcript:

Project 1: Machine Learning Using Neural Networks Ver 1.1

2(C) 2006, SNU Biointelligence Laboratory Outline Classification using ANN  Learn and classify text documents  Estimate several statistics on the dataset

3(C) 2006, SNU Biointelligence Laboratory Network Structure … Class 1 Class 3 Class 2 Input

CLASSIC3 Dataset

5(C) 2006, SNU Biointelligence Laboratory CLASSIC3 Three categories: 3891 documents  CISI: 1,460 document abstracts on information retrieval from Institute of Scientific Information.  CRAN: 1,398 document abstracts on Aeronautics from Cranfield Institute of Technology.  MED: 1,033 biomedical abstracts from MEDLINE.

6(C) 2006, SNU Biointelligence Laboratory Text Presentation in Vector Space 문서집합 Term vectors baseball specs graphics hockey unix space d1d2d3dndn Term-document matrix stemming stop-words elimination feature selection Bag-of-Words representation VSM representation Dataset Format

7(C) 2006, SNU Biointelligence Laboratory Dimensionality Reduction Sort by score Scoring measure (on individual feature) ML algorithm term (or feature) vectors choose terms with higher values individual feature scores Term Weighting TF or TF x IDF documents in vector space TF: term frequency IDF: Inverse Document Frequency N: Number of documents n i : number of documents that contain the j-th word

8(C) 2006, SNU Biointelligence Laboratory Construction of Document Vectors Controlled vocabulary  Stopwords are removed  Stemming is used.  Words of which document frequency is less than 5 is removed.  Term size: 3,850 A document is represented with a 3,850-dimensional vector of which elements are the frequency of words.  Words are sorted according to their values of information gain.  Top 100 terms are selected  3,830 (examples) x 100 (terms) matrix

Experimental Results

10(C) 2006, SNU Biointelligence Laboratory Data Setting for the Experiments Basically, training and test set are given.  Training : 2,683 examples  Test : 1,147 examples N-fold cross-validation (Optional)  Dataset is divided into N subsets.  The holdout method is repeated N times.  Each time, one of the N subsets is used as the test set and the other (N-1) subsets are put together to form a training set.  The average performance across all N trials is computed.

11(C) 2006, SNU Biointelligence Laboratory Number of Epochs

12(C) 2006, SNU Biointelligence Laboratory Number of Hidden Units  Minimum 10 runs for each setting # Hidden Units TrainTest Average  SD BestWorst Average  SD BestWorst Setting 1 Setting 2 Setting 3 

13(C) 2006, SNU Biointelligence Laboratory

14(C) 2006, SNU Biointelligence Laboratory Other Methods/Parameters Normalization method for input vectors Class decision policy Learning rates ….

15(C) 2006, SNU Biointelligence Laboratory ANN Sources Source codes  Free software  Weka  NN libraries (C, C++, JAVA, …)  MATLAB tool box Web sites  

16(C) 2006, SNU Biointelligence Laboratory Submission Due date: April 18 (Tue) Both ‘hardcopy’ and ‘ ’  Used software and running environments  Experimental results with various parameter settings  Analysis and explanation about the results in your own way  FYI, it is not important to achieve the best performance