Project 1: Text Classification by Neural Networks

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
Evaluation of Decision Forests on Text Categorization
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Harun Ug˘uz 2011.KBS A two-stage feature selection method for text categorization by.
AI Practice 05 / 07 Sang-Woo Lee. 1.Usage of SVM and Decision Tree in Weka 2.Amplification about Final Project Spec 3.SVM – State of the Art in Classification.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Using IR techniques to improve Automated Text Classification
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
Three kinds of learning
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
The Vector Space Model …and applications in Information Retrieval.
1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,
TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab Computational Intelligence Laboratory Toyota.
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk.
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
Goal: Goal: Learn to automatically  File s into folders  Filter spam Motivation  Information overload - we are spending more and more time.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Medical Diagnosis via Genetic Programming Project #2 Artificial Intelligence: Biointelligence Computational Neuroscience Connectionist Modeling of Cognitive.
Project 1: Classification Using Neural Networks Kim, Kwonill Biointelligence laboratory Artificial Intelligence.
Text Classification using SVM- light DSSI 2008 Jing Jiang.
Text mining.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Artificial Intelligence Project 1 Neural Networks Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Today Ensemble Methods. Recap of the course. Classifier Fusion
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.
IR Homework #1 By J. H. Wang Mar. 5, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
Artificial Intelligence Project 1 Neural Networks Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
Project 2: Classification Using Genetic Programming Kim, MinHyeok Biointelligence laboratory Artificial.
Artificial Intelligence Project 1 Neural Networks Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
Solving Function Optimization Problems with Genetic Algorithms September 26, 2001 Cho, Dong-Yeon , Tel:
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
Project 1: Classification Using Neural Networks Kim, Kwonill Biointelligence laboratory Artificial Intelligence.
Artificial Intelligence Project 1 Neural Networks Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE Bruno Pinheiro Renato Correa
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
String Kernels on Slovenian documents Blaž Fortuna Dunja Mladenić Marko Grobelnik.
Medical Diagnosis via Genetic Programming
Application of Classification and Clustering Methods on mVoC (Medical Voice of Customer) data for Scientific Engagement Yingzi Xu, Department of Statistics,
Classification with Perceptrons Reading:
Artificial Intelligence Project 2 Genetic Algorithms
Unsupervised Learning and Autoencoders
Classifying enterprises by economic activity
Optimization and Learning via Genetic Programming
Machine Learning with Weka
Text Categorization Assigning documents to a fixed set of categories
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Unsupervised Machine Learning: Clustering Assignment
Assignment 7 Due Application of Support Vector Machines using Weka software Must install libsvm Data set: Breast cancer diagnostics Deliverables:
Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.
Presentation transcript:

Project 1: Text Classification by Neural Networks Ver 1.1

(C) 2006, SNU Biointelligence Laboratory Outline Classification using ANN Learn and classify text documents Estimate several statistics on the dataset (C) 2006, SNU Biointelligence Laboratory

(C) 2006, SNU Biointelligence Laboratory Network Structure … Class 1 Class 3 Class 2 Input (C) 2006, SNU Biointelligence Laboratory

CLASSIC3 Dataset

(C) 2006, SNU Biointelligence Laboratory CLASSIC3 Three categories: 3891 documents CISI: 1,460 document abstracts on information retrieval from Institute of Scientific Information. CRAN: 1,398 document abstracts on Aeronautics from Cranfield Institute of Technology. MED: 1,033 biomedical abstracts from MEDLINE. (C) 2006, SNU Biointelligence Laboratory

Text Presentation in Vector Space 문서집합 stemming stop-words elimination feature selection . . . VSM representation 1 2 3 1 Term vectors 1 2 Bag-of-Words representation d1 d2 d3 dn baseball 1 2 3 specs graphics hockey Term-document matrix unix Dataset Format space (C) 2006, SNU Biointelligence Laboratory

Dimensionality Reduction term (or feature) vectors individual feature Scoring measure (on individual feature) Sort by score scores choose terms with higher values documents in vector space ML algorithm Term Weighting TF or TF x IDF TF: term frequency IDF: Inverse Document Frequency N: Number of documents ni: number of documents that contain the j-th word (C) 2006, SNU Biointelligence Laboratory

Construction of Document Vectors Controlled vocabulary Stopwords are removed Stemming is used. Words of which document frequency is less than 5 is removed.  Term size: 3,850 A document is represented with a 3,850-dimensional vector of which elements are the frequency of words. Words are sorted according to their values of information gain.  Top 100 terms are selected  3,830 (examples) x 100 (terms) matrix (C) 2006, SNU Biointelligence Laboratory

Experimental Results

Data Setting for the Experiments Basically, training and test set are given. Training : 2,683 examples Test : 1,147 examples N-fold cross-validation (Optional) Dataset is divided into N subsets. The holdout method is repeated N times. Each time, one of the N subsets is used as the test set and the other (N-1) subsets are put together to form a training set. The average performance across all N trials is computed. (C) 2006, SNU Biointelligence Laboratory

(C) 2006, SNU Biointelligence Laboratory Number of Epochs (C) 2006, SNU Biointelligence Laboratory

(C) 2006, SNU Biointelligence Laboratory Number of Hidden Units Number of Hidden Units Minimum 10 runs for each setting # Hidden Units Train Test Average  SD Best Worst Setting 1 Setting 2 Setting 3  (C) 2006, SNU Biointelligence Laboratory

(C) 2006, SNU Biointelligence Laboratory

Other Methods/Parameters Normalization method for input vectors Class decision policy Learning rates …. (C) 2006, SNU Biointelligence Laboratory

(C) 2006, SNU Biointelligence Laboratory ANN Sources Source codes Free software  Weka NN libraries (C, C++, JAVA, …) MATLAB tool box Web sites http://www.cs.waikato.ac.nz/~ml/weka/ http://www.faqs.org/faqs/ai-faq/neural-nets/part5/ (C) 2006, SNU Biointelligence Laboratory

(C) 2006, SNU Biointelligence Laboratory Submission Due date: October 12 (Thur) Both ‘hardcopy’ and ‘email’ Used software and running environments Experimental results with various parameter settings Analysis and explanation about the results in your own way FYI, it is not important to achieve the best performance (C) 2006, SNU Biointelligence Laboratory