Learning to Classify Documents Edwin Zhang Computer Systems Lab

Slides:



Advertisements
Similar presentations
M&Ms Statistics.
Advertisements

Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
VK Dice By: Kenny Gutierrez, Vyvy Pham Mentors: Sarah Eichhorn, Robert Campbell.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Assuming normally distributed data! Naïve Bayes Classifier.
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
A Combinatorial Fusion Method for Feature Mining Ye Tian, Gary Weiss, D. Frank Hsu, Qiang Ma Fordham University Presented by Gary Weiss.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
Data Transformations COMPUTE and RECODE Commands.
Levels of Measurement Nominal measurement Involves assigning numbers to classify characteristics into categories Ordinal measurement Involves sorting objects.
Spreadsheets. What are the parts Rows are numbered vertically Columns are lettered horizontally Where rows and columns intersect is called a cell A sheet.
Mental Math Computation. Multiply Mentally 84 × 25 What strategy did you use? Why did we choose 84 × 25 instead of 85 × 25?
Exercise Session 10 – Image Categorization
Automating Tasks with Visual Basic. Introduction  When can’t find a readymade macro action that does the job you want, you can use Visual Basic code.
Bayesian Networks. Male brain wiring Female brain wiring.
Multiplication is the process of adding equal sets together = 6 We added 2 three times.
Chapter 2 Practice.
Agenda  Commenting  Inputting Data from Keyboard (scanf)  Arithmetic Operators  ( ) * / + - %  Order of Operations  Mixing Different Numeric Data.
Naive Bayes Classifier
UNIT 7: Using Excel in the Law Office. This Week’s Assignment You should be working on your three-part assignment Part 1 deals with the things you learned.
Essential Question: Why is dividing the same as multiplying by the reciprocal of that number?
Psychology Chapter 1 Section 7: Evaluating Findings.
Thinking Mathematically
Spreadsheet Vocabulary Terms
Learning to Classify Documents Edwin Zhang Computer Systems Lab
Spreadsheets Lesson 1: Introduction. Lesson Objectives To understand what a spread sheet is and how it can be used To identify the features of a spreadsheet.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Statistics 1: Introduction to Probability and Statistics Section 3-2.
WHAT IS THE VALUE OF X? x = 0 for value in [3, 41, 12, 9, 74, 15] : if value < 10 : x = x + value print x.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Chapter More on Classes Intro to Computer Science CS1510, Section 2 Dr. Sarah Diesburg.
The number which appears most often in a set of numbers. Example: in {6, 3, 9, 6, 6, 5, 9, 3} the Mode is 6 (it occurs most often). Mode : The middle number.
COP 2551 Introduction to Object Oriented Programming with Java Topics –Java Statements –Java Expressions –Postfix Expressions –Prefix Expressions –Evaluating.
To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Review for Test2. Scope 8 problems, 60 points. 1 Bonus problem (5 points) Coverage: – Test 1 coverage – Exception Handling, Switch Statement – Array of.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
ARRAYS (Extra slides) Arrays are objects that help us organize large amounts of information.
Spreadsheet Vocabulary Terms
Section 5.3 Dividing Monomials
Naive Bayesian Classification
Spreadsheets.
CSE P573 Applications of Artificial Intelligence Bayesian Learning
Multiplying & Dividing Integers
Chapter 8 Arrays Objectives
Web Systems Development (CSC-215)
We are starting to program with JavaScript
Data Structures – 1D Lists
Learning to Classify Documents Edwin Zhang Computer Systems Lab
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Image Analysis
Computer Science 2 Getting an unknown # of …. Into an array.
Learning to Classify Documents Edwin Zhang Computer Systems Lab
Topics discussed in this section:
Statistics 1: Introduction to Probability and Statistics
Review for Test1.
Arrays Part 2.
CSE 491/891 Lecture 25 (Mahout).
Chapter 8 Arrays Objectives
Multivariate Methods Berlin Chen
Generative Models and Naïve Bayes
7.4 Properties of Exponents
Naïve Bayes Text Classification
Dry Run Fix it Write a program
Multiplying Up.
Using Formulas.
Information Organization: Evaluation of Classification Performance
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009-2010

Introduction Classifying documents using a Bayesian method Two Parts Learning Prediction Coded in Java

Background Naïve Bayes Classifier/Bayesian Method Computes the conditional probability p(T|D) for a given document D for every topic Assigns the document D to the topic with the largest conditional probability http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html

Background Small variations to Naïve Bayesian Classifier Number of times term appears in training documents Will be computing P(D|T) instead of P(T|D)

Background Program has two steps: Learning Prediction

Learning Uses training documents features selection

Prediction Uses conditional probability Uses the features that were selected in the Learning section Assigns the document to the topic that has the highest “score”

Development Created Category, Document, Terms classes Category class deals with the categories Document class deals with the documents Terms class deals with terms that appear in each document

Category Class Each category contains an array of documents Started out with 2 categories Added more categories as my program started working

Document Class Each document contains an array of terms. Training documents Prediction documents

Terms Class Terms class dealt with all the terms that appeared in the training documents For each term, an array of counts on the number of times the term appears in documents Counts for each category Also, each term is assigned a score Score = number of times in category A + 1/number of times in category B + 1 to avoid dividing by 0

Development (continued) Created an array of categories Read in all my training documents Stored all the terms that appear in an array of Terms Sorted the array of terms based on the score for each category Chose the top 25 terms from the sorted array based on each category End of the learning part

Development (continued) Read in a prediction document Looked for terms that were features Each category had a variable

Development (continued) For each feature, multiplied each variable by a calculated score Category with the highest score at the end was the likely category

Development (continued) Initially started with 2 categories Once program started working, added 3 more categories

Results Initial problems With 2 categories, worked flawlessly on 10 documents With 5 categories, worked on 28 of the 30 documents tested

Discussion Worked as well as I expected Possible areas for future experiments Different method for calculating scores for terms Different method of calculating scores for the category

Acknowledgements My dad, Jianping Zhang My lab director, Randolph Latimer