Document Quality Judgment with Textual Featues Bing Bai Computer Science Department Rutgers University December 2003.

Slides:



Advertisements
Similar presentations
Keyboarding Objective Apply language skills in keyed documents
Advertisements

University of Sheffield NLP Module 4: Machine Learning.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Modern Language Association style… aka MLA. According to OWL at PURDUE… MLA (Modern Language Association) style is most commonly used to write papers.
A Metric for Software Readability by Raymond P.L. Buse and Westley R. Weimer Presenters: John and Suman.
Political Party, Gender, and Age Classification Based on Political Blogs Michelle Hewlett and Elizabeth Lingg.
Indian Statistical Institute Kolkata
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
Vocabulary Parts of Speech Study Guide
Presenters: Başak Çakar Şadiye Kaptanoğlu.  Typical output of an IR system – static predefined summary ◦ Title ◦ First few sentences  Not a clear view.
SLIDE 1IS 240 – Spring 2010 Logistic Regression The logistic function: The logistic function is useful because it can take as an input any.
Stylometry Project May 4, 2007 Pace’s Research Day.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
The Chicken Project Dimension Reduction-Based Penalized logistic Regression for cancer classification Using Microarray Data By L. Shen and E.C. Tan Name.
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Keyboarding Objective 3.01 Interpret Proofreader Marks
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Procedures and Instrumentation, APA
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Jeopardy 1.03, 3.01, and 3.02 Numbers and Symbols Proofreaders’ Marks Language Skills
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Power Hour APA Style September 29, Important to be familiar with APA Style Citation generators and suggested citations are not always correct. You.
Information Literacy. Information Literacy includes: The ability of a student to: 1.Identify the need for information Select a topic 2.Access information.
INFO 4307/6307 Comparative Evaluation of Machine Learning Models Guest Lecture by Stephen Purpura November 16, 2010.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
University of Southern California Department Computer Science Bayesian Logistic Regression Model (Final Report) Graduate Student Teawon Han Professor Schweighofer,
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Spam Detection Ethan Grefe December 13, 2013.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Optimal Bayes Classification
Objectives Identify correct placement of commas in sentences
Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science.
Amazon review utility estimator. Overview  Goal: To determine the “usefulness” of Amazon.com reviews  Using Mallet classifiers  Several custom features.
Basic Modern Language Association Format Purdue Online Writing Lab “OWL”
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Topic by Topic Performance of Information Retrieval Systems Walter Liggett National Institute of Standards and Technology TREC-7 (1999)
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
Chapter 4 Summary Writing.
Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
With a Little (Writing) Help from Our Friends 1)The group focused on minimum writing requirements that could fit all disciplines. 2)The group decided that.
Annotated Bibliography.  Annotation is the process by which you emphasize important information in a document  You create annotated bibliography to.
Keyboarding Jeopardy TechniqueLettersNumbersSymbolsGeneral Final.
Useful Writing Proper Writing Techniques and Exploring Useful Sites Copyright © Texas Education Agency, All rights reserved. Images and other multimedia.
Big Data Processing of School Shooting Archives
CSE 4705 Artificial Intelligence
Writing An Argument.
Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics
Workplace Communication
Presentation Title Here
Classifying enterprises by economic activity
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Classification Breakdown
Machine Learning – a Probabilistic Perspective
Learning to Rank with Ties
Introduction to Sentiment Analysis
Lecture 16. Classification (II): Practical Considerations
Fig. 5 Multivariate fMRI analysis.
Presentation transcript:

Document Quality Judgment with Textual Featues Bing Bai Computer Science Department Rutgers University December 2003

Document Qualities Not relevance Not relevance Also important in information retrieval system Also important in information retrieval system Partially dependent on Textual features Partially dependent on Textual features Document length Document length “Coward” “Coward”

Document Qualities(Continued) Pre-defined Qualities Pre-defined Qualities Accuracy Accuracy Credibility Credibility Depth Depth Grammar Correctness Grammar Correctness Objectivity Objectivity Multi-side Multi-side Readability Readability Source Authority Source Authority Verbose-Concise Verbose-Concise

Textual Features Statistics by GATE Statistics by GATE Categories of Features Categories of Features Punctuation Number of periods, question marks, exclamation marks, … Symbol Number of dollar signs, percent signs, plus signs, … Length Average length paragraph in words. Length of title, subtitle, … Upper Case Number of all upper case words, number of words with the first letter capital, …

Textual Features (Continued) Quotation Average quotation length Key Terms Number of word "say", "seem", and "expert" Unique words Number of unique words, excluding stop words, … POS Number of token, proper noun, personal pronoun, … Entities Number of person, location, organization, and date, …

Data Set and Testing Scheme More than 2000 Document from 3 different article sources: CNS, TREC, and XinHua News Agency. More than 2000 Document from 3 different article sources: CNS, TREC, and XinHua News Agency. The Nine Qualities of these document are judged by faculty, professionals, and students. The Nine Qualities of these document are judged by faculty, professionals, and students. 3 qualities (“Depth”, “Multi-side”, “Objectivity”) showed strongest correlations with the textual features we defined. 3 qualities (“Depth”, “Multi-side”, “Objectivity”) showed strongest correlations with the textual features we defined. 2-fold Cross Validation for 5 times. The training set and testing set are generated randomly each time. 2-fold Cross Validation for 5 times. The training set and testing set are generated randomly each time.

Results Depth (1119/894) Multi-side (1038/975) Objectivity (995/1018) J4862.6/52.6/58.2± /56.1/58.3± /51.6/51.8±1.1 NB 81.6/42.4/64.2 ± /43.7/61.8± /59.2/53.4±1.0 SMO81.5/45.6/65.5± /56.6/67.7± /61.7/56.8±0.66 LR74.4/51.1/64.0± /60.8/65.8± /59.3/57.3±2.8

Factor Anaysis Purpose: viewing 112 variables is hard, data reduction allows us to concentrate on the most important factors of data. Purpose: viewing 112 variables is hard, data reduction allows us to concentrate on the most important factors of data. Two qualities distribution on factor 1 and factor 2, on the left is “Depth”, on the right is “Multi-side”. Two qualities distribution on factor 1 and factor 2, on the left is “Depth”, on the right is “Multi-side”.

Gaussian-Bayesian Classifier if P(x|C1)P(C1) > P(x|C2)P(C2) then classify x as class I; else classify x as class II. if P(x|C1)P(C1) > P(x|C2)P(C2) then classify x as class I; else classify x as class II. Where Where Singularity elimination (Get rid of trivial eigens) Singularity elimination (Get rid of trivial eigens)

GBC Results

GBC (Continued) Gaussian boundary is not as good as linear boundary (Logistic Regression and Support Vector Machine). Gaussian boundary is not as good as linear boundary (Logistic Regression and Support Vector Machine). One reason: the distributions are not Gaussian One reason: the distributions are not Gaussian The distributions of feature NN, (a) is the distribution with low objectivity, (b) is the distribution with high objectivity. The distributions of feature NN, (a) is the distribution with low objectivity, (b) is the distribution with high objectivity.