Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta.

Slides:

Advertisements

Similar presentations

Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Traditional IR models Jian-Yun Nie.

Indexing DNA Sequences Using q-Grams

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.

Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.

Evaluating Search Engine

WMES3103 : INFORMATION RETRIEVAL

Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.

ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.

Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

Information Retrieval in Practice

A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.

Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

Chapter 1 Introduction to Data Mining

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.

CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.

Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002.

Text Based Information Retrieval

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Information Retrieval on the World Wide Web

Multimedia Information Retrieval

Students: Meiling He Advisor: Prof. Brain Armstrong

CSE 635 Multimedia Information Retrieval

LINGUA INGLESE 2A – a.a. 2018/2019 Computer-Aided Translation Technology LESSON 3 prof. ssa Laura Liucci –

A Suite to Compile and Analyze an LSP Corpus

Presentation transcript:

Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Motivation For search applications we often would like to narrow down the result set to a certain class of documents For corpus construction an exclusion of certain document classes could be helpful Documents with a high rate of errors could harm in applications like for example computer aided language learning (CALL) or lexicon construction. Documents of certain classes could be more erroneous like others. It makes sense to investigate the implications of document genre in the area of noise reduction

Definition of Genre Partition of documents into distinct classes of text with similar function and form Independent dimension ideally orthogonal to topic Examples for document genres: blogs, guestbooks, science reports Mixed documents are possible = documents where parts belong to different genres

Two different views on Genre

A document with the wrong genre will often be noise

Two different views on Genre A document with the wrong genre will often be noise: Macro-Noise

Two different views on Genre A document with the wrong genre will often be noise: Macro-Noise In documents of different genre we find different amounts of noise:

Two different views on Genre A document with the wrong genre will often be noise: Macro-Noise In documents of different genre we find different amounts of noise: Micro-Noise

Outline Introduction of a new genre hierarchy Macro-Noise detection –Feature Space –Classifiers –Experiments and applications Micro-Noise detection –Error dictionaries –Experiments on correlation of genre and noise –Experiments on classification by noise

A hierarchy of Genres Demands for a genre classification schema: Task oriented granularity Hierarchical Logically consistent Complete

A hierarchy of Genres 8 container classes with 32 leaf genres

Corpus Containter Classes Allow to compare to other classification schemas Allow to evaluate the seriousness of classification errors Training and Evaluation Corpus For each of the 32 genres 20 English HTML web documents for training and 20 documents for testing were collected leading to a corpus with 1,280 files.

Detection of Macro-Noise Macro-Noise detection is a classification problem Candidate Features Feature selection mechanism Build Classifiers Combine Classifiers for Classification

Feature Space Examples for Features Form: line length, number of sentences Vocabulary: specialized word lists, dictionaries, multi lexemic epr. Structure: POS Complex patterns: style All together we got over 200 features for the 32 genres

Feature Space Kernel question: Selection of features Global feature sets for the standard machine learning algorithms Specialized feature sets for our specialized classifiers Small set of significant and natural features for each genre Avoiding accidental similarities between documents

Feature Space Feature Selection for specialized genre classifiers do select candidate feature add feature if performance of classification improves ordering by classification strength prune features that have become obsolete until Recall > 90/75% && Precision > 90/75% Rules: Constructed as inequations with discriminative ranges Classifiers: Conjunction of single rules

Classifiers Example: Classifier for reportage as a conjunction of single rules

Classifiers Classifier Combination Filtering: Class as a disqualification criterion for another class in the case of multiple classification Ordering by F1 value: Classifiers that lead more probably to a correct classification are applied first Ordering by dependencies and recall: A graph with edges that represent the number of wrong classifications of one class as another controls the sequence of classifier application. First, edges with smaller values are traversed leading to fewer wrong classifications

Experiments on Macro-Noise Detection of Genre: On the test corpus we get a precision of 72.2% and an overall recall of 54,00% with the specialized classifiers Superior to machine learning methods with SVM as the best method leading to 51.9% precision and to 47.8% recall The superiority can be stated only for the small training corpora Work for incremental classifier improvement and the behavior on bigger training sets is forthcoming

Experiments on Macro-Noise Application 1: Retrieving Scientific Articles on fish Queries like (cod Λ habitat) are sent to a search engine to retrieve scientific documents Evaluation over the 30 top-ranked documents of a query Precision and the Recall at cut-points 5,10,15,20 documents could be significantly improved by genre recognition, leaving room for further improvement

Experiments on Macro-Noise Application 2: Language models for speech recognition Language models of speech corpora are notoriously sparse Standard solution augmentation by text documents should be improved choosing genres similar to spoken text as: forum, interview, blog The noise in a crawled corpus of ~30,000 documents could be reduced to a residue of 2.5%

Detection of Micro-Noise Examples for Micro-Noise: Typing errors, cognitive errors Method: Detection of errors with specialized Error dictionaries

Error Dictionaries Construction principle: Micro-Noise occurs from elucidable channel characteristics. These characteristics can be discovered in an analytical way or by observations in a training corpus. Transition rules: R i := lαr lβr with l,α, β,r as character sequences These rules are applied to a vocabulary base that should represent the documents to be processed. Productivity depends on context l,r. We get a raw error dictionary D _err-raw with entries haracter transition(s)] [error token | original token | character transition(s)]

Error Dictionaries Filtering Step: The raw error dictionary D _err_raw is filtered against a collection of relevant positive dictionaries leading to two error dictionaries: D _err : non word errors D _err-ff : word errors, false friends

Error Dictionaries Usage of error dictionaries: With a base of 100,000 English words we got a filtered error dictionary for typing errors with 9,427,051 entries For cognitive errors we got a lexicon with 1,202,997 entries Recall 60 %, Precision 85% on a reference corpus Error detection: scan the text with the error dictionary and compute the mean error rate per 1,000 tokens

Experiments on Micro-Noise Correlation of error rate and genre: For each genre in the genre corpus we computed the errors per 1,000 tokens with the help of the two error dictionaries We got a strong correlation between genre and mean error rate Extreme values are legal texts with 0.23 errors per 1,000 tokens and guestbooks with 6.23 errors per 1,000 tokens

Experiments on Micro-Noise Stability of the values for Training and Test corpora: similar plot

Experiments on Micro-Noise Preliminary experiments on using Micro-Noise for classification: Extension of specialized genre classifiers by a filter based on the mean error rate: Improvement of precision for 5 genres but also 1 classifier that lost performance, recall for 3 genres was lower SVM classifier with new feature mean error rate: also equivocal results with improvements for some of the genres Problem: high variance of the error rate, with error free documents also for genres with a high mean error rate

Conclusion

For certain applications the dimension genre partitions document repositories into noise and wanted documents

Conclusion For certain applications the dimension genre partitions document repositories into noise and wanted documents We introduced a new genre hierarchy that allows informed corpus construction

Conclusion For certain applications the dimension genre partitions document repositories into noise and wanted documents We introduced a new genre hierarchy that allows informed corpus construction Our easy to implement specialized classifiers are able to reach competitive results for genre recognition even with small training corpora

Conclusion For certain applications the dimension genre partitions document repositories into noise and wanted documents We introduced a new genre hierarchy that allows informed corpus construction Our easy to implement specialized classifiers are able to reach competitive results for genre recognition Error dictionaries can be used to estimate the mean error rates of documents

Conclusion For certain applications the dimension genre partitions document repositories into noise and wanted documents We introduced a new genre hierarchy that allows informed corpus construction Our easy to implement specialized classifiers are able to reach competitive results for genre recognition Error dictionaries can be used to estimate the mean error rates of documents We found a strong correlation between genre and the error rate

Conclusion For certain applications the dimension genre partitions document repositories into noise and wanted documents We introduced a new genre hierarchy that allows informed corpus construction Our easy to implement specialized classifiers are able to reach competitive results for genre recognition Error dictionaries can be used to estimate the mean error rates of documents We found a strong correlation between genre and the error rate Classification by noise leads to equivocal results

Future Work We will try to convince other researchers to build up a corpus with at least 1,000 documents per genre We work on an incremental learning algorithm for the improvement of our classifiers by user click behavior The correlation of genre and error rates will be further investigated on the a bigger genre corpus with an exhaustive statistical analysis Regarding the effects of errors on IR applications the repair potential of error dictionaries will be investigated

Thank you for your attention!