Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta.

Similar presentations


Presentation on theme: "Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta."— Presentation transcript:

1 Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

2 Motivation For search applications we often would like to narrow down the result set to a certain class of documents For corpus construction an exclusion of certain document classes could be helpful Documents with a high rate of errors could harm in applications like for example computer aided language learning (CALL) or lexicon construction. Documents of certain classes could be more erroneous like others. It makes sense to investigate the implications of document genre in the area of noise reduction

3 Definition of Genre Partition of documents into distinct classes of text with similar function and form Independent dimension ideally orthogonal to topic Examples for document genres: blogs, guestbooks, science reports Mixed documents are possible = documents where parts belong to different genres

4 Two different views on Genre

5 A document with the wrong genre will often be noise

6 Two different views on Genre A document with the wrong genre will often be noise: Macro-Noise

7 Two different views on Genre A document with the wrong genre will often be noise: Macro-Noise In documents of different genre we find different amounts of noise:

8 Two different views on Genre A document with the wrong genre will often be noise: Macro-Noise In documents of different genre we find different amounts of noise: Micro-Noise

9 Outline Introduction of a new genre hierarchy Macro-Noise detection –Feature Space –Classifiers –Experiments and applications Micro-Noise detection –Error dictionaries –Experiments on correlation of genre and noise –Experiments on classification by noise

10 A hierarchy of Genres Demands for a genre classification schema: Task oriented granularity Hierarchical Logically consistent Complete

11 A hierarchy of Genres 8 container classes with 32 leaf genres

12 Corpus Containter Classes Allow to compare to other classification schemas Allow to evaluate the seriousness of classification errors Training and Evaluation Corpus For each of the 32 genres 20 English HTML web documents for training and 20 documents for testing were collected leading to a corpus with 1,280 files.

13 Detection of Macro-Noise Macro-Noise detection is a classification problem Candidate Features Feature selection mechanism Build Classifiers Combine Classifiers for Classification

14 Feature Space Examples for Features Form: line length, number of sentences Vocabulary: specialized word lists, dictionaries, multi lexemic epr. Structure: POS Complex patterns: style All together we got over 200 features for the 32 genres

15 Feature Space Kernel question: Selection of features Global feature sets for the standard machine learning algorithms Specialized feature sets for our specialized classifiers Small set of significant and natural features for each genre Avoiding accidental similarities between documents

16 Feature Space Feature Selection for specialized genre classifiers do select candidate feature add feature if performance of classification improves ordering by classification strength prune features that have become obsolete until Recall > 90/75% && Precision > 90/75% Rules: Constructed as inequations with discriminative ranges Classifiers: Conjunction of single rules

17 Classifiers Example: Classifier for reportage as a conjunction of single rules

18 Classifiers Classifier Combination Filtering: Class as a disqualification criterion for another class in the case of multiple classification Ordering by F1 value: Classifiers that lead more probably to a correct classification are applied first Ordering by dependencies and recall: A graph with edges that represent the number of wrong classifications of one class as another controls the sequence of classifier application. First, edges with smaller values are traversed leading to fewer wrong classifications

19 Experiments on Macro-Noise Detection of Genre: On the test corpus we get a precision of 72.2% and an overall recall of 54,00% with the specialized classifiers Superior to machine learning methods with SVM as the best method leading to 51.9% precision and to 47.8% recall The superiority can be stated only for the small training corpora Work for incremental classifier improvement and the behavior on bigger training sets is forthcoming

20 Experiments on Macro-Noise Application 1: Retrieving Scientific Articles on fish Queries like (cod Λ habitat) are sent to a search engine to retrieve scientific documents Evaluation over the 30 top-ranked documents of a query Precision and the Recall at cut-points 5,10,15,20 documents could be significantly improved by genre recognition, leaving room for further improvement

21 Experiments on Macro-Noise Application 2: Language models for speech recognition Language models of speech corpora are notoriously sparse Standard solution augmentation by text documents should be improved choosing genres similar to spoken text as: forum, interview, blog The noise in a crawled corpus of ~30,000 documents could be reduced to a residue of 2.5%

22 Detection of Micro-Noise Examples for Micro-Noise: Typing errors, cognitive errors Method: Detection of errors with specialized Error dictionaries

23 Error Dictionaries Construction principle: Micro-Noise occurs from elucidable channel characteristics. These characteristics can be discovered in an analytical way or by observations in a training corpus. Transition rules: R i := lαr lβr with l,α, β,r as character sequences These rules are applied to a vocabulary base that should represent the documents to be processed. Productivity depends on context l,r. We get a raw error dictionary D _err-raw with entries haracter transition(s)] [error token | original token | character transition(s)]

24 Error Dictionaries Filtering Step: The raw error dictionary D _err_raw is filtered against a collection of relevant positive dictionaries leading to two error dictionaries: D _err : non word errors D _err-ff : word errors, false friends

25 Error Dictionaries Usage of error dictionaries: With a base of 100,000 English words we got a filtered error dictionary for typing errors with 9,427,051 entries For cognitive errors we got a lexicon with 1,202,997 entries Recall 60 %, Precision 85% on a reference corpus Error detection: scan the text with the error dictionary and compute the mean error rate per 1,000 tokens

26 Experiments on Micro-Noise Correlation of error rate and genre: For each genre in the genre corpus we computed the errors per 1,000 tokens with the help of the two error dictionaries We got a strong correlation between genre and mean error rate Extreme values are legal texts with 0.23 errors per 1,000 tokens and guestbooks with 6.23 errors per 1,000 tokens

27 Experiments on Micro-Noise Stability of the values for Training and Test corpora: similar plot

28 Experiments on Micro-Noise Preliminary experiments on using Micro-Noise for classification: Extension of specialized genre classifiers by a filter based on the mean error rate: Improvement of precision for 5 genres but also 1 classifier that lost performance, recall for 3 genres was lower SVM classifier with new feature mean error rate: also equivocal results with improvements for some of the genres Problem: high variance of the error rate, with error free documents also for genres with a high mean error rate

29 Conclusion

30 For certain applications the dimension genre partitions document repositories into noise and wanted documents

31 Conclusion For certain applications the dimension genre partitions document repositories into noise and wanted documents We introduced a new genre hierarchy that allows informed corpus construction

32 Conclusion For certain applications the dimension genre partitions document repositories into noise and wanted documents We introduced a new genre hierarchy that allows informed corpus construction Our easy to implement specialized classifiers are able to reach competitive results for genre recognition even with small training corpora

33 Conclusion For certain applications the dimension genre partitions document repositories into noise and wanted documents We introduced a new genre hierarchy that allows informed corpus construction Our easy to implement specialized classifiers are able to reach competitive results for genre recognition Error dictionaries can be used to estimate the mean error rates of documents

34 Conclusion For certain applications the dimension genre partitions document repositories into noise and wanted documents We introduced a new genre hierarchy that allows informed corpus construction Our easy to implement specialized classifiers are able to reach competitive results for genre recognition Error dictionaries can be used to estimate the mean error rates of documents We found a strong correlation between genre and the error rate

35 Conclusion For certain applications the dimension genre partitions document repositories into noise and wanted documents We introduced a new genre hierarchy that allows informed corpus construction Our easy to implement specialized classifiers are able to reach competitive results for genre recognition Error dictionaries can be used to estimate the mean error rates of documents We found a strong correlation between genre and the error rate Classification by noise leads to equivocal results

36 Future Work We will try to convince other researchers to build up a corpus with at least 1,000 documents per genre We work on an incremental learning algorithm for the improvement of our classifiers by user click behavior The correlation of genre and error rates will be further investigated on the a bigger genre corpus with an exhaustive statistical analysis Regarding the effects of errors on IR applications the repair potential of error dictionaries will be investigated

37 Thank you for your attention!


Download ppt "Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta."

Similar presentations


Ads by Google