Download presentation
Presentation is loading. Please wait.
Published byTodd Gallagher Modified over 9 years ago
1
© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide 14-September-2014 Prof. Dr.-Ing. Ralf Steinmetz KOM - Multimedia Communications Lab iKNOW_SentenceClassification__SebS___2014.09.18.pptx Authors: Sebastian Schmidt (presenting) Steffen Schnitzer Christoph Rensing Generic Sentence Classification: Examining the Scenario of Scientific Abstracts and Scrum Protocols Image source: www.moebellisten.de
2
KOM – Multimedia Communications Lab2 Introduction Motivation Challenge and concept Scenarios Overview Corpora Approach used for classification Evaluation Setup Results for the scenarios Conclusion and Future Work Outline
3
KOM – Multimedia Communications Lab3 Information overload through flood of textual documents Professional settings Research settings Educational settings Hard for individuals to find relevant textual documents according to their information need String-based filtering can help to reduce the amount of documents to be read “Find online tutorials that deal with Java” “I am searching for a job in the pharmaceutical sector” Motivation
4
KOM – Multimedia Communications Lab4 Contextual ambiguity Pre-filtering of text sections can help! Based on the type of information contained Goal: A generic concept for sentence-type classification Challenge & Concept “Cleaning staff wanted! We are a company in the pharmaceutic sector.” vs. “We are acquiring people having pharmaceutic training” “For taking this course you should know about Java programming.” vs. “After this course you will be an expert in Java programming.”
5
KOM – Multimedia Communications Lab5 Introduction Motivation Challenge and concept Scenarios Overview Corpora Approach used for classification Evaluation Setup Results for the scenarios Conclusion and Future Work Outline
6
KOM – Multimedia Communications Lab6 Abstract consists of the content in a condensed form Typical queries from researchers Types can be assigned to the sentences, e.g. Motivation Goals Related Work → Knowing this type simplifies the execution of the queries Scenarios Abstracts of Scientific Articles Which other articles face a particular problem? Which other articles use a particular approach? Which approach performs best for a specific problem?
7
KOM – Multimedia Communications Lab7 Common questions (with variations) What went well? What went wrong? What could be improved? Often informal content “Testing took too long” “Teamwork was excellent” ….. Management might be interested in particular ones only Automated assignment to questions could simplify the creation of the protocols Scenarios Protocols of Scrum Retrospective Meetings Image source: commons.wikimedia.org
8
KOM – Multimedia Communications Lab8 Corpora Abstracts of Scientific Articles (Multimedia) Image source: http://digitalsherpa.com/how-to-use-social-media-to-conduct-market-research/
9
KOM – Multimedia Communications Lab9 1000 abstracts 8,633 sentences Biomedical domain 7 classes Background Objective Result …… Sentences annotated with one label by three annotators High inter-annotator agreement (κ= 0.85) → Annotations of only one annotator were used →Corpus BioM Corpora Abstracts of Scientific Articles ([1]) Image source: http://www.dmu.ac.uk/research/research-faculties-and-institutes/health-and-life- sciences/biomedical-and-environmental-health/biomedical-and-environmental-health.aspx
10
KOM – Multimedia Communications Lab10 139 Scrum retrospective protocols from major software company 653 sentences Sentences were clustered into “What went well?” “What went wrong?” “What could be improved?” → Corpus Scrum All sentences that could not be assigned to a cluster by humans were removed, e.g. “Timing” “Collaboration with Peter Smith” → Corpus Scrum_Subset Corpora Protocols of Scrum Retrospective Meetings
11
KOM – Multimedia Communications Lab11 Introduction Motivation Challenge and concept Scenarios Overview Corpora Approach used for classification Evaluation Setup Results for the scenarios Conclusion and Future Work Outline
12
KOM – Multimedia Communications Lab12 Supervised classification with domain-independent features 10 feature groups Approach Content All words as features Sentiment Positive/negative based on word-to- sentiment mapping Negation Count of negation words Tense Based on Stanford Lexicalized Parser Tense indicator Based on word endings and modal verbs Adjectives Based on Stanford Lexicalized Parser Indicative indicator Count of “need”, “should”, “must” Personal pronouns Based on Stanford Lexicalized Parser Position of the sentence Normalized position of the sentence within its context Number of words Total number of words
13
KOM – Multimedia Communications Lab13 Introduction Motivation Challenge and concept Scenarios Overview Corpora Approach used for classification Evaluation Setup Results for the scenarios Conclusion and Future Work Outline
14
KOM – Multimedia Communications Lab14 Different Classifiers used Support Vector Machines Naïve Bayes J48 Weka 10-fold cross validation Evaluation Setup Image source: http://www.cs.waikato.ac.nz/ml/weka/, http://scriptslines.com/blog/k-fold-cross-validation/
15
KOM – Multimedia Communications Lab15 Evaluation Abstracts of Scientific Articles (F1-Measure) MMBioM SVMNBJ48SVMNBJ48 All features0.6920.6900.6400.7980.7310.739 Single feature Words0.6340.6680.5750.7480.6830.668 Position0.4890.4870.4920.5570.5400.554 Tense Indicator0.2780.2790.2650.2540.319 All except single feature Words0.5550.4920.5100.6660.6050.648 Position0.6340.6560.5760.7500.6700.675 Adjectives0.6990.6920.6410.7990.7350.738 Best results for SVM Words alone gives results that are OK Results can be better when not using all features
16
KOM – Multimedia Communications Lab16 Evaluation Abstracts of Scientific Articles Different tag sets for the same kind of corpus do only seem to have a minor influence on the results → Size of evaluation data is more relevant
17
KOM – Multimedia Communications Lab17 Evaluation Protocols of Scrum Retrospective Meetings (F1-Measure) ScrumScrum_Subset SVMNBJ48SVMNBJ48 All features0.5720.5620.5130.6610.6690.592 Single feature Words0.5520.5330.4850.6470.6440.546 Sentiment0.3230.3790.4250.4150.4640.458 Tense Indicator0.3570.3390.4100.366 0.315 All except single feature Words0.4670.4840.4660.5500.5700.548 Sentiment0.5580.5500.4950.6560.6500.565 Adjectives0.5720.5600.5200.6640.6850.606 Best results for SVM/NB In the subset Sentiment is meaningful Results can be better when not using all features
18
KOM – Multimedia Communications Lab18 Introduction Motivation Challenge and concept Scenarios Overview Corpora Approach used for classification Evaluation Setup Results for the scenarios Conclusion and Future Work Outline
19
KOM – Multimedia Communications Lab19 Results generally good Also the training corpora are not too large No domain-specific features required Worse results for Scrum scenarios Incorrect grammar Many typos Shorter sentences Adding contextual information might be helpful Implementation in application needed for evaluation of usefulness of filtering concept Conclusion & Future Work
20
KOM – Multimedia Communications Lab20 Questions & Contact Image Source: http://www.dreifragezeichen.de/
21
KOM – Multimedia Communications Lab21 [1] Y. Guo, A. Korhonen, M. Liakata, I. S. Karolinska, L. Sun, and U. Stenius. Identifying the information structure of scientific abstracts: An investigation of three different schemes. In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, BioNLP ’10, page 99–107, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. References
22
KOM – Multimedia Communications Lab22 Backup Slides Results Scientific Abstracts
23
KOM – Multimedia Communications Lab23 Backup Slides Results Scrum
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.