© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Farag Saad i-KNOW 2014 Graz- Austria,
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Bring Order to Your Photos: Event-Driven Classification of Flickr Images Based on Social Knowledge Date: 2011/11/21 Source: Claudiu S. Firan (CIKM’10)
1 SELC:A Self-Supervised Model for Sentiment Classification Likun Qiu, Weishi Zhang, Chanjian Hu, Kai Zhao CIKM 2009 Speaker: Yu-Cheng, Hsieh.
A Brief Overview. Contents Introduction to NLP Sentiment Analysis Subjectivity versus Objectivity Determining Polarity Statistical & Linguistic Approaches.
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Automatic Discovery of Technology Trends from Patent Text Youngho Kim, Yingshi Tian, Yoonjae Jeong, Ryu Jihee, Sung-Hyon Myaeng School of Engineering Information.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Automatic Classification of Semantic Relations between Facts and Opinions Koji Murakami, Eric Nichols, Junta Mizuno, Yotaro Watanabe, Hayato Goto, Megumi.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Flash talk by: Aditi Garg, Xiaoran Wang Authors: Sarah Rastkar, Gail C. Murphy and Gabriel Murray.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text Soo-Min Kim and Eduard Hovy USC Information Sciences Institute 4676.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
Multimodal Alignment of Scholarly Documents and Their Presentations Bamdad Bahrani JCDL 2013 Submission Feb 2013.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Date: 2013/8/27 Author: Shinya Tanaka, Adam Jatowt, Makoto P. Kato, Katsumi Tanaka Source: WSDM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Estimating.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Acknowledgements Contact Information Objective An automated annotation tool was developed to assist human annotators in the efficient production of a high.
*Erasmus University Rotterdam P.O. Box 1738, NL-3000 DR Rotterdam, the Netherlands † Teezir BV Wilhelminapark 46, NL-3581 NL, Utrecht, the Netherlands.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Extracting Hidden Components from Text Reviews for Restaurant Evaluation Juanita Ordonez Data Mining Final Project Instructor: Dr Shahriar Hossain Computer.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Text Annotation By: Harika kode Bala S Divakaruni.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
WP4 Models and Contents Quality Assessment
Introduction Machine Learning 14/02/2017.
Using UMLS CUIs for WSD in the Biomedical Domain
Statistical NLP: Lecture 9
An Ontology-Enhanced Hybrid Approach to Aspect-Based Sentiment Analysis Daan de Heij, Artiom Troyanovsky, Cynthia Yang, Milena Zychlinsky Scharff, Kim.
How To Extend the Training Data
Information Retrieval
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide 14-September-2014 Prof. Dr.-Ing. Ralf Steinmetz KOM - Multimedia Communications Lab iKNOW_SentenceClassification__SebS___ pptx Authors: Sebastian Schmidt (presenting) Steffen Schnitzer Christoph Rensing Generic Sentence Classification: Examining the Scenario of Scientific Abstracts and Scrum Protocols Image source:

KOM – Multimedia Communications Lab2 Introduction  Motivation  Challenge and concept Scenarios  Overview  Corpora Approach used for classification Evaluation  Setup  Results for the scenarios Conclusion and Future Work Outline

KOM – Multimedia Communications Lab3 Information overload through flood of textual documents  Professional settings  Research settings  Educational settings Hard for individuals to find relevant textual documents according to their information need String-based filtering can help to reduce the amount of documents to be read  “Find online tutorials that deal with Java”  “I am searching for a job in the pharmaceutical sector” Motivation

KOM – Multimedia Communications Lab4 Contextual ambiguity Pre-filtering of text sections can help!  Based on the type of information contained Goal: A generic concept for sentence-type classification Challenge & Concept  “Cleaning staff wanted! We are a company in the pharmaceutic sector.” vs.  “We are acquiring people having pharmaceutic training”  “For taking this course you should know about Java programming.” vs.  “After this course you will be an expert in Java programming.”

KOM – Multimedia Communications Lab5 Introduction  Motivation  Challenge and concept Scenarios  Overview  Corpora Approach used for classification Evaluation  Setup  Results for the scenarios Conclusion and Future Work Outline

KOM – Multimedia Communications Lab6 Abstract consists of the content in a condensed form Typical queries from researchers Types can be assigned to the sentences, e.g.  Motivation  Goals  Related Work → Knowing this type simplifies the execution of the queries Scenarios Abstracts of Scientific Articles Which other articles face a particular problem? Which other articles use a particular approach? Which approach performs best for a specific problem?

KOM – Multimedia Communications Lab7 Common questions (with variations)  What went well?  What went wrong?  What could be improved? Often informal content  “Testing took too long”  “Teamwork was excellent”  ….. Management might be interested in particular ones only Automated assignment to questions could simplify the creation of the protocols Scenarios Protocols of Scrum Retrospective Meetings Image source: commons.wikimedia.org

KOM – Multimedia Communications Lab8 Corpora Abstracts of Scientific Articles (Multimedia) Image source:

KOM – Multimedia Communications Lab abstracts  8,633 sentences Biomedical domain 7 classes  Background  Objective  Result …… Sentences annotated with one label by three annotators  High inter-annotator agreement (κ= 0.85) → Annotations of only one annotator were used →Corpus BioM Corpora Abstracts of Scientific Articles ([1]) Image source: sciences/biomedical-and-environmental-health/biomedical-and-environmental-health.aspx

KOM – Multimedia Communications Lab Scrum retrospective protocols from major software company  653 sentences Sentences were clustered into  “What went well?”  “What went wrong?”  “What could be improved?” → Corpus Scrum All sentences that could not be assigned to a cluster by humans were removed, e.g.  “Timing”  “Collaboration with Peter Smith” → Corpus Scrum_Subset Corpora Protocols of Scrum Retrospective Meetings

KOM – Multimedia Communications Lab11 Introduction  Motivation  Challenge and concept Scenarios  Overview  Corpora Approach used for classification Evaluation  Setup  Results for the scenarios Conclusion and Future Work Outline

KOM – Multimedia Communications Lab12 Supervised classification with domain-independent features 10 feature groups Approach  Content  All words as features  Sentiment  Positive/negative based on word-to- sentiment mapping  Negation  Count of negation words  Tense  Based on Stanford Lexicalized Parser  Tense indicator  Based on word endings and modal verbs  Adjectives  Based on Stanford Lexicalized Parser  Indicative indicator  Count of “need”, “should”, “must”  Personal pronouns  Based on Stanford Lexicalized Parser  Position of the sentence  Normalized position of the sentence within its context  Number of words  Total number of words

KOM – Multimedia Communications Lab13 Introduction  Motivation  Challenge and concept Scenarios  Overview  Corpora Approach used for classification Evaluation  Setup  Results for the scenarios Conclusion and Future Work Outline

KOM – Multimedia Communications Lab14 Different Classifiers used  Support Vector Machines  Naïve Bayes  J48 Weka 10-fold cross validation Evaluation Setup Image source:

KOM – Multimedia Communications Lab15 Evaluation Abstracts of Scientific Articles (F1-Measure) MMBioM SVMNBJ48SVMNBJ48 All features Single feature Words Position Tense Indicator All except single feature Words Position Adjectives Best results for SVM Words alone gives results that are OK Results can be better when not using all features

KOM – Multimedia Communications Lab16 Evaluation Abstracts of Scientific Articles Different tag sets for the same kind of corpus do only seem to have a minor influence on the results → Size of evaluation data is more relevant

KOM – Multimedia Communications Lab17 Evaluation Protocols of Scrum Retrospective Meetings (F1-Measure) ScrumScrum_Subset SVMNBJ48SVMNBJ48 All features Single feature Words Sentiment Tense Indicator All except single feature Words Sentiment Adjectives Best results for SVM/NB In the subset Sentiment is meaningful Results can be better when not using all features

KOM – Multimedia Communications Lab18 Introduction  Motivation  Challenge and concept Scenarios  Overview  Corpora Approach used for classification Evaluation  Setup  Results for the scenarios Conclusion and Future Work Outline

KOM – Multimedia Communications Lab19 Results generally good  Also the training corpora are not too large  No domain-specific features required Worse results for Scrum scenarios  Incorrect grammar  Many typos  Shorter sentences Adding contextual information might be helpful Implementation in application needed for evaluation of usefulness of filtering concept Conclusion & Future Work

KOM – Multimedia Communications Lab20 Questions & Contact Image Source:

KOM – Multimedia Communications Lab21 [1] Y. Guo, A. Korhonen, M. Liakata, I. S. Karolinska, L. Sun, and U. Stenius. Identifying the information structure of scientific abstracts: An investigation of three different schemes. In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, BioNLP ’10, page 99–107, Stroudsburg, PA, USA, Association for Computational Linguistics. References

KOM – Multimedia Communications Lab22 Backup Slides Results Scientific Abstracts

KOM – Multimedia Communications Lab23 Backup Slides Results Scrum