1 Berendt: Advanced databases, first semester 2011, 1 Advanced databases – Inferring new knowledge.

Slides:



Advertisements
Similar presentations
Writing Research Papers - A presentation by William Badke
Advertisements

Data Mining vs. Statistics
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
The Social Scientific Method An Introduction to Social Science Research Methodology.
The Range of Qualitative Methods Module number 4 ESRC workshops for qualitative research in management.
Should I Believe It? A Practical Guide to Evaluating the Quality of Internet Websites RAH 10/08.
The Art of Publishing Aka “just the facts ma’am”.
“How Can Research Help Me?” Please make SURE your notes are similar to what I have written in mine.
Welcome to the seminar course
10. NLTS2 Documentation Overview. 1 Prerequisites Recommended modules to complete before viewing this module  1. Introduction to the NLTS2 Training Modules.
A Graph-based Recommender System Zan Huang, Wingyan Chung, Thian-Huat Ong, Hsinchun Chen Artificial Intelligence Lab The University of Arizona 07/15/2002.
Qualitative Social Work Research
Twitter Shingo Ichikawa. General Descriptions What is twitter? –Twitter is a free social networking and micro-blogging service that enables its users.
Twitter – what is it? The School District of Haverford Township |
Promoting Your Business Through Twitter ©2009, All rights reserved Fox Coaching Associates.
Twitter Glossary. #: People use the hashtag symbol # before a relevant keyword or phrase (no spaces) in their Tweet to categorize those Tweets and help.
PSRC Technology Integration Team TWITTER 101.  Twitter is a social networking tool or microblog.  It is composed of short text, pictures, and URLs called.
2015 SLA IT Webinar Using Analytics to Understand Social Media Activity Michelle Chen School of Information San José State University February 4 th, 2015.
Researching Your Presentation
TWITTER BASICS GATEHOUSE NEWS & INTERACTIVE DIVISION.
Teaching Comprehension in the early grades Leecy Wise
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
Postgraduate (Research) - Databases
Research Methods for Computer Science CSCI 6620 Spring 2014 Dr. Pettey CSCI 6620 Spring 2014 Dr. Pettey.
Visualization Tools for Twitter A review and analysis of visualization tools in the Twitter domain By Joseph Vincze.
More than words: Social networks’ text mining for consumer brand sentiments A Case on Text Mining Key words: Sentiment analysis, SNS Mining Opinion Mining,
Senior Thesis: Review of Literature Samples, Citation help, Search techniques.
Krishnaprasad Thirunarayan, Pramod Anantharam, Cory A. Henson, and Amit P. Sheth Kno.e.sis Center, Ohio Center of Excellence on Knowledge-enabled Computing,
The Fullerton College Library. Welcome to Library Research.
©2010 John Wiley and Sons Chapter 11 Research Methods in Human-Computer Interaction Chapter 11- Analyzing Qualitative.
AELDP ACADEMIC READING. Questions Do you have any questions about academic reading?
Content Strategy.
Content analysis (Holsti)
Introduction to Text and Web Mining. I. Text Mining is part of our lives.
Introducing Communication Research 2e © 2014 SAGE Publications Chapter Twelve Content Analysis: Understanding Text and Image in Numbers.
Finding Credible Sources
Emily Irwin CEP 806 Fall  Teaching experience:  I have worked with seventh and ninth grade biology students who are working on science fair projects.
Qualitative Papers. Literature Review: Sensitizing Concepts Contextual Information Baseline of what reader should know Establish in prior research: Flaws.
The Literature Search and Background of the Problem.
FOR 500 PRINCIPLES OF RESEARCH: PROPOSAL WRITING PROCESS
Building the Body of Knowledge Module 2, Class 3 A Teaching Module Developed by the Curriculum Task Force of the Sloan Work and Family Research Network.
Content Analysis Presented by: Eric S. Riley. What we’re going to cover – Fast…  What is Content Analysis  Rough History of Content Analysis  The Procedure.
1 Berendt: Advanced databases, first semester 2011, 1 Advanced databases – Inferring new knowledge.
HOW TO WRITE A RESEARCH PAPER CGHS Language Arts.
FINAL PROJECT (CE3216) The Literature Review Dr Deepak T.J. SCHOOL OF CIVIL ENGINEERING.
Navigating a Research Topic Kathy Clarke Reference Librarian Carrier Library/clarkeke.
Lesson Overview Lesson Overview What Is Science? Lesson Overview 1.1 What Is Science?
QUALITATIVE RESEARCH What is the distinction between Inductive and Deductive research? Qualitative research methods – produces observations that are not.
Copyright 2010, The World Bank Group. All Rights Reserved. Testing and Documentation Part II.
Lesson Overview Lesson Overview What Is Science? Lesson Overview 1.1 What Is Science?
Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.
1 1 1 Berendt: Advanced databases, first semester 2008, Advanced databases – Semantic Web Mining.
Cedric D. Murry APT Instructor of Applied Technology in research and development.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
RESEARCH METHODS Lecture 36. NON-REACTIVE RESEARCH.
Rachael Addicott Centre for Public Services Organisations February 2006 School of Management – Methodology and Qualitative Research Methods ANALYSING QUALITATIVE.
How to Write a research paper
Secondary Data Searches
Data-Driven Educational Data Mining ---- the Progress of Project
Research Methods for Computer Science
Chapter 2: Overview of the Action Research Process
The Literature Search and Background of the Problem
Content analysis, thematic analysis and grounded theory
Presented by: Eric S. Riley
How to Write a research paper
Literature Review Guidelines
WISER Social Sciences: Key Search Skills
Literature Review Guidelines
How to Write a research paper
Dr. Debaleena Chattopadhyay Department of Computer Science
Presentation transcript:

1 Berendt: Advanced databases, first semester 2011, 1 Advanced databases – Inferring new knowledge from data(bases): Text Mining II Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science Last update: 28 December 2011

2 Berendt: Advanced databases, first semester 2011, 2 Agenda Some advanced forms of text mining (index7.ppt, pp ) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations

3 Berendt: Advanced databases, first semester 2011, 3 Agenda Some advanced forms of text mining (index7.ppt, pp ) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations

4 Berendt: Advanced databases, first semester 2011, 4 Agenda Some advanced forms of text mining (index7.ppt, pp ) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations

5 Berendt: Advanced databases, first semester 2011, 5 Motivation for association-rule learning/mining: store layout (Amazon, earlier: Wal-Mart,...) Where to put: spaghetti, butter?

6 Berendt: Advanced databases, first semester 2011, 6 What makes people happy?

7 Berendt: Advanced databases, first semester 2011, 7 Agenda Some advanced forms of text mining (index7.ppt, pp ) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations

8 Berendt: Advanced databases, first semester 2011, 8 News and social media, in particular tweets

9 Berendt: Advanced databases, first semester 2011, 9 Recall: CRISP-DM CRISP-DM CRoss Industry Standard Process for Data Mining a data mining process model that describes commonly used approaches that expert data miners use to tackle problems.

10 Berendt: Advanced databases, first semester 2011, 10 Business understanding

11 Berendt: Advanced databases, first semester 2011, 11 Data understanding

12 Berendt: Advanced databases, first semester 2011, 12 Agenda Some advanced forms of text mining (index7.ppt, pp ) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations

13 Berendt: Advanced databases, first semester 2011, 13 What are the relations between these text (parts)?

14 Berendt: Advanced databases, first semester 2011, 14 Or these?

15 Berendt: Advanced databases, first semester 2011, 15 A list of possible (and interesting) text relations in the News/Blogs/Tweets domain (relation Tweet -> news art.) Repetition (could be more interesting if repeated repetition /retweet -> rep. Weights?) Repetition of the headline ? Pointing to interesting links (diff. To identify? – need to process the link / might have redirection) Pointing to the article … anything becomes more important if it‘s retweeted (endorsement?) … … that may depend on WHO (re)tweets it – measured e.g. by no. Of followers … Comment Reference to event or topic via a hashtag (Obama election) … -- hashtags can be used to identify a topic that might also be present in NAs (being-about-the-same-topic)  learn from the words around the texts, and co-occurring hashtags use SentiStrength to determine if a text has a positive or negative relationship with a tweet (endorsement; criticism)

16 Berendt: Advanced databases, first semester 2011, 16 Agenda Some advanced forms of text mining (index7.ppt, pp ) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations

17 Berendt: Advanced databases, first semester 2011, 17 What is Content Analysis? n A form of textual analysis *usually* n Categorizes chunks of text according to Code n Blend of qualitative and quantitative Schwandt, Thomas A. Dictionary of Qualitative Inquiry. 2nd ed. Sage Publications: Thousand Oaks, CA, From Eric S. Riley (n.d.) Content Analysis (pp. 3-6).

18 Berendt: Advanced databases, first semester 2011, 18 Rough History - 1 Classical Content Analysis n Used as early as the 30’s in military intelligence n Analyzed items such as communist propaganda, and military speeches for themes n Created matrices searching for the number of occurrences of particular words/phrases Roberts, C.W. "Content Analysis." International Encyclopedia of the Social and Behavioral Sciences. Elsevier: Amsterdam, From Eric S. Riley (n.d.) Content Analysis (pp. 3-6).

19 Berendt: Advanced databases, first semester 2011, 19 Rough History - 2 (New) Content Analysis* n Moved into Social Science Research n Study trends in Media, Politics, and provides method for analyzing open ended questions n Can include visual documents as well as texts n More of a focus on phrasal/categorical entities than simple word counting *My own terminology, more generally referred to as simply “Content Analysis” From Eric S. Riley (n.d.) Content Analysis (pp. 3-6).

20 Berendt: Advanced databases, first semester 2011, 20 Procedure 1. Identifying a corpus of texts and Sample Pop. 2. Determine unit of analysis 3. Find Themes (inductive or deductive) 4. Build a Codebook 5. Mark the texts 6. Analyze the code from texts quantitatively Denzin, Norman K. Handbook of Qualitative Research. Sage Publications: Thousand Oaks, CA, From Eric S. Riley (n.d.) Content Analysis (pp. 3-6).

21 Berendt: Advanced databases, first semester 2011, 21 Coding Analyzing the archived content. Includes: 1. Identifying units of analysis (e.g., individual user posts, game characters) 2. Creating a codebook 3. Creating coding sheets (may be electronic now) 4. Training, coding, intercoder reliability assessment, etc. From Paul Skalski (n.d.) Content Analysis of Interactive Media. p. 10

22 Berendt: Advanced databases, first semester 2011, 22 Examples To be shown in class: Overview of an example from the Social Web: see pp. 11ff. Further resources include: A detailed example of a codebook for Content Analysis of Stories about Protest Events: More examples of codebooks and coding schemes:

23 Berendt: Advanced databases, first semester 2011, 23 Thus …

24 Berendt: Advanced databases, first semester 2011, 24 A first project plan (for HWs 4 and 6) – HW 4 PHASE Data understanding / initial data collection of the class attribute n in different teams: 1. come up with different possible relations between texts 2. find a small number of examples nNB: Sampling strategy? 3. develop a codebook and coding scheme 4. have several coders code a larger number of examples nNB: Sampling strategy? 5. measure inter-rater agreement nhttp://en.wikipedia.org/wiki/Krippendorff%27s_Alphahttp://en.wikipedia.org/wiki/Krippendorff%27s_Alpha PHASE Pause – revisit the literature and re-evaluate it (not really a CRISP-DM phase …) 1. Compare your results! 2. In the light of all this, revisit (as an example from the literature) the Sentistrength coding procedure and discuss it critically

25 Berendt: Advanced databases, first semester 2011, 25 A first project plan (for HWs 4 and 6) – HW 6 PHASE Data preparation 1. You may skip most of this phase. Take the data as prepared by Ilija! PHASE Modelling 1. Understand / develop [depending on time and interest] formal measures of such text relations 2. Calculate the measures for the corpora 3. Calculate the accuracy of classification PHASE Evaluation 1. Do an error analysis. Be critical with yourself, the results, and their meaning for the initial question ;-) PHASE Deployment 1. Produce final report

26 Berendt: Advanced databases, first semester 2011, 26 First round of relations R1: summary R2: repetition of headline R3: (the tweet is a) link (to the article) R4: (the tweet is a) link to another article on the same topic R5: comment on the article‘s content R6: comment on a topic related to the article R7: comment on the article (note: only if there is a link to the article!) R8: hashtag-about-topic R9: endorsement of the article R10: endorsement of its content R11: criticism of the article R12: criticism of its content

27 Berendt: Advanced databases, first semester 2011, 27 Problems/observations n Non-English tweets n TODO: language classification or different story selection n Overlapping categories: headline repetition + link to article (this is likely to happen on sites that have an automatic tweet generator) n TODO: new category n Hashtag „#“ missing (sometimes – only in the Oil Spill data?) n Comment on a tweet that commented on an article (in these tweets, there is a syntactic indicator of retweeting: RT) AND most retweets are comments on the retweeted text n TODO: new category „indirect comment“ n Use the article as an argument n TODO: new category „link works as repetition“ – simplify to category „link to the article (a suggestion to someone to read it) = answer; RT = spread in your own network; both may contain commenting (but answer presupposes the recipient knows what this is about)  exclude answer tweets?!

28 Berendt: Advanced databases, first semester 2011, 28 Problems/observations (2) Overlapping (s.a.) Found an instance of only-link (not yet clear to what; some links don‘t work) Headline + sentence (as far as the 140-chars allow) + link No relation – topic too big (Iraq war vs. Iraqi economy)

29 Berendt: Advanced databases, first semester 2011, 29 Second round of relations: „the manually annotated tweet is a … of some news article / other text“ R1: Summary w link R2: Headline w link R3: Summary or headline wo link R4: Endorsement w link R5: Endorsement wo link R6: Criticism w link R7: Criticism wo link R8: Otherwise emotionally charged text R9: Just a link R10: Comment on another tweet [rule: always involves RT R11: Enriching another tweet [rule: always involves RT] Rule: if there is a link, try to check it to see whether the tweet text repeats the headline R12: OTHER

30 Berendt: Advanced databases, first semester 2011, 30 Outlook Some advanced forms of text mining (index7.ppt, pp ) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations Notes about language modelling and about Inference on/with/for the Semantic Web

31 Berendt: Advanced databases, first semester 2011, 31 References / background reading n Stemler, Steve (2001). An overview of content analysis. Practical Assessment, Research & Evaluation, 7(17). This describes, among other things, the classic book in the field: Krippendorff, K. (1980). Content Analysis: An Introduction to Its Methodology. Newbury Park, CA: Sage. n The CRISP-DM manual can be found at n „Our“ twitter study: Subašić, I. & Berendt, B. (2011). Peddling or Creating? Investigating the Role of Twitter in News Reporting. In Proceedings of ECIR 2011 ( ). Berlin etc.: Springer. LNCS endt_2011.pdf