Download presentation
Presentation is loading. Please wait.
Published byAdrianna Snowden Modified over 9 years ago
1
1 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge from data(bases): Text Mining II Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http://people.cs.kuleuven.be/~bettina.berendt/teaching Last update: 28 December 2011
2
2 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 2 Agenda Some advanced forms of text mining (index7.ppt, pp. 32-47) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations
3
3 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 3 Agenda Some advanced forms of text mining (index7.ppt, pp. 32-47) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations
4
4 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 4 Agenda Some advanced forms of text mining (index7.ppt, pp. 32-47) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations
5
5 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 5 Motivation for association-rule learning/mining: store layout (Amazon, earlier: Wal-Mart,...) Where to put: spaghetti, butter?
6
6 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 6 What makes people happy?
7
7 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 7 Agenda Some advanced forms of text mining (index7.ppt, pp. 32-47) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations
8
8 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 8 News and social media, in particular tweets
9
9 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 9 Recall: CRISP-DM CRISP-DM CRoss Industry Standard Process for Data Mining a data mining process model that describes commonly used approaches that expert data miners use to tackle problems.
10
10 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 10 Business understanding
11
11 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 11 Data understanding
12
12 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 12 Agenda Some advanced forms of text mining (index7.ppt, pp. 32-47) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations
13
13 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 13 What are the relations between these text (parts)?
14
14 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 14 Or these?
15
15 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 15 A list of possible (and interesting) text relations in the News/Blogs/Tweets domain (relation Tweet -> news art.) Repetition (could be more interesting if repeated repetition /retweet -> rep. Weights?) Repetition of the headline ? Pointing to interesting links (diff. To identify? – need to process the link / might have redirection) Pointing to the article … anything becomes more important if it‘s retweeted (endorsement?) … … that may depend on WHO (re)tweets it – measured e.g. by no. Of followers … Comment Reference to event or topic via a hashtag (Obama election) … -- hashtags can be used to identify a topic that might also be present in NAs (being-about-the-same-topic) learn from the words around the texts, and co-occurring hashtags use SentiStrength to determine if a text has a positive or negative relationship with a tweet (endorsement; criticism)
16
16 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 16 Agenda Some advanced forms of text mining (index7.ppt, pp. 32-47) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations
17
17 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 17 What is Content Analysis? n A form of textual analysis *usually* n Categorizes chunks of text according to Code n Blend of qualitative and quantitative Schwandt, Thomas A. Dictionary of Qualitative Inquiry. 2nd ed. Sage Publications: Thousand Oaks, CA, 2001. From Eric S. Riley (n.d.) Content Analysis (pp. 3-6). http://www.geocities.com/licinius/washington/contentanalysis.ppt http://www.geocities.com/licinius/washington/contentanalysis.ppt
18
18 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 18 Rough History - 1 Classical Content Analysis n Used as early as the 30’s in military intelligence n Analyzed items such as communist propaganda, and military speeches for themes n Created matrices searching for the number of occurrences of particular words/phrases Roberts, C.W. "Content Analysis." International Encyclopedia of the Social and Behavioral Sciences. Elsevier: Amsterdam, 2001. From Eric S. Riley (n.d.) Content Analysis (pp. 3-6). http://www.geocities.com/licinius/washington/contentanalysis.ppt http://www.geocities.com/licinius/washington/contentanalysis.ppt
19
19 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 19 Rough History - 2 (New) Content Analysis* n Moved into Social Science Research n Study trends in Media, Politics, and provides method for analyzing open ended questions n Can include visual documents as well as texts n More of a focus on phrasal/categorical entities than simple word counting *My own terminology, more generally referred to as simply “Content Analysis” From Eric S. Riley (n.d.) Content Analysis (pp. 3-6). http://www.geocities.com/licinius/washington/contentanalysis.ppt http://www.geocities.com/licinius/washington/contentanalysis.ppt
20
20 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 20 Procedure 1. Identifying a corpus of texts and Sample Pop. 2. Determine unit of analysis 3. Find Themes (inductive or deductive) 4. Build a Codebook 5. Mark the texts 6. Analyze the code from texts quantitatively Denzin, Norman K. Handbook of Qualitative Research. Sage Publications: Thousand Oaks, CA, 2000. From Eric S. Riley (n.d.) Content Analysis (pp. 3-6). http://www.geocities.com/licinius/washington/contentanalysis.ppt http://www.geocities.com/licinius/washington/contentanalysis.ppt
21
21 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 21 Coding Analyzing the archived content. Includes: 1. Identifying units of analysis (e.g., individual user posts, game characters) 2. Creating a codebook 3. Creating coding sheets (may be electronic now) 4. Training, coding, intercoder reliability assessment, etc. From Paul Skalski (n.d.) Content Analysis of Interactive Media. http://academic.csuohio.edu/kneuendorf/c63309/Interactive09.ppt, p. 10 http://academic.csuohio.edu/kneuendorf/c63309/Interactive09.ppt
22
22 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 22 Examples To be shown in class: Overview of an example from the Social Web: see http://academic.csuohio.edu/kneuendorf/c63309/Interactive09.ppt, pp. 11ff. http://academic.csuohio.edu/kneuendorf/c63309/Interactive09.ppt Further resources include: A detailed example of a codebook for Content Analysis of Stories about Protest Events: http://www.ssc.wisc.edu/~oliver/PROTESTS/ArticleCopies/codebook2000.htm http://www.ssc.wisc.edu/~oliver/PROTESTS/ArticleCopies/codebook2000.htm More examples of codebooks and coding schemes: http://academic.csuohio.edu/kneuendorf/content/hcoding/hcindex.htm http://academic.csuohio.edu/kneuendorf/content/hcoding/hcindex.htm
23
23 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 23 Thus …
24
24 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 24 A first project plan (for HWs 4 and 6) – HW 4 PHASE Data understanding / initial data collection of the class attribute n in different teams: 1. come up with different possible relations between texts 2. find a small number of examples nNB: Sampling strategy? 3. develop a codebook and coding scheme 4. have several coders code a larger number of examples nNB: Sampling strategy? 5. measure inter-rater agreement nhttp://en.wikipedia.org/wiki/Krippendorff%27s_Alphahttp://en.wikipedia.org/wiki/Krippendorff%27s_Alpha PHASE Pause – revisit the literature and re-evaluate it (not really a CRISP-DM phase …) 1. Compare your results! 2. In the light of all this, revisit (as an example from the literature) the Sentistrength coding procedure and discuss it critically
25
25 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 25 A first project plan (for HWs 4 and 6) – HW 6 PHASE Data preparation 1. You may skip most of this phase. Take the data as prepared by Ilija! PHASE Modelling 1. Understand / develop [depending on time and interest] formal measures of such text relations 2. Calculate the measures for the corpora 3. Calculate the accuracy of classification PHASE Evaluation 1. Do an error analysis. Be critical with yourself, the results, and their meaning for the initial question ;-) PHASE Deployment 1. Produce final report
26
26 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 26 First round of relations R1: summary R2: repetition of headline R3: (the tweet is a) link (to the article) R4: (the tweet is a) link to another article on the same topic R5: comment on the article‘s content R6: comment on a topic related to the article R7: comment on the article (note: only if there is a link to the article!) R8: hashtag-about-topic R9: endorsement of the article R10: endorsement of its content R11: criticism of the article R12: criticism of its content
27
27 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 27 Problems/observations n Non-English tweets n TODO: language classification or different story selection n Overlapping categories: headline repetition + link to article (this is likely to happen on sites that have an automatic tweet generator) n TODO: new category n Hashtag „#“ missing (sometimes – only in the Oil Spill data?) n Comment on a tweet that commented on an article (in these tweets, there is a syntactic indicator of retweeting: RT) AND most retweets are comments on the retweeted text n TODO: new category „indirect comment“ n Use the article as an argument n TODO: new category „link works as repetition“ – simplify to category „link to the article (a suggestion to someone to read it) n @ = answer; RT = spread in your own network; both may contain commenting (but answer presupposes the recipient knows what this is about) exclude answer tweets?!
28
28 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 28 Problems/observations (2) Overlapping (s.a.) Found an instance of only-link (not yet clear to what; some links don‘t work) Headline + sentence (as far as the 140-chars allow) + link No relation – topic too big (Iraq war vs. Iraqi economy)
29
29 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 29 Second round of relations: „the manually annotated tweet is a … of some news article / other text“ R1: Summary w link R2: Headline w link R3: Summary or headline wo link R4: Endorsement w link R5: Endorsement wo link R6: Criticism w link R7: Criticism wo link R8: Otherwise emotionally charged text R9: Just a link R10: Comment on another tweet [rule: always involves RT or @] R11: Enriching another tweet [rule: always involves RT] Rule: if there is a link, try to check it to see whether the tweet text repeats the headline R12: OTHER
30
30 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 30 Outlook Some advanced forms of text mining (index7.ppt, pp. 32-47) Recall: The importance of business and data understanding (BU & DU) for knowledge discovery The Twitter Study and its questions: BU & DU Relations between texts Content analysis as a method for generating ground-truth annotations Notes about language modelling and about Inference on/with/for the Semantic Web
31
31 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 31 References / background reading n Stemler, Steve (2001). An overview of content analysis. Practical Assessment, Research & Evaluation, 7(17). http://PAREonline.net/getvn.asp?v=7&n=17 http://PAREonline.net/getvn.asp?v=7&n=17 This describes, among other things, the classic book in the field: Krippendorff, K. (1980). Content Analysis: An Introduction to Its Methodology. Newbury Park, CA: Sage. n The CRISP-DM manual can be found at http://www.spss.ch/upload/1107356429_CrispDM1.0.pdf http://www.spss.ch/upload/1107356429_CrispDM1.0.pdf n „Our“ twitter study: Subašić, I. & Berendt, B. (2011). Peddling or Creating? Investigating the Role of Twitter in News Reporting. In Proceedings of ECIR 2011 (207-213). Berlin etc.: Springer. LNCS 6611. http://people.cs.kuleuven.be/~bettina.berendt/Papers/subasic_ber endt_2011.pdf
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.