Kiran Garimella
News Scientific papers Search Queries Twitter ◦ Gender ◦ Relationships ◦ Migration ◦ Politics
I’m a.. Just kidding!
Link structure Connected text Hidden structure/patterns This talk ◦ Summarizing scientific articles ◦ Political trends from search queries ◦ Romantic relationship breakups on Twitter
Motivation ◦ Not many existing systems ◦ Completely different from news document summarization ◦ Many topics ◦ Strong citation network ◦ Precise structure Introduction Related work Experiments, etc. 9
10 Irrelevant Sentences Relevant Sentences Categories Aim Own Background Contrast Other New paper Model Categorized Sentences Final Summary Papers
Manually annotating is a very tedious and difficult job Final summary depends on the classification accuracy Summary might depend on the training data 11
Make use of the strong citation network 12
Page Rank?
Paper A Paper B X1 X2 X3 X4 X1 X Paper C X1 X5 X7 Citations
search classify Citation 1 Citation 2 Extracted Citation Sentences Topics +ve -ve Summary Sentences from X Paper to be Summarized (X) 15
Contains the negative points of a paper too. Different view points covered. Can be useful to create a survey. Did not work Not many negative statements made Difficult to classify as positive or negative 16
17 Example:
Split text into sentences?, paragraphs? Text tiling to the rescue A technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.
19 Various Machine Learning approaches have been proposed for chunking. (a,b,c,d) Chunking is a widely used technique in Natural language processing. Under the same shallow structure.. Step I – Extract text tiles Step II – Cluster cited papers
20 Various Machine Learning approaches have been proposed for chunking. (a,b,c,d) Step III - Extract keywords from text tiles Step IV – Search for keywords in the clusters obtained in Step II Step V – Rank relevant sentences and present to the user
User Search Paper Viewing Module Search Module Text tiling Module Generate Text Tiles Cluster cited papers Extract Context Clustering Module Rank Sentences Ranking Module Citation Sentences Summary Presentati on Module Link: bin/summarization/summarizer.html Pipeline 21
Left leaning blogs (387)Right leaning blogs (644) From Benkler and Shaw “A tale of two blogospheres” (2010) and Wonkosphere Blog Directory
Use self-provided age and gender and ZIP- derived estimates People clicking on right-leaning blogs: – Are older (50 vs. 45 years) – Are more male (63% vs. 55%) – Are more white (81% vs. 78%) – More likely to study at La Sapienza (92.3% vs. 11.4%) All these trends agree with voters‘ demographics
“huffingtonpost.com” is left-leaning a left-leaning vote for “pizza is a vegetable” Aggregate votes across all clicks on political blogs to compute overall leaning From Blogs to Queries v L = left-clicks for query V L = total left clicks
Some background first
Largest known knowledge repository Covers wide range of domains Manually tagged hierarchical categorization system Frequently updated Well built link structure Categories ◦ Pages Links
31
32
Examples using Wikipedia mapping for 6 months of data, July 4, 2011 – January 8, queries for Wikipedia entity “Patient Protection & Affordable Care Act” obama healthcare bill text (.91)who pays for obamacare (.04) obama health care privileges (.83)obamacare reaches the supreme court (.09) is affordable care act unconstitutional (.78) is obamacare constitutional (.16) queries for Wikipedia category “Occupy” who started occupy wall street (.94)occupy wall street rape (.09) we are the 99% (.91)occupy movement violence (.25) occupy movement supporters (.78)crime in occupy movement (.44)
``cost obama trip to india‘‘ Mapping Queries to Statements 364 distinct queries mapped to true facts 574 distinct queries mapped to false facts
Small pieces of text, which may not give a lot of information, can be enhanced using external knowledge sources.
* Fake profiles 28 hour snapshot of Twitter from July 2013.
Nov 4, 2013 Feb 23, 2014 (BREAKUP) Apr 24, 2014 Tweets, mutual friendships and profile information collected every week. Nov 11, 2013 Nov 25, 2013 Data collected for 24 weeks. ……
Before breakupAfter breakup
Source: After?
Before breakupAfter breakup
Don’t breakup and fight publicly Word clouds as an easy source to get an overview
Use entity extraction on the abstracts. Co-occurring entities might indicate something. Create an entity co-occurrence graph.
@gvrkiran