Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University.

Similar presentations


Presentation on theme: "1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University."— Presentation transcript:

1 1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University

2 2

3 3 Science and Language  Predicting impact of scientific journal articles  Generating updates on disaster

4 4 Machine learning framework  Data (often labeled)  Extraction of “features” from text data  Prediction of output

5 5 Machine learning framework  Data (often labeled)  Extraction of “features” from text data  Prediction of output What data is available for learning?

6 6 Machine learning framework  Data (often labeled)  Extraction of “features” from text data  Prediction of output What features yield good predictions?

7 7 Science and Language  Predicting impact of scientific journal articles  Generating updates on disaster

8 8

9 9 What can articles yield?

10 10 Alexandrescu and Kirchhoff, 2007 Lee and Ng, 2002 Ando, 2006 Niu et al., 2005 Stevenson and Wilks, 2001 Ng et al., 2003 Rigau et al., 1997Yarowsky, 1995 Stetina et al., 1998 Ng and Lee, 1996 Peh and Ng, 1997 Yarowsky, 1992 Khapra et al., 2011 Chan et al., 2007 Time Qazvinian et al, JAIR 2013 Citation Network

11 11 What can articles yield?  Argumentative zoning (Teufel, 1996)  “Here, we present quantitative estimates of the global biological impacts of climate changes.” [AIM]  Citation sentiment and scope  “The approach of economists takes a broader view. … We argue that this approach misses biologically important phenomena.”

12 12 Prediction of Scientific Impact Climate change, Climate model  Input: term, document  Extract features from full text and metadata  Predict prominence of term, document

13 13 Approach  Metadata features  Citation network  Author collaboration network  Text features  Argumentative zoning  Citation sentiment  Primary concepts  Relations Logistic regression to predict future impact

14 14 Data  4 million full-text Elsevier journal articles  48 million Web of Science metadata records  Fields: medical, chemistry, biology, computer science, …  Ground truth function

15 15 Experiments Experimental comparisons of text features vs metadata features Data: Selected Elsevier articles and counterparts in WOS

16 16 What have We Learned?

17 17 What have we learned?

18 18 Text features alone outperform metadata Elsevier data Text

19 19 Metadata adds value when combined with text Elsevier data Text Text+metadata

20 20 Experiments Experimental comparisons of text features vs metadata features Data: All WOS and Elsevier data

21 21 What have We Learned?

22 22 Text better than metadata on Web of Science Text

23 23 Science and Language  Predicting impact of scientific journal articles  Generating updates on disaster

24 24 It’s October 28 th, 2012

25 25

26 26

27 27

28 28

29 29

30 30 Monitor events over time  Input: streaming data  News, social media, web pages  At every hour, what’s new

31 31 Data  NIST: Trec Temporal Summarization Track  hourly web crawl  October 2011 - February 2013  16.1TB  Natural disasters, industrial disasters, social unrest, …  Training Data drawn from Wikipedia

32 32

33 33 The U.S. Pacific Tsunami Warning Center said there was a possibility of a local tsunami, within 100 or 200 miles of the epicenter,

34 34 Would you like to contribute to this story? Start a discussion.

35 35 Add to Digg. Add to del.icio.us. Add to Facebook.

36 36 System Update

37 37 Temporal Summarization Approach At time t: 1. Predict salience for input sentences  Disaster-specific features for predicting salience 2. Remove redundant sentences 3. Cluster and select exemplar sentences for t  Incorporate salience prediction as a prior Kedzie & al, Bloomberg Social Good Workshop, KDD 2014 Kedzie & al, ACL 2015

38 38 Predicting Salience: Model Features Basic sentence level features  sentence length  punctuation count  number of capitalized words  number of event type synonyms, hypernyms, and hyponyms

39 39 Predicting Salience: Model Features Basic sentence level features  sentence length  punctuation count  number of capitalized words  number of event type synonyms, hypernyms, and hyponyms High Salience Nicaragua's disaster management said it had issued a local tsunami alert. Medium Salience People streamed out of homes, schools and oce buildings as far north as Mexico City. Low Salience Add to Digg Add to del.icio.us Add to Facebook Add to Myspace

40 40 Predicting Salience: Model Features Basic sentence level features  sentence length  punctuation count  number of capitalized words  number of event type synonyms, hypernyms, and hyponyms High Salience Nicaragua's disaster management said it had issued a local tsunami alert. Medium Salience People streamed out of homes, schools and oce buildings as far north as Mexico City. Low Salience Add to Digg Add to del.icio.us Add to Facebook Add to Myspace

41 41 Predicting Salience: Model Features Basic sentence level features  sentence length  punctuation count  number of capitalized words  number of event type synonyms, hypernyms, and hyponyms High Salience Nicaragua's disaster management said it had issued a local tsunami alert. Medium Salience People streamed out of homes, schools and oce buildings as far north as Mexico City. Low Salience Add to Digg Add to del.icio.us Add to Facebook Add to Myspace

42 42 Predicting Salience: Model Features Language Models (5-gram Kneser-Ney model)  generic news corpus (10 years AP and NY Times articles)  domain specific corpus (disaster related Wikipedia articles)

43 43 Predicting Salience: Model Features Language Models (5-gram Kneser-Ney model)  generic news corpus (10 years AP and NY Times articles)  domain specific corpus (disaster related Wikipedia articles) High Salience Nicaragua's disaster management said it had issued a local tsunami alert. Medium Salience People streamed out of homes, schools and oce buildings as far north as Mexico City. Low Salience Add to Digg Add to del.icio.us Add to Facebook Add to Myspace

44 44 Predicting Salience: Model Features Language Models (5-gram Kneser-Ney model)  generic news corpus (10 years AP and NY Times articles)  domain specific corpus (disaster related Wikipedia articles) High Salience Nicaragua's disaster management said it had issued a local tsunami alert. Medium Salience People streamed out of homes, schools and oce buildings as far north as Mexico City. Low Salience Add to Digg Add to del.icio.us Add to Facebook Add to Myspace

45 45 Predicting Salience: Model Features Language Models (5-gram Kneser-Ney model)  generic news corpus (10 years AP and NY Times articles)  domain specific corpus (disaster related Wikipedia articles) High Salience Nicaragua's disaster management said it had issued a local tsunami alert. Medium Salience People streamed out of homes, schools and oce buildings as far north as Mexico City. Low Salience Add to Digg Add to del.icio.us Add to Facebook Add to Myspace

46 46 Predicting Salience: Model Features Geographic Features  tag input with Named-Entity tagger  get coordinates for locations and mean distance to event High Salience Nicaragua's disaster management said it had issued a local tsunami alert. Medium Salience People streamed out of homes, schools and oce buildings as far north as Mexico City. Low Salience Add to Digg Add to del.icio.us Add to Facebook Add to Myspace

47 47 Predicting Salience: Model Features Geographic Features  tag input with Named-Entity tagger  get coordinates for locations and mean distance to event High Salience Nicaragua's disaster management said it had issued a local tsunami alert. Medium Salience People streamed out of homes, schools and oce buildings as far north as Mexico City. Low Salience Add to Digg Add to del.icio.us Add to Facebook Add to Myspace

48 48 Predicting Salience: Model Features Temporal Features measuring burstiness of words  average tf-idf at the current time t0  difference of average tf-idf at time t0 and t - 1  difference of average tf-idf at time t0 and t - 5  hours since event start time

49 49 SubEvent Identification ➔ Decompose articles on a main event into related sub-events: Manhattan Blackout Breezy Point fire Public Transit Outage Hurricane Sandy

50 50 What have We Learned?

51 51 What have We Learned? Salience predictions increase precision

52 52 What have We Learned? Salience predictions increase precision Penalizing redundancy removes too many important facts

53 53 Integrate lens of the media and scientific knowledge  To better predict unfolding impact o of disaster  Social media to feed into models of resilience

54 54 In Sum  Can exploit the words on the web  Language analysis can help the world of science

55 55 Thank You!  The research presented here has been supported in part by DARPA BOLT, IARPA FUSE, and NSF.


Download ppt "1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University."

Similar presentations


Ads by Google