Richard Gruss, Daniel Morgado, Nate Craun, Colin Shea-Blymyer

Richard Gruss, Daniel Morgado, Nate Craun, Colin Shea-Blymyer
OutbreakSum Automatic Summarization of Texts Relating to Disease Outbreaks CS 4984: Computational Linguistics 12/10/2014 Richard Gruss, Daniel Morgado, Nate Craun, Colin Shea-Blymyer Note that the presentations will be graded, considering the following: * Including suitable metadata, e.g., on first and last slides * Time management and organization (proper balance of: outline, intro, problem statement, approach, interesting findings, lessons learned, summary) * Clarity of explanation (not reading words from slides, rephrasing to aid understanding, logical flow) * Effective use of visual aids (good use of color and spacing, multimedia) * Informative tables * Insightful figures * Font size sufficient so can see from back of room * Courtesy: brief acknowledgements and references * Clear speech (suitable pace, articulation, eloquence) * Eye contact, looking at audience not back at screen * Handling questions (confidence, listening to the question, addressing the question, giving correct response)

Collection Processing
Your Small Encephalitis 365 files Your Big Ebola 15,000 files News Articles Size of collection & makeup Encoding Issues, Different Languages, File Sizes, Numbers of Files, Blank Files, Duplicate files, Classification Number of files in report - should be 365 Be better to have html source and process that: Log in, Nav Bar, etc Web Stop Words - Log in, Must have JS enabled, Flash, etc. Segue Small Files

Problems Non-English Languages Blank Files Irrelevant Files Duplicate Files Solutions Hashing Classification Size of collection & makeup Encoding Issues, Different Languages, File Sizes, Numbers of Files, Blank Files, Duplicate files, Classification Number of files in report - should be 365 Be better to have html source and process that: Log in, Nav Bar, etc Web Stop Words - Log in, Must have JS enabled, Flash, etc. Segue Small Files

Classification Features tfidf Document Length Target Words Future Improvements “Web” Stop Words HTML Parsing Size of collection & makeup Encoding Issues, Different Languages, File Sizes, Numbers of Files, Blank Files, Duplicate files, Classification Number of files in report - should be 365 Be better to have html source and process that: Log in, Nav Bar, etc Web Stop Words - Log in, Must have JS enabled, Flash, etc. Segue Small Files

Hadoop: Performance Small file problem Passing data representations
Excessive calls to Map function Reading 64mb blocks containing few kb of data. Built corpus_squeeze.py utility Classifying 7k documents: 13m27s → 1m18s Passing data representations JSON pickle: ASCII serialized Python objects Some Hadoop performance issues and solutions we ran into were With regards to the… Small File Problem: Map function is being called in excess, lots of overhead associated with it. With a default block size of 64mb HDFS & Hadoop are designed to work well on large files. As we all experienced Hadoop performs pretty poorly with large quantities of small files. I/O overhead of reading a 64mb block with only a few kb or less of data. Built a corpus_squeeze utility module which compressed the input directory into k equal-sized files. Passing data representations Another challenge is efficiently passing data between map and reduce phases. Started with JSON objects, but as our needs grew we had to switch to pickle, which is a serialized python object, allowed us to directly load Python objects from our mapper into our reducer.

Hadoop: Workflow Streaming interface does not facilitate code reuse.
Framework for reusable mapper and reducer Shared utilities module All jobs define same 3 functions: Compose → Combine → Compute Debugging MapReduce simulation script Some workflow challenges we encountered were… Streaming Interface: Being able to reuse code under the hadoop streaming interface. It’s particularly easy to just create your own mapper and reducer script for each kind of job you want to run but you end up repeating a lot of work and then any modification in your methodology requires you to go back and individual modify every affected (mapper, reducer) pair you’ve built prior. Built a framework for a reuseable Shared utilities module for reading/writing functions and the like All of our jobs reside in the same module and follow the same interface which allowed us to swap out any job in our mapper and reduce files. Debugging: Built a MapReduce simulation script script which ran our jobs locally identical to the Hadoop process such that we could debug all errors locally before deployment. Saved us a lot of time and headache being able to write MapReduce jobs in our local IDEs

Language Processing Find words with highest tf-idf
vaccine cure antibiotic hospital health contract transmit organism toxic sanitation cholera case die patients children environment study kill medicine CDC WHO diagnose Find words with highest tf-idf Create “your words” list of intuitively identifying words Combination of these lists used for classification disease outbreak epidemic infect quarantine sick airborne waterborne pathogen virus bacteria parasite symptoms treatment specific to diseases in our collection first unit to last unit w/ approaches suggested POS (did not get much out of it alone, means to the end) - lemmatization, tokenization, NER a good end result was to receive passages that had these terms in it

Language Processing Named Entity Recognition → Best Results
Stanford NER - Very effective only when passed a single tokenized sentence NLTK NE Chunk - Far less accurate than SNER Open NLP - Faster and more effective than SNER Clustering & Topics → Bad Results Tokenization: nltk default sentence tokenization was fairly inexact Used Reuters, and Brown news collections to train better sentence tokenizer. Can detect sentences despite improper spacing before/after sentences. we wanted sentence tok, used a trained tokenizer that ^ paragraph tokenization Named Entity Recognition ← Best results Stanford NER - really effective when passed a single tokenized sentence NLTK NE Chunk - not as accurate as SNER Open NLP - Fast and effective ~clustering, topics ← Bad results preprocessing

Summarization: Template
[DISEASE] has struck [LOCATION]. As of [DATE], there have been [NUMBER] number of cases either killed or hospitalized. Authorities are [AUTHORITIES MEASURES]. [DISEASE] spread [HOW IT SPREADS]. [HEALTH ORGANIZATION] is [WHAT HEALTH ORG IS DOING]. [DISEASE] include [SYMPTOMS]. Treatments include. [TREATMENTS]. The total cost of the outbreak is estimated at [MONEY AMOUNT]. The disease may spread to [OTHER LOCATIONS]. goals, efforts, problems faced Grammar, examples, dollar signs tricky currency strings, date strings, relative dates buckets , order of magnitude We had enough data to not worry about particular confusions that might occur, disregarded ambiguous scenarios. n-grams (“_ outbreak”) most recent date regex with “cases” or “deaths” at this point, we just used one or the other government adjectivization liberian government is threatening to close down the market which sits next to the largest township, nigerian government issuing warnings that the treatment isn't effective , the rumor is being spread by text message and social media hand-picked our top organizations centers for disease control issued guidance to all u.s. hospitals and mortuaries on the safe handling of human remains of ebola patients world health organization is convening a meeting of medical ethicists next week to examine what it calls "the responsible thing to do" about whatever supplies eventually may become available of a medicine that's never been tested in people symptoms: symptoms, webMD, POS tag, regex : POS grammar: grammer regex {(ADJ)*(NOUN)+(ADV)*(VERB)*} No treatments

Summarization: NLG Report using NLG Database
{'date': ' ', 'source': ' ', 'cases': 0, 'location': 'Liberia', 'deaths': '1552'} {'date': ' ', 'source': ' ', 'cases': 0, 'location': 'West Africa', 'deaths': '700'} {'date': ' ', 'source': ' ', 'cases': 0, 'location': 'Liberia', 'deaths': '700'} {'date': ' ', 'source': '1023-5', 'cases': '201', 'location': 'West Africa', 'deaths': '672'} parseable dates tacking back to paragraphs/docs top locations of interest Numbers are easily found with regex, but how do we choose the right ones?

Summarization: NLG News in the United States of 4000 cases recently reported in West Africa has caused concern. (NP (NP (NNP News)) (PP (IN in) (NP (DT the) (NNP United) (NNPS States)) (PP (IN of) (NP (CD 4000) (NNS cases)) (NP (NNP West) (NNP Africa)))))))) (VP (VBZ has) (VP (VBN caused) (NP (NN concern)))

Summarization: NLG summary ::= intro-sentence detail-paragraph history-paragraph closing-sentence intro-sentence ::= "There has been an outbreak of Ebola in the following locations: " location-list detail-paragraph ::= detail-sentence detail-sentence-with-transition* detail-sentence-with-transition ::= transition detail-sentence transition ::= ("Also" | "In addition" | "Likewise" | "Additionally" | "Furthermore") detail-sentence "In [month year], there were between [num] and [num] cases of Ebola in location, with between [num] and [num] deaths." history-paragraph ::= "There have been earlier cases of Ebola." history-sentence history-sentence-with-transition* history-sentence-with-transition ::= transition + history-sentence history-sentence ::= "Ebola was found in [location] in [year]." closing_sentence ::= "[Ebola boilerplate]"

Summarization: NLG There has been an outbreak of Ebola reported in the following locations: Liberia, West Africa, Nigeria, Guinea, and Sierra Leone. In January 2014, there were between 425 and 3052 cases of Ebola in Liberia, with between 2296 and 2917 deaths. Additionally, In January 2014, there were between 425 and 4500 cases of Ebola in West Africa, with between 2296 and 2917 deaths. Also... There were previous Ebola outbreaks in the past. Ebola was found in 1989 in Liberia. As well, Ebola was found in 1989 in West Africa. ... Ebola virus disease (EVD; also Ebola hemorrhagic fever, or EHF), or simply Ebola, is a disease of humans and other primates caused by ebolaviruses...

Questions? Special thanks to: Edward Fox Xuan Zhang Tarek Kanan
Mohammad Halimi Sponsors: NSF DUE IIS

Additional Slides

Hadoop → Local Transition
Task Method Local Time Cluster Time Named Entity Recognition Stanford NER > 4hours 61m18.391s OpenNLP 8m26.338s

MapReduce

Compose → Combine → Compute

Richard Gruss, Daniel Morgado, Nate Craun, Colin Shea-Blymyer

Similar presentations

Presentation on theme: "Richard Gruss, Daniel Morgado, Nate Craun, Colin Shea-Blymyer"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Richard Gruss, Daniel Morgado, Nate Craun, Colin Shea-Blymyer

Similar presentations

Presentation on theme: "Richard Gruss, Daniel Morgado, Nate Craun, Colin Shea-Blymyer"— Presentation transcript:

Similar presentations

About project

Feedback