Richard Gruss, Daniel Morgado, Nate Craun, Colin Shea-Blymyer

Slides:



Advertisements
Similar presentations
Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Ebola – Facts, Myths, and Fiction Dr M. Oladoyin Odubanjo Executive Secretary, The Nigerian Academy of Science (NAS) 1st Vice Chair, Association of Public.
VERMONT EMS EBOLA VIRUS DISEASE EDUCATION Patsy Kelso PhD, Vermont Department of Health State Epidemiologist and Vermont EMS.
Tailoring Needs Chapter 3. Contents This presentation covers the following: – Design considerations for tailored data-entry screens – Design considerations.
Ebola Virus Disease (EVD) Updated 11:30 a.m
1 ADVANCED MICROSOFT POWERPOINT Lesson 5 – Using Advanced Text Features Microsoft Office 2003: Advanced.
Side Bar: Vomiting Larry
McGraw-Hill/Irwin © 2009 The McGraw-Hill Companies, All Rights Reserved.
Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.
WEBQUEST Let’s Begin TITLE AUTHOR:. Let’s continue Return Home Introduction Task Process Conclusion Evaluation Teacher Page Credits This document should.
Hadoop(MapReduce) in the Wild —— Our current understandings & uses of Hadoop Le Zhao, Changkuk Yoo, Mark Hoy, Jamie Callan Presenter: Le Zhao
Technical Writing II Acknowledgement: –This lecture notes are based on many on-line documents. –I would like to thank these authors who make the documents.
Network modeling of the Ebola Outbreak Ahmet Aksoy.
1 Chapter 20 — Creating Web Projects Microsoft Visual Basic.NET, Introduction to Programming.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG.
Ebola Virus Outbreak This presentation has been prepared by Christine H. Herrmann, Ph.D. of the Department of Molecular Virology and Microbiology at Baylor.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
WorkPlace Pro Utilities.
T raining on Read&Write GOLD Dick Powers
© 2003 East Collaborative e ast COLLABORATIVE ® eC SoftwareProducts TrackeCHealth.
Put the Title of the WebQuest Here A WebQuest for xth Grade (Put Subject Here) Designed by (Put Your Name Here) Put Your Address Here Put some interesting.
1 Computing Software. Programming Style Programs that are not documented internally, while they may do what is requested, can be difficult to understand.
An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.
Copyright © 1994 Carnegie Mellon University Disciplined Software Engineering - Lecture 3 1 Software Size Estimation I Material adapted from: Disciplined.
The Range of Disease Activity #31 Mrs. Hunter, Mr. Robertson and Mrs. Kozuch.
Natural language processing tools Lê Đức Trọng 1.
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
A PRESENTATION BY: DETTA MOHAMAD ALNAAL JAMES BURGESS BUROOJ MUSHTAQ ANIMAN RANDHAWA Assignment 2: Ebola.
Reading Flash. Training target: Read the following reading materials and use the reading skills mentioned in the passages above. You may also choose some.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Unit #7 Charts Questions? Comments?. MS PPT 2007: Presentations Made Easy; Planning and Preparing PowerPoint allows you to create a professional presentation.
©SoftMooreSlide 1 Introduction to HTML: Forms ©SoftMooreSlide 2 Forms Forms provide a simple mechanism for collecting user data and submitting it to.
Self Learning Project English Presentation 10 % of your English grade!
Teaching Big Data Through Problem-Based Learning Richard Gruss, Business Information Technology, Virginia Tech Tarek Kanan Software Engineering Department.
Ebola By: Alex Kidd, Justice Swain, Brandon Weese, Alexis Hatton, and Taylor Miller.
CS 350: Introduction to Software Engineering Slide Set 3 Estimating with Probe I C. M. Overstreet Old Dominion University Spring 2006.
Ebola Virus BY: HEATHER BRANDSTETTER SAMANTHA LACLAIR JENNA HENSEL DANIELLE GILFUS.
2014 Ebola outbreak in West Africa 2014 Ebola Virus Disease (EVD) Outbreak in West Africa: Status report Sarah L Barber Representative World Health Organization,
Text2PTO: Modernizing Patent Application Filing A Proposal for Submitting Text Applications to the USPTO.
START Application Spencer Johnson Jonathan Barella Cohner Marker.
Office of Global Health and HIV (OGHH) Ebola Community Education and Preparedness Training Materials.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Unit 5: Developing the Training Program 1 © SHRM 2009.
A Simple Approach for Author Profiling in MapReduce
Answers to Your Questions about
Dependency Analysis Use Cases
A Straightforward Author Profiling Approach in MapReduce
Technical & Research Posters: Text, Visuals, & Presentation
An Open Source Project Commonly Used for Processing Big Data Sets
Ebola Virus Table Top Exercise
Computer Aided Software Engineering (CASE)
FORMAL SYSTEM DEVELOPMENT METHODOLOGIES
Floods Joe Acanfora, Myron Su, David Keimig and Marc Evangelista
Extraction, aggregation and classification at Web Scale
UNH Programming Assistance Center Automation
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ebola Virus Disease (EVD) WHAT IS IT?
EBOLA VIRUS DISEASE Joseph P. Iser, MD, DrPH, MSc Southern Nevada Health District.
Overview of big data tools
Computational Linguistic Analysis of Earthquake Collections
Through the Fire and Flames
Charles Tappert Seidenberg School of CSIS, Pace University
Extracting Recipes from Chemical Academic Papers
Needs Assessment Slides for Module 4
Welcome to HOSA 10/21/14.
Planning, Composing & Revising
Map Reduce, Types, Formats and Features
Presentation transcript:

Richard Gruss, Daniel Morgado, Nate Craun, Colin Shea-Blymyer OutbreakSum Automatic Summarization of Texts Relating to Disease Outbreaks CS 4984: Computational Linguistics 12/10/2014 Richard Gruss, Daniel Morgado, Nate Craun, Colin Shea-Blymyer Note that the presentations will be graded, considering the following: * Including suitable metadata, e.g., on first and last slides * Time management and organization (proper balance of: outline, intro, problem statement, approach, interesting findings, lessons learned, summary) * Clarity of explanation (not reading words from slides, rephrasing to aid understanding, logical flow) * Effective use of visual aids (good use of color and spacing, multimedia) * Informative tables * Insightful figures * Font size sufficient so can see from back of room * Courtesy: brief acknowledgements and references * Clear speech (suitable pace, articulation, eloquence) * Eye contact, looking at audience not back at screen * Handling questions (confidence, listening to the question, addressing the question, giving correct response)

Collection Processing Your Small Encephalitis 365 files Your Big Ebola 15,000 files News Articles Size of collection & makeup Encoding Issues, Different Languages, File Sizes, Numbers of Files, Blank Files, Duplicate files, Classification Number of files in report - should be 365 Be better to have html source and process that: Log in, Nav Bar, etc Web Stop Words - Log in, Must have JS enabled, Flash, etc. Segue Small Files

Collection Processing Problems Non-English Languages Blank Files Irrelevant Files Duplicate Files Solutions Hashing Classification Size of collection & makeup Encoding Issues, Different Languages, File Sizes, Numbers of Files, Blank Files, Duplicate files, Classification Number of files in report - should be 365 Be better to have html source and process that: Log in, Nav Bar, etc Web Stop Words - Log in, Must have JS enabled, Flash, etc. Segue Small Files

Collection Processing Classification Features tfidf Document Length Target Words Future Improvements “Web” Stop Words HTML Parsing Size of collection & makeup Encoding Issues, Different Languages, File Sizes, Numbers of Files, Blank Files, Duplicate files, Classification Number of files in report - should be 365 Be better to have html source and process that: Log in, Nav Bar, etc Web Stop Words - Log in, Must have JS enabled, Flash, etc. Segue Small Files

Hadoop: Performance Small file problem Passing data representations Excessive calls to Map function Reading 64mb blocks containing few kb of data. Built corpus_squeeze.py utility Classifying 7k documents: 13m27s → 1m18s Passing data representations JSON pickle: ASCII serialized Python objects Some Hadoop performance issues and solutions we ran into were With regards to the… Small File Problem: Map function is being called in excess, lots of overhead associated with it. With a default block size of 64mb HDFS & Hadoop are designed to work well on large files. As we all experienced Hadoop performs pretty poorly with large quantities of small files. I/O overhead of reading a 64mb block with only a few kb or less of data. Built a corpus_squeeze utility module which compressed the input directory into k equal-sized files. Passing data representations Another challenge is efficiently passing data between map and reduce phases. Started with JSON objects, but as our needs grew we had to switch to pickle, which is a serialized python object, allowed us to directly load Python objects from our mapper into our reducer.

Hadoop: Workflow Streaming interface does not facilitate code reuse. Framework for reusable mapper and reducer Shared utilities module All jobs define same 3 functions: Compose → Combine → Compute Debugging MapReduce simulation script Some workflow challenges we encountered were… Streaming Interface: Being able to reuse code under the hadoop streaming interface. It’s particularly easy to just create your own mapper and reducer script for each kind of job you want to run but you end up repeating a lot of work and then any modification in your methodology requires you to go back and individual modify every affected (mapper, reducer) pair you’ve built prior. Built a framework for a reuseable Shared utilities module for reading/writing functions and the like All of our jobs reside in the same module and follow the same interface which allowed us to swap out any job in our mapper and reduce files. Debugging: Built a MapReduce simulation script script which ran our jobs locally identical to the Hadoop process such that we could debug all errors locally before deployment. Saved us a lot of time and headache being able to write MapReduce jobs in our local IDEs

Language Processing Find words with highest tf-idf vaccine cure antibiotic hospital health contract transmit organism toxic sanitation cholera case die patients children environment study kill medicine CDC WHO diagnose Find words with highest tf-idf Create “your words” list of intuitively identifying words Combination of these lists used for classification disease outbreak epidemic infect quarantine sick airborne waterborne pathogen virus bacteria parasite symptoms treatment specific to diseases in our collection first unit to last unit w/ approaches suggested POS (did not get much out of it alone, means to the end) - lemmatization, tokenization, NER a good end result was to receive passages that had these terms in it

Language Processing Named Entity Recognition → Best Results Stanford NER - Very effective only when passed a single tokenized sentence NLTK NE Chunk - Far less accurate than SNER Open NLP - Faster and more effective than SNER Clustering & Topics → Bad Results Tokenization: nltk default sentence tokenization was fairly inexact Used Reuters, and Brown news collections to train better sentence tokenizer. Can detect sentences despite improper spacing before/after sentences. we wanted sentence tok, used a trained tokenizer that ^ paragraph tokenization Named Entity Recognition ← Best results Stanford NER - really effective when passed a single tokenized sentence NLTK NE Chunk - not as accurate as SNER Open NLP - Fast and effective ~clustering, topics ← Bad results preprocessing

Summarization: Template [DISEASE] has struck [LOCATION]. As of [DATE], there have been [NUMBER] number of cases either killed or hospitalized. Authorities are [AUTHORITIES MEASURES]. [DISEASE] spread [HOW IT SPREADS]. [HEALTH ORGANIZATION] is [WHAT HEALTH ORG IS DOING]. [DISEASE] include [SYMPTOMS]. Treatments include. [TREATMENTS]. The total cost of the outbreak is estimated at [MONEY AMOUNT]. The disease may spread to [OTHER LOCATIONS]. goals, efforts, problems faced Grammar, examples, dollar signs tricky currency strings, date strings, relative dates buckets , order of magnitude We had enough data to not worry about particular confusions that might occur, disregarded ambiguous scenarios. n-grams (“_ outbreak”) most recent date regex with “cases” or “deaths” at this point, we just used one or the other government adjectivization liberian government is threatening to close down the market which sits next to the largest township, nigerian government issuing warnings that the treatment isn't effective , the rumor is being spread by text message and social media hand-picked our top organizations centers for disease control issued guidance to all u.s. hospitals and mortuaries on the safe handling of human remains of ebola patients world health organization is convening a meeting of medical ethicists next week to examine what it calls "the responsible thing to do" about whatever supplies eventually may become available of a medicine that's never been tested in people symptoms: symptoms, webMD, POS tag, regex : POS grammar: grammer regex {(ADJ)*(NOUN)+(ADV)*(VERB)*} http://www.rightdiagnosis.com/lists/symptoms.htm No treatments

Summarization: NLG Report using NLG Database {'date': '3069-12-02', 'source': '10208-3', 'cases': 0, 'location': 'Liberia', 'deaths': '1552'} {'date': '2014-12-04', 'source': '10219-4', 'cases': 0, 'location': 'West Africa', 'deaths': '700'} {'date': '2014-12-04', 'source': '10219-4', 'cases': 0, 'location': 'Liberia', 'deaths': '700'} {'date': '2014-12-02', 'source': '1023-5', 'cases': '201', 'location': 'West Africa', 'deaths': '672'} parseable dates tacking back to paragraphs/docs top locations of interest Numbers are easily found with regex, but how do we choose the right ones?

Summarization: NLG News in the United States of 4000 cases recently reported in West Africa has caused concern. (NP (NP (NNP News)) (PP (IN in) (NP (DT the) (NNP United) (NNPS States)) (PP (IN of) (NP (CD 4000) (NNS cases)) (NP (NNP West) (NNP Africa)))))))) (VP (VBZ has) (VP (VBN caused) (NP (NN concern)))

Summarization: NLG summary ::= intro-sentence detail-paragraph history-paragraph closing-sentence intro-sentence ::= "There has been an outbreak of Ebola in the following locations: " location-list detail-paragraph ::= detail-sentence detail-sentence-with-transition* detail-sentence-with-transition ::= transition detail-sentence transition ::= ("Also" | "In addition" | "Likewise" | "Additionally" | "Furthermore") detail-sentence "In [month year], there were between [num] and [num] cases of Ebola in location, with between [num] and [num] deaths." history-paragraph ::= "There have been earlier cases of Ebola." history-sentence history-sentence-with-transition* history-sentence-with-transition ::= transition + history-sentence history-sentence ::= "Ebola was found in [location] in [year]." closing_sentence ::= "[Ebola boilerplate]"

Summarization: NLG There has been an outbreak of Ebola reported in the following locations: Liberia, West Africa, Nigeria, Guinea, and Sierra Leone. In January 2014, there were between 425 and 3052 cases of Ebola in Liberia, with between 2296 and 2917 deaths. Additionally, In January 2014, there were between 425 and 4500 cases of Ebola in West Africa, with between 2296 and 2917 deaths. Also... There were previous Ebola outbreaks in the past. Ebola was found in 1989 in Liberia. As well, Ebola was found in 1989 in West Africa. ... Ebola virus disease (EVD; also Ebola hemorrhagic fever, or EHF), or simply Ebola, is a disease of humans and other primates caused by ebolaviruses...

Questions? Special thanks to: Edward Fox Xuan Zhang Tarek Kanan Mohammad Halimi Sponsors: NSF DUE-1141209 IIS-1319578

Additional Slides

Hadoop → Local Transition Task Method Local Time Cluster Time Named Entity Recognition Stanford NER > 4hours 61m18.391s OpenNLP 8m26.338s

MapReduce

Compose → Combine → Compute