Deliverable #2 Goals: Become familiar with shared task summarization data Implement initial base system with all components Focus on content selection.

Slides:



Advertisements
Similar presentations
How To Use TEEL.
Advertisements

Ailab.ijs.si Jasna Škrbec Blaž Fortuna Marko Grobelnik Exploring & Visualization of News Archives.
Refining the structure
The Columbine Massacre The Columbine High School massacre was a school shooting which occurred on April 20, 1999, at Columbine High School in Columbine,
Information Retrieval in Practice
Staff Included in the Career Family Classification System Working Titles.
Jumping Off Points Ideas of possible tasks Examples of possible tasks Categories of possible tasks.
1 I256: Applied Natural Language Processing Marti Hearst October 18, 2006.
Project title Team Members. Project Title Brief description of the project in bullet form.
XML Primer. 2 History: SGML vs. HTML vs. XML SGML (1960) XML(1996) HTML(1990) XHTML(2000)
Trade Shows & Mall Displays January Introduction. The National Member Services Committee has developed a series of National Education Seminars to.
Overview of Search Engines
HDF 1 NCSA HDF XML Activities Robert E. McGrath Mike Folk National Center for Supercomputing Applications.
Preparing Technical Presentations CIVL 1101, Fall 2008 Project 1: Topographic Model.
ULI101 – XHTML Basics (Part II) What is Markup Language? XHTML vs. HTML General XHTML Rules Block Level XHTML Tags XHTML Validation.
Karen Herter (HMG) Mike Langley (DGS) April 15, 2008 Portfolio Manager for California State Buildings Meeting the Requirements of Executive Order S
SCHOOL VIOLENCE. THE BATH SCHOOL DISASTER OF 1927 – 45 VICTIMS 38 schoolchildren, aged 7-14, were killed when a series of bombs went off in Bath Township,
XHTML 1.1  Derived from Standard Generalized Markup Language (SGML) of ISO  XHTML concerned primary with content rather than presentation and style 
K.Furukawa, Nov Database and Simulation Codes 1 Simple thoughts Around Information Repository and Around Simulation Codes K. Furukawa, KEK Nov.
Guide To UNIX Using Linux Third Edition Chapter 8: Exploring the UNIX/Linux Utilities.
1 Construction Chapter Key Concepts Be familiar with the system construction process. Understand different types of tests and when to use Understand.
New Jersey Ask Testing Writing About Quotes. What is a quote/aphorism?  a passage quoted from a book, story, and/or poem  Words of wisdom  Has important.
Project 2 Completing a Three Dimensional Workspace Using Logical and Lookup Functions Jason C. H. Chen, Ph.D. Professor of Management Information Systems.
The Chardon High School shooting occurred on February 27, 2012, at Chardon High School in Chardon, Ohio, United States. At the time of the incident, it.
XML Device Description PP a-MMA_XMLDeviceDescriptionWorkShop.pptx M. Marchhart 1 Workshop.
PROPOSAL Steps to Writing and Revising your Proposal.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
TOPSpro Special Topics VI:TOPSpro for Instructors.
Information Retrieval in Practice
PSYCH 625 MENTOR It's Your Life/psych625mentor.com
Main Ideas: Stated and Implied
EET 2259 Unit 13 Strings and File I/O
The E-Rate Program Tips for Success Fall 2011 Applicant Trainings.
Search Engine Architecture
Use Cases Discuss the what and how of use cases: Basics Benefits
English Week 6.
Writing A Character Essay
Systems & Applications: Introduction
Ling573 Systems & Applications April 7, 2016
Basic Microsoft Word 2013.
EL (English Language) Students and WIDA Standards
“The Wizards of Perfil”
Enter Presentation Title
Evergreen Data Systems
Response to Film Essay CAT #2 Year 7 ENH (Term 4).
To understand character & setting in a story
How To Use TEEL.
Response to Film Essay CAT #2 Year 7 ENH (Term 4).
Microsoft Excel 101.
Web Systems Development (CSC-215)
OSEP Initiatives on Early Childhood Outcomes
Book review What? How? When?.
Systems Analysis and Design
Book review What? How? When?.
New Perspectives on XML
Read Chapter in Elie Wiesel’s Night
Short Story Pp Project.
Resource Recommendation for AAN
Enter Presentation Title
ANNOTATED BIBLIOGRAPHY INFORMATION
DB Project Database Systems
Current Events Every Monday!.
Writing the 5-paragraph essay! WOOOOHOOOOOO!!!
Hands-On: FSA Assessments For Foreign Schools
Sports Psychology Coach Doron
CSE591: Data Mining by H. Liu
EET 2259 Unit 13 Strings and File I/O
SNoW & FEX Libraries; Document Classification
Writing Text-based Introductions
Writing Text-based Introductions
Presentation transcript:

Deliverable #2 Goals: Become familiar with shared task summarization data Implement initial base system with all components Focus on content selection Evaluate resulting summaries

TAC 2010 Shared Task Basic data: Test Topic Statements: Document sets: Brief topic description List of associated document identifiers from corpus Document sets: Drawn from AQUAINT/AQUAINT-2 LDC corpora Available on patas Summary results: Model summaries

Topics <topic id = "D0906B" category = "1"> <title> Rains and mudslides in Southern California </title> <docsetA id = "D0906B-A"> <doc id = "AFP_ENG_20050110.0079" /> <doc id = "LTW_ENG_20050110.0006" /> <doc id = "LTW_ENG_20050112.0156" /> <doc id = "NYT_ENG_20050110.0340" /> <doc id = "NYT_ENG_20050111.0349" /> <doc id = "LTW_ENG_20050109.0001" /> <doc id = "LTW_ENG_20050110.0118" /> <doc id = "NYT_ENG_20050110.0009" /> <doc id = "NYT_ENG_20050111.0015" /> <doc id = "NYT_ENG_20050112.0012" /> </docset> <docsetB id = "D0906B-B"> <doc id = "AFP_ENG_20050221.0700" /> ……

Documents <DOC><DOCNO> APW20000817.0002 </DOCNO> <DOCTYPE> NEWS STORY </DOCTYPE><DATE_TIME> 2000-08-17 00:05 </DATE_TIME> <BODY> <HEADLINE> 19 charged with drug trafficking </HEADLINE> <TEXT><P> UTICA, N.Y. (AP) - Nineteen people involved in a drug trafficking ring in the Utica area were arrested early Wednesday, police said. </P><P> Those arrested are linked to 22 others picked up in May and comprise ''a major cocaine, crack cocaine and marijuana distribution organization,'' according to the U.S. Department of Justice. </P>

Notes Topic files: IDs reference documents in AQUAINT corpora Include both docsetA and docsetB Use ONLY *docsetA* “B” used for update task IDs reference documents in AQUAINT corpora

Notes AQUAINT/AQUAINT-2 corpora Subset of Gigaword Format is SGML Used in many NLP shared tasks Format is SGML Not fully XML compliant Includes non-compliant characters: e.g. with &s May not be “rooted” Some differences between subcorpora Span different date ranges

Tips & Tricks Handling SGML with XML tools Non-uniform corpora: Elementtree has recover mode: E.g. parser = etree.XMLParser(recover=True)                    data_tree = etree.parse(f, parser) Consider escaping &-prefixed content Varied paragraph structure: .xpath(".//TEXT//P|.//TEXT") Non-uniform corpora: You may hard-code corpus handling Or create configuration files

Model Summaries Five young Amish girls were killed, shot by a lone gunman. At about 1045, on October 02, 2006, the gunman, Charles Carl Roberts IV, age 32, entered the Georgetown Amish School in Nickel Mines, Pennsylvania, a tiny village about 55 miles west of Philadelphia. He let the boys and the adults go, before he tied up the girls, ages 6 to 13. Police and emergency personnel rushed to the school but the gunman killed himself as they arrived. His motive was unclear but in a cell call to his wife he talked about abusing two family members 20 years ago.

Initial System Implement end-to-end system From reading in topic files to summarization to eval Need at least basic components for: Content selection Information ordering Content realization Focus on content selection for D2: Must be non-trivial (i.e. non-random/lead) Others can be minimal (i.e. “copy” for content real.)

Summaries Basic formatting: 100 word summaries Just ASCII, English sentences No funny formatting (bullets, etc) May output on multiple lines One file per topic summary All topics in single directory

Summarization Evaluation Primarily using ROUGE Standard implementation ROUGE-1, -2, -4: Scores found to have best correlation with responsiveness Primary metric: ROUGE Recall (“R”) Store in results directory

Model & Output Names Topic id=D0901A Summary file name: D0901-A.M.100.A.A 1. Split document id on: id_part1=D0901 and id_part2=A 2. Construct filename as: [id_part1]- [docset].M.[max_token_count].[id_part2].[some_unique_a lphanum]