 LREC 2008, Marrakech Morocco - May 30 2008 New Resources for Document Classification, Analysis and Translation Technologies Stephanie Strassel, Lauren.

Slides:



Advertisements
Similar presentations
HES Data Management Ari Haukijärvi. Planning of HES Data Management Purpose of the data management The data will be available for analysis The available.
Advertisements

STAAR Alternate 2 Preparing for the STAAR Alternate 2 Assessment 2015.
Geospatial One-Stop A Federal Gateway to Federal, State & Local Geographic Data
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
Information Retrieval in Practice
Mouse Movement Biometrics, Pace University, Fall'20071 Mouse Movement Biometrics Fall 2007 Capstone -Team Members Rafael Diaz Michael Lampe Nkem Ajufor.
RECRUIT Overview November 29, 2005 Academic Personnel Systems 1 Academic Personnel Systems: RECRUIT Please silence cell-phones.
1 of 6 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
End User Training Materials
Optimum Solutions Corporation All the Right Answers.
Overview of Search Engines
Creating Online Class Communities Jennifer Dorman Discovery Education
1 What the AP Course Audit is NOT: An audit of teacher quality An audit of teacher credentials An audit of teacher pedagogy An audit of teachers.
Chapter 9 Collecting Data with Forms. A form on a web page consists of form objects such as text boxes or radio buttons into which users type information.
XML, DITA and Content Repurposing By France Baril.
1st NRC Meeting, October 2006, Amsterdam 1 ICCS 2009 Field Operations.
Trimble Connected Community
SWIS Digital Inspections Project (SWIS DIP) Chris Allen, Information Management Branch California Integrated Waste Management Board November 5, 2008 The.
DE&T (QuickVic) Reporting Software Overview Term
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
MAHI Research Database Data Validation System Software Prototype Demonstration September 18, 2001
Junior High Literacy Assessment May 26-28, 2008.
FSA Writing Component Spring 2015.
Florida Standards Assessments Paper-Based Materials SCHOOL COORDINATOR RETURN INSTUCTIONS.
Libra: Thesis and Dissertation Submission. What is Libra? UVA’s institutional repository, providing online archiving and access for the scholarly output.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Technology Choices for the JSTOR Online Archive Presented by Chang Feng Department of Computer Engineering and Computer Science, University of Missouri-Columbia,
Rowland Unified School District (Slides from ETS) CAHSEE Test Site Coordinator Information.
Primary Reading & Math Assessments Grades 1 and 2 Assessments
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.
On-line data submission training California Partnership for Achieving Student Success.
Meet and Confer Rule 26(f) of the Federal Rules of Civil Procedure states that “parties must confer as soon as practicable - and in any event at least.
USNSCC Instructions for Test Admin View this manual using Microsoft’s Internet Explorer. May not be compatible with other browsers To download this document.
BNR – Stroke: data entry and data management CAREC/PAHO Curacoa,15-16 November 2010 Gina Pitts, BNR-CVD Registrar Chronic Disease Research Centre, Jemmotts.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
1 UNOG Library Digitization and Microform Unit (DMU) – December 2009.
Foxbright – Smarter Education Websiteswww.foxbright.com Foxbright Training Foxbright Teacher Pages
Stanford Achievement Test – Tenth Edition Grade 3 Alternate Assessment for Promotion
Drupal for NGOs 1 Amnesty.org redesign 22 July 2008.
EMAS Walkthrough Registration, registration updates and consultation.
Elementary Literacy Assessment October 9 -12, 2007 Sub Code: 968.
GPO’s Federal Digital System December 10, 2009 U.S. Government Printing Office.
August 2005 TMCOps TMC Operator Requirements and Position Descriptions Phase 2 Interactive Tool Project Presentation.
FP6 IT System 1 ELECTRONIC PROPOSAL SUBMISSION SYSTEM.
Introduction to Morpho RCN Workshop Samantha Romanello Long Term Ecological Research University of New Mexico.
ISAN: International Standard Audiovisual Number Hollywood Post Alliance Technology Retreat January 27 & 28, 2005 S. Merrill Weiss Merrill Weiss Group LLC.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.
Student Centered ODS ETL Processing. Insert Search for rows not previously in the database within a snapshot type for a specific subject and year Search.
 ReadSoft 2004 Processing census forms.  ReadSoft 2004 ReadSoft Corporate Profile n Swedish company - founded1991 n Listed in Stockholm stock exchange.
Preparing for Portals Paul Dempsey Director of Electronic Communication Dickinson College.
CABLING SYSTEM WARRANTY REGISTRATION. PURPOSE OF CABLING REGISTRATION.
HETUS Pilot Group 8 Privacy procedures and ethical issues Kimberly Fisher, Centre for Time Use Research – co-ordinator External consultant Kai Ludwigs.
Copyright © 2010 Pearson Education, inc. or its affiliates. All rights reserved. Texas Assessment Management System.
Enterprise Oracle Solutions Oracle Report Manager The New ADI and More Revised:June 20091Report Manager/SROAUG Presentation.
Introduction. Internet Worldwide collection of computers and computer networks that link people to businesses, governmental agencies, educational institutions,
2012 TELPAS Online Testing & Data Collection. Disclaimer  These slides have been prepared by the Student Assessment Division of the Texas Education Agency.
TEA Student Assessment Division 2  These slides have been prepared by the Student Assessment Division of the Texas Education Agency.  If any slide is.
Copyright © 2010 Pearson Education, Inc. or its affiliate(s). All rights reserved.1 | Assessment & Information 1 Online Testing Administrator Training.
American Diploma Project Administrative Site Training.
American Diploma Project Administrative Site Training.
MBP Expense Module Phase 1. MBP Expense Module Implementation Phase 1 includes the following reports: – Monthly Expense Reports – Education Reimbursement.
Make-Up Testing/Undo Student Test Submissions
ELECTRONIC PROPOSAL SUBMISSION SYSTEM
An Overview of Data-PASS Shared Catalog
Jessica Dare, Principal Consultant
Students Welcome to “Students” training module..
Evergreen Data Systems
Manuscript Transcription Assistant Initiative
Student Transcripts Service (STS): Sending Your Marks to Post-Secondary Institutions (PSI) November 2018.
Student Transcripts Service (STS): Sending Your Marks to Post-Secondary Institutions (PSI) November 2018.
Presentation transcript:

 LREC 2008, Marrakech Morocco - May New Resources for Document Classification, Analysis and Translation Technologies Stephanie Strassel, Lauren Friedman, Safa Ismael, David Lee, Kazuaki Maeda, Linda Brandschain {strassel, lf, safa, david4, maeda, Linguistic Data Consortium

 LREC 2008, Marrakech Morocco - May Presentation Outline  MADCAT Program Overview  Technology Challenges  Roadmap  Data Creation  Phase 1 Data Profile  Processing  Collection  Annotation  Data Format  Evaluation  Conclusions and Future Work

 LREC 2008, Marrakech Morocco - May MADCAT Overview  MADCAT: Multilingual Document Classification Analysis and Translation  A 5-year DARPA program  MADCAT technologies will convert foreign language document images into English text, enabling English speakers to extract, assess, and respond to information in a timely manner  Multiple input types and domains  Hard-copy, PDF, camera-captured  Newspapers, letters, signs, graffiti, how-to manuals, memos, postcards, forms, diaries, ledgers, etc.

 LREC 2008, Marrakech Morocco - May Technology Challenges  Extract relevant metadata about the document structure  Integrate and optimize page segmentation, metadata extraction, OCR and translation technologies  Create end-to-end system for deployment at program’s end with over 90% accuracy  Current baseline is ~2%  Primary evaluation metric is edit distance: HTER  Same protocols as used in the GALE program  Limited focus in Phase 1  Arabic > English  High resolution (600 dpi) images of handwritten newspaper and web text  Topics primarily news, current events and commentary  Manual segmentation provided

 LREC 2008, Marrakech Morocco - May Pre-MADCAT: State of the Art Phase 1: Add handwriting Phase 4-5: New genres, topics, quality conditions Newswire Broadcast Talk Shows Weblogs Newswire Broadcast Talk Shows Weblogs Printed Handwritten Phase 2-3: New data types Personal Identif. Instructns Books Training Manuals Letters Forms Ledgers Diaries Calendar Maps Poems Verdicts Letters Forms Ledgers Diaries News Commentary News Commentary Science Engineering Personal Science Engineering Personal Religious Military Other Controlled Uncontrolled Calendars Instructns Genre Topic Medium Source Data Quality Printed Handwritten Printed Handwritten Phase Roadmap

 LREC 2008, Marrakech Morocco - May Phase 1 Data Profile  In Phase 1, data drawn from DARPA GALE program  New collection to acquire handwritten versions  Genres: Formal text (newswire) and informal text (weblogs)  Benefits  Eliminates domain mismatch between GALE state of the art MT models and MADCAT test sets  Allows developers to focus on primary challenge: handwriting  Data characteristics well understood, cost and time factors are reasonably well known  Training data costs controlled since translations exist  Production begins immediately, training data available sooner  Provides controlled test sets for evaluation across programs  Subsequent phases will add new data types, genres and other challenge elements

 LREC 2008, Marrakech Morocco - May Training and DevTest  Training  Minimum 2000 unique pages Half formal (newswire), half informal (web text) words per page  Minimum 100 unique scribes in training pool  5 scribes per page  At minimum 10,000 manuscripts (scribe-pages) in Phase 1 training set  DevTest  320 unique pages Half formal (newswire), half informal (web text) 125 words/page  50 scribes in devtest pool 25 from training, 25 previously unseen  2 scribes per page, ~7 pages per scribe  Total of 640 manuscripts; 80,000 words

 LREC 2008, Marrakech Morocco - May Evaluation Data  320 unique pages from GALE P3 Eval set  Half formal (newswire), half informal (web text)  125 words/page  50 scribes in eval partition  25 from training, 25 previously unseen  6 scribes per page, ~40 pages per scribe  Total of 1920 manuscripts, 240,000 words  Subset of eval set designated for pilot evaluation in September 2008

 LREC 2008, Marrakech Morocco - May Data Preparation  Start with electronic text from GALE  Whole documents collected from newswire or web  Segmented into SUs (semantic/sentence units)  Each segment manually translated  Pre-processing prior to handwriting  Tokenization to words for later stages  Segments reordered and formatting added to create optimal pages for handwriting assignment Roughly 5 words/line to avoid line wrapping No more than 25 lines/page to avoid page breaks  After handwriting, images scanned at high resolution (600 dpi, greyscale)  Images are ground truth annotated at line, word level  Major challenge is logical storage of many layers of information across multiple versions of the same data

 LREC 2008, Marrakech Morocco - May Collection  New human subjects collection required to produce handwritten versions of existing data  Pilot collection currently underway at LDC in Philadelphia LDC Arabic staff and recent Iraqi immigrants in Philly  Additional collections planned with partner sites in Lebanon, Morocco and possibly Egypt  Regional variety necessary to capture stylistic writing differences E.g. use of Indic vs. Arabic numbers  Assignment and tracking of data and scribes controlled through centralized LDC database and assignment protocol  Scribe partition (train only, test only, both)  Writing conditions  Regional variation  Genre, topic and source balance

 LREC 2008, Marrakech Morocco - May Writing Conditions  Implement  90% ballpoint pen (I)  10% pencil (P)  Paper  75% unlined white paper (U)  25% lined paper (L)  Writing speed  90% normal (N)  5% fast (F)  5% careful (C)

 LREC 2008, Marrakech Morocco - May Scribe visits public URL, contacts site coordinator Site coordinator schedules appointment Scribe comes in, takes writing sample test Site coordinator verifies scribe eligibility Site coordinator logs in to secure website via login page Scribe completes registration via registration page Scribe verifies info via confirmation page Site coordinator prints out subject ID and instructions for subject via assignment page Coordinator pulls kit for this subject ID Scribe returns completed kit to site Coordinator verifies kit completeness and arranges payment Scribe leaves with kit and instructions Coordinator files completed kit for scanning/delivery Site scans completed kit(s) as safeguard Site ships completed paper kit(s) to LDC for archiving LDC selects source data LDC generates kits (documents + writing conditions) Sites publicize study and recruit participants LDC delivers data kits to collection sites Site uploads image file to LDC LDC processes completed kits for subsequent tasks Collection Workflow

 LREC 2008, Marrakech Morocco - May Scribe Demographics  Scribes register in person at collection site and take writing test  To assess literacy and ability to follow instructions  Enter demographic info on LDC's secure server  Name, address (for payment purposes only)  Age, gender, level of education, occupation  Where born, where raised  Primary language of educational instruction  Handedness  After registration, scribes receive brief tutorial  No line wrapping, no page breaks  Copy text exactly: no omissions or insertions, no corrections to source text

 LREC 2008, Marrakech Morocco - May Scribe Assignments  Assignments are in the form of printed "kits"  50 printed pages to be copied plus assignment table Assignment table specifies page order and writing conditions  Multiple scribes/kit, so conditions and order vary  Printed pages labeled with page and kit ID  Scribes affix label with scribe, page and kit ID to back of completed manuscript To facilitate data tracking during scanning and post- processing  Scribes supply paper and writing instrument  To sample natural variation  Payment per completed kit  Exhaustive check on first assignment (completeness and accuracy)  Spot check on remainder of assignments

 LREC 2008, Marrakech Morocco - May Ground Truthing  Zones created at word level only for Phase 1  Lines can be extrapolated from annotation  Other zone types possible in future phases Structural elements (e.g. signature block)  Explicit reading order preserved  Locations are polygons  Restricted to upright rectangles in the first phase  Each zone contains a unique ID, the contents, location (coordinates)  Status tags to accommodate scribe mistakes  extra, missing, typo  nextZoneID tag to indicate reading order  In Phase 1, ground truthing primarily by partner site (Applied Media Analysis)

 LREC 2008, Marrakech Morocco - May GEDI Toolkit  GroundTruth - Editor and Document Interface (GEDI) created by Applied Media Analysis (AMA)

 LREC 2008, Marrakech Morocco - May Data Format MADCATUnifier Process takes multiple data streams and generates single xml output file which contains all required information 1) Text layer *Source Text *Tokenization *SU Segmentation *Translation 2) Image layer *zone bounding boxes 3) Scribe demographics 4) Document metadata

 LREC 2008, Marrakech Morocco - May Evaluation  Input: (segmented) Arabic handwritten image  Output: segmented English text  HTER is primary evaluation metric (edit distance)  Manual post-editing task corrects MT output one segment at a time until it has the same meaning as the reference translation, making as few edits as possible  NIST-developed MTPostEditor GUI Editors review segment-aligned MT and gold standard translation  No access to original Arabic text or handwritten image file  No official separate evaluation of OCR or processing components

 LREC 2008, Marrakech Morocco - May Conclusions; Future Work  LDC is creating a set of new linguistic resources for image processing, document classification and translation on a scale not previously available  Phase 1: Large collection of Arabic handwritten, translated, segmented, ground truthed text  Infrastructure for collection, annotation and data management Including a unified, extensible data format  Extended to new data types, domains, languages, annotations in future phases  Resources will be available through LDC

 LREC 2008, Marrakech Morocco - May Acknowledgements  This work was supported in part by the Defense Advanced Research Projects Agency, MADCAT Program Grant No. HR The content of this paper does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.  Thank you to Audrey Le and Mark Przybocki at NIST for helping to define data and format requirements