 LREC 2008, Marrakech Morocco - May 30 2008 New Resources for Document Classification, Analysis and Translation Technologies Stephanie Strassel, Lauren.

 LREC 2008, Marrakech Morocco - May 30 2008 New Resources for Document Classification, Analysis and Translation Technologies Stephanie Strassel, Lauren Friedman, Safa Ismael, David Lee, Kazuaki Maeda, Linda Brandschain {strassel, lf, safa, david4, maeda, brndschn}@ldc.upenn.edu Linguistic Data Consortium http://projects.ldc.upenn.edu/MADCAT

 LREC 2008, Marrakech Morocco - May 30 2008 Presentation Outline  MADCAT Program Overview  Technology Challenges  Roadmap  Data Creation  Phase 1 Data Profile  Processing  Collection  Annotation  Data Format  Evaluation  Conclusions and Future Work

 LREC 2008, Marrakech Morocco - May 30 2008 MADCAT Overview  MADCAT: Multilingual Document Classification Analysis and Translation  A 5-year DARPA program  MADCAT technologies will convert foreign language document images into English text, enabling English speakers to extract, assess, and respond to information in a timely manner  Multiple input types and domains  Hard-copy, PDF, camera-captured  Newspapers, letters, signs, graffiti, how-to manuals, memos, postcards, forms, diaries, ledgers, etc.

 LREC 2008, Marrakech Morocco - May 30 2008 Technology Challenges  Extract relevant metadata about the document structure  Integrate and optimize page segmentation, metadata extraction, OCR and translation technologies  Create end-to-end system for deployment at program’s end with over 90% accuracy  Current baseline is ~2%  Primary evaluation metric is edit distance: HTER  Same protocols as used in the GALE program  Limited focus in Phase 1  Arabic > English  High resolution (600 dpi) images of handwritten newspaper and web text  Topics primarily news, current events and commentary  Manual segmentation provided

 LREC 2008, Marrakech Morocco - May 30 2008 Pre-MADCAT: State of the Art Phase 1: Add handwriting Phase 4-5: New genres, topics, quality conditions Newswire Broadcast Talk Shows Weblogs Newswire Broadcast Talk Shows Weblogs Printed Handwritten Phase 2-3: New data types Personal Identif. Instructns Books Training Manuals Letters Forms Ledgers Diaries Calendar Maps Poems Verdicts Letters Forms Ledgers Diaries News Commentary News Commentary Science Engineering Personal Science Engineering Personal Religious Military Other Controlled Uncontrolled Calendars Instructns Genre Topic Medium Source Data Quality Printed Handwritten Printed Handwritten Phase Roadmap

 LREC 2008, Marrakech Morocco - May 30 2008 Phase 1 Data Profile  In Phase 1, data drawn from DARPA GALE program  New collection to acquire handwritten versions  Genres: Formal text (newswire) and informal text (weblogs)  Benefits  Eliminates domain mismatch between GALE state of the art MT models and MADCAT test sets  Allows developers to focus on primary challenge: handwriting  Data characteristics well understood, cost and time factors are reasonably well known  Training data costs controlled since translations exist  Production begins immediately, training data available sooner  Provides controlled test sets for evaluation across programs  Subsequent phases will add new data types, genres and other challenge elements

 LREC 2008, Marrakech Morocco - May 30 2008 Training and DevTest  Training  Minimum 2000 unique pages Half formal (newswire), half informal (web text) 100-250 words per page  Minimum 100 unique scribes in training pool  5 scribes per page  At minimum 10,000 manuscripts (scribe-pages) in Phase 1 training set  DevTest  320 unique pages Half formal (newswire), half informal (web text) 125 words/page  50 scribes in devtest pool 25 from training, 25 previously unseen  2 scribes per page, ~7 pages per scribe  Total of 640 manuscripts; 80,000 words

 LREC 2008, Marrakech Morocco - May 30 2008 Evaluation Data  320 unique pages from GALE P3 Eval set  Half formal (newswire), half informal (web text)  125 words/page  50 scribes in eval partition  25 from training, 25 previously unseen  6 scribes per page, ~40 pages per scribe  Total of 1920 manuscripts, 240,000 words  Subset of eval set designated for pilot evaluation in September 2008

 LREC 2008, Marrakech Morocco - May 30 2008 Data Preparation  Start with electronic text from GALE  Whole documents collected from newswire or web  Segmented into SUs (semantic/sentence units)  Each segment manually translated  Pre-processing prior to handwriting  Tokenization to words for later stages  Segments reordered and formatting added to create optimal pages for handwriting assignment Roughly 5 words/line to avoid line wrapping No more than 25 lines/page to avoid page breaks  After handwriting, images scanned at high resolution (600 dpi, greyscale)  Images are ground truth annotated at line, word level  Major challenge is logical storage of many layers of information across multiple versions of the same data

 LREC 2008, Marrakech Morocco - May 30 2008 Collection  New human subjects collection required to produce handwritten versions of existing data  Pilot collection currently underway at LDC in Philadelphia LDC Arabic staff and recent Iraqi immigrants in Philly  Additional collections planned with partner sites in Lebanon, Morocco and possibly Egypt  Regional variety necessary to capture stylistic writing differences E.g. use of Indic vs. Arabic numbers  Assignment and tracking of data and scribes controlled through centralized LDC database and assignment protocol  Scribe partition (train only, test only, both)  Writing conditions  Regional variation  Genre, topic and source balance

 LREC 2008, Marrakech Morocco - May 30 2008 Writing Conditions  Implement  90% ballpoint pen (I)  10% pencil (P)  Paper  75% unlined white paper (U)  25% lined paper (L)  Writing speed  90% normal (N)  5% fast (F)  5% careful (C)

 LREC 2008, Marrakech Morocco - May 30 2008 Scribe visits public URL, contacts site coordinator Site coordinator schedules appointment Scribe comes in, takes writing sample test Site coordinator verifies scribe eligibility Site coordinator logs in to secure website via login page Scribe completes registration via registration page Scribe verifies info via confirmation page Site coordinator prints out subject ID and instructions for subject via assignment page Coordinator pulls kit for this subject ID Scribe returns completed kit to site Coordinator verifies kit completeness and arranges payment Scribe leaves with kit and instructions Coordinator files completed kit for scanning/delivery Site scans completed kit(s) as safeguard Site ships completed paper kit(s) to LDC for archiving LDC selects source data LDC generates kits (documents + writing conditions) Sites publicize study and recruit participants LDC delivers data kits to collection sites Site uploads image file to LDC LDC processes completed kits for subsequent tasks Collection Workflow

 LREC 2008, Marrakech Morocco - May 30 2008 Scribe Demographics  Scribes register in person at collection site and take writing test  To assess literacy and ability to follow instructions  Enter demographic info on LDC's secure server  Name, address (for payment purposes only)  Age, gender, level of education, occupation  Where born, where raised  Primary language of educational instruction  Handedness  After registration, scribes receive brief tutorial  No line wrapping, no page breaks  Copy text exactly: no omissions or insertions, no corrections to source text

 LREC 2008, Marrakech Morocco - May 30 2008 Scribe Assignments  Assignments are in the form of printed "kits"  50 printed pages to be copied plus assignment table Assignment table specifies page order and writing conditions  Multiple scribes/kit, so conditions and order vary  Printed pages labeled with page and kit ID  Scribes affix label with scribe, page and kit ID to back of completed manuscript To facilitate data tracking during scanning and post- processing  Scribes supply paper and writing instrument  To sample natural variation  Payment per completed kit  Exhaustive check on first assignment (completeness and accuracy)  Spot check on remainder of assignments

 LREC 2008, Marrakech Morocco - May 30 2008 Ground Truthing  Zones created at word level only for Phase 1  Lines can be extrapolated from annotation  Other zone types possible in future phases Structural elements (e.g. signature block)  Explicit reading order preserved  Locations are polygons  Restricted to upright rectangles in the first phase  Each zone contains a unique ID, the contents, location (coordinates)  Status tags to accommodate scribe mistakes  extra, missing, typo  nextZoneID tag to indicate reading order  In Phase 1, ground truthing primarily by partner site (Applied Media Analysis)

 LREC 2008, Marrakech Morocco - May 30 2008 GEDI Toolkit  GroundTruth - Editor and Document Interface (GEDI) created by Applied Media Analysis (AMA)

 LREC 2008, Marrakech Morocco - May 30 2008 Data Format MADCATUnifier Process takes multiple data streams and generates single xml output file which contains all required information 1) Text layer *Source Text *Tokenization *SU Segmentation *Translation 2) Image layer *zone bounding boxes 3) Scribe demographics 4) Document metadata

 LREC 2008, Marrakech Morocco - May 30 2008 Evaluation  Input: (segmented) Arabic handwritten image  Output: segmented English text  HTER is primary evaluation metric (edit distance)  Manual post-editing task corrects MT output one segment at a time until it has the same meaning as the reference translation, making as few edits as possible  NIST-developed MTPostEditor GUI Editors review segment-aligned MT and gold standard translation  No access to original Arabic text or handwritten image file  No official separate evaluation of OCR or processing components

 LREC 2008, Marrakech Morocco - May 30 2008 Conclusions; Future Work  LDC is creating a set of new linguistic resources for image processing, document classification and translation on a scale not previously available  Phase 1: Large collection of Arabic handwritten, translated, segmented, ground truthed text  Infrastructure for collection, annotation and data management Including a unified, extensible data format  Extended to new data types, domains, languages, annotations in future phases  Resources will be available through LDC

 LREC 2008, Marrakech Morocco - May 30 2008 Acknowledgements  This work was supported in part by the Defense Advanced Research Projects Agency, MADCAT Program Grant No. HR0011-08-1-004. The content of this paper does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.  Thank you to Audrey Le and Mark Przybocki at NIST for helping to define data and format requirements

 LREC 2008, Marrakech Morocco - May 30 2008 New Resources for Document Classification, Analysis and Translation Technologies Stephanie Strassel, Lauren.

Similar presentations

Presentation on theme: " LREC 2008, Marrakech Morocco - May 30 2008 New Resources for Document Classification, Analysis and Translation Technologies Stephanie Strassel, Lauren."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

 LREC 2008, Marrakech Morocco - May 30 2008 New Resources for Document Classification, Analysis and Translation Technologies Stephanie Strassel, Lauren.

Similar presentations

Presentation on theme: " LREC 2008, Marrakech Morocco - May 30 2008 New Resources for Document Classification, Analysis and Translation Technologies Stephanie Strassel, Lauren."— Presentation transcript:

Similar presentations

About project

Feedback