Download presentation
Presentation is loading. Please wait.
Published byNathaniel McLaughlin Modified over 9 years ago
1
LREC 2008, Marrakech Morocco - May 30 2008 New Resources for Document Classification, Analysis and Translation Technologies Stephanie Strassel, Lauren Friedman, Safa Ismael, David Lee, Kazuaki Maeda, Linda Brandschain {strassel, lf, safa, david4, maeda, brndschn}@ldc.upenn.edu Linguistic Data Consortium http://projects.ldc.upenn.edu/MADCAT
2
LREC 2008, Marrakech Morocco - May 30 2008 Presentation Outline MADCAT Program Overview Technology Challenges Roadmap Data Creation Phase 1 Data Profile Processing Collection Annotation Data Format Evaluation Conclusions and Future Work
3
LREC 2008, Marrakech Morocco - May 30 2008 MADCAT Overview MADCAT: Multilingual Document Classification Analysis and Translation A 5-year DARPA program MADCAT technologies will convert foreign language document images into English text, enabling English speakers to extract, assess, and respond to information in a timely manner Multiple input types and domains Hard-copy, PDF, camera-captured Newspapers, letters, signs, graffiti, how-to manuals, memos, postcards, forms, diaries, ledgers, etc.
4
LREC 2008, Marrakech Morocco - May 30 2008 Technology Challenges Extract relevant metadata about the document structure Integrate and optimize page segmentation, metadata extraction, OCR and translation technologies Create end-to-end system for deployment at program’s end with over 90% accuracy Current baseline is ~2% Primary evaluation metric is edit distance: HTER Same protocols as used in the GALE program Limited focus in Phase 1 Arabic > English High resolution (600 dpi) images of handwritten newspaper and web text Topics primarily news, current events and commentary Manual segmentation provided
5
LREC 2008, Marrakech Morocco - May 30 2008 Pre-MADCAT: State of the Art Phase 1: Add handwriting Phase 4-5: New genres, topics, quality conditions Newswire Broadcast Talk Shows Weblogs Newswire Broadcast Talk Shows Weblogs Printed Handwritten Phase 2-3: New data types Personal Identif. Instructns Books Training Manuals Letters Forms Ledgers Diaries Calendar Maps Poems Verdicts Letters Forms Ledgers Diaries News Commentary News Commentary Science Engineering Personal Science Engineering Personal Religious Military Other Controlled Uncontrolled Calendars Instructns Genre Topic Medium Source Data Quality Printed Handwritten Printed Handwritten Phase Roadmap
6
LREC 2008, Marrakech Morocco - May 30 2008 Phase 1 Data Profile In Phase 1, data drawn from DARPA GALE program New collection to acquire handwritten versions Genres: Formal text (newswire) and informal text (weblogs) Benefits Eliminates domain mismatch between GALE state of the art MT models and MADCAT test sets Allows developers to focus on primary challenge: handwriting Data characteristics well understood, cost and time factors are reasonably well known Training data costs controlled since translations exist Production begins immediately, training data available sooner Provides controlled test sets for evaluation across programs Subsequent phases will add new data types, genres and other challenge elements
7
LREC 2008, Marrakech Morocco - May 30 2008 Training and DevTest Training Minimum 2000 unique pages Half formal (newswire), half informal (web text) 100-250 words per page Minimum 100 unique scribes in training pool 5 scribes per page At minimum 10,000 manuscripts (scribe-pages) in Phase 1 training set DevTest 320 unique pages Half formal (newswire), half informal (web text) 125 words/page 50 scribes in devtest pool 25 from training, 25 previously unseen 2 scribes per page, ~7 pages per scribe Total of 640 manuscripts; 80,000 words
8
LREC 2008, Marrakech Morocco - May 30 2008 Evaluation Data 320 unique pages from GALE P3 Eval set Half formal (newswire), half informal (web text) 125 words/page 50 scribes in eval partition 25 from training, 25 previously unseen 6 scribes per page, ~40 pages per scribe Total of 1920 manuscripts, 240,000 words Subset of eval set designated for pilot evaluation in September 2008
9
LREC 2008, Marrakech Morocco - May 30 2008 Data Preparation Start with electronic text from GALE Whole documents collected from newswire or web Segmented into SUs (semantic/sentence units) Each segment manually translated Pre-processing prior to handwriting Tokenization to words for later stages Segments reordered and formatting added to create optimal pages for handwriting assignment Roughly 5 words/line to avoid line wrapping No more than 25 lines/page to avoid page breaks After handwriting, images scanned at high resolution (600 dpi, greyscale) Images are ground truth annotated at line, word level Major challenge is logical storage of many layers of information across multiple versions of the same data
10
LREC 2008, Marrakech Morocco - May 30 2008 Collection New human subjects collection required to produce handwritten versions of existing data Pilot collection currently underway at LDC in Philadelphia LDC Arabic staff and recent Iraqi immigrants in Philly Additional collections planned with partner sites in Lebanon, Morocco and possibly Egypt Regional variety necessary to capture stylistic writing differences E.g. use of Indic vs. Arabic numbers Assignment and tracking of data and scribes controlled through centralized LDC database and assignment protocol Scribe partition (train only, test only, both) Writing conditions Regional variation Genre, topic and source balance
11
LREC 2008, Marrakech Morocco - May 30 2008 Writing Conditions Implement 90% ballpoint pen (I) 10% pencil (P) Paper 75% unlined white paper (U) 25% lined paper (L) Writing speed 90% normal (N) 5% fast (F) 5% careful (C)
12
LREC 2008, Marrakech Morocco - May 30 2008 Scribe visits public URL, contacts site coordinator Site coordinator schedules appointment Scribe comes in, takes writing sample test Site coordinator verifies scribe eligibility Site coordinator logs in to secure website via login page Scribe completes registration via registration page Scribe verifies info via confirmation page Site coordinator prints out subject ID and instructions for subject via assignment page Coordinator pulls kit for this subject ID Scribe returns completed kit to site Coordinator verifies kit completeness and arranges payment Scribe leaves with kit and instructions Coordinator files completed kit for scanning/delivery Site scans completed kit(s) as safeguard Site ships completed paper kit(s) to LDC for archiving LDC selects source data LDC generates kits (documents + writing conditions) Sites publicize study and recruit participants LDC delivers data kits to collection sites Site uploads image file to LDC LDC processes completed kits for subsequent tasks Collection Workflow
13
LREC 2008, Marrakech Morocco - May 30 2008 Scribe Demographics Scribes register in person at collection site and take writing test To assess literacy and ability to follow instructions Enter demographic info on LDC's secure server Name, address (for payment purposes only) Age, gender, level of education, occupation Where born, where raised Primary language of educational instruction Handedness After registration, scribes receive brief tutorial No line wrapping, no page breaks Copy text exactly: no omissions or insertions, no corrections to source text
14
LREC 2008, Marrakech Morocco - May 30 2008 Scribe Assignments Assignments are in the form of printed "kits" 50 printed pages to be copied plus assignment table Assignment table specifies page order and writing conditions Multiple scribes/kit, so conditions and order vary Printed pages labeled with page and kit ID Scribes affix label with scribe, page and kit ID to back of completed manuscript To facilitate data tracking during scanning and post- processing Scribes supply paper and writing instrument To sample natural variation Payment per completed kit Exhaustive check on first assignment (completeness and accuracy) Spot check on remainder of assignments
15
LREC 2008, Marrakech Morocco - May 30 2008 Ground Truthing Zones created at word level only for Phase 1 Lines can be extrapolated from annotation Other zone types possible in future phases Structural elements (e.g. signature block) Explicit reading order preserved Locations are polygons Restricted to upright rectangles in the first phase Each zone contains a unique ID, the contents, location (coordinates) Status tags to accommodate scribe mistakes extra, missing, typo nextZoneID tag to indicate reading order In Phase 1, ground truthing primarily by partner site (Applied Media Analysis)
16
LREC 2008, Marrakech Morocco - May 30 2008 GEDI Toolkit GroundTruth - Editor and Document Interface (GEDI) created by Applied Media Analysis (AMA)
17
LREC 2008, Marrakech Morocco - May 30 2008 Data Format MADCATUnifier Process takes multiple data streams and generates single xml output file which contains all required information 1) Text layer *Source Text *Tokenization *SU Segmentation *Translation 2) Image layer *zone bounding boxes 3) Scribe demographics 4) Document metadata
18
LREC 2008, Marrakech Morocco - May 30 2008 Evaluation Input: (segmented) Arabic handwritten image Output: segmented English text HTER is primary evaluation metric (edit distance) Manual post-editing task corrects MT output one segment at a time until it has the same meaning as the reference translation, making as few edits as possible NIST-developed MTPostEditor GUI Editors review segment-aligned MT and gold standard translation No access to original Arabic text or handwritten image file No official separate evaluation of OCR or processing components
19
LREC 2008, Marrakech Morocco - May 30 2008 Conclusions; Future Work LDC is creating a set of new linguistic resources for image processing, document classification and translation on a scale not previously available Phase 1: Large collection of Arabic handwritten, translated, segmented, ground truthed text Infrastructure for collection, annotation and data management Including a unified, extensible data format Extended to new data types, domains, languages, annotations in future phases Resources will be available through LDC
20
LREC 2008, Marrakech Morocco - May 30 2008 Acknowledgements This work was supported in part by the Defense Advanced Research Projects Agency, MADCAT Program Grant No. HR0011-08-1-004. The content of this paper does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. Thank you to Audrey Le and Mark Przybocki at NIST for helping to define data and format requirements
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.