EARS STT Workshop at ICASSP, March 2005 EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel,

EARS STT Workshop at ICASSP, March 2005 EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel, David Graff, Kevin Walker, Mark Liberman {ccieri,maamouri,shudong,jfiumara, strassel,graff,walkerk,myl@ldc.upenn.edu}

EARS STT Workshop at ICASSP, March 2005 What Happens Next? Collect feedback here Check feasibility of new ideas –e.g. availability of BN (tran)scripts Estimate cost, timeline for wish list Sponsors allocate funds EARS Board revise priorities Re-estimate cost, timeline for task list Communicate final plan “Start”

EARS STT Workshop at ICASSP, March 2005 What Happened Next? Feedback was generally favorable Next day learned of 3 month projects Received 25% funding Preparation of utility thresh holds Learned of TIDES/EARS end Learned that GALE <> TIDES+EARS Completed existing commitments –STT Test Sets (MT Test Set) –CTS Collections Adjusted focus to GALE preparation

EARS STT Workshop at ICASSP, March 2005 Broadcast News Continue 2004 collection –>2000h English: VOA, NBC/MSNBC, CNN, ABC, PBS, PRI, WB17 –>1000h Chinese: VOA, CCTV, Radio Free Asia (RFA), NTDTV, Tai Yuan –>1000h Arabic: VOA, Al Hurra, Al Jazeera, Dubai, Jordan TV, LBC, Nile Select 2005 evaluation set then distribute 2004 data (February 2005) –delivery made after eval set picked 2005 Collection same sources, volumes –add semi-automatic language, source, program ID to QC process –harvest (tran)scripts where possible –100 hours of transcribed Chinese BN (commercial, QTr) –100 hours of transcribed Arabic BN (commercial, QTr) –collect broadcast conversations: audio and (tran)scripts Continue IPR negotiations Contribute to Experiments –Utility of Careful vs. Commercial vs. QTr. vs. CC. vs. Roverized ASR Update pronouncing Lexicons with vocab from English, Chinese, Arabic Continue collection with sources adjusted for GALE –Greater focus on broadcast conversation –Total: 62.5 hrs/week of Arabic, 60 hrs/week of Chinese, 75 hs/week of English –BC: 2.5 hours/week Arabic, 15 hours/week Chinese, 25 hours/week English –Acquired IPR for several new programs: 100% English 50% of Arabic, Chinese

EARS STT Workshop at ICASSP, March 2005 English CTS Volume: complement 2003 collection to provide another 1400 hours (was 850) with subjects making 1-20 10-minute calls Used November 2003 Topics BBNT/WordWave doing transcription Complete collection of 1400 hours Finalize evaluation set Distribute beginning in December as transcripts are ready 1400 hours sent to BBN/WordWave for transcription 450 hours distributed to sites February 17

EARS STT Workshop at ICASSP, March 2005 Chinese CTS New Collection at HKUST –Target 200 hours transcribed, gender balance, regions represented Transcription based upon RT03 150 hours in delivered to LDC so far –regions not balanced across delivery increments Select 2005 evaluation & dev/test sets –to control demographics across train/test sets Deliver training data once final increment has arrived and evaluation data extracted Repeat collection in 2005 –require gender, age, regional balance across collection epoch –require word segmentation? Build portable platform? HKUST finished Collection of 150 hours of CTS –ready for release once test set extracted –will deliver 50 more hours at end of March –will collect & transcribe another 50 hours through June

EARS STT Workshop at ICASSP, March 2005 Arabic CTS Fisher Protocol, platform in US Select 2005 evaluation set from current collection Continue collection until current pool sapped Complete audit and transcription; deliver in December Add ‘yellow’ tier (surface phonemic) transcription Build portable platform? Begin new dialect? Demographics changed since last test sets created –new Dev/Test as well as Eval set required Finished 50 hours of Levantine Arabic CTS Released on 01/15/2005 as LDC 2005SO7 & LDC 2005TO3 50 more hours of Levantine due March 31, 2005 85 hours scheduled June 30, 2005 ??? Yellow layer transcription of 15h underway RT rates improving: 8-10xRT on green, 15xRT yellow (assuming green)

EARS STT Workshop at ICASSP, March 2005 STT Test Sets None

EARS STT Workshop at ICASSP, March 2005 MDE Ported English specification v6.2 to Chinese, Arabic Created MDE v7 specification, tool for English Created Chinese and Arabic tools Created small pilot data set in each language Distributed as: LDC2004E47

EARS STT Workshop at ICASSP, March 2005 GALE Preparation Created 13 new Fisher English topics designed to elicit ACE worthy conversations Collected 500 conversations; manually selected 25% for transcription. ACE transcribed; are in ACE annotation pipeline LDC Staff Read DLI DLPT material in Arabic LDC Staff read WSJ articles In preparation for GALE, adding new source types e-lists, blogs, chat, technical reports, GovDocs Built general purpose speech annotation toolkit; ready April 1.

EARS STT Workshop at ICASSP, March 2005 Distribution Rules Most EARS sites are LDC members Those who are not have data under evaluation agreement –Require return at end of program –LDC will offer extension; sites not part of GALE by June 2005 must return data then –Or non-members, non-GALE sites can keep data by becoming LDC members Exception drive arrays of BN data. This must be returned by both members and non- member not involved in GALE

EARS STT Workshop at ICASSP, March 2005 GALE-related efforts Data scouting in English, Chinese, Arabic –Exploring new domains Broadcast conversation (roundtable, talk shows, call-ins) Web text (blogs, newsgroups, chat, discussion forums) –Defining best practices Identifying, Harvesting, Formatting, Licensing –Researching more economical sources, methods Transcripts, story segmentation Annotation efficiencies Local infrastructure in place –Annotation toolkit –Annotation guidelines & web resources guide –Scouting teams for English, Chinese Arabic lagging Sharable version of tools, docs in progress To date, –English: 270 sites identified (16 topics) –Chinese: 57 sites identified (10 topics) –Arabic: 10 sites identified (3 topics) –All of these now/soon in ACE annotation pipeline –IPR secured under “fair use”

EARS STT Workshop at ICASSP, March 2005 Documentation

EARS STT Workshop at ICASSP, March 2005 Use search engine to find sites for each types –Minimum thresholds for each data type/subject Tool tallies good/bad sites identified; logs URLs/judgments to DB Categorize URLs as good or bad for TIDES-type annotation –“Bad” URLs are not revisited for a topic Process  The left side of the web scouting tool shows a tally of the data types found for the annotator’s topic.  The bottom pane of the tool is a window where the annotator inputs information, including data type, title, and URL, for each site that he finds.  The top pane of the tool is occupied by a web browser.

EARS STT Workshop at ICASSP, March 2005 Up-to-minute updates http://www.ldc.upenn.edu/Projects/GALE/Annotation/DataScouting/status.php

EARS STT Workshop at ICASSP, March 2005 EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel,

Similar presentations

Presentation on theme: "EARS STT Workshop at ICASSP, March 2005 EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EARS STT Workshop at ICASSP, March 2005 EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel,

Similar presentations

Presentation on theme: "EARS STT Workshop at ICASSP, March 2005 EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel,"— Presentation transcript:

Similar presentations

About project

Feedback