Presentation is loading. Please wait.

Presentation is loading. Please wait.

John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary.

Similar presentations


Presentation on theme: "John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary."— Presentation transcript:

1 John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary Tools.

2 John J. Kovarik, NSA/CSS Senior Language Technology Authority 2 Project Goals Unite federal foreign language analysts in communities of interest by language to increase the speed and accuracy of multilingual work Outgrowth of NSA legacy individual foreign language dictionary tools Share Next Generation tool suite across the federal government in 90 languages

3 John J. Kovarik, NSA/CSS Senior Language Technology Authority 3 Foreign Language Work 1970’s Manual tools –Hardcopy dictionaries (2-10 per person) –3x5 card files for specialized vocabulary –Pen and paper only Work environment –Career analysts revered as subject matter experts rule the work place. –College graduates hired right out of school, some with military experience, enter the job.

4 John J. Kovarik, NSA/CSS Senior Language Technology Authority 4 Foreign Language Challenge I The classic sparse data problem Never enough vocabulary Never enough grammar training Never enough cultural knowledge

5 John J. Kovarik, NSA/CSS Senior Language Technology Authority 5 Foreign Language Challenge II Why it’s a sparse data problem. Communication is usually spontaneous between 2 or more people who share a great deal of special knowledge in common Ultimate goals often not explicit Ambiguity reigns for outsiders No simple rules for filling in the blanks

6 John J. Kovarik, NSA/CSS Senior Language Technology Authority 6 An example— 女人 去 打敲 竹鋼的 密醫 來 解決 她的 問題 。 All glossed (4 min/chr 17chrs) meaning obscure—”Female people go hit knock bamboo curtain’s secret doctor come untie decide her ask issue.” All phrases verified (longest string match—9) clearer—”A woman goes and knocks on the bamboo curtain’s secret doctor to come resolve her problem.”…but still uncertain Check for neologism—go to FBIS recent translations, look to clarify meaning of new term “knock bamboo curtain”. “Knock on the bamboo curtain for a secret doctor” = “seek out an illegal quack” “A woman (must) go seek out an illegal quack to resolve her problem.”

7 John J. Kovarik, NSA/CSS Senior Language Technology Authority 7 People say, “What’s the big deal with just an on-line dictionary?” “I never/seldom use a dictionary!” –Native speaker syndrome –Vast majority of people must use a dictionary in a second/third language “Hardcopy dictionaries are better.” –Can’t do wild-card searches by hand –Not engineered for 10 sec. avg. response –Humans tire; machines do not.

8 John J. Kovarik, NSA/CSS Senior Language Technology Authority 8 1991 First Generation Dictionary DB Tool 200,000 entries from 3x5 cards collected over 20 years Wild card searchable Cross referenced 4 ways in accordance with user requirements Displayed in native script Can cut and paste queries/responses

9 John J. Kovarik, NSA/CSS Senior Language Technology Authority 9 Reactions to 1 st Generation Tool Younger analysts used it; liked it; made great suggestions to improve it Senior analysts usually would not use it

10 John J. Kovarik, NSA/CSS Senior Language Technology Authority 10 1995 2 nd Generation Dictionary DB Tool Responses faster on queries with leading wild card GUI customized per user input Candidate entry system established Usership doubled ! Senior analysts start to use it

11 John J. Kovarik, NSA/CSS Senior Language Technology Authority 11 1998 3 rd Generation Dictionary DB Tool Database re-encoded in UTF8 Simultaneous simplified and traditional Chinese display enabled Average 1,000-3,000 candidate entries approved annually ’98-’02 Usership again doubled !

12 John J. Kovarik, NSA/CSS Senior Language Technology Authority 12 Today Wordscape The Next Generation Dictionary Tool Retains all Chinese capabilities Expands to all language collections Neologism newswire research tools Over 90 languages represented in one Unicode DB unified under one XML schema and one suite of tools Under LASER ACTD funding, extending all across the federal government!

13 John J. Kovarik, NSA/CSS Senior Language Technology Authority 13 Technology and Standards New technology being used –Benefits of scale from use of UTF8, XML Standards adopted—leading change –Participating in ISO standards group Technical Committee 37 on terminology and language resources (developing standardized formats for foreign language lexical resources and data exchange)

14 John J. Kovarik, NSA/CSS Senior Language Technology Authority 14 When do Unicode standards fail? When Unicode standards are not standard! 3 rd World languages less commonly taught in the United States Hindi (many different script rendering implementations) Mongolian (no standardized spelling, many newswire web sites employ non-standard fonts)

15 John J. Kovarik, NSA/CSS Senior Language Technology Authority 15 Language Knowledge Services Team/Resources John L. George Program Manager ( 301) 688-9133 Over 20 computer scientists/techs Currently deploying Beta version Learning from testing with earlier version instantiations at FBI and NSA on JWICS now, SIPRnet/NIPRnet next

16 John J. Kovarik, NSA/CSS Senior Language Technology Authority 16 Contact Information John J. Kovarik Senior Language Technology Authority NSA Representative to LASER ACTD National Security Agency 9800 Savage Road Suite 6486 S2 Phone: (301) 688-7198 Kovarik@afterlife.ncsc.mil


Download ppt "John J. Kovarik, NSA/CSS Senior Language Technology Authority 1 Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary."

Similar presentations


Ads by Google