Multilingual, Multi-script Catalog Requirements (An Arcadia Project) ________________________ January 29, 2010
Jan 2010 Outline _____________________________________________________ Background about the Arcadia non-Roman script project Introductions Orbis vs. YUFind and systems like YUFind Requirements discussion Wrap-up
Jan 2010 Project Goals _____________________________________________________ Gap analysis of multilingual, multi-script functionality in Lucene-Solr-Solrmarc discovery applications (e.g., YUFind) Identification of desirable functionality Collaboration opportunities, community interest Recommendations with level-of-effort analysis
Jan 2010 Orbis vs. Yufind _____________________________________________________
vs Chinese example: “ 中日韩经济合作的新起点 ” N-gram tokens, where N=2:
Jan 2010 Background: NR Scripts in Catalog Records _____________________________________________________
Jan 2010 JACKPHY _____________________________________________________ Japanese Arabic Chinese Korean Persian Hebrew Yiddish
Jan 2010 One-to-Many (CJK) _____________________________________________________ Example: “Mao Zedong” 毛泽东 Simplified 毛澤東 Traditional 毛沢東 Kanji (Modern)
Jan 2010 One-to-Many (CJK) _____________________________________________________ “Mao Zedong” in simplified Chinese characters retrieves 527 results
Jan 2010 One-to-Many (CJK) _____________________________________________________ The same search in traditional Chinese characters yields154 hits. Also Note paired fields
Jan 2010 One-to-Many (Digraphs) _____________________________________________________ ו וירטשאפט The Yiddish word “Virtshaft” is entered here with two separate vavs (i.e., key stroke ‘u’ in Microsoft’s Hebrew IME): U05D5 + U05D5
Jan 2010 One-to-Many (Digraphs) _____________________________________________________ N = 49 results
Jan 2010 One-to-Many (Digraphs) _____________________________________________________ װירטשאפט The same word is this time entered as a double-vav digraph = U05F0 (via MS Hebrew IME key combo right-alt+u)
Jan 2010 One-to-Many (Digraphs) _____________________________________________________ N = 11 results
Jan 2010 NR Spelling Suggestions _____________________________________________________ Unhelpful suggestion?
Jan 2010 Labels and Facets _____________________________________________________ Should script/language of query determine script/language of facets?
Jan 2010 Labels and Facets _____________________________________________________ Better would be: 杉本つとむ, (11) 高橋幹夫, (11) 野口武彦. (8) 渡辺信一郎, (7) OR: Sugimoto, Tsutomu, (11) Takahashi, Mikio, (11) Noguchi, Takehiko. (8) Watanabe, Shin’ichirō, (7) But not both mixed together. Let end user decide?
Jan 2010 Labels and Facets _____________________________________________________ We would like to choose our preference of display script here. For example, 江戸 By: 野村兼太郎, Published: 1942 Format: Book, Electronic Resource 江戶 の 翻訳家たち By: 杉本 つとむ, Published: 1995 Format: Book, Electronic Resource We would like to ask library users the best option for displaying parallel field data: 江戶 / 田中優子編. Contributors: 田中優子, Format: Book Language: Japanese Published: 東京 : 作品社, Series: 日本の名随筆. 03 别卷 ; 94 江戶 / 田中優子編. Edo / Tanaka Yūko hen. Contributors: 田中優子, Tanaka, Yūko, Format: Book Language: Japanese Published: 東京 : 作品社, Tōkyō : Sakuhinsha, Series: 日本の名随筆. 03 别卷 ; 94 Nihon no meizuihitsu. 03 Bekkan ; 94
Jan 2010 Language/Script of Interface _____________________________________________________ OCLC’s brief record display Interface easily flipped to one of several languages
Jan 2010 Language/Script of Interface _____________________________________________________ OCLC’s detailed record display with Japanese language interface
Language/Script of Interface OCLC WorldCat.org does localization of labels and instructions as well as localization of mapped facet values. Examples here in Chinese.
Jan 2010 Language/Script of Interface _____________________________________________________
Jan 2010 Language/Script of Interface & Text Directionality _____________________________________________________
Jan 2010 Sorting of Results _____________________________________________________ 江戸文学俗信辞典 Edo bungaku zokushin jiten 江戸文学地名辞典Edo bungaku chimei jiten 江戸文学辞典Edo bungaku jiten 江戸文様辞典Edo mon’yo jiten
Jan 2010 Sorting of Results _____________________________________________________ Also note bi- directional text
Jan 2010 Sorting within result sets: Options to Consider _____________________________________________________ For multiple languages sharing a script, e.g. Chinese ideographs, Arabic, Hebrew, or Latin, how would the users prefer to see the result sets sorted? We consider here the Chinese & Arabic cases…
Jan 2010 Sorting within Result Sets: Options to Consider _____________________________________________________ Sorting of results returned in Chinese script— Three sort strategies: (a) sort by Romanized equivalents; (b) sort by pronunciation; or (c) sort by radical- stroke?
Jan 2010 Sorting within Results Sets: Arabic script _____________________________________________________ How to handle additional Arabic-script characters in use for languages such as Persian, Kurdish, and/or Urdu? ڤ (vah, derived from ﻑ, fah) پ(pah) ﭺ (chah, derived from ج, g ̌ im) گ (gaf) ژ (zāī, derived from ز, zayin)
Jan 2010 Discussion User Needs and Expectations