Presentation is loading. Please wait.

Presentation is loading. Please wait.

All the Word (ATW) Introduction: The software developed at ATW is a fully functional, large scale, multi-lingual natural language generator designed and.

Similar presentations


Presentation on theme: "All the Word (ATW) Introduction: The software developed at ATW is a fully functional, large scale, multi-lingual natural language generator designed and."— Presentation transcript:

1

2 All the Word (ATW) Introduction: The software developed at ATW is a fully functional, large scale, multi-lingual natural language generator designed and developed entirely from a linguist’s perspective. The system incorporates extensive typological, discourse, semantic, and syntactic research into its semantic representational system and its transfer and synthesizing grammars. The semantic representations are comprised of a controlled, English based meta- language augmented by a feature system that was designed to accommodate a wide variety of languages. During the development phase of this project, the system was tested with English, Spanish, Korean, Kewa (Papua New Guinea), Jula (Cote d’Ivoure), North Tanna (Vanuatu), Chinantec (Mexico), and Angas (Nigeria). The system is presently being used to translate the narrative portions of the Old Testament into several languages of the Philippines. Hi Everyone, since I won’t be able to attend the meetings, I’ll provide a lot of text rather than bullet points so that you can see what we’re working on at ATW. I wish I could join you.

3 ATW’s Goal The goal of ATW is to provide high quality translations of the entire Bible, Bible stories, commentaries, devotional materials, pastoral training materials, Christian classics, and community development texts in a wide variety of languages, particularly minority and endangered languages. The drafts produced by oursoftware always require mother-tongue editing to improve the naturalness, information flow, lexical selection, etc., but experiments indicate that editing the computer generated drafts significantly improves the productivity of experienced mother-tongue translators without any loss of quality. The generated texts are always easily understandable, grammatically perfect, and convey essentially the same information as the original documents. We hope to eventually have semantic representations for the entire Bible, but Psalms and Proverbs will certainly be problematic.

4 Model of ATW’s System Our system consists of five components: 1) the ontology which contains all of the concepts that are used in the semantic representations, 2) the semantic representations which are the thoroughly annotated and disambiguated source texts used during the translation process, 3) a lexicon which contains all of the target language words and their associated features and forms, 4) a transfer grammar which restructures the semantic representations into a new underlying deep structure representation that is appropriate for a particular target language, and 5) the synthesizing grammar which synthesizes the final surface forms of the target text.

5 The Ontology The ontology consists of approximately 2,000 semantically simple English concepts. These concepts include the universal semantic primitives identified by the Natural Semantic Metalanguage theorists Anna Weirzbicka and Cliff Goddard, and the defining vocabulary of the Longman Dictionary serve as the semantic molecules. The concepts are organized into seven semantic categories. Each concept is very precisely defined and used in consistent environments throughout all of the semantic representations. Many of the concepts have multiple senses. For example, the ontology includes 25 senses of ‘be’: an existential sense, a locative sense, a predicative sense, a class membership sense, etc.

6 A Sample of the Events (verbs) in ATW’s Ontology
The Ontology A Sample of the Events (verbs) in ATW’s Ontology The colors in the ‘Senses’ column indicate the semantic complexity level of each event. Purple indicates a universal semantic primitive, light yellow indicates a semantic molecule, and light green indicates a complex concept that never occurs in the semantic representations, but will be inserted by a rule into the semantic representations if the user indicates his target language has a lexical equivalent. Each event also has a Theta Grid which indicates its obligatory and optional arguments throughout the semantic representations.

7 The Semantic Representational System
We considered formal semantics, conceptual semantics, and generative semantics, but they were unsuitable because they don’t contain sufficient information for minority languages. Therefore we developed a new semantic representational system which consists of semantically simple concepts in structurally simple propositions, and each concept, phrase, and proposition includes numerous features. For example, each object (noun) is marked for Number, and the possible values are Singular, Dual, Trial, Quadrial, Plural, and Paucal because some languages morphologically distinguish each of those values. Other features include Participant Tracking, Proximity, Semantic Role, Time, Aspect, Mood, Reflexivity, Polarity, Illocutionary Force, Discourse Genre, Salience Band, etc. When the biblical scholars are uncertain about the intended meaning of a particular passage, we include alternates which capture the major views.

8 An Example of ATW’s Semantic Representations
Semantic Representation of “Paulus started walking from the market to a village named Terpen.” As seen in the figure above, each phrase and clause is clearly marked, the relationships between the constituents are marked, and each concept, phrase, and clause has numerous features. For example, the second feature associated with each noun phrase indicates its semantic role. The NP containing “Paulus” is marked with an “A” indicating “Agent,” the NP containing “market” is marked with “s” indicating “Source,” and the NP containing “village” is marked with “d” indicating “Destination.” The fifth feature associated with each verb indicates its aspect. The verb “walk” is marked with “I” indicating “Inceptive Aspect.” Rules use these features in order to supply the appropriate morphology, put the constituents in their proper order, etc. The final result for English is “Paulus started walking from the market to a village named Terpen.”

9 The Target Lexicon The lexicon is a repository of all the target language words and their associated features and forms. The lexicon has seven syntactic categories: nouns, verbs, adjectives, adverbs, adpositions, conjunctions, and particles. For each syntactic category, linguists are able to define the features and forms that are pertinent to a particular target language. So each target noun may be assigned a gender value, an honorifics value, a class value, etc. The various forms of each target word are generated by lexical spellout rules which are able to add inflectional affixes and perform morphophonemic operations. All cases of suppletion are entered directly into the lexicon.

10 An Example of ATW’s Target Lexicon
Tagalog Verbs with their Glosses and Forms

11 The Transfer Grammar It’s impossible to develop a language neutral semantic representation of a text. Therefore the transfer grammar is responsible for restructuring the semantic representations in any way necessary to produce new underlying deep structure representations that are appropriate for each particular target language. So the transfer grammar contains rules that generate grammatical relations from semantic roles, perform theta grid adjustments for the events (verbs), handle the relativization strategies, mark noun-noun relationships, build clause chains, correct collocational clashes, etc. After the transfer grammar has been executed for a particular text, the result is a new underlying deep structure representation that is appropriate for the target language.

12 Model of ATW’s Transfer Grammar

13 The Tagalog Structural Adjustment Rule that Aggregates Possessor NPs
A Transfer Rule Most of the rules in the transfer grammar consist of an input structure and an output structure. Users build the input and output structures as shown above. When the rule is executed, the software searches the semantic representation of a verse for constituents that match the input structure. When a match is found, the changes specified in the output structure will be performed on the current semantic representation. The Tagalog Structural Adjustment Rule that Aggregates Possessor NPs

14 The Synthesizing Grammar
The synthesizing grammar is responsible for synthesizing the final surface forms of the target language text. This grammar was designed to resemble as closely as possible the descriptive grammars that field linguists routinely write. Therefore the synthesizing grammar contains rules that mark agreement, select lexical forms, add all the contextual affixation (prefixes, suffixes, infixes, circumfixes, suprafixes, and clitics), order the constituents appropriately, perform morphophonemic operations, identify where pronouns and/or switch reference markers may be used, and insert punctuation. The input to the synthesizing grammar is the deep structure representation produced by the transfer grammar, and the output of the synthesizing grammar is target language text.

15 Model of ATW’s Synthesizing Grammar
Note that the Pronoun Identification rules cannot be executed until after the Phrase Structure rules have ordered all the constituents appropriately. So after the Phrase Structure rules have been executed, the Pronoun Identification and Spellout rules are executed. But then the Phrase Structure rules must be executed again because pronouns may be positioned differently than nouns (e.g., possessive nouns may precede their possessums, but possessive pronouns may follow their possessums). After the Phrase Structure rules have been executed the second time, all of the constituents are in their final positions, and the Word Morphophonemic rules are executed.

16 A Synthesizing Spellout Rule
The Tagalog Rule that inserts Case Markers

17 A Synthesizing Spellout Rule
The Kewa Rule that supplies Tense Morphemes

18 A Synthesizing Clitic Rule
A Kewa Clitic Rule that marks the Objects of Reciprocal Actions Typological research indicates that languages employ three types of clitics: 1) Pre-clitics which attach to the beginning of the first word in the phrase or clause, 2) Second Position clitics which attach to the end of the first word in the phrase or clause, and 3) Post-clitics which attach to the end of the last word in the phrase or clause. So our clitic rules permit users to specify these three types of clitics.

19 A Synthesizing Phrase Structure Rule
A Phrase Structure Rule for Tagalog Clauses

20 A Morphophonemic Rule for Tagalog’s Ergative Common Case Marker
The Ergative Common Case Marker ‘ng’ changes to a suffix ‘-ng’ whenever it follows a verb that ends with a vowel. A Morphophonemic Rule for Tagalog’s Ergative Common Case Marker

21 Unedited English and Korean Texts Generated by ATW
One day a doctor named Paulus returned from the market to his village named Terpen. While Paulus had been at the market, some people had told him about a certain disease. So when Paulus returned to his village, he said to Isak, who was the village chief, and the other people who lived in Terpen, "A new disease named Avian Influenza has killed most of the birds that are at the market. This disease has killed many chickens and many ducks. 어느 날 팔러스라는 의사가 시장에서 터펜이라는 자기 마을로 돌아왔다. 팔러스가 시장에 있는 동안 사람들이 팔러스에게 어떤 병에 대해서 말하였다. 그래서 팔러스는 자기 마을로 돌아왔을 때 마을 이장인 아이작과 터펜에 사는 다른 사람들에게 말하였다. "조류 인플루엔자라는 새 병이 시장에 있는 대부분 새들을 죽였습니다. 이 병은 닭들과 오리들을 많이 죽였습니다. The two texts shown above have not been edited; they were both produced directly by our software. Those texts give you a sense of both the quality and content of the texts generated by our software. As mentioned earlier, the texts are always easily understandable, grammatically perfect, and convey essentially the same information as the original documents. The texts always require mother-tongue editing to improve the naturalness, the information flow, the lexical selections, and make the texts more culturally relevant, but the editing takes only a fraction of the time required for manual translation. The goal of our system is to help the translators get their first presentable draft very quickly.

22 Testimonies from Two Teachers in Manila
A teacher at Lyncrest Christian Academy in Manila wrote: "The students were captivated by the books and couldn't wait to read them. I requested they read the first chapter before I collected them, but many of the students finished reading to the end! The principal is interested in using the books in the Christian Living class. Kids love comic books, and this is a great way to teach them truth from God's Word! Thank you for sharing these books with our students." A teacher at Shining Stars in Manila wrote, “Thank you ATW! The students enjoyed the story; it was easy to read and easy to understand. They’d like the entire Bible study series to use the same format as this book. We would certainly be interested in getting copies of Genesis when it is ready! God bless!” After we finished translating the book of Ruth into Tagalog, we put the text into the pictures provided by Free Illustrated Bible, and printed 500 copies in color. We then distributed those books at churches, schools, and orphanages throughout the Manila area. The books have been very popular with adults, children, teachers, and students. We have Esther and the first half of Daniel ready to be printed, and we’ll soon have Genesis ready also. One missionary who prepares Tagalog Sunday school materials saw the texts that we produce, and immediately said he wants to include our translations in his future videos. We’re thrilled that our Tagalog translations will be used throughout Manila and the rest of the country to help train Bible school teachers and Sunday school teachers.

23 Remaining Work Our approach relies entirely upon the manual development of semantic representations. Manually developing these semantic representations is a time consuming task, but after a book has been thoroughly analyzed, our software can quickly generate translations of that book in a wide variety of languages. At the present time we have semantic representations for Genesis, Ruth, Esther, Daniel, Nahum, Luke, six Pauline epistles, and three community development texts. Exodus chapters 1 through 20, Judges, and a set of Bible stories should be finished this summer. We’re also making progress on Joshua and 1&2 Samuel. At the present time we’re focussing on the Old Testament because many missionaries are already working on the New Testament, and they don’t want help with that work. But we’ve found that they enthusiastically welcome computational assistance with the Old Testament. After we have semantic representations for the entire Bible, we’ll begin developing semantic representations for commentaries, devotional materials, pastoral training materials, etc.

24 Summary We work with missionaries, linguists, and mother-tongue speakers to build computational lexicons and grammars for languages, particularly minority and endangered languages that don’t yet have the Bible. After we’ve developed a lexicon and grammar for a language, our software is able to quickly produce initial draft translations of all the texts that we’ve developed semantic representations for. In the Philippines we’ve produced Tagalog translations of Genesis, Ruth, Esther, half of Daniel, and the first five chapters of Luke. As we develop more semantic representations of biblical books, we’ll produce Tagalog translations of those books as well. We’ve also found that it’s very efficient to modify the Tagalog lexicon and grammar to accommodate other Malayo-Polynesian languages in the Philippines, so we’re producing translations of the same biblical books in other languages also. Our Tagalog translations are currently being used in churches, schools, and orphanages throughout the Manila area.

25 For More Information ... For more information, please visit our website at We have a demo video on our home page, and you can download the current version of our software, complete with English and Tagalog lexicons and grammars. We have numerous tutorials, documents, and videos describing the details of our system, the experiments we’ve done to test the quality of the generated texts, summaries of the projects we did during the development phase of this system, and testimonies from pastors, missionaries, teachers, and students in the Manila area who are using our Tagalog translations. If you’d like to hear about our progress, you can sign up for our newsletters. We welcome all suggestions and comments.


Download ppt "All the Word (ATW) Introduction: The software developed at ATW is a fully functional, large scale, multi-lingual natural language generator designed and."

Similar presentations


Ads by Google