Download presentation
Presentation is loading. Please wait.
Published byLenard Rice Modified over 8 years ago
1
Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries Free Public Resource Comprehensive Data Expert Input Students Crowdsourcing Data Mining Linking All Languages Obtain each expression from every language 7000 languages globally = hundreds of millions of terms Requires: Robust platform Complex architecture Simple and attractive for millions of users Broad approach to data collection One dictionary from your language to any language Unique data model that accounts for complexities within and between languages Each entry is a container for extensive data Rich data for use in Human Language Technology applications High precision machine translation Computer assisted translation Voice recognition and synthesis Live localization Growing a well-reputed website Expand user base from Africa to languages worldwide Many thousands of top 5 Google search results Mobile services big data on small devices cheap phones with expensive networks (African context) APIs and XML for external machine applications Every word has a definition in its own language “Talking Dictionaries” for non- written languages “Living Dictionaries” – data grows over time Will include geo-tagging of terms and pronunciations, historical information, relationships within a language 4-dimensional tapestry of human linguistic expression across time and space Transitivity: a concept that is linked to another language acquires that language’s links Degrees of separation: tracking distance between links as a confidence index Degrees of equivalence: charting how closely concepts correspond Core data design principle: translation is mapping ideas, not letter strings, across languages Structured but flexible online Edit Engine The Fidget Widget: mobile app for targeted data collection from the crowd Merging engine to bring in data from existing data sets http://terms.kamusi.org: participatory platform for expert-led community terminology development Specialized terms possible for specific domains, e.g. Science and Medicine Development and Human Rights Emergency Response Potential to integrate with other projects, e.g. MOOCs Government forms Existing linguistic data is extremely variable – mapping fields is a major challenge, especially from older scanned sources Data in each set must be validated by experts or crowds Data must be aligned to specific senses – automation is not possible Fidget Widget for simple tasks in idle time Gamification: competition within and across languages Social recognition for contributions Validation before publication – building confidence into the system Authoritative knowledge for long-term data reliability Data model that satisfies all technical needs identified by linguists Accounts for all language variables Simple to configure and use Paying for expert labor: need to build a system for the public to “buy” words Training opportunity for students in translation and linguistics Kamusi gives stipend support for students to develop data in their language Pilot program in place at University of Ngozi in Burundi, with plans to expand to other African universities How to recognize good data in thousands of languages we can’t read Detecting and correcting good users who give bad data How to prevent bad data in thousands of languages Detecting and eliminating malicious users and their submissions Preventing spam registration and comments with millions of users and millions of pages How to get good data from non- experts, including non-literate speakers of endangered languages How to sharply focus tasks to a user’s skill and knowledge set What incentives will motivate participation, e.g. recognition, social media integration Creating iterative processes to validate data with statistical confidence Games with a purpose: how to make lexicography fun How to make old data available in new structures Techniques for capturing data from inconsistent and unstructured sources Tools to facilitate human review of imported data How to make diverse data commensurate Matching concepts across languages in the absence of indexes or sense disambiguation How to collect enhanced data, beyond basic translations and definitions How to link with other data projects Incorporating data from other sources, e.g. WordNet Exposing complex data in simple ways for external use cases How to integrate with translation technologies Harvesting terms and usage examples from translation software Giving translators and machines near-perfect vocabulary How to present data in numerous languages for diverse public needs, from schoolchildren to research scholars
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.