Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries.

Slides:



Advertisements
Similar presentations
Thesaurus speed dating conclusions. The ideal thesaurus… …is tailor-made for the special needs of its user community. In other words, it is different.
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Computational Paradigms in the Humanities – eHumanities and their role and impact in transdisciplinary research Gerhard Budin University of Vienna.
Help communities share knowledge more effectively across the language barrier Automated Community Content Editing PorTal.
Presentation at WebEx Meeting June 15,  Context  Challenge  Anticipated Outcomes  Framework  Timeline & Guidance  Comment and Questions.
1 © 2010 SAGA Worldwide, LLC. All Rights Reserved.
Lexicography ( Dictionary Skills) Lecture 2
Company and Services Overview. Overview of UBL Suite of Services Flexible Pricing Partnering with UBL Ease of Integration Open Discussion.
Open Data at the World Bank. Open Data at the World Bank Open about what we do Open about what we.
Information and Business Work
THE UNIVERSITY OF HONG KONG WEB BY DANIEL CHURCHILL 2.0.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
SE 464: Industrial Information systems Systems Engineering Department Industrial Information System LAB 02: Introduction to SAP.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Supporting Classroom Interaction with Networked Tablet PCs Richard Anderson Professor of Computer Science and Engineering University of Washington.
Data Sources & Using VIVO Data Visualizing Scholarship VIVO provides network analysis and visualization tools to maximize the benefits afforded by the.
1212 Management and Communication of Distributed Conceptual Design Knowledge in the Building and Construction Industry Dr.ir. Jos van Leeuwen Eindhoven.
An innovative platform to allow translation and indexing of internet sites Localization World
Web 3.0 or The Semantic Web By: Konrad Sit CCT355 November 21 st 2011.
Microsoft Kinect TM + Crowdsourcing A Prototype for Code Validation Matt Trippy, Owner Enovation Concepts.
A Social Help Engine for Online Social Network Mobile Users Tam Vu, Akash Baid WINLAB, Rutgers University May 21,
ACOS 2010 Standards of Mathematical Practice
OCLC Online Computer Library Center A Global OpenURL Resolver Registry Phil Norman OCLC Dlsr4lib Workshop March 23 rd, 2006 Arlington VA.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 1 August 15th, 2012 BP & IA Team.
Help communities share knowledge more effectively across the language barrier Automated Community Content Editing PorTal.
Final Exam Part 1. Internet Regulation Internet regulation according to internet society states that it is about restricting or controlling certain pieces.
Hacettepe University Usluel, Y. K., Mazman, S.G. & Arıkan, A. PROSPECTIVE TEACHERS’ AWARENESS OF COLLABORATIVE WEB 2.0 TOOLS WWW/INTERNET 2009.
Margaret J. Cox King’s College London
Supporting Research with Weblogs: A Study on Web-based Research Support Systems JingTao Yao Department of Computer Science, University or Regina CANADA.
Ohio Technology Standards August 9, 2005 Why Standards in Technology? No Child Left Behind Technology Literacy requirement Computer and Multimedia Literacy.
Internet Based Information Sources on Urbanism - Tutorial - Authors: D. Milovanovic, D. S. Furundzic, yubc.net.
Dr. Nikos Houssos| National Documentation Centre / NHRF European Network of National Contact Points for Research Infrastructures moving forward The CERIF-based.
1 Benjamin Perry, Venkata Kambhampaty, Kyle Brumsted, Lars Vilhuber, William Block Crowdsourcing DDI Development: New Features from the CED 2 AR Project.
AVU International Conference, Nairobi, Kenya, Nov. 20, 2013 James Glapa-Grossklag, College of the Canyons Kathleen Ludewig Omollo, University of Michigan.
1 Web 2.0 and Government September /Translates to… Why care? IBM 2006 Global CEO Study identifies the key problems that Web 2.0 can help with.
1 On the Record Report of the Library of Congress Working Group on the Future of Bibliographic Control Diane Boehr Head of Cataloging, NLM
The role of Parthenos for CLARIN ERIC Steven Krauwer CLARIN ERIC Executive Director 1.
Connecting Teachers Can there be models of effective practice for teachers with ICT? Chair: Christine Vincent, Becta Presenter: Margaret Cox King’s College.
Christine Laham, Fahed Abdu, David Dezano,Shelly Kim.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
U.S. Department of the Interior U.S. Geological Survey Web Presence, Data Sharing, Real- time Analysis and Crowdsourcing GFSAD30 Sixth Workshop – July.
Track 1 – Part 1 What can we do to prepare the library of the future for researchers ? The Europeana Library Conference Madrid, December 2012.
Open Systems Solutions. Now. UNESCO WSIS+10 Review Paris February 26, 2013 “I WANT AN ACCESS TO THE WORLD OF KNOWLEDGE” Malala Yousafzai, Swat District,
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
What’s the use?: Searching for catalog user tasks beyond finding, identifying, selecting, and obtaining Marty Kurth Heads of Cataloging Interest Group.
TECHONOLOGY experts INDUSTRY Some of our clients Link Translation’s extensive experience includes translation for some of the world's largest and leading.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Online curriculum centre Faculty member training, April 2009.
What’s the Big Deal About R? Tom Tiedeman, OCIO July 21, 2015.
CSC 104 December 13,2012. Internet Regulation: States that it is about restricting or controlling certain pieces of information. This consisting of censorship.
OWL Representing Information Using the Web Ontology Language.
Introduction to the Semantic Web and Linked Data
Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.
1 These courseware materials are to be used in conjunction with Software Engineering: A Practitioner’s Approach, 5/e and are provided with permission by.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
The library is open Digital Assets Management & Institutional Repository Russian-IUG November 2015 Tomsk, Russia Nabil Saadallah Manager Business.
Behrooz ChitsazLorrie Apple Johnson Microsoft ResearchU.S. Department of Energy.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Instantly Deliver and Track Training to Learners Anytime, Around the World and on Any Device Within Your Office 365 Environment with LMS365 OFFICE 365.
E-LEARNING At Churchill Park School. Parent Evening – E-learning 27 August 2014 New Zealand Curriculum definition of e- learning: E- learning is described.
1 Using DLESE: Finding Resources to Enhance Teaching Shelley Olds Holly Devaul 11 July 2004.
Queen Martin Math TEACHnology Queen Martin
Teaching and Learning with Technology
European Network of e-Lexicography
CSc4730/6730 Scientific Visualization
TERMINOLOGY AND TRANSLATION
Consortium for Entrepreneurship Education
Introduction to Information Retrieval
Presentation transcript:

Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries Free Public Resource Comprehensive Data Expert Input Students Crowdsourcing Data Mining Linking All Languages Obtain each expression from every language 7000 languages globally = hundreds of millions of terms Requires: Robust platform Complex architecture Simple and attractive for millions of users Broad approach to data collection One dictionary from your language to any language Unique data model that accounts for complexities within and between languages Each entry is a container for extensive data Rich data for use in Human Language Technology applications High precision machine translation Computer assisted translation Voice recognition and synthesis Live localization Growing a well-reputed website Expand user base from Africa to languages worldwide Many thousands of top 5 Google search results Mobile services big data on small devices cheap phones with expensive networks (African context) APIs and XML for external machine applications Every word has a definition in its own language “Talking Dictionaries” for non- written languages “Living Dictionaries” – data grows over time Will include geo-tagging of terms and pronunciations, historical information, relationships within a language 4-dimensional tapestry of human linguistic expression across time and space Transitivity: a concept that is linked to another language acquires that language’s links Degrees of separation: tracking distance between links as a confidence index Degrees of equivalence: charting how closely concepts correspond Core data design principle: translation is mapping ideas, not letter strings, across languages Structured but flexible online Edit Engine The Fidget Widget: mobile app for targeted data collection from the crowd Merging engine to bring in data from existing data sets participatory platform for expert-led community terminology development Specialized terms possible for specific domains, e.g. Science and Medicine Development and Human Rights Emergency Response Potential to integrate with other projects, e.g. MOOCs Government forms Existing linguistic data is extremely variable – mapping fields is a major challenge, especially from older scanned sources Data in each set must be validated by experts or crowds Data must be aligned to specific senses – automation is not possible Fidget Widget for simple tasks in idle time Gamification: competition within and across languages Social recognition for contributions Validation before publication – building confidence into the system Authoritative knowledge for long-term data reliability Data model that satisfies all technical needs identified by linguists Accounts for all language variables Simple to configure and use Paying for expert labor: need to build a system for the public to “buy” words Training opportunity for students in translation and linguistics Kamusi gives stipend support for students to develop data in their language Pilot program in place at University of Ngozi in Burundi, with plans to expand to other African universities How to recognize good data in thousands of languages we can’t read Detecting and correcting good users who give bad data How to prevent bad data in thousands of languages Detecting and eliminating malicious users and their submissions Preventing spam registration and comments with millions of users and millions of pages How to get good data from non- experts, including non-literate speakers of endangered languages How to sharply focus tasks to a user’s skill and knowledge set What incentives will motivate participation, e.g. recognition, social media integration Creating iterative processes to validate data with statistical confidence Games with a purpose: how to make lexicography fun How to make old data available in new structures Techniques for capturing data from inconsistent and unstructured sources Tools to facilitate human review of imported data How to make diverse data commensurate Matching concepts across languages in the absence of indexes or sense disambiguation How to collect enhanced data, beyond basic translations and definitions How to link with other data projects Incorporating data from other sources, e.g. WordNet Exposing complex data in simple ways for external use cases How to integrate with translation technologies Harvesting terms and usage examples from translation software Giving translators and machines near-perfect vocabulary How to present data in numerous languages for diverse public needs, from schoolchildren to research scholars