DATA INTEGRATION FOR LANGUAGE DOCUMENTATION Under the guidance of :- Dr. Jan Chomicki & Dr. Jeff Good Presented By:- Sumit Agrawal
INTRODUCTION This project aims at integrating large amount of data spread across various files & folders and in different formats. The data is about 7-9 languages related to linguistics project undergoing in Cameroon. Data also contains metadata about the files.
DATA FORMATS Questionnaire data Data Available in different format AudioVisual Audio recordings Video recordings Photographs Scanned images Textual Transcriptions (some time-aligned, XML) Unstructured text (various formats) Questionnaire data Lexical data (e.g., vocabulary items in a database) Metadata
CHALLENGES Each file should have a metadata, but it is not the case for every file. Some files don’t have the associated metadata. Each researcher has different format of writing the file. Different researchers sometimes interacted with the same people. More than 200 different file types.
AIM System which can query the data by:- - Author name - Speaker name - Date and language name etc. E.g.-Records pertaining to language ‘Naki’. All the records of the date ‘2011-08-09’ Clean the data. Remove duplicates and build a database.
AIM Each file to be linked to its metadata. Query the RDF data using SPARQL . Integration of database and file system. User interface development for queries. Know the density of data. Database Management
ORIGINAL DATA- FOLDERS
ORIGINAL DATA- FILES .
Parsing The files were parsed using python scripts.
INITIAL RESULT
CLEANING & LINKING The different data formats were identified . The identified files were grouped based on file extensions . The related metadata for each file. e.g. language , date and extension were extracted. Duplicate files were identified. The unidentified files were grouped in a separate file. The identified files were linked to the existing metadata. Two types of metadata one which we extracted and the other which was provided.
AFTER CLEANING -RESULT A sample of data constructed after cleaning and linking the data with metadata:- Naki 12-11-05 .wav F:\DataIntegration\GoodBackup1-Obang\Naki\Jeff_Good\Cameroon2005 Naki-12-11-05-1-JCG.wav George Ngong NAKI-NOTEBOOK-2005-1 Jeff Good Naki 12-11-05 .wav F:\DataIntegration\GoodBackup1-Obang\Naki\Jeff_Good\Cameroon2005 Naki-12-11-05-2-JCG.wav George Ngong NAKI-NOTEBOOK-2005-1:26 Jeff Good Naki 14-11-05 .wav F:\DataIntegration\GoodBackup1-Obang\Naki\Jeff_Good\Cameroon2005 Naki-14-11-05-1-JCG.wav George Ngong NAKI-NOTEBOOK-2005-1:78 Jeff Good Naki 15-11-05 .wav F:\DataIntegration\GoodBackup1-Obang\Naki\Jeff_Good\Cameroon2005 Naki-15-11-05-1-JCG.wav George Ngong NAKI-NOTEBOOK-2005-1:914 Jeff Good
XML SCHEMA
RDF DatA
BUILD A RDF DATABASE USING SESAME TRIPLES OF THE RDF MODEL
RDF GRAPH
Current GOALS Providing SPARQL querying ability for the RDF data. Linking of the remaining metadata to the parsed metadata. Building database for unidentified file.
LONG TERm GOALS Create a multimedia server to store the whole data along with metadata as well as RDF data. Automated dumping of data in the repository. Building a user interface. Provide Linked Data for Sematic Web
THANK YOU!
REFERENCES http://www.w3.org/TR/rdf-schema/ http://www.delaman.org/docs/meeting06/good-metadata.pdf http://www.acsu.buffalo.edu/~jcgood/jcgood-CUPHEL.pdf http://www.w3.org/RDF/Validator/ Legal Disclaimer: All other products, company names, brand names, trademarks and logos are the property of their respective owners.