Download presentation
Presentation is loading. Please wait.
1
DATA INTEGRATION FOR LANGUAGE DOCUMENTATION
Under the guidance of :- Dr. Jan Chomicki & Dr. Jeff Good Presented By:- Sumit Agrawal
2
INTRODUCTION This project aims at integrating large amount of data spread across various files & folders and in different formats. The data is about 7-9 languages related to linguistics project undergoing in Cameroon. Data also contains metadata about the files.
4
DATA FORMATS Questionnaire data Data Available in different format
AudioVisual Audio recordings Video recordings Photographs Scanned images Textual Transcriptions (some time-aligned, XML) Unstructured text (various formats) Questionnaire data Lexical data (e.g., vocabulary items in a database) Metadata
5
CHALLENGES Each file should have a metadata, but it is not the case for every file. Some files don’t have the associated metadata. Each researcher has different format of writing the file. Different researchers sometimes interacted with the same people. More than 200 different file types.
6
AIM System which can query the data by:- - Author name - Speaker name
- Date and language name etc. E.g.-Records pertaining to language ‘Naki’. All the records of the date ‘ ’ Clean the data. Remove duplicates and build a database.
7
AIM Each file to be linked to its metadata.
Query the RDF data using SPARQL . Integration of database and file system. User interface development for queries. Know the density of data. Database Management
8
ORIGINAL DATA- FOLDERS
9
ORIGINAL DATA- FILES .
10
Parsing The files were parsed using python scripts.
11
INITIAL RESULT
12
CLEANING & LINKING The different data formats were identified .
The identified files were grouped based on file extensions . The related metadata for each file. e.g. language , date and extension were extracted. Duplicate files were identified. The unidentified files were grouped in a separate file. The identified files were linked to the existing metadata. Two types of metadata one which we extracted and the other which was provided.
13
AFTER CLEANING -RESULT
A sample of data constructed after cleaning and linking the data with metadata:- Naki wav F:\DataIntegration\GoodBackup1-Obang\Naki\Jeff_Good\Cameroon2005 Naki JCG.wav George Ngong NAKI-NOTEBOOK Jeff Good Naki wav F:\DataIntegration\GoodBackup1-Obang\Naki\Jeff_Good\Cameroon2005 Naki JCG.wav George Ngong NAKI-NOTEBOOK :26 Jeff Good Naki wav F:\DataIntegration\GoodBackup1-Obang\Naki\Jeff_Good\Cameroon2005 Naki JCG.wav George Ngong NAKI-NOTEBOOK :78 Jeff Good Naki wav F:\DataIntegration\GoodBackup1-Obang\Naki\Jeff_Good\Cameroon2005 Naki JCG.wav George Ngong NAKI-NOTEBOOK :914 Jeff Good
14
XML SCHEMA
15
RDF DatA
16
BUILD A RDF DATABASE USING SESAME
TRIPLES OF THE RDF MODEL
17
RDF GRAPH
18
Current GOALS Providing SPARQL querying ability for the RDF data.
Linking of the remaining metadata to the parsed metadata. Building database for unidentified file.
19
LONG TERm GOALS Create a multimedia server to store the whole data along with metadata as well as RDF data. Automated dumping of data in the repository. Building a user interface. Provide Linked Data for Sematic Web
20
THANK YOU!
22
REFERENCES http://www.w3.org/TR/rdf-schema/
Legal Disclaimer: All other products, company names, brand names, trademarks and logos are the property of their respective owners.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.