Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal.

Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Problem Statement Setting: A online integration system finds a new data file…Setting: A online integration system finds a new data file… Question: Can it be integrated into the system on the fly? How?Question: Can it be integrated into the system on the fly? How? Sub-tasks:Sub-tasks: Understand the dataUnderstand the data Talk to data hostTalk to data host Consult field expertConsult field expert Process the dataProcess the data Database administratorDatabase administrator ProgrammerProgrammer Can we automate the process?

Dataset flat file Layout learning Layout descriptor Parser generation Parser Parsing Raw attribute values Value cleaning and summarization Attribute summaries Score calculation Scores Expert or clustering algorithm Cutoff values Labeling Labels Step 1 1.Delimiter Identification (Ref [25], [26]) 2.Wrapper Generation (Ref [32]) 3.Schema Mining Step 2 Step 3 On-the-fly Integration Overview

Schema Mining Assign meaning (label or names) to attributes in a data setAssign meaning (label or names) to attributes in a data set ChallengesChallenges WhatWhat DelimitersDelimiters Values Values HowHow Top-downTop-down Bottom-up Bottom-up

Our Approach Summarize attribute values from bottom upSummarize attribute values from bottom up Similarity between ontology and schemaSimilarity between ontology and schema An attribute a with label att, a value vAn attribute a with label att, a value v Schema: “v is-a att”Schema: “v is-a att” Ontology: “Node(v) is a child of Node(att)”Ontology: “Node(v) is a child of Node(att)” E.g protein is-a molecule typeE.g protein is-a molecule type Common ancestor of values in ontology ~ attribute label in schemaCommon ancestor of values in ontology ~ attribute label in schema

Real-world Complications Complete comprehensive ontology databaseComplete comprehensive ontology database Selective samplingSelective sampling Error-free datasetError-free dataset Adjustable sensitivity and fault toleranceAdjustable sensitivity and fault tolerance TimeTime Data mining + Statistic analysisData mining + Statistic analysis Remark: attribute label  attribute name e.g date : {creation date, last modification date}

Outline MotivationMotivation SystemSystem Mining algorithmMining algorithm ExperimentExperiment

Schema Mining System System Data cleaning and summarization Score function Ontology database Value cleaning and summarization Attribute summaries Score calculation Scores Expert or Clustering algorithm Cutoff values Labeling Attribute labels

Data Summarization Token profile: a ordered list of N(numerical), A(alphabetic) and special charactersToken profile: a ordered list of N(numerical), A(alphabetic) and special characters E.g Profile(“polyA_site”)=A_AE.g Profile(“polyA_site”)=A_A Token category: word, number or elseToken category: word, number or else Frequent tokensFrequent tokens Approximate frequent token mining algorithmApproximate frequent token mining algorithm Assumption: token distributed evenlyAssumption: token distributed evenly

Template Scoring Function Desired propertyDesired property SimpleSimple Adjustable trade-off between sensitivity and error toleranceAdjustable trade-off between sensitivity and error tolerance

Ontology Database Goal: to approximate a complete comprehensive ontology databaseGoal: to approximate a complete comprehensive ontology database ApproachApproach “Complete”: sample popular terms“Complete”: sample popular terms “Comprehensive”: public ontology databases + common facts“Comprehensive”: public ontology databases + common facts ResultResult 6 major categories, 386 terms6 major categories, 386 terms

Ontology Contents Sample existing databasesSample existing databases Organism name: NCBI TaxonomyOrganism name: NCBI Taxonomy Cellular component: Gene OntologyCellular component: Gene Ontology Publication method: NCBI Entrez JournalsPublication method: NCBI Entrez Journals New categoriesNew categories Biology database: popular database namesBiology database: popular database names Molecular type: biology factMolecular type: biology fact Free text: common words in natural languageFree text: common words in natural language EnhancementEnhancement + taxonomy hierarchy + direct submission

Outline MotivationMotivation SystemSystem Mining algorithmMining algorithm Using ontologyUsing ontology Using heuristicsUsing heuristics ExperimentExperiment

Mining With Ontology 1.Occurrence(term) = Frequent_Counts[i], if term=Frequent_Tokens[i] min i:[0, t] Frequent_Counts[i], if term=Frequent_Tokens[0]|…|Frequent_Tokens[t] 0, else 2.Strength(term) = Occurrence(term) +  Strength(child_term)

Mining With Ontology Likelihood of attribute to be labeled with lLikelihood of attribute to be labeled with l Factors:Factors: Relative strength of term l compared with that of other termsRelative strength of term l compared with that of other terms completeness of ontologycompleteness of ontology Score = product of two factors, modulated by the template scoring functionScore = product of two factors, modulated by the template scoring function

Mining With Heuristics Use token profileUse token profile “number”: {N, N.N}“number”: {N, N.N} “date”: {N-A-N, N/N/N}“date”: {N-A-N, N/N/N} Use frequent token countsUse frequent token counts “identification”: Frequent_Counts[]=1“identification”: Frequent_Counts[]=1

Mining With Heuristics Use other token informationUse other token information “biological sequence”: length >45, or in 10’s“biological sequence”: length >45, or in 10’s Use token sequence informationUse token sequence information “people name”: length (2~3), separator (“,” or “and”), profile (not number, date)“people name”: length (2~3), separator (“,” or “and”), profile (not number, date)

Experimental Results DatasetsDatasets GenBank, UniProt SWISSPROT and PfamGenBank, UniProt SWISSPROT and Pfam Cutoff valuesCutoff values Cluster scores to group most, middle and little by minimizing standard deviationCluster scores to group most, middle and little by minimizing standard deviation EvaluationEvaluation Weighted Cohen’s Kappa: Compare group most, middle and little with true label Y(yes), P(partial) and N(no)Weighted Cohen’s Kappa: Compare group most, middle and little with true label Y(yes), P(partial) and N(no)

Results: Summary Category 1: cellular component, 2: database, 3: date, 4: free text, 5: ID, 6: molecule type, 7: name, 8: number, 9: organism, 10: publication method, 11: sequence Very good Good Moderate

Results: Cellular Component (Ontology)

Results: Biology Database (Ontology)

Results: Free Text (Ontology)

Results: Molecule Type (Ontology)

Results: Organism Name (Ontology)

Results: Publication Method (Ontology)

Results: Date (Heuristics)

Results: ID (Heuristics)

Results: People Name (Heuristics)

Results: Number (Heuristics)

Results: Bio. Sequence (Heuristics)

Discussion: Hits and Misses According to Kappa tests, good or very goodAccording to Kappa tests, good or very good Possible improvementPossible improvement Better clustering methodBetter clustering method Bigger ontology databaseBigger ontology database More involved language analysisMore involved language analysis Hybrid of bottom-up and top-down approachesHybrid of bottom-up and top-down approaches

Assigning Schema Labels Using Ontology and Heuristics

Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal.

Similar presentations

Presentation on theme: "Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal.

Similar presentations

Presentation on theme: "Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal."— Presentation transcript:

Similar presentations

About project

Feedback