Download presentation
Presentation is loading. Please wait.
Published byIrma Chandler Modified over 9 years ago
1
Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal
2
Problem Statement Setting: A online integration system finds a new data file…Setting: A online integration system finds a new data file… Question: Can it be integrated into the system on the fly? How?Question: Can it be integrated into the system on the fly? How? Sub-tasks:Sub-tasks: Understand the dataUnderstand the data Talk to data hostTalk to data host Consult field expertConsult field expert Process the dataProcess the data Database administratorDatabase administrator ProgrammerProgrammer Can we automate the process?
3
Dataset flat file Layout learning Layout descriptor Parser generation Parser Parsing Raw attribute values Value cleaning and summarization Attribute summaries Score calculation Scores Expert or clustering algorithm Cutoff values Labeling Labels Step 1 1.Delimiter Identification (Ref [25], [26]) 2.Wrapper Generation (Ref [32]) 3.Schema Mining Step 2 Step 3 On-the-fly Integration Overview
4
Schema Mining Assign meaning (label or names) to attributes in a data setAssign meaning (label or names) to attributes in a data set ChallengesChallenges WhatWhat DelimitersDelimiters Values Values HowHow Top-downTop-down Bottom-up Bottom-up
5
Our Approach Summarize attribute values from bottom upSummarize attribute values from bottom up Similarity between ontology and schemaSimilarity between ontology and schema An attribute a with label att, a value vAn attribute a with label att, a value v Schema: “v is-a att”Schema: “v is-a att” Ontology: “Node(v) is a child of Node(att)”Ontology: “Node(v) is a child of Node(att)” E.g protein is-a molecule typeE.g protein is-a molecule type Common ancestor of values in ontology ~ attribute label in schemaCommon ancestor of values in ontology ~ attribute label in schema
6
Real-world Complications Complete comprehensive ontology databaseComplete comprehensive ontology database Selective samplingSelective sampling Error-free datasetError-free dataset Adjustable sensitivity and fault toleranceAdjustable sensitivity and fault tolerance TimeTime Data mining + Statistic analysisData mining + Statistic analysis Remark: attribute label attribute name e.g date : {creation date, last modification date}
7
Outline MotivationMotivation SystemSystem Mining algorithmMining algorithm ExperimentExperiment
8
Schema Mining System System Data cleaning and summarization Score function Ontology database Value cleaning and summarization Attribute summaries Score calculation Scores Expert or Clustering algorithm Cutoff values Labeling Attribute labels
9
Data Summarization Token profile: a ordered list of N(numerical), A(alphabetic) and special charactersToken profile: a ordered list of N(numerical), A(alphabetic) and special characters E.g Profile(“polyA_site”)=A_AE.g Profile(“polyA_site”)=A_A Token category: word, number or elseToken category: word, number or else Frequent tokensFrequent tokens Approximate frequent token mining algorithmApproximate frequent token mining algorithm Assumption: token distributed evenlyAssumption: token distributed evenly
10
Template Scoring Function Desired propertyDesired property SimpleSimple Adjustable trade-off between sensitivity and error toleranceAdjustable trade-off between sensitivity and error tolerance
11
Ontology Database Goal: to approximate a complete comprehensive ontology databaseGoal: to approximate a complete comprehensive ontology database ApproachApproach “Complete”: sample popular terms“Complete”: sample popular terms “Comprehensive”: public ontology databases + common facts“Comprehensive”: public ontology databases + common facts ResultResult 6 major categories, 386 terms6 major categories, 386 terms
12
Ontology Contents Sample existing databasesSample existing databases Organism name: NCBI TaxonomyOrganism name: NCBI Taxonomy Cellular component: Gene OntologyCellular component: Gene Ontology Publication method: NCBI Entrez JournalsPublication method: NCBI Entrez Journals New categoriesNew categories Biology database: popular database namesBiology database: popular database names Molecular type: biology factMolecular type: biology fact Free text: common words in natural languageFree text: common words in natural language EnhancementEnhancement + taxonomy hierarchy + direct submission
13
Outline MotivationMotivation SystemSystem Mining algorithmMining algorithm Using ontologyUsing ontology Using heuristicsUsing heuristics ExperimentExperiment
14
Mining With Ontology 1.Occurrence(term) = Frequent_Counts[i], if term=Frequent_Tokens[i] min i:[0, t] Frequent_Counts[i], if term=Frequent_Tokens[0]|…|Frequent_Tokens[t] 0, else 2.Strength(term) = Occurrence(term) + Strength(child_term)
15
Mining With Ontology Likelihood of attribute to be labeled with lLikelihood of attribute to be labeled with l Factors:Factors: Relative strength of term l compared with that of other termsRelative strength of term l compared with that of other terms completeness of ontologycompleteness of ontology Score = product of two factors, modulated by the template scoring functionScore = product of two factors, modulated by the template scoring function
16
Mining With Heuristics Use token profileUse token profile “number”: {N, N.N}“number”: {N, N.N} “date”: {N-A-N, N/N/N}“date”: {N-A-N, N/N/N} Use frequent token countsUse frequent token counts “identification”: Frequent_Counts[]=1“identification”: Frequent_Counts[]=1
17
Mining With Heuristics Use other token informationUse other token information “biological sequence”: length >45, or in 10’s“biological sequence”: length >45, or in 10’s Use token sequence informationUse token sequence information “people name”: length (2~3), separator (“,” or “and”), profile (not number, date)“people name”: length (2~3), separator (“,” or “and”), profile (not number, date)
18
Experimental Results DatasetsDatasets GenBank, UniProt SWISSPROT and PfamGenBank, UniProt SWISSPROT and Pfam Cutoff valuesCutoff values Cluster scores to group most, middle and little by minimizing standard deviationCluster scores to group most, middle and little by minimizing standard deviation EvaluationEvaluation Weighted Cohen’s Kappa: Compare group most, middle and little with true label Y(yes), P(partial) and N(no)Weighted Cohen’s Kappa: Compare group most, middle and little with true label Y(yes), P(partial) and N(no)
19
Results: Summary Category 1: cellular component, 2: database, 3: date, 4: free text, 5: ID, 6: molecule type, 7: name, 8: number, 9: organism, 10: publication method, 11: sequence Very good Good Moderate
20
Results: Cellular Component (Ontology)
21
Results: Biology Database (Ontology)
22
Results: Free Text (Ontology)
23
Results: Molecule Type (Ontology)
24
Results: Organism Name (Ontology)
25
Results: Publication Method (Ontology)
26
Results: Date (Heuristics)
27
Results: ID (Heuristics)
28
Results: People Name (Heuristics)
29
Results: Number (Heuristics)
30
Results: Bio. Sequence (Heuristics)
31
Discussion: Hits and Misses According to Kappa tests, good or very goodAccording to Kappa tests, good or very good Possible improvementPossible improvement Better clustering methodBetter clustering method Bigger ontology databaseBigger ontology database More involved language analysisMore involved language analysis Hybrid of bottom-up and top-down approachesHybrid of bottom-up and top-down approaches
32
Assigning Schema Labels Using Ontology and Heuristics
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.