Download presentation
Presentation is loading. Please wait.
Published byChloe Holmes Modified over 9 years ago
1
Controlled Vocabulary Working Group Activities 2005-2006
2
The Problem ► Inconsistent, disjunct and sparse keywords negatively impact data discovery 72.2% of all keywords are used at only a single LTER site 90% of all keywords are used at 4 or fewer LTER sites
3
The Problem ► Good “Browse” interfaces require some organization of keywords ► E.g. BIOSPHERE PLANTS ► VASCULAR PLANTS OAK ► NON-VASCULAR PLANTS ANIMALS ► VERTEBRATES ► INVERTEBRATES
4
Possible Solutions 1. Create an LTER Controlled Vocabulary or Thesaurus or Ontology Advantages: ► Absolute control on contents ► Ability to customize to meet LTER needs Disadvantages: ► Development will be time and resource expensive ► Such development can be a highly technical field requiring specialists
5
Possible Solutions 2. Adopt an existing controlled vocabulary, thesaurus or ontology Advantages: ► Minimal cost to LTER ► Aids in linking LTER to a larger world of data systems Disadvantages: ► Lack of control ► Existing systems may not be suitable for LTER use Lack desirable terms
6
2005 LTER IM Meeting ► A the 2005 IM meeting we decided that the best option to explore was Option 2 (use an existing resource) Rationale: ► Could potentially save lots of time, trouble and money! ► Helps forge links with other groups ► Could make LTER systems interact better with other similar systems
7
Plan of Action
8
General Steps ► Identify existing resources that LTER could use NBII Thesaurus GEMET (GEneral Multilingual Environmental Thesaurus) Global Change Master Directory (GCMD) SEEK Ontology ► Evaluate the usability of existing systems ► Develop tools and relationships needed to exploit and improve the system(s) of choice
9
Assembling Resources ► assemble list of existing keywords EML ► Keywords ► title words ► attribute definition words ► taxonomy keywords ITIS SPIRE web service from UMD.BaltCo.... DTOC publications titles, keywords and abstracts Site keyword lists - e.g., AND-LTER need to count word and site frequency and number of keywords per document
10
Some Statistics Source Number of Terms Number used at 5 or more sites Most Frequently used EML Keywords 2,71186 LTER (1002), Temperature (701) EML Titles 2,480921 And (768), Data (394), LTER (350) EML Attributes 6,318436 The (4,207), Data(1,621), Carbon(328) DTOC Keywords 2,774103 ARC (1645), Temperature (732) Bibliography Titles 13,5381,855 Of (12,611), Forest (2,050)
11
Consolidated List ► The consolidated list includes 21,153 words or terms along with Number of “lists” on which it appeared (max 5) Number of sites and uses from each list Max and Min number of sites using (0-26) Max and Min number of uses (0-12,611) Is it a multi-word term?
12
Ranking/Rating Words ► Terms were sorted by: Number of Lists Max. number of sites on any single list Min. number of sites on any single list Number of uses ► The top 1010 terms were then rated as “useful” (U), “marginal/not sure” (M) or “not useful” (N) by volunteers Needed for abbreviations e.g., CO2 and words that are too general (e.g., “Above”, “Total”) The resulting list was then additionally sorted by a term score T=((U*1)+(M*0)+(N*-1))/(U+M+N) Always “Useful”=1.00, Always “Not Useful”= -1.00
13
Top of the list
14
Bottom of the list
15
Preliminary Evaluation ► Volunteers have used highly ranked words from the “list of 1000” to test retrieval from various thesauri So far NBII seems to be preferred, but we need additional testers ► Inigo San Gil has been working on automated queries of the of NBII Thesaurus
16
Tasks for this meeting ► Once we have a controlled vocabulary, how shall we use it? What tools do we need to develop? ► What additional testing/evaluation is required (bring in PI’s?)? What institutional relationships need to be pursued? What actions do we need to take to improve the usability of resources for LTER use?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.