Download presentation
Presentation is loading. Please wait.
Published byBethany Armstrong Modified over 9 years ago
1
BUILDING NANOBANK Data Structure and Selection Criteria Jason Fong and Emre Uyar University of California, Los Angeles 1
2
What is Nanobank? Nanobank is a collection of observations from various sources (scientific articles, patents and government grants), determined to be related to nanotechnology field, either by probabilistic information retrieval (IR) methods or by being declared nano by a source authority. 2
3
Data Sources - Articles 580,711 scientific articles from peer reviewed journals. Source: Science Citation Index, Arts & Humanities Citation Index and Social Sciences Citation Index of the Institute for Scientific Information Inc. (ISI®). All together, these indexes contain more than 24,250,000 entries from over 8,700 peer reviewed scientific journals. 3
4
Data Sources – Patents and Grants 240,437 patents from U.S. Patenting and Trademark Office’s online database of more than 4,000,000 patents, granted by USPTO from 1976 to 2006. 52,831 grants from NIH and NSF databases. 4
5
Data Contents Articles ◦ Titles ◦ Journal volume and issue numbers ◦ Publication years ◦ Author names ◦ Names and addresses of organizations affiliated with authors 5
6
Data Contents Patents ◦ Titles and abstracts ◦ Application and grant dates ◦ Names and addresses of inventors and assignees ◦ U.S. and international patent classifications 6
7
Data Contents Grants ◦ Titles and abstracts ◦ Receiving organization names and addresses ◦ PI and co-PI names ◦ Grant amounts 7
8
Nanobank Data Structure Internal database – Stored in a relational database – Separate tables for various data items – ID numbers for each item link between tables Version posted on Nanobank.org – Denormalized form of internal database – Storing redundant data isn’t as space-efficient, but lessens the need to join multiple tables – Nanobank Codebook contains detailed information on tables and fields available in each 8
9
Document Selection Document Selection Methods ◦ Keywords ◦ Probabilistic ◦ Authority-selected Tables include a field to indicate selection method: ◦ “nanobank_flag” = 1 if selected by Keywords or Probabilistic; 0 otherwise ◦ “authority_flag” = 1 if Authority-selected; 0 otherwise 9
10
Document Selection: Keywords Search for text patterns matching words or phrases related to nanotechnology Words and phrases chosen by subject specialists Less effective for identifying very early or recent documents – Early documents were written before the terms were in common usage – Recent documents have terms that are too new to be included in the search patterns 10
11
Document Selection: Probabilistic Incorporates new terms as they come into common usage Uses the Xapian search engine library to perform ranking calculations Analyzes document text and ranks against a set of query terms 11
12
Document Selection: Probabilistic Initial query terms from the Virtual Journal of Nanoscale Science & Technology (VJNano): ◦ All articles in VJNano assumed to be relevant ◦ Select highest ranked terms Document selection process: ◦ Use initial query terms to select relevant documents from all journal articles ◦ Select additional terms from those relevant documents and add to query ◦ Repeat selection with expanded query terms 12
13
Document Selection: Authority Set Articles – Listed in the Virtual Journal of Nanoscale Science & Technology Patents – Listed under United States Patent Classification Class 977 (Nanotechnology) NSF Grants – program name contains “nano” NIH Grants – NIH descriptive tag contains “nano” 13
14
GEOCODING Standardizing between differing naming conventions used in different sources. Standardizing between non-uniformity in how observations are recorded. Correcting common mistakes. For US observations: Providing different grouping units (other than city and state) not available in original data sources, like counties and BEA areas. 14
15
COUNTRY GEOCODING Country names in all observations are cleaned, standardized and assigned an ISO code (2 digit alphabetical) Current ISO list of countries is taken as basis; historical entries assigned to the closest current country to the extend available. 15
16
US GEOCODING US observations are those in 50 US states, DC and 7 US associated areas. Cities, states, counties and BEA economic areas are coded using “Populated Places” data obtained from FIPS 55 database and BEA. Basis is the city-state combination. City names are standardized and matched to the names in FIPS database on a state-by-state basis. In articles, 99.98% of US observations have been assigned a definite city - state code. 16
17
US GEOCODING: Variables Created 1. Standard_city_name: Standardized name as it appears on the FIPS database (corrected for misspelings, abbreviations, etc...) 2. State_code: 2 digit numeric code. 3. City_code: 5 digit numeric code, unique by state. 4. County_code: 5 digit numeric code. 5. County_name City code + state code uniquely determine a populated place. Numeric codes are same as the codes used by FIPS. 17
18
GEOCODING – US BEA Areas Bureau of Economic Analysis (BEA) created 179 Economic Areas in the US by asigning each county is assigned to a unique BEA. BEA_code: 3 digit numeric code that determines the associated BEA Economic area for each observation. "BEA's economic areas define the relevant regional markets surrounding metropolitan or micropolitan statistical areas. They consist of one or more economic nodes - metropolitan or micropolitan statistical areas that serve as regional centers of economic activity and the surrounding counties that are economically related to the nodes. The economic areas were redefined on November 17, 2004, and are based on commuting data from the 2000 decennial population census, on redefined statistical areas from OMB (February 2004), and on newspaper circulation data from the Audit Bureau of Circulations for 2001." 18
19
ORGANIZATION CODES Each observation is assigned an alpha numerical code. 2 digit alphabetical part determines the organization type. Numeric part groups names that are same up to standardization and hand cleaning First 2 digitsOrganization type FIFirm UNUniversity NLNational Lab RIResearch Inst UGUS Government HOHospital ASAcademy of Sciences NONo Organization SCSchool OTOther 19
20
Organization Codes: Types of Cleaning 1. Standardization of common identifiers: ◦ IBM = IBM Corp. = IBM Corporation ◦ Univ = University = University of = Universidade = Universidad = Univerzitet = Universita = Universitat = Universiti = Universite = Universitet = Universiteit 2. Using look up tables and hand cleaning to identify common variants (and misspellings) of names used by the same organization: ◦ IBM = Int Buisness Machines = International Business Machines Corporation = Int Business Machines Operation 20
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.