LIS 7450, Searching Electronic Databases Basic: Database Structure & Database Construction Dialog: Database Construction for Dialog (FYI) Deborah A. Torres
Database Structure Organization of Data Elements and records
Database Record Record – basic unit of information in a database (file). Example: Bibliographic record contains description information, i.e. author, title, publisher etc.
Fields Field – a distinct part or section of a record (a unit of information within the record) Example of personnel record fields: employee’s name, special identifier number, address, date of hire etc.
Field Design Decisions For each field Decide what information is placed within that field & format for that information (text, numeric) Should there be subfields within a field? What to call the fields? Field codes (abbreviations, numbering) Order of the fields
Example: MARC Record (a type of record you should be familiar with) Record Fields & Codes The 100 field contain author information. The 245 field contains main title information.
Other Design Decisions Hyphenated words Home-school Stop words High frequency words not useful for searching Single words and phrases Library, library science, color of money Alternative spellings of words Color, colour
Types of Databases Bibliographic – references and abstracts of published documents Fulltext – complete text of articles, dictionary entry, code of law, or other such document. Directory – factual information about organizations, companies, products, people, or materials.
Types of Databases Numeric – data in a tabular or statistically manipulated form, often with some added text. Hybrid – a mix of record types. For example, a database may have full- text records for some publications and citations and abstracts for other source documents.
Database Construction Basic Steps for automatic indexing of text documents
Six Basic Steps Step 1: Parse text into words Step 2: Compare to stoplist and eliminate stopwords Step 3: Stem content words (reduce to root words) (skip this step if decide not to stem) Step 4: Count stemmed word occurrences Step 5: Create union list of terms Step 6: Create data structure for specific retrieval techniques (i.e. an inverted file)
Example: Simple Set of 5, One-sentence documents D1: It is a dog eat dog world! D2: While the world sleeps. D3: Let sleeping dogs lie. D4: I will eat my hat. D5: My dog wears a hat. “D” stands for document
Step 1: Parse Text into Words D1: it is a dog eat dog world D2: while the world sleeps D3: let sleeping dogs lie D4: I will eat my hat D5: my dog wears a hat Note: Some databases remove punctuation for words, like possessives; others preserve it. What difference would this make ?
Step 2: Eliminate Stop Words D1: dog eat dog world D2: world sleeps D3: let sleeping dogs lie D4: eat hat D5: dog wears hat Stop words are content-free words – those not useful in determining the content of the document. Examples: pronouns (I, my), prepositions (of, by, on), articles (a, the, this )
Step 3: Stemming (remember not all databases stem words) D1: dog eat dog world D2: world sleeps D3: let sleeping dogs lie D4: eat hat D5: dog wears hat D1: dog eat dog world D2: world sleep D3: let sleep dog lie D4: eat hat D5: dog wear hat
Types of Stemming Decisions No Stemming: contract contracts contracted contracting contractor contraction contractual contracture Weak Stemming: Inflections: -s, -es, -ed, -ing, -’s Strong Stemming: Derivations: - tion, -ly, -ally Reduce words to a root variant; there are different stemming algorithms
A bit more about stemming for searching… Some databases automatically search for all of the words that come from the same stem/root word unless you indicate that you only want the word you entered. Example: if you entered computer, the database would also search for computing, computers, computation, etc.
Step 4: Sort Words, Count Duplicates D1: dog eat world D2: sleep world D3: dog let lie sleep D4: eat hat D5: dog hat wear D1: dog(2) eat world D2: sleep world D3: dog let lie sleep D4: eat hat D5: dog hat wear Sort into Alpha order Count any duplicates
Step 5: Create Union List of Unique Terms Unsorted List dog eat world sleep world dog let lie sleep eat hat dog hat wear Sorted List dog eat hat let lie sleep wear world Sorted, Unique List dog eat hat let lie sleep wear world
Step 6: Create Inverted Index (inverted file) dog eat hat let lie sleep wear word Union List Unique terms dog: D1 D3 D5 eat: D1 D4 hat: D4 D5 let: D3 lie: D3 sleep: D2 D3 wear: D5 word: D1 D2 Inverted Index: has pointers to documents in which word occurs Inverted Index
Dialog Database Construction FYI: For those interested in Dialog
Dialog Database Construction Step 1: Create a linear file of records received from the Information Provider. Assign sequential accession numbers to the records. Step 2: Label the fields within the records: AU for Author, TI for Title, etc. If a field is word- indexed, also label the words within each field. Exclude stop words: AN FOR THE AND FROM TO BY WITH
Dialog Database Construction Step 3: Create the Basic Index: all words and phrases from fields containing subject-related terms. Step 4: Create the Additional Indexes: all terms from all remaining fields.