LIS618 lecture 4 Thomas Krichel 2004-02-28. Structure Document preprocessing Practice: Nexis.

Slides:



Advertisements
Similar presentations
LIS618 lecture 3 Thomas Krichel Structure Theory: discussion of the Boolean model Theory: the vector model Practice: Nexis.
Advertisements

LIS618 lecture 4 Thomas Krichel Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:
LIS618 lecture 3 Thomas Krichel Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.
LIS618 lecture 6 Thomas Krichel structure DIALOG –basic vs additional index –initial database file selection (files) Lexis/Nexis.
Search Techniques. It is imperative students use proper techniques when searching information on a computer system. It is imperative students use proper.
Chapter 5: Introduction to Information Retrieval
Search Techniques Boolean Logic and Keyword Searching.
Online Skills for Lexis & Westlaw University of Missouri-Kansas City School of Law Paul D. Callister, JD, MSLIS Director of the Leon E. Bloch Law Library.
Advanced Searching Engineering Village.
Engineering Village ™ Basic Searching.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Punctuation & Grammar., ?; :’!., ?; “” :’!., ?; “” :’!
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Engineering Village ™ ® Basic Searching On Compendex ®
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
WMES3103 : INFORMATION RETRIEVAL
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
CS 430 / INFO 430 Information Retrieval
WISER: Newspapers online : an introduction to the scope and range of recent and current newspapers available on Oxlip, including hints on effective search.
1 Lecture 3  Lexical elements  Some operators:  /, %, =, +=, ++, --  precedence and associativity  #define  Readings: Chapter 2 Section 1 to 10.
Chapter 5: Information Retrieval and Web Search
RESEARCH. STEPS TO RESEARCHING 1. Identify and Develop Your Topic 2. Find Background Information 3. Use Catalogs to Find Books 4. Find Internet Resources.
With Windows 7 Comprehensive© 2012 Pearson Education, Inc. Publishing as Prentice Hall1 PowerPoint Presentation to Accompany GO! with Windows 7 Comprehensive.
LIS618 lecture 5 Thomas Krichel Structure of talk Nexis.com OCLC firstsearch.
LIS618 lecture 2 Thomas Krichel Structure of talk General round trip on theoretical matters, part –Information retrieval models vector model.
Grammar Skills Workshop
… and other search strategies that work!
Proofreading Skills Keyboarding Objective Apply language skills in keyed documents.
Searching Databases. What is in the Library? The Online Library has thousands of journal articles and electronic books available for your use. Also available.
LIS618 lecture 1 Thomas Krichel economic rational for traditional model In olden days the cost of telecommunication was high. database use.
Fourth Edition Discovering the Internet Discovering the Internet Complete Concepts and Techniques, Second Edition Chapter 3 Searching the Web.
Steps to Writing A Research Paper In MLA Format. Writing a Research Paper The key to writing a good research paper or documented essay is to leave yourself.
Research & Learning For Libraries and Patrons that need to stay Ahead of the Learning Curve Presenter Name Here Books24x7® for Libraries.
LIS618 lecture 4 Thomas Krichel Structure Brief discussion of the Dialog worksheet. Document preprocessing Practice: Nexis.
Planning a search strategy.  A search strategy may be broadly defined as a conscious approach to decision making to solve a problem or achieve an objective.
Research Project Career Exploration Portfolio. Project Information Topic: Career that interests you Must be specific (anesthesiologist, not doctor) Must.
LIS618 lecture 4 Thomas Krichel Structure of talk Before online searching Introduction to online searching Introduction to DIALOG –Overview.
LIS618 lecture 8 Credo and Gale Thomas Krichel
PSYC 200 Week #5 APA Language Guidelines (review and new)
Chapter 6: Information Retrieval and Web Search
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
1 Internet Research Third Edition Unit A Searching the Internet Effectively.
Web of Science: Citation Indexes on the Web Gary Wiggins 9/29/2004.
UoS Libraries 2011 EndNote X5 - basic graduate session.
Internet Research – Illustrated, Fourth Edition Unit B.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
CIW Lesson 6MBSH Mr. Schmidt1.  Define databases and database components  Explain relational database concepts  Define Web search engines and explain.
Three indexes: Social Science Citation Index Index to Legal Periodicals Index to Foreign Legal Periodicals.
LIS618 lecture 8 Thomas Krichel Lexis/Nexis Lexis is a specialized legal research service Nexis is primarily a news services adds an important.
LIS618 lecture 4 Thomas Krichel Structure of talk The blue sheet Working with Dialog Nexis.com.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Unit B Constructing Complex Searches Internet Research Third Edition.
Guide to Lexis. Introduction Lexis provides access to case law from UK, Australia, USA, New Zealand and Canada Lexis provides access to case law from.
IUB Libraries Faculty & Graduate Student Updates Web of Science: Citation Indexes on the Web Presented by Gary Wiggins
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Searching a Database. The Searching Process-How do I start? When faced with a task that requires you to search for information, it can be quite overwhelming.
Third Edition Discovering the Internet Discovering the Internet Complete Concepts and Techniques, Second Edition Chapter 3 Searching the Web.
Internet Searching: Finding Quality Information
Text Based Information Retrieval
CS 430: Information Discovery
ITE 130 Web Searching.
Search Techniques and Advanced tools for Researchers
Computer Literacy BASICS: A Comprehensive Guide to IC3, 3rd Edition
CAB Abstracts, Medline & Zoological Record
Ingenta Ingenta Ingenta (ConnectComplete) (formerly Uncover)
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Presentation transcript:

LIS618 lecture 4 Thomas Krichel

Structure Document preprocessing Practice: Nexis

document preprocessing There are some operations that may be done to the documents before indexing –lexical analysis –stemming of words –elimination of stop words –selection of index terms –construction of term categorization structures we will look at those in turn in many cases, document preprocessing is not well documented by the provider. but searchers need to be aware of them…

lexical analysis divides a stream of characters into a stream of words seems easy enough but…. –should we keep numbers? –hyphens. compare "state-of-the-art" with "b-52" –removal of punctuation, but “The battle of Cannae took place in 217 B.C. It was a bad defeat for the Romans." –casing. compare "bank" and "Bank"

stemming in general, users search for the occurrence of a term irrespective of grammar plural, gerund forms, past tense can be subject to stemming important algorithm by Porter, applicable to English only evidence about the effect of stemming on information retrieval is mixed stemming is relatively rare these days.

elimination of stop words some words carry no meaning and should be eliminated in fact any word that appears in 80% of all documents is pretty much useless, but consider a searcher for "to be or not to be". It is better to reduce the index weight of terms that appear very frequently

index term selection In printed indexes, we use nouns only some nouns that appear heavily together can be considered to be one index term, such as "computer science" Dialog deals with this through phrase indexing. Nexis has the smart indexing feature that groups terms into concepts Most web engines, index all words, and all of the individually

thesauri a list of words and for each word, a list of related words –synonyms –broader terms –narrower terms used –to provide a consistent vocabulary for indexing and searching –to assist users with locating terms for query formulation –allow users to broaden or narrow query

use of thesauri Thesauri are limited to experimental systems, or some high-quality systems, examples are – bin/thesaurus.pl –Nexis It can be confusing to users. There is very little free thesaurus data available.

Back to Nexis: customization You can customize the search results on the top right corner of the screen. –You should set your default result list to the expanded list, so you can see your keywords in context of the documents. –You may wish to increase the size of the default page to 99. You can hide the subject directory.

test sources The ones that I have used are – News 60 days (the top source) – Two foreign sources Le Monde Der Spiegel

dating Dates can be entered in any of the following forms –07/24/ /24/00 –July 24, Jul 24, 2000 –July 2000 You can set the dates in the menu, but they will not go beyond the stated frame of the source.

Document preprocessing in Nexis The following are always considered word limits –hyphens - –slashes/ –parentheses() –spaces examples –“state of the art” will also find “state-of-the-art” –“co-operative” will not find “cooperative”. The documentation says you have to leave them out of the query. I think this is wrong, you can use slash and hyphen.

query preprocessing in Nexis apostrophe: –if followed by "s", it is a possessive. Singular and plural and plural possessive are also found. Thus “company’s” also finds company companies companies’ –if not, it counts like a character in a word at-sign: contrary to documentation, does not appear to be preprocessed. But you can leave it out when you search for an addresses. gives the same results as “president whitehouse.gov”

query preprocessing in Nexis ampersand: if it is surrounded by blanks, most of the time, Nexis treats it as "and". If it is not, it treats it as a normal character But the & is not always equivalent to “and”. Example: search in Le Monde for –“cable & wireless” same as “cable wireless” –“cable and wireless” gives a different result

query preprocessing in Nexis colon and comma are read as a space unless adjacent characters are numbers. percent and pound sign mean themselves and are not equivalent to anything. ? $ ; in the query are all ignored, it is said in the documentation, but that is not true –“$ 4711” in Der Spiegel versus “4711” in the same source ® is replaced by the word "R", ™ is replaced by the word "TM“, according to documentation.

query preprocessing in Nexis if you enter a dot, it is interpreted as a decimal point if surrounded by numbers if the dot is followed by just one letter, you have to keep it –“10:14 a.m.” will get some results –“10:14 am” will get different results but if there are several letters, you can leave it out –“ebay com” gives same results as “ebay.com”

query preprocessing in Nexis The double quote is ignored in a query. It can be used to make the word not, a reserved word, searchable, according to the documentation. Example –“to be or not to be”, Der Spiegel, all dates This method can not be applied to other noise words. It is not even possible to get lists of noise words.

effort to be easy Overall, Nexis makes an effort to be easy at the expense of precise query semantics. One positive example are long quotes –who said: “For the past three years there has been no growth. Sooner or later they have to figure out that the melody has changed.” –Such long quotes work most of the time, they can be entered without surrounding quotation marks.

noise and reserved words Noise words are common words –in power search, noise words are ignored, replace by space –in quick search, you can use phrases –no list of noise words Reserved words are –and –or –not used in Boolean expressions. They are not indexed.

plurals Nexis indexes plural and possessive as the singular. But in power search, you can use the following –PLURAL (term) only the plural of term –SINGULAR (term) only the singular of term –ALLCAPS (term) only capitals of term –NOCAPS (term) no capitals of term –CAPS (term) capitalized term only Note that term can be a sequence of words. This feature does not work properly on certain databases. Example allcaps(RFA) in Le Monde.

plural, singular etc functions Such functions are not reliable. Example, search in the Der Spiegel –national plural(archives)no results –national archives –singular(haus) same results as “haus” –plural(haus)no results –upper(USA) no results

entering phrases in power search If you just put one after the other, it looks for the one after the other subject to –latin chars only Путин does not work –no hits (opera) –erroneous hits (explorer) –removal of noise words “Thomas Bishoff” finds Thomas of Bishoff –plural ORing “José Carreras” will also find José Carrera

searching for phrases If your phrase contains a word that is a reserved word you are best off using connectors. Example, current news, today only –“black white” 221 hits “black and white” “black or white” “black/white” –“black and white”444 hits interpreted “and” as a reserved word.

searching for phrases If your phrase contains a word that likely to be a word you are best off using connectors. Otherwise, you search seems to be translated with the noise word removed and then replaced by a connector. Example, current news, 60 days –“cream of white” you get “cream or white” “cream/white” “cream into the white”

treatment of accents I have done a lot of research, but found no systematic treatment of accents. It seems source dependent. Therefore search several variations, e.g. –Müntefering –Muentefering –Muntefering Same thing for plurals. Some foreign language sources are not searched for plurals. Example, search “archive” in Der Spiegel.

order of connectors The order is –OR –W/n, PRE/n –W/S (can not be combined with w/n or pre/n) –W/P (can not be combined with w/n or pre/n) –AND –AND NOT When you use two or more of the same connectors in a search, they normally operate from left to right. When a search contains multiple W/n or PRE/n connectors, the connectors operate in numerical order with the smallest number first.

examples use news most current 60 days –“Harry pre/1 Miller” 18 hits 5 th has the sentence “HAWTHORN has discovered another Harry Miller”. –“Harry pre/2 Miller” 93 hits “another pre/2 Miller w/3 Hawthorn” matches “HAWTHORN has discovered another Harry Miller”, because the pre/2 is evaluated first.

problem formats Sometimes Nexis gets confused! Problem Format: Corrected Format: –A w/n (B and C) A w/n B and A w/n C –A w/s (B and C) A w/s B and A w/s C –A w/p (B and C) A w/p B and A w/p C –A w/s (B and C) A w/s B and A w/s C but of course you can do something like “another pre/2 Miller and Miller w/s Hawthorn”

Nexis segments Nexis does some document preprocessing for characters, discussed in a later slide. The processed document has a number field/value pairs that are called segments Not every source has every segment. Searches using segments will return no hits for sources that don’t have the segment.

segment types I make a distinction between –native –smart-indexed segments. Smart-indexed segments contain the result of smart indexing. Thus, they are not native to the source.

segment types Some segment can be sorted. There are the ones that have numbers or dates as values. These segments use the following arithmetic operators: –= isequal to or is –>aftgreater than or after –<befless than or before Example for Der Spiegel atleast3(foster) and date( 1000)

how to know about segment This is virtually impossible, because there is no comprehensive documentation. Look at the example of Le Monde, where the list of segments seems incomplete. For Der Spiegel “rubrik(hausmitteilung) and text(foster)” does find the example given. “section(hausmitteilung) and body(foster)”

typical segments in news BYLINE CORRRECTION CORRECTION-DATE (date) DATE (date) DATELINE(not a date) GRAPHIC HEADLINE HIGHLIGHT LEAD HLEAD is HEADLINE, HIGHLIGHT, & LEAD

typical segments in news PUBLICATION name and copyright SECTION SERIES SOURCE TICKER TYPE

typical smart-indexed segments CITY COMPANY COUNTRY GEOGRAPHIC INDUSTRY KEYWORD ORGANIZATION PERSON PRODUCT SUBJECT TICKER TYPE TERMS includes all these

terms example compare, in recent news, person(osama bin laden) with “osama bin laden” index terms can be useful to hone in on complex topics that can have many names. Example –start with “sex” as an index term –collect terms related to gay marriage and civil unions

segment search You can place query terms and connectors in a segment and then search for it. Example: hlead((drug or substance) w/10 abuse)

using segments for news uses power search expressions, plus hlead (expression) ? headline (expression) company (expression) for a company byline (expression) for the author show (expression) for a television show transcript expression is a simple keyword or expression using several words, possible combined with connectors.

segments for legal data name (expression) for the name of a party cite (expression) for a citation expression for case law title (expression) for the title of a law article expression is a simple keyword or expression using several words, possible combined with connectors.

Search forms There are special forms for –News –Company reports –Market indicators –Portfolio –News and quotes about companies

Personal news alert do a search then click on “track in personal news” to get to a screen where you can enter –periodicity –what documents to be sent –subject This works for real estate for me.

Real time news This uses a different query language –terms are implicitly ANDed –explicit AND and OR allowed –phrases have to be put in quotes –* starts for any number of characters, not just one as in power search –parenthesis can be used I have poor experience with this.

Summary on Nexis Nexis has a rich set of resources. It can be searched by inexperienced, but likely to get poor result. Clever learning about its features can get you quite far, however, the features are not well documented online. Nexis seems frequently to violate its stated rules.

Thank you for your attention!