Information Retrieval (IR) on the Internet
Contents Definition of IR Performance Indicators of IR systems Basics of an IR system Some IR Techniques Search Engines Challenges faced by IR on the Internet Conclusion
What is IR IR refers to going through documents on the Internet Presenting documents relevant to search terms Presenting ONLY relevant documents poses a challenge Hence IR systems are measured according to certain indicators
Performance Indicators Response time - Time taken to present results - Not really an issue these days Precision - Percentage of the results that are relevant Recall - Percentage of ALL relevant documents on the Internet that were presented
Performance Indicators contd. More on Recall - Not possible to calculate this - If ALL relevant documents were known - Then it would be possible to return ONLY relevant documents during a search The user is not considered and should be
Basics of an IR system An IR system has three main concerns - Create abstract view of search terms - Create abstract view of documents - Match both views Once all three are achieved then the IR system is working properly
Basics of an IR system contd. Search terms Documents Keywords Matching Abstraction Feedback Resulting docs
Basics of an IR system contd. The process of arriving at a successful abstracted view of the search terms refers to the Query formulation process The process of arriving at keywords to represent a document and point to it, refers to the Indexing process
Some Techniques used for IR Some Techniques used for IR Indexing Ranking
Indexing (IR Technique) Stripping a document to keywords/search terms Using these keywords as pointers to the document
Some Approaches to Indexing Manual Indexing - As the name suggests - Impossible due to the size of the Internet Metadata - Is an invisible file tied to a web page and holds data about the contents of the page - e.g. the Dublin Core Metadata Element Set which proposes a 15 element set that holds data like; creator, title, subject and so on
An Indexing technique Term weighting - Keywords do not have the same strength - Numerical values are assigned, the higher the value, the more relevant the keyword - The value is referred to as the weight - Weights can be assigned based on term frequency or on inverse document frequency
Ranking (IR Technique) Uses term weighting of a document to give priority - The sum of the weights of keywords is used to order results in descending order
Search Engines These are an intricate part of IR on the Internet They receive search terms and match them with relevant documents They only have access to indices as accessing the entire document will degrade performance and be too costly
Some Challenges The size of the Internet - Research shows that only 60% of the Internet is indexed by search engines - Any one search engine only indexes 3- 34% of the Internet Kobayashi, M. & Takeda, K. (2000). “Information Retrieval on the Web”, [online] in ACM Computing Surveys, Vol. 32, No. 2, June 2000, Kobayashi, M. & Takeda, K. (2000). “Information Retrieval on the Web”, [online] in ACM Computing Surveys, Vol. 32, No. 2, June 2000,
Some Challenges contd. Even indexed documents are amended, replaced or removed altogether making the indexing structure inaccurate (sometimes) Impossible to enforce Metadata proposals - Academic journals are not dated sometimes User may not be clear about the information need which could affect the search terms provided
Conclusion IR is important as we thrive on information In spite of the challenges faced by IR, it still returns a decent level of success. Sometimes with the initial set of search terms, sometimes after a few attempts There is a lot of work going on to improve IR techniques and it is my belief that a breakthrough will be achieved soon
Thank you Any questions?