Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval (IR) on the Internet. Contents  Definition of IR  Performance Indicators of IR systems  Basics of an IR system  Some IR Techniques.

Similar presentations


Presentation on theme: "Information Retrieval (IR) on the Internet. Contents  Definition of IR  Performance Indicators of IR systems  Basics of an IR system  Some IR Techniques."— Presentation transcript:

1 Information Retrieval (IR) on the Internet

2 Contents  Definition of IR  Performance Indicators of IR systems  Basics of an IR system  Some IR Techniques  Search Engines  Challenges faced by IR on the Internet  Conclusion

3 What is IR  IR refers to going through documents on the Internet  Presenting documents relevant to search terms  Presenting ONLY relevant documents poses a challenge  Hence IR systems are measured according to certain indicators

4 Performance Indicators  Response time - Time taken to present results - Not really an issue these days  Precision - Percentage of the results that are relevant  Recall - Percentage of ALL relevant documents on the Internet that were presented

5 Performance Indicators contd.  More on Recall - Not possible to calculate this - If ALL relevant documents were known - Then it would be possible to return ONLY relevant documents during a search  The user is not considered and should be

6 Basics of an IR system  An IR system has three main concerns - Create abstract view of search terms - Create abstract view of documents - Match both views  Once all three are achieved then the IR system is working properly

7 Basics of an IR system contd. Search terms Documents Keywords Matching Abstraction Feedback Resulting docs

8 Basics of an IR system contd.  The process of arriving at a successful abstracted view of the search terms refers to the Query formulation process  The process of arriving at keywords to represent a document and point to it, refers to the Indexing process

9 Some Techniques used for IR Some Techniques used for IR  Indexing  Ranking

10 Indexing (IR Technique)  Stripping a document to keywords/search terms  Using these keywords as pointers to the document

11 Some Approaches to Indexing  Manual Indexing - As the name suggests - Impossible due to the size of the Internet  Metadata - Is an invisible file tied to a web page and holds data about the contents of the page - e.g. the Dublin Core Metadata Element Set which proposes a 15 element set that holds data like; creator, title, subject and so on

12 An Indexing technique  Term weighting - Keywords do not have the same strength - Numerical values are assigned, the higher the value, the more relevant the keyword - The value is referred to as the weight - Weights can be assigned based on term frequency or on inverse document frequency

13 Ranking (IR Technique)  Uses term weighting of a document to give priority - The sum of the weights of keywords is used to order results in descending order

14 Search Engines  These are an intricate part of IR on the Internet  They receive search terms and match them with relevant documents  They only have access to indices as accessing the entire document will degrade performance and be too costly

15 Some Challenges  The size of the Internet - Research shows that only 60% of the Internet is indexed by search engines - Any one search engine only indexes 3- 34% of the Internet Kobayashi, M. & Takeda, K. (2000). “Information Retrieval on the Web”, [online] in ACM Computing Surveys, Vol. 32, No. 2, June 2000, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.5978 Kobayashi, M. & Takeda, K. (2000). “Information Retrieval on the Web”, [online] in ACM Computing Surveys, Vol. 32, No. 2, June 2000, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.5978 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.5978

16 Some Challenges contd.  Even indexed documents are amended, replaced or removed altogether making the indexing structure inaccurate (sometimes)  Impossible to enforce Metadata proposals - Academic journals are not dated sometimes  User may not be clear about the information need which could affect the search terms provided

17 Conclusion  IR is important as we thrive on information  In spite of the challenges faced by IR, it still returns a decent level of success. Sometimes with the initial set of search terms, sometimes after a few attempts  There is a lot of work going on to improve IR techniques and it is my belief that a breakthrough will be achieved soon

18 Thank you Any questions?


Download ppt "Information Retrieval (IR) on the Internet. Contents  Definition of IR  Performance Indicators of IR systems  Basics of an IR system  Some IR Techniques."

Similar presentations


Ads by Google