IR. SI 650/EECS 549 Information Retrieval People search the Web daily Search engines –Google –Bing –Baidu –Yandex Information Retrieval is about search.

SI 650/EECS 549 Information Retrieval

People search the Web daily Search engines –Google –Bing –Baidu –Yandex Information Retrieval is about search engines

Author: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze Hardcover: 496 pages Publisher: Cambridge University Press; 1 edition (July 7, 2008) Language: English ISBN: 978-0521865715 information need: structured query Goal: find “matched” information Structured data Data is small

information need: free text query Goal: find relevant information Unstructured data: e.g., text documents Data is large

https://www.google.com/trends/topcharts

Conventional (library catalog) –Search by keyword, title, author, etc. Text-based (Lexis-Nexis, Google, Yahoo!) –Search by keywords. Limited search using queries in natural language. Image-based –shapes, colors, keywords Question answering systems (ask.com) –Search in (restricted) natural language Clustering systems (Vivísimo, Clusty) Research systems (Lemur, Nutch)

Content TypePublished Content Professional web content User generated content Private text content Amount / day3-4G~ 2G8-10G~ 3T - Ramakrishnan and Tomkins 2007

19962009 - Slide from Manning et al.

The size of the indexed world wide web pages (by Nov 2015) –Indexed by Google: about 48 Billion pages –Indexed by Bing: about 13 Billion pages http://www.worldwidewebsize.com/

Twitter hits 400 million tweets per day –June 2012. Dick Costolo, CEO at Twitter Over 2.5 billion photos uploaded to Facebook each month (2010) –blog.facebook.com Google’s clusters process a total of more than 20 petabytes of data per day. –2008. Jeff Dean from Google

~750k /day ~3M day ~150k /day 1M 10B 2.5 M ~100B Where to Start? Where to Go? Gold?

Dynamically generated content New pages get added all the time The size of the blogosphere doubles every 6 months

Narrow Definition: Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). –Manning, Raghavan, and Schütze

There is an information need -- query Data is unstructured -- usually text documents Data is large The goal is to find (relevant) information Best Examples: Library Systems; Search Engines!

User Query judgments Documents results Query Rep Doc Rep Ranking Feedback INDEXING SEARCHI NG QUERY MODIFICATION INTERFACE Believe me! Google is as simple as this!

Queries –Boolean? Free text? Structured? … Documents –Free text? Semi-structured? How to index? … Retrieval Algorithm –Relevance? Ranking? Personalized? …

QUERY: a representation of what the user is looking for –Boolean? Free text? Structured? … DOCUMENT: an (text) information entity that the user wants to retrieve –Free text? Semi-structured? How to index? … COLLECTION: a set of documents INDEX: a representation of information that makes querying easier TERM: word or concept that appears in a document/query Retrieval Algorithm –Relevance? Ranking? Personalized? …

Decide what to index (documents). Collect and process them. Index them (efficiently). –Keep the index up to date. Develop the retrieval algorithm. Provide user-friendly query facilities. That’s it! We’ll learn how to make this happen.

And This?

Data can be semi-structured, multi-modal, abstract. –Image, video, opinion, expert, … –Although in this class, we will only talk about text Information need can be implicit, dynamic, inaccurate –Information filtering, recommender systems –Sometimes there isn’t a query! Find information  knowledge acquisition –Give me what you have  tell me what you know Relevance isn’t the only criterion. –Novelty, diversity, personalization, …

Search Text Filtering Categorization Summarization Clustering Natural Language Content Analysis Extraction Mining Visualization Retrieval Applications Mining Applications Information Access Knowledge Acquisition Information Organization

Document Representation and Content Analysis (e.g., text representation, document structure, linguistic analysis, NLP for IR, cross- and multi-lingual IR, information extraction, sentiment analysis, clustering, classification, topic models, facets, text streams) Queries and Query Analysis (e.g., query intent, query suggestion and prediction, query representation and reformulation, query log analysis, conversational search and dialogue, spoken queries, summarization, question answering) Retrieval Models and Ranking (e.g., IR theory, language models, probabilistic retrieval models, learning to rank, combining searches, diversity and aggregated search) Search Engine Architectures and Scalability (e.g., indexing, compression, distributed IR, P2P IR, mobile IR, cloud IR) Users and Interactive IR (e.g., user studies, user and task models, interaction analysis, session analysis, exploratory search, personalized search, social and collaborative search, search interface, whole session support) Filtering and Recommending (e.g., content-based filtering, collaborative filtering, recommender systems) Evaluation (e.g., test collections, experimental design, effectiveness measures, session-based evaluation, simulation) Web IR and Social Media Search (e.g., link analysis, click models/behavioral modeling, social tagging, social network analysis, blog and microblog search, forum search, community-based QA, adversarial IR and spam, vertical and local search) IR and Structured Data (e.g., XML search, ranking in databases, desktop search, entity search) Multimedia IR (e.g., image search, video search, speech/audio search, music search) Other Applications (e.g., digital libraries, enterprise search, genomics IR, legal IR, patent search, text reuse, new retrieval problems)

How can we - store and manage large scale text data? - find useful information? - organize information automatically? - extract useful patterns? - … How can we manage text information effectively and efficiently?

TM Algorithms User Text Storage Compression Probabilistic inference Machine learning Natural language processing Human-computer interaction TM Applications Software engineering Web Computer science Information Science

Information Retrieval Databases Library & Info Science Machine Learning Pattern Recognition Data Mining Natural Language Processing Web, Social Computing, Bioinformatics, Health Info… Statistics Optimization Software engineering Computer systems Models Algorithms Applications Systems

ACM SIGIR VLDB, PODS, ICDE ASIS Learning/Mining NLP Applications Statistics Software/systems COLING, EMNLP, ANLP HLT UAI RECOMB, PSB, IHI JCDL Info. Science Info Retrieval ACM CIKM Databases ACM SIGMOD ACL ICML AAAI ACM SIGKDD ISMB WWW SOSP OSDI WSDM TREC NIPS ICWSM

Difficulty in natural language understanding Unlimited domain Inherently vague user information need Effective & efficient Must deal with reasoning under uncertainty based on incomplete information

IR. SI 650/EECS 549 Information Retrieval People search the Web daily Search engines –Google –Bing –Baidu –Yandex Information Retrieval is about search.

Similar presentations

Presentation on theme: "IR. SI 650/EECS 549 Information Retrieval People search the Web daily Search engines –Google –Bing –Baidu –Yandex Information Retrieval is about search."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IR. SI 650/EECS 549 Information Retrieval People search the Web daily Search engines –Google –Bing –Baidu –Yandex Information Retrieval is about search.

Similar presentations

Presentation on theme: "IR. SI 650/EECS 549 Information Retrieval People search the Web daily Search engines –Google –Bing –Baidu –Yandex Information Retrieval is about search."— Presentation transcript:

Similar presentations

About project

Feedback