Presentation is loading. Please wait.

Presentation is loading. Please wait.

고급정보검색론 Advanced Information Retrieval

Similar presentations


Presentation on theme: "고급정보검색론 Advanced Information Retrieval"— Presentation transcript:

1 고급정보검색론 Advanced Information Retrieval
Data Mining Lab., Univ. of Seoul

2 강의 목표 강의 주제 정보검색 intelligent information retrieval 텍스트 마이닝 Data mining Text mining Machine learning 분석 도구의 활용 R: language and environment for statistical computing and graphics.

3 교재: Introduction to Information Retrieval
강의홈페이지 Data Mining Lab., Univ. of Seoul

4 R 교재

5 강의 홈페이지

6 Data Mining Lab., Univ. of Seoul
(지능형) 정보검색시스템의 예 Data Mining Lab., Univ. of Seoul

7 Data Mining Lab., Univ. of Seoul
Evaluation 평가 방법 발표: 25% 과제 리포트: 25% 텍스트마이닝 및 최신 검색 기법 활용 기말시험: 50% Data Mining Lab., Univ. of Seoul

8 Information Retrieval (IR)
The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent “killer app.” Concerned firstly with retrieving relevant documents to a query. Concerned secondly with retrieving from large sets of documents efficiently.

9 Typical IR Task Given: Find:
A corpus of textual natural-language documents. A user query in the form of a textual string. Find: A ranked set of documents that are relevant to the query.

10 IR System Document corpus IR Query String System Ranked Documents
.

11 Relevance Relevance is a subjective judgment and may include:
Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need).

12 Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

13 Problems with Keywords
May not retrieve relevant documents that include synonymous terms. “restaurant” vs. “café” “PRC” vs. “China” May retrieve irrelevant documents that include ambiguous terms. “bat” (baseball vs. mammal) “Apple” (company vs. fruit) “bit” (unit of data vs. act of eating)

14 Beyond Keywords We will cover the basics of keyword-based IR, but…
We will focus on extensions and recent developments that go beyond keywords. We will cover the basics of building an efficient IR system, but… We will focus on basic capabilities and algorithms rather than system’s issues that allow scaling to industrial size databases.

15 Intelligent IR Taking into account the meaning of the words used.
Taking into account the order of words in the query. Adapting to the user based on direct or indirect feedback. Taking into account the authority of the source.

16 IR System Architecture
User Interface Text User Need Text Operations Logical View User Feedback Query Operations Indexing Database Manager Inverted file Searching Query Index Text Database Ranked Docs Retrieved Docs Ranking

17 IR System Components Text Operations forms index words (tokens).
Stopword removal Stemming Indexing constructs an inverted index of word to document pointers. Searching retrieves documents that contain a given query token from the inverted index. Ranking scores all retrieved documents according to a relevance metric.

18 IR System Components (continued)
User Interface manages interaction with the user: Query input and document output. Relevance feedback. Visualization of results. Query Operations transform the query to improve retrieval: Query expansion using a thesaurus. Query transformation using relevance feedback.

19 Web Search Application of IR to HTML documents on the World Wide Web.
Differences: Must assemble document corpus by spidering the web. Can exploit the structural layout information in HTML (XML). Documents change uncontrollably. Can exploit the link structure of the web.

20 Web Search System Web Spider Document corpus IR Query String System
Ranked Documents 1. Page1 2. Page2 3. Page3 .

21 Modern IR-Related Tasks
Automated document categorization Information filtering (spam filtering) Information routing Automated document clustering Recommending information or products Information extraction Information integration Question answering

22 History of IR ’s: Initial exploration of text retrieval systems for “small” corpora of scientific abstracts, and law and business documents. Development of the basic Boolean and vector-space models of retrieval. Prof. Salton and his students at Cornell University are the leading researchers in the area.

23 IR History Continued 1980’s:
Large document database systems, many run by companies: Lexis-Nexis Dialog MEDLINE

24 IR History Continued 1990’s:
Searching FTPable documents on the Internet Archie WAIS Searching the World Wide Web Lycos Yahoo Altavista

25 IR History Continued 1990’s continued: Organized Competitions
NIST TREC Recommender Systems Ringo Amazon NetPerceptions Automated Text Categorization & Clustering

26 Recent IR History 2000’s Link analysis for Web Search
Google Automated Information Extraction Whizbang Fetch Burning Glass Question Answering TREC Q/A track

27 Recent IR History 2000’s continued: Multimedia IR Cross-Language IR
Image Video Audio and music Cross-Language IR DARPA Tides Document Summarization

28 Related Areas Database Management Library and Information Science
Artificial Intelligence Natural Language Processing Machine Learning

29 Database Management Focused on structured data stored in relational tables rather than free-form text. Focused on efficient processing of well-defined queries in a formal language (SQL). Clearer semantics for both data and queries. Recent move towards semi-structured data (XML) brings it closer to IR.

30 Library and Information Science
Focused on the human user aspects of information retrieval (human-computer interaction, user interface, visualization). Concerned with effective categorization of human knowledge. Concerned with citation analysis and bibliometrics (structure of information). Recent work on digital libraries brings it closer to CS & IR.

31 Artificial Intelligence
Focused on the representation of knowledge, reasoning, and intelligent action. Formalisms for representing knowledge and queries: First-order Predicate Logic Bayesian Networks Recent work on web ontologies and intelligent information agents brings it closer to IR.

32 Natural Language Processing
Focused on the syntactic, semantic, and pragmatic analysis of natural language text and discourse. Ability to analyze syntax (phrase structure) and semantics could allow retrieval based on meaning rather than keywords.

33 Natural Language Processing: IR Directions
Methods for determining the sense of an ambiguous word based on context (word sense disambiguation). Methods for identifying specific pieces of information in a document (information extraction). Methods for answering specific NL questions from document corpora.

34 Machine Learning Focused on the development of computational systems that improve their performance with experience. Automated classification of examples based on learning concepts from labeled training examples (supervised learning). Automated methods for clustering unlabeled examples into meaningful groups (unsupervised learning).

35 Machine Learning: IR Directions
Text Categorization Automatic hierarchical classification (Yahoo). Adaptive filtering/routing/recommending. Automated spam filtering. Text Clustering Clustering of IR query results. Automatic formation of hierarchies (Yahoo). Learning for Information Extraction Text Mining

36 Machine Learning

37 Machine Learning Supervised Learning Unsupervised Learning
Classification Prediction model Unsupervised Learning Clustering, Association Description model Semi-Supervised Learning For less supervision Semi-Supervised Clustering For a little supervision

38 Supervised Learning 기본 개념 Classification (Prediction) Model

39 Supervised Learning Learning Algorithm에 따라 Classification Model이 다름

40 Supervised Learning Algorithms
Decision Tree Neural Networks Bayesian Statistics Naïve Bayes Bayesian Networks Instance-based Learning K-Nearest Neighbor Support Vector Machine Rough Set Theory Meta-learning Ensemble (Committee): Bagging, Boosting Expectation-Maximization (EM)

41 Supervised Learning: Decision Tree
Credit Analysis accept reject salary < 20000 no yes Education in graduate Learning Label = Class Training data Classification Model

42 Supervised Learning: Support Vector Machine
The hyper-plane that separates positive and negative training data is w  x + b = 0

43 Supervised Learning: Naïve Bayes
Estimated probability distribution

44 Supervised Learning: k-Nearest Neighbors
+ - - Training Data ? - + - + - Unknown Data Voting 적용

45 Unsupervised Learning Algorithms : Clustering
Partitioning method K-means, K-medoids, … Hierarchical method Agglomerative and divisive hierarchical clustering Complete-linkage, single-linkage, group-average, ward’s method BIRCH, CURE, … Density-based method DBSCAN, OPTICS, … Grid-based method STING, WaveCluster, CLIQUE, … Model-based method Statistical approach, … Data Mining Lab., Univ. of Seoul, Copyright ® 2008

46 Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering k-means i 단계 : centroid (i+1) 단계 Data Mining Lab., Univ. of Seoul, Copyright ® 2008

47 Clustering hierarchical clustering
b c d e a b c d e d e a b c d e 4step 3step 2step 1step 0step Data Mining Lab., Univ. of Seoul, Copyright ® 2008

48 Clustering density-based clustering
Clusters: density-connected sets DBSCAN algorithm Data Mining Lab., Univ. of Seoul, Copyright ® 2008

49 Association (Rule) Mining
Basket Analysis A priori Algorithm Sequential pattern mining Associations + Time (or order)

50 Text Mining

51 Text Mining Data Mining Text Mining Text Mining의 기반 기술 구조적 데이터를 대상
Unstructured/Semi-structured text documents를 대상 Text Mining의 기반 기술 Machine learning Information retrieval Natural language processing Statistical learning

52 Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Text Mining Data Mining과의 차이 Curse of Dimensionality Feature selection 기술이 중요 Zipf’s Law, document frequency, x2 Statistics, Mutual Information Natural language process 기술과 결합 linguistic, lexical, and semantical techniques Data Mining Lab., Univ. of Seoul, Copyright ® 2008

53 Text Mining Application
Document Retrieval (Search) Document Recommendation Text Classification Text Clustering Text Summarization Text (information) Extraction Text Association Rule Mining Topic Detection

54 Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Text Classification Web Page Indexing Web directory-based Search Engine에서 웹문서의 자동분류 Data Mining Lab., Univ. of Seoul, Copyright ® 2008

55 Text Clustering Document Clustering Word Clustering
Big text data에 대한 조망: document cluster에 대한 description model의 생성 Cluster 내 문서집합에서 주요 단어의 추출 distance function이 중요 Word Clustering 용도: thesaurus 구축, word sense disambiguation 등 2가지 방식 Corpus-based approach Taxonomy-based approach

56 Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Text Clustering: 검색엔진 Clusty.com vivisimo incorp. Data Mining Lab., Univ. of Seoul, Copyright ® 2008

57 Word Associations Example: Association Rules
w1 => w3 with 50% (2/4) support and 66% (2/3) confidence w3 => w1 with 50% (2/4) support and 100% (2/2) confidence

58 Text Summarization abstraction IE 활용 중요 문장-1 중요 문장-2 Machine …
Learning 활용


Download ppt "고급정보검색론 Advanced Information Retrieval"

Similar presentations


Ads by Google