Web-based Information Architectures Jian Zhang. Today’s Topics Term Weighting Scheme Vector Space Model & GVSM Evaluation of IR Rocchio Feedback Web Spider.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Text Categorization.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Linear Classifiers (perceptrons)
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.
Under The Hood [Part I] Web-Based Information Architectures MSEC – Mini II 28-October-2003 Jaime Carbonell.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Learning for Text Categorization
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
K nearest neighbor and Rocchio algorithm
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Evaluating the Performance of IR Sytems
Building Web Spiders Web-Based Information Architectures MSEC Mini II Jaime Carbonell.
Under The Hood [Part II] Web-Based Information Architectures MSEC Mini II Jaime Carbonell.
Vector Space Model CS 652 Information Extraction and Integration.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
IR Models: Review Vector Model and Probabilistic.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Advanced Multimedia Text Classification Tamara Berg.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Text Classification, Active/Interactive learning.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Chapter 6: Information Retrieval and Web Search
Artificial Intelligence Web Spidering & HW1 Preparation Jaime Carbonell 22 January 2002 Today's Agenda Finish A*, B*, Macrooperators.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Learning from observations
1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Data Mining and Decision Support
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
KNN & Naïve Bayes Hongning Wang
Data Mining and Text Mining. The Standard Data Mining process.
DATA MINING © Prentice Hall.
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Text Categorization Assigning documents to a fixed set of categories
Chapter 5: Information Retrieval and Web Search
A task of induction to find patterns
A task of induction to find patterns
Presentation transcript:

Web-based Information Architectures Jian Zhang

Today’s Topics Term Weighting Scheme Vector Space Model & GVSM Evaluation of IR Rocchio Feedback Web Spider Algorithm Text Mining: Named Entity Identification Data Mining Text Categorization (kNN)

Term Weighting Scheme TW = TF * IDF –TF part = f 1 (tf(term, doc)) –IDF part = f 2 (idf(term)) = f 2 (N/df(term)) –E.g., f 1 (tf) = normalized_tf = tf/max_tf; f 2 (idf) = log 2 (idf) –E.g, f 1 (tf) = tf; f 2 (idf) = 1 NOTE: definition of DF!

Document & Query Representation Bag of words, Vector Space Model(VSM) Word Normalization –Stopwords removal –Stemming Proximity phrases Each element of the vector is the Term Weight of that term w.r.t the document/query.

Similarity Measure Dot Product:

Similarity Measure Cosine Similarity:

Information Retrieval Basic assumption: Shared words between query and document Similarity measures –Dot product –Cosine similarity (normalized)

Evaluation Recall = a/(a+c) Precision = a/(a+b) F1=2.0*recall*precision / (recall+precision) Accuracy – Bad for IR,

Refinement of VSM Query expansion Relevance Feedback –Rocchio Formula: … Alpha, beta, gamma and their meanings

Generalized Vector Space Model Given a collection of training data, present each term as a n-dimensional vector D1D1 D2D2 …DjDj …DnDn T1T1 w 11 w 12 …w 1j …w 1n T2T2 w 21 w 22 …w 2j …w 2n ………………… TiTi w i1 w i2 …w ij …w in ………………… TmTm w m1 w m2 …w mj …w mn

GVSM (2) Define similarity between term t i and t j Sim(t i, t j ) = cos(t i, t j ) Similarity between qury and document is based on the term-term similarity –For each query term q i, find the term t D in the document D that is most similar to q i. This value v iD, can be considered as the similarity between a sigle word query q i and the document D. –Sum up the similarities between each query term and the document D. This is considered the similarity between the query and the document D.

GVSM (3) Sim(Q,D) = Σ i [Max j (sim(q i, d j )] or normalizing for document & query length: Sim norm (Q, D) =

Maximal Marginal Relevance Redundancy reduction Getting more novel things Formula MMR(Q, C, R) = Argmax k d i in C [λS(Q, d i ) - (1-λ)max d j in R (S(d i, d j ))]

MMR Example (Summarization) S1 S2 S3 S4 S5 S6 S1 S3 S4 Full Text Summary Query

MMR Example (Summarization) Select first sentence: λ=0.7 S1 S2 S3 S4 S5 S6 S3 Full Text Summary Query Sim(Q, S) = Q. S / (|Q||S|)

MMR Example (Summarization) Select second sentence S1 S2 S3 S4 S5 S6 S3 Full Text Summary Query S3 S1

S4 S1 S2 S3 S4 S5 S6 S1 Full Text Summary Query S3 S1 MMR Example (Summarization) Select third sentence

Text Categorization Task You want to classify a document to some categories automatically. For example, the categories are "weather" and "sport". To do that, you can use kNN algorithm. To use kNN, you need a collection of documents, each of them is labeled to some categories by human.

Text Categorization Procedure Using VSM represent each document in the training data Using VSM represent the document to be categorized (new document). Use cosine (or some other measures, but cosine is good here, why) find top k documents (k nearest neighbors ) in the training data that are similar to the new document. Decide from the k nearest neighbors what are the categories for the new document

Web Spider The web graph at any instant of time contains k-connected subgraphs The spider algorithm given in class is a depth first search through a web subgraph Avoiding respidering the same page Completeness is not guaranteed. Partial solution is to get seed URLs as diverse as possible.

Web Spider PROCEDURE SPIDER 4 (G, {SEEDS}) Initialize COLLECTION Initialize VISITED For every ROOT in SEEDS Initialize STACK Let STACK := push(ROOT, STACK) While STACK is not empty, Do URL curr := pop(STACK) Until URL curr is not in VISITED insert-hash(URL curr, VISITED) PAGE := look-up(URL curr ) STORE(, COLLECTION) For every URL i in PAGE, push(URL i, STACK) Return COLLECTION

Text Mining Components of Text Mining Categorization by topic or Genre Fact extraction from text Data Mining from DBs or extracted facts

Fact extraction from text Named Entity Identification FSA/FST, HMM Role-Situated Named Entities Apply context information Information Extraction Template matching

Named Entity Identification Definition of A Finite State Acceptor (FSA) With an input source (e.g. string of words) Outputs "YES" or "NO" Definition of A Finite State Transducer (FST) An FSA with variable binding Outputs "NO" or "YES"+variable-bindings Variable bindings encode recognized entity e.g. "YES "

Named Entity Identification Example. Identify numbers: 1, 2.0, -3.22, +3e2, 4e-5 D = {0,1,2,3,4,5,6,7,8,9} +- D D e. D e D D D D D Start

Data Mining Learning by caching –What/when to cache –When to use/invalidate/update cache Learning from Examples (a.k.a, "Supervised" learning) –Labeled examples for training –Learn the mapping from examples to labels –E.g.: Naive Bayes, Decision Trees,... –Text Categorization (using kNN or other means) is a learning-from-examples task

Data Mining "Speedup" Learning –Tuning search heuristics from experience –Inducing explicit control knowledge –Analogical learning (generalized instances) Optimization "policy" learning –Predicting continuous objective function –E.g. Regression, Reinforcement,... New Pattern Discovery (aka "Unsupervised" Learning) –Finding meaningful correlations in data –E.g. association rules, clustering,...

Generalize v.s. Specialize Generalize: First, each record in your database is a RULE Then, generalize (how?, when to stop?) Specialize: First, give a very general rule (almost useless) Then, specialize (how? When to stop?)

Methods for Supervised DM Classifiers Linear Separators (regression) Naive Bayes (NB) Decision Trees (DTs) k-Nearest Neighbor (kNN) Decision rule induction Support Vector Machines (SVMs) Neural Networks (NNs)...