Information Retrieval (7)

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.
Advertisements

Eigen Decomposition and Singular Value Decomposition
3D Geometry for Computer Graphics
Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.
Covariance Matrix Applications
Text Databases Text Types
Latent Semantic Analysis
Dimensionality Reduction PCA -- SVD
Lecture 19 Singular Value Decomposition
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Latent Semantic Analysis
Hinrich Schütze and Christina Lioma
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Using resources WordNet and the BNC. WordNet: History 1985: a group of psychologists and linguists start to develop a “lexical database” –Princeton University.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
Information Retrieval: Models and Methods October 15, 2003 CMSC Gina-Anne Levow.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
Lecture 20 SVD and Its Applications Shang-Hua Teng.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Matrix Factorization and Latent Semantic Indexing 1 Lecture 13: Matrix Factorization and Latent Semantic Indexing Web Search and Mining.
2010 © University of Michigan Latent Semantic Indexing SI650: Information Retrieval Winter 2010 School of Information University of Michigan 1.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
SINGULAR VALUE DECOMPOSITION (SVD)
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Wordnet - A lexical database for the English Language.
Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Natural Language Processing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.
ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.
NLP. Text Similarity Example: post-close market announcements The S&P 500 climbed 6.93, or 0.56 percent, to 1,243.72, its best close since June 12, 2001.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
Information Retrieval Search Engine Technology (7) Prof. Dragomir R. Radev.
Dimensionality Reduction and Principle Components Analysis
Information Retrieval: Models and Methods
Document Clustering Based on Non-negative Matrix Factorization
Information Retrieval: Models and Methods
Latent Semantic Indexing
LSI, SVD and Data Management
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
SVD: Physical Interpretation and Applications
Symmetric Matrices and Quadratic Forms
Some thing about Nouns in WordNet
Latent Semantic Indexing
Recuperação de Informação B
Restructuring Sparse High Dimensional Data for Effective Retrieval
Lecture 20 SVD and Its Applications
Symmetric Matrices and Quadratic Forms
Latent Semantic Analysis
Presentation transcript:

Information Retrieval (7) Prof. Dragomir R. Radev radev@umich.edu

11. Lexical semantics and wordnet IR Winter 2010 … 11. Lexical semantics and wordnet

Lexical Networks Used to represent relationships between words Example: WordNet - created by George Miller’s team at Princeton Based on synsets (synonyms, interchangeable words) and lexical matrices

Lexical matrix

Synsets Disambiguation Synonyms {board, plank} {board, committee} substitution weak substitution synonyms must be of the same part of speech

$ ./wn board -hypen Synonyms/Hypernyms (Ordered by Frequency) of noun board 9 senses of board Sense 1 board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping Sense 2 => sheet, flat solid => artifact, artefact => object, physical object => entity, something Sense 3 board, plank => lumber, timber => building material

Sense 4 display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 5 board, gameboard => surface Sense 6 board, table => fare => food, nutrient => substance, matter

Sense 7 control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 8 circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit Sense 9 dining table, board => table => furniture, piece of furniture, article of furniture => furnishings

Antonymy “x” vs. “not-x” “rich” vs. “poor”? {rise, ascend} vs. {fall, descend}

Other relations Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”. Hyponymy: {tree} is a hyponym of {plant}. Hierarchical structure based on hyponymy (and hypernymy).

Other features of WordNet Index of familiarity Polysemy

Familiarity and polysemy board used as a noun is familiar (polysemy count = 9) bird used as a noun is common (polysemy count = 5) cat used as a noun is common (polysemy count = 7) house used as a noun is familiar (polysemy count = 11) information used as a noun is common (polysemy count = 5) retrieval used as a noun is uncommon (polysemy count = 3) serendipity used as a noun is very rare (polysemy count = 1)

Compound nouns advisory board appeals board backboard backgammon board baseboard basketball backboard big board billboard binder's board binder board blackboard board game board measure board meeting board member board of appeals board of directors board of education board of regents board of trustees

Overview of senses 1. board -- (a committee having supervisory powers; "the board has seven members") 2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows") 3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes) 4. display panel, display board, board -- (a board on which information can be displayed to public view) 5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces") 6. board, table -- (food or meals in general; "she sets a fine table"; "room and board") 7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree") 8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board")

Top-level concepts {act, action, activity} {animal, fauna} {artifact} {attribute, property} {body, corpus} {cognition, knowledge} {communication} {event, happening} {feeling, emotion} {food} {group, collection} {location, place} {motive} {natural object} {natural phenomenon} {person, human being} {plant, flora} {possession} {process} {quantity, amount} {relation} {shape} {state, condition} {substance} {time}

WordNet parameters wn reason -hypen - hypernyms wn reason -synsn - synsets wn reason -simsn - synonyms wn reason -over - overview of senses wn reason -famln - familiarity/polysemy wn reason -grepn - compound nouns

12. Latent semantic indexing Singular value decomposition IR Winter 2010 … 12. Latent semantic indexing Singular value decomposition

Problems with lexical semantics Polysemy (sim < cos) Bar, bank, jaguar, hot Synonymy (sim > cos) Building/edifice, Large/big, Spicy/hot Relatedness Doctor/patient/nurse/treatment Sparse matrix Need: dimensionality reduction

Techniques for dimensionality reduction Based on matrix decomposition (goal: preserve clusters, explain away variance) A quick review of matrices Vectors Matrices Matrix multiplication

Eigenvectors and eigenvalues An eigenvector is an implicit “direction” for a matrix where v (eigenvector) is non-zero, though λ (eigenvalue) can be any complex number in principle Computing eigenvalues:

Eigenvectors and eigenvalues Example: Det (A-lI) = (-1-l)*(-l)-3*2=0 Then: l+l2-6=0; l1=2; l2=-3 For l1=2: Solutions: x1=x2

Matrix decomposition If S is a square matrix, it can be decomposed into ULU-1 where U = matrix of eigenvectors L = diagonal matrix of eigenvalues SU = UL U-1SU = L S = ULU-1

Example

Example Eigenvalues are 3, 2, 0 x is an arbitrary vector, yet Sx depends on the eigenvalues and eigenvectors

SVD: Singular Value Decomposition A=USVT U is the matrix of orthogonal eigenvectors of AAT V is the matrix of orthogonal eigenvectors of ATA The components of S are the eigenvalues of ATA This decomposition exists for all matrices, dense or sparse If A has 5 columns and 3 rows, then U will be 5x5 and V will be 3x3 In Matlab, use [U,S,V] = svd (A)

Term matrix normalization D1 D2 D3 D4 D5 D1 D2 D3 D4 D5

Example (Berry and Browne) T1: baby T2: child T3: guide T4: health T5: home T6: infant T7: proofing T8: safety T9: toddler D1: infant & toddler first aid D2: babies & children’s room (for your home) D3: child safety at home D4: your baby’s health and safety: from infant to toddler D5: baby proofing basics D6: your guide to easy rust proofing D7: beanie babies collector’s guide

Document term matrix

Decomposition u = -0.6976 -0.0945 0.0174 -0.6950 0.0000 0.0153 0.1442 -0.0000 0 -0.2622 0.2946 0.4693 0.1968 -0.0000 -0.2467 -0.1571 -0.6356 0.3098 -0.3519 -0.4495 -0.1026 0.4014 0.7071 -0.0065 -0.0493 -0.0000 0.0000 -0.1127 0.1416 -0.1478 -0.0734 0.0000 0.4842 -0.8400 0.0000 -0.0000 -0.2622 0.2946 0.4693 0.1968 0.0000 -0.2467 -0.1571 0.6356 -0.3098 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 -0.3098 -0.6356 -0.3519 -0.4495 -0.1026 0.4014 -0.7071 -0.0065 -0.0493 0.0000 -0.0000 -0.2112 0.3334 0.0962 0.2819 -0.0000 0.7338 0.4659 -0.0000 0.0000 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 0.3098 0.6356 v = -0.1687 0.4192 -0.5986 0.2261 0 -0.5720 0.2433 -0.4472 0.2255 0.4641 -0.2187 0.0000 -0.4871 -0.4987 -0.2692 0.4206 0.5024 0.4900 -0.0000 0.2450 0.4451 -0.3970 0.4003 -0.3923 -0.1305 0 0.6124 -0.3690 -0.4702 -0.3037 -0.0507 -0.2607 -0.7071 0.0110 0.3407 -0.3153 -0.5018 -0.1220 0.7128 -0.0000 -0.0162 -0.3544 -0.4702 -0.3037 -0.0507 -0.2607 0.7071 0.0110 0.3407

Decomposition Spread on the v1 axis s = 1.5849 0 0 0 0 0 0 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0.7100 0 0 0 0 0 0 0 0.5692 0 0 0 0 0 0 0 0.1977 0 0 0 0 0 0 0

Rank-4 approximation s4 = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0 0 0

Rank-4 approximation u*s4*v' -0.0019 0.5985 -0.0148 0.4552 0.7002 0.0102 0.7002 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.1980 0.0514 0.0064 0.2199 0.0535 -0.0544 0.0535 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008 0.2165 0.2494 0.4367 0.2282 -0.0360 0.0394 -0.0360

Rank-4 approximation u*s4 -1.1056 -0.1203 0.0207 -0.5558 0 0 0 -1.1056 -0.1203 0.0207 -0.5558 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.1786 0.1801 -0.1765 -0.0587 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0 -0.3348 0.4241 0.1149 0.2255 0 0 0

Rank-4 approximation s4*v' -0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 -0.7150 0.5544 0.6001 -0.4686 -0.0605 -0.1457 -0.0605 0.1808 -0.1749 0.3918 -0.1043 -0.2085 0.5700 -0.2085 0 0 0 0 0 0 0

Rank-2 approximation s2 = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 0 0 0 0 0

Rank-2 approximation u*s2*v' 0.1361 0.4673 0.2470 0.3908 0.5563 0.4089 0.5563 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.1057 0.1205 0.1239 0.1430 0.0293 -0.0341 0.0293 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048 0.2343 0.2454 0.2685 0.3027 0.0286 -0.1073 0.0286

Rank-2 approximation u*s2 -1.1056 -0.1203 0 0 0 0 0 -1.1056 -0.1203 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.1786 0.1801 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0 -0.3348 0.4241 0 0 0 0 0

Rank-2 approximation s2*v' -0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 0 0 0 0 0 0 0

Documents to concepts and terms to concepts A(:,1)'*u*s -0.4238 0.6784 -0.8541 0.1446 -0.0000 -0.1853 0.0095 >> A(:,1)'*u*s4 -0.4238 0.6784 -0.8541 0.1446 0 0 0 >> A(:,1)'*u*s2 -0.4238 0.6784 0 0 0 0 0 >> A(:,2)'*u*s2 -1.1233 0.3650 0 0 0 0 0 >> A(:,3)'*u*s2 -0.6762 0.6807 0 0 0 0 0

Documents to concepts and terms to concepts >> A(:,4)'*u*s2 -0.9972 0.6478 0 0 0 0 0 >> A(:,5)'*u*s2 -1.1809 -0.4914 0 0 0 0 0 >> A(:,6)'*u*s2 -0.7918 -0.8121 0 0 0 0 0 >> A(:,7)'*u*s2

Cont’d >> (s2*v'*A(1,:)')' -1.7523 -0.1530 0 0 0 0 0 0 0 -1.7523 -0.1530 0 0 0 0 0 0 0 >> (s2*v'*A(2,:)')' -0.6585 0.4768 0 0 0 0 0 0 0 >> (s2*v'*A(3,:)')' -0.8838 -0.7275 0 0 0 0 0 0 0 >> (s2*v'*A(4,:)')' -0.2831 0.2291 0 0 0 0 0 0 0 >> (s2*v'*A(5,:)')'

Cont’d >> (s2*v'*A(6,:)')' -0.4730 0.6078 0 0 0 0 0 0 0 -0.4730 0.6078 0 0 0 0 0 0 0 >> (s2*v'*A(7,:)')' -0.8838 -0.7275 0 0 0 0 0 0 0 >> (s2*v'*A(8,:)')' -0.5306 0.5395 0 0 0 0 0 0 0 >> (s2*v'*A(9,:)')‘

Properties A is a document to term matrix. What is A*A’, what is A’*A? 1.5471 0.3364 0.5041 0.2025 0.3364 0.2025 0.5041 0.2025 0.2025 0.3364 0.6728 0 0 0.6728 0 0 0.3364 0 0.5041 0 1.0082 0 0 0 0.5041 0 0 0.2025 0 0 0.2025 0 0.2025 0 0.2025 0.2025 0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066 0.5041 0 0.5041 0 0 0 1.0082 0 0 0.2025 0.3364 0 0.2025 0.3364 0.2025 0 0.5389 0.2025 A'*A 1.0082 0 0 0.6390 0 0 0 0 1.0092 0.6728 0.2610 0.4118 0 0.4118 0 0.6728 1.0092 0.2610 0 0 0 0.6390 0.2610 0.2610 1.0125 0.3195 0 0.3195 0 0.4118 0 0.3195 1.0082 0.5041 0.5041 0 0 0 0 0.5041 1.0082 0.5041 0 0.4118 0 0.3195 0.5041 0.5041 1.0082

Latent semantic indexing (LSI) Dimensionality reduction = identification of hidden (latent) concepts Query matching in latent space

Useful pointers http://lsa.colorado.edu http://lsi.research.telcordia.com http://www.cs.utk.edu/~lsi

Readings MRS18 MRS17, MRS19 MRS20