Text & Web Mining 9/22/2018.

Text & Web Mining 9/22/2018

Structured Data So far we have focused on mining from structured data:
Attribute  Value  Outlook  Sunny Temperature  Hot Windy  Yes Humidity  High Play  Yes Most data mining involves such data 9/22/2018

Complex Data Types Increased importance of complex data:
Spatial data: includes geographic data and medical & satellite images Multimedia data: images, audio, & video Time-series data: for example banking data and stock exchange data Text data: word descriptions for objects World-Wide-Web: highly unstructured text and multimedia data Focus 9/22/2018

Text Databases Many text databases exist in practice
News articles Research papers Books Digital libraries messages Web pages Growing rapidly in size and importance 9/22/2018

Semi-Structured Data Text databases are often semi-structured Example:
Title Author Publication_Date Length Category Abstract Content Structured attribute/value pairs Unstructured 9/22/2018

Handling Text Data Modeling semi-structured data
Information Retrieval (IR) from unstructured documents Text mining Compare documents Rank importance & relevance Find patterns or trends across documents 9/22/2018

Information Retrieval
IR locates relevant documents Key words Similar documents IR Systems On-line library catalogs On-line document management systems 9/22/2018

Performance Measure Two basic measures Retrieved Relevant Relevant &
documents Relevant documents Relevant & retrieved All documents 9/22/2018

Retrieval Methods Keyword-based IR Similarity-based IR
E.g., “data and mining” Synonymy problem: a document may talk about “knowledge discovery” instead Polysemy problem: mining can mean different things Similarity-based IR Set of common keywords Return the degree of relevance Problem: what is the similarity of “data mining” and “data analysis” 9/22/2018

Modeling a Document Set of n documents and m terms
Each document is a vector v in Rm The j-th coordinate of v measures the association of the j-th term Here r is the number of occurrences of the j-th term and R is the number of occurrences of any term. 9/22/2018

Frequency Matrix 9/22/2018

Similarity Measures Cosine measure Dot product Norm of the vectors
9/22/2018

Example Google search for “association mining”
Two of the documents retrieved: Idaho Mining Association: mining in Idaho (doc 1) Scalable Algorithms for Association mining (doc 2) Using only the two terms 9/22/2018

New Model Add the term “data” to the document model 9/22/2018

Frequency Matrix Will quickly become large
Singular value decomposition can be used to reduce it 9/22/2018

{document_id, a_set_of_keywords}
Association Analysis Collect set of keywords frequently used together and find association among them Apply any association rule algorithm to a database in the format {document_id, a_set_of_keywords} 9/22/2018

Document Classification
Need already classified documents as training set Induce a classification model Any difference from before? A set of keywords associated with a document has no fixed set of attributes or dimensions 9/22/2018

Association-Based Classification
Classify documents based on associated, frequently occurring text patterns Extract keywords and terms with IR and simple association analysis Create a concept hierarchy of terms Classify training documents into class hierarchies Use association mining to discover associated terms to distinguish one class from another 9/22/2018

Remember Generalized Association Rules
Taxonomy: Ancestor of shoes and hiking boots Clothes Footwear Outerwear Shirts Shoes Hiking Boots Jackets Ski Pants Generalized association rule X Y where no item in Y is an ancestor of an item in X 9/22/2018

Classifiers Let X be a set of terms
Let Anc (X) be those terms and their ancestor terms Consider a rule X C and document d If X  Anc (d) then X C covers d A rule that covers d may be used to classify d (but only one can be used) 9/22/2018

Procedure Step 1: Generate all generalized association rules , where X is a set of terms and C is a class, that satisfy minimum support. Step 2: Rank the rules according to some rule ranking criterion Step 3: Select rules from the list 9/22/2018

Web Mining The World Wide Web may have more opportunities for data mining than any other area However, there are serious challenges: It is too huge Complexity of Web pages is greater than any traditional text document collection It is highly dynamic It has a broad diversity of users Only a tiny portion of the information is truly useful 9/22/2018

Search Engines  Web Mining
Current technology: search engines Keyword-based indices Too many relevant pages Synonymy and polysemy problems More challenging: web mining Web content mining Web structure mining Web usage mining 9/22/2018

Web Content Mining 9/22/2018

Example: Classification of Web Documents
Assign a class to each document based on predefined topic categories E.g., use Yahoo!’s taxonomy and associated documents for training Keyword-based document classification Keyword-based association analysis 9/22/2018

Web Structure Mining 9/22/2018

Authoritative Web Pages
High quality relevant Web pages are termed authoritative Explore linkages (hyperlinks) Linking a Web page can be considered an endorsement of that page Those pages that are linked frequently are considered authoritative (This has its roots back to IR methods based on journal citations) 9/22/2018

Structure via Hubs A hub is a set of Web pages containing collections of links to authorities There is a wide variety of hubs: Simple list of recommended links on a person’s home page Professional resource lists on commercial sites 9/22/2018

HITS Hyperlink-Induced Topic Search (HITS)
Form a root set of pages using the query terms in an index-based search (200 pages) Expand into a base set by including all pages the root set links to ( pages) Go into an iterative process to determine hubs and authorities 9/22/2018

Calculating Weights Authority weight Hub weight Page p is pointed
to by page q 9/22/2018

Adjacency Matrix Lets number the pages {1,2,…,n}
The adjacency matrix is defined by By writing the authority and hub weights as vectors we have 9/22/2018

Recursive Calculations
We now have By linear algebra theory this converges to the principle eigenvectors of the the two matrices 9/22/2018

Output The HITS algorithm finally outputs
Short list of pages with high hub weights Short list of pages with high authority weights Have not accounted for context 9/22/2018

Applications The Clever Project at IBM’s Almaden Labs Google
Developed the HITS algorithm Google Developed at Stanford Uses algorithms similar to HITS (PageRank) On-line version 9/22/2018

Web Usage Mining 9/22/2018

Complex Data Types Summary
Emerging areas of mining complex data types: Text mining can be done quite effectively, especially if the documents are semi-structured Web mining is more difficult due to lack of such structure Data includes text documents, hypertext documents, link structure, and logs Need to rely on unsupervised learning, sometimes followed up with supervised learning such as classification 9/22/2018

Text & Web Mining 9/22/2018.

Similar presentations

Presentation on theme: "Text & Web Mining 9/22/2018."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text & Web Mining 9/22/2018.

Similar presentations

Presentation on theme: "Text & Web Mining 9/22/2018."— Presentation transcript:

Similar presentations

About project

Feedback