Download presentation
Presentation is loading. Please wait.
Published byMelvyn Marshall Modified over 9 years ago
1
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1
2
Outline Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions 2
3
Four Problems Finding relevant information Low precision-which is due to the irrelevance of many of the search results. This results in a difficulty finding the relevant information. LOW RECALL which is due to the inability to index all the information available on the web.This results in a difficulty finding the unindexed information that is relevant. Creating new knowledge out of available information on the web While the problem above is a query-triggered process (retrieval oriented), this problem is a data-triggered process. 3
4
4 Personalizing the information Catering to personal preference in content and presentation(associated with the type and presentation of the information ) Learning about the consumers What does the customer want to do? Using web data to effectively market products and/or services
5
Other Approaches Web mining is NOT the only approach Database approach (DB) Information retrieval (IR) Natural language processing (NLP) In-depth syntactic and semantic analysis Web document community Standards, manually appended meta-information, maintained directories, etc 5
6
Direct vs. Indirect Web Mining Web mining techniques can be used to solve the information overload problems: Directly Attack the problem with web mining techniques E.g. newsgroup agent classifies news as relevant Indirectly Used as part of a bigger application that addresses problems E.g. used to create index terms for a web search service 6
7
The Research Converging research from: Database, information retrieval, and artificial intelligence (specifically NLP and machine learning) Focusing on research from the machine learning point of view 7
8
Web Mining: Definition “Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.” Can be viewed as four subtasks Not the same as Information Retrieval Not the same as Information Extraction 8
9
Web Mining: Subtasks Resource finding Retrieving intended documents Information selection/pre-processing Select and pre-process specific information from retrieved web resources. Generalization Discover general patterns within and across web sites Analysis Validation and/or interpretation of mined patterns 9
10
Web Mining: Not IR Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non- relevant documents as possible Web document classification, which is a Web Mining task, could be part of an IR system (e.g. indexing for a search engine) 10
11
Web Mining: Not IE Information extraction (IE) aims to extract the relevant facts from given documents while IR aims to select relevant documents. IE systems for the general Web are not feasible Most focus on specific Web sites or content 11
12
12 IE - IR Information Retrieval Automatic retrieval of relevant documents Primary Goals: oIndexing Text oSearching for useful documents in a collection o“Bag of unordered words” o“Web document classification “ task is an instance of IR Information Extraction Extract relevant facts from documents Primary Goals: oTransform collection of retrieved documents to information. oStructure of representation of a document o“Web document classification “ task is an instance of IR oIE has a higher level of granularity oResult: oStructured Database oCompression or summary of Text or documents
13
13 Types of IE I E from unstructured texts ( Classical) Unstructured ?? Free texts eg.News stories Basic to deep linguistic pre- processing. IE from semi-structured texts (Structural) Semi-Structured ?? HTML Uses meta-information eg. HTML tags Wrapper Induction, Machine learning used to build systems (semi-)automatically
14
Web Mining and Machine Learning Machine learning is concerned with the development of algorithms and techniques that allow computers to "learn". Web mining is NOT learning from the Web. Some applications of machine learning on the web are NOT Web Mining Methods used for Web Mining are NOT limited to machine learning There is a close relationship between web mining and machine learning 14
15
15 Machine learning techniques support and help web mining as they could be applied to the processes in the web mining. For example, recent research shows that applying machine learning techniques could improve the text classification process compared to the traditional IR techniques. In short,web mining intersects with the application of the machine learning on the web. Web Mining and Machine Learning
16
Web Mining Categories Web Content Mining Discovering useful information from web contents/data/documents. Web Structure Mining Discovering the model underlying link structures (topology) on the Web. E.g. discovering authorities and hubs Web Usage Mining Make sense of data generated by surfers Usage data from logs, user profiles, user sessions, cookies, user queries, bookmarks, mouse clicks and scrolls, etc. 16
17
Web Content Data Structure Unstructured – free text Semi-structured – HTML More structured – Table or Database generated HTML pages Multimedia data – receive less attention than text or hypertext 17
18
Web Structure Mining Interested in the structure between Web documents (not within a document) Example: PageRank – Google Application: Discovering micro-communities in the Web Measuring the “completeness” of a Web site 18
19
Web Usage Mining Tries to predict user behavior from interaction with the Web Wide range of data (logs) Web client data Proxy server data Web server data Two common approaches Map usage data into relational tables before using adapted data mining techniques Use log data directly by utilizing special pre-processing techniques 19
20
Thank you! 20
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.