Download presentation
Presentation is loading. Please wait.
1
Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan chia@csie.ncu.edu.tw Sep. 16, 2005
2
Sep. 21, 20042 Course Content Web Information Integration Web Information Retrieval Traditional IR systems Web Mining
3
Sep. 21, 20043 Topic I: Web Information Integration Search Interface Integration Web page collection Web data extraction Search result integration Web Service
4
Sep. 21, 20044 Web Page Collection Metacrawler http://www.metacrawler.com/http://www.metacrawler.com/ Google · Yahoo · Ask Jeeves About · LookSmart · Overture · FindWhat Ebay http://www.ebay.com/http://www.ebay.com/ Information asymmetry between buyers and sellers Technology Program generators WNDL, W4F, XWrap, Robomaker
5
Sep. 21, 20045 Web Data Extraction Example Technology Information Extraction Systems WIEN, Softmealy, Stalker, IEPAD, DeLA, OLERA, Roadrunner, EXALG, XWrap, W4F, etc. Data Annotation Wrapper induction is an excellent exercise of machine learning technologies
6
Sep. 21, 20046 Topic II: Web Information Retrieval From User Perspective Browsing via categories Searching via search engines Query answering From System Perspective Web crawling Indexing and querying Link-based ranking Query answering Semantic Web, XML retrieval, etc.
7
Sep. 21, 20047 Web Categories Yahoo http://www.yahoo.comhttp://www.yahoo.com Fourteen categories and ninety subcategories Categorization by humans Technology Document classification Pros and Cons Overview of the content in the database Browsing without specific targets
8
Sep. 21, 20048 Search Engines Google http://www.google.comhttp://www.google.com Search by keyword matching Business model Technology Web Crawling Indexing for fast search Ranking for good results Pros and Cons Search engines locate the documents not the answers
9
Sep. 21, 20049 Question Answering Askjeeves http://www.ask.comhttp://www.ask.com Input a question or keywords Relevance feedback from users to clarify the targets ExtAns (Molla et al., 2003) Technology Text information extraction Natural Language Processing
10
Sep. 21, 200410 Topic III: Techniques from Traditional IR Text Operations Lexical analysis of the text Elimination of stop words Index term selection Indexing and Searching Inverted files Suffix trees and suffix arrays Signature files IR Model and Ranking Technique Query Operations Relevance feedback Query expansion
11
Sep. 21, 200411 Topic IV: Web Mining Usage Analysis Focused Crawling Clustering of Web search result Text classification
12
Sep. 21, 200412 Available Techniques Artificial Intelligence Search and Logic programming Machine Learning Supervised learning (classification) Unsupervised learning (clustering) Database and Warehousing OLAP and Iceberg queries Data Mining Pattern mining from large data sets Other Disciplines Statistics, neural network, genetic algorithms, etc.
13
Sep. 21, 200413 Classical Tasks Classification Artificial Intelligence, Machine Learning Clustering Pattern recognition, neural network Pattern Mining Association rules, sequential patterns, episodes mining, periodic patterns, frequent continuities, etc.
14
Sep. 21, 200414 Classification Methods Supervised Learning (Concept Learning) General-to-specific ording Decision tree learning Bayesian learning Instance-based learning Sequential covering algorithms Artificial neural networks Genetic algorithms Reference: Mitchell, 1997
15
Sep. 21, 200415 Clustering Algorithms Unsupervised learning (comparative analysis) Partition Methods Hierarchical Methods Model-based Clustering Methods Density-based Methods Grid-based Methods Reference: Han and Kamber (Chapter 8)
16
Sep. 21, 200416 Pattern Mining Various kinds of patterns Association Rules Closed itemsets, maximal itemsets, non-redundant rules, etc. Sequential patterns Episodes mining Periodic patterns Frequent continuities
17
Sep. 21, 200417 Applications Relational Data E.g. Northern Group Retail (Business Intelligence)Northern Group Retail Banking, Insurance, Health, others Web Information Retrieval and Extraction Bioinformatics Multimedia Mining Spatial Data Mining Time-series Data Mining
18
Sep. 21, 200418 Course Schedule Web Data Extraction (3 weeks) Web Interface Integration (1 week) Web Page Collection (1 week) Techniques from Traditional IR (2 weeks) Query Answering (1 week) Link Based Analysis (1 week) Focused Crawling (1 week) Web Usage Mining (1 week) Clustering Search Result (1 week) Text Classification (1 week)
19
Sep. 21, 200419 Grading Project I: 30% Implementation of the chosen paper (W10) Project II: 30% Topic can be chosen freely (W16) Paper reading: 20% Presentation Homework: 10% Involvement in the Class: 10%
20
Sep. 21, 200420 References Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval, Addison Wesley Han, J. and Kamber, M. 2001. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers Mitchell, T. M. 1997. Machine Learning, McGRAW- HILL. Molla, D., Schwitter, R., Rinaldi, F., Dowdall, J. and Hess, M. 2003. ExtrAns: Extracting Answers from Technical Texts. IEEE Intelligent Systems, July/August 2003, 12-17.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.