Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan
Sep. 21, Course Content Web Information Retrieval Browsing via categories Searching via search engines Query answering Web Information Integration Web page collection Data extraction from semi-structured Web pages Data integration
Sep. 21, Web Categories Yahoo Fourteen categories and ninety subcategories Categorization by humans Technology Document classification Pros and Cons Overview of the content in the database Browsing without specific targets
Sep. 21, Search Engines Google Search by keyword matching Business model Technology Web Crawling Indexing for fast search Ranking for good results Pros and Cons Search engines locate the documents not the answers
Sep. 21, Question Answering Askjeeves Input a question or keywords Relevance feedback from users to clarify the targets ExtAns (Molla et al., 2003) Technology Text information extraction Natural Language Processing
Sep. 21, Web Page Collection Metacrawler Google · Yahoo · Ask Jeeves About · LookSmart · Overture · FindWhat Ebay Information asymmetry between buyers and sellers Technology Program generators WNDL, W4F, XWrap, Robomaker
Sep. 21, Data Extraction from Semi- structured Documents Example Technology Information Extraction Systems WIEN, Softmealy, Stalker, IEPAD, DeLA, OLERA, Roadrunner, EXALG, XWrap, W4F, etc. Data Annotation Wrapper induction is an excellent exercise of machine learning technologies
Sep. 21, Data Integration Technology Template based interface design Microsoft Visual Programming tools
Sep. 21, Available Techniques Artificial Intelligence Search and Logic programming Machine Learning Supervised learning (classification) Unsupervised learning (clustering) Database and Warehousing OLAP and Iceberg queries Data Mining Pattern mining from large data sets Other Disciplines Statistics, neural network, genetic algorithms, etc.
Sep. 21, Classical Tasks Classification Artificial Intelligence, Machine Learning Clustering Pattern recognition, neural network Pattern Mining Association rules, sequential patterns, episodes mining, periodic patterns, frequent continuities, etc.
Sep. 21, Classification Methods Supervised Learning (Concept Learning) General-to-specific ording Decision tree learning Bayesian learning Instance-based learning Sequential covering algorithms Artificial neural networks Genetic algorithms Reference: Mitchell, 1997
Sep. 21, Clustering Algorithms Unsupervised learning (comparative analysis) Partition Methods Hierarchical Methods Model-based Clustering Methods Density-based Methods Grid-based Methods Reference: Han and Kamber (Chapter 8)
Sep. 21, Pattern Mining Various kinds of patterns Association Rules Closed itemsets, maximal itemsets, non-redundant rules, etc. Sequential patterns Episodes mining Periodic patterns Frequent continuities
Sep. 21, Applications Relational Data E.g. Northern Group Retail (Business Intelligence)Northern Group Retail Banking, Insurance, Health, others Web Information Retrieval and Extraction Bioinformatics Multimedia Mining Spatial Data Mining Time-series Data Mining
Sep. 21, Techniques from Information Retrieval (IR) Text Operations Lexical analysis of the text Elimination of stop words Index term selection Indexing and Searching Inverted files Suffix trees and suffix arrays Signature files Ranking Models Query Operations Relevance feedback Query expansion
Sep. 21, Course Schedule Techniques from Information Retrieval Text Operations Indexing and Searching Ranking Models Query Operations Text Information Extraction for Query answering AutoSlog, SRV, Rapier, etc. Data extraction from semi-structured Web pages WIEN, Softmealy, Stalker, IEPAD, DeLA, Roadrunner, EXALG, OLERA, etc. Web page collection XWrap, W4F, Robomaker, etc.
Sep. 21, Grading Two projects (by groups): 50% Chosen from the topics covered in the course Presentation and reports Paper reading (by yourself): 20% Presentation Information Integration Projects: 30% Chosen freely Presentation and reports
Sep. 21, References Baeza-Yates, R. and Ribeiro-Neto, B Modern Information Retrieval, Addison Wesley Han, J. and Kamber, M Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers Mitchell, T. M Machine Learning, McGRAW- HILL. Molla, D., Schwitter, R., Rinaldi, F., Dowdall, J. and Hess, M ExtrAns: Extracting Answers from Technical Texts. IEEE Intelligent Systems, July/August 2003,