Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05
What is the Semantic Web A way to automate reasoning with web data RDF A uniform way to describe resources (subject,predicate,object) Ontology Hierarchical structure of data Property restrictions Implicit typing
Adding Meta-Data A prerequisite for Semantic Web (SW) is structured knowledge Manual Approach Too Much data Trust Issues Noise This process needs to be automated
Armadillo Automatically annotate web pages Validity based on a number of weak techniques Redundant Information Rating of Sources Context around a capture (LP)² - Extraction of knowledge Makes use of Natural Language Processing (NLP)
(LP)² Induce tagging rules Generalize NLP and keep best rules Remove covered instances from pool High Precision, Low Recall Contextual Tagging Recovers rules and constrains their application Correction and Validation Shifts tags to correct position (within d spaces) Validation
Heterogeneity Armadillo Uses weak NLP Uses intra-document relation recognition Requirements Must adapt to different document types Relation Extraction
Bootstrapping Learning Armadillo Unsupervised approach – user only validates User cannot drive system towards interesting documents and facts Requirements Identify triples Goal: Bootstrap learning on a large scale User needs a role to guide learning
Content Cleaning and Normalization Armadillo Noise added during unsupervised (LP)² Use the multiple weak evidence to help avoid poor seeds Requirements Handle noisy training data
Conclusion Semantic Web Meta-Data Armadillo – a tool for IE Evidence Building and Validation Extraction of knowledge (LP)² A survey of requirements in mining web content for SW meta-data