Machine Learning and the Semantic Web

Machine Learning and the Semantic Web
Hendrik Blockeel Katholieke Universiteit Leuven Department of Computer Science Thanks : Raymond Kosala, Nico Jacobs

Overview Machine learning and data mining
Relationship with semantic web Synergy between both Some concrete examples Document classification Information integration Conclusions

Machine Learning & Data Mining
Related technology, different focus Machine learning: Programs that improve their performance on certain tasks Focus on adaptive behaviour Data mining: Discovering implicit knowledge (regularities) in large amounts of data Focus on handling large amounts of data Very useful technology in the context of the Web

Learning Agents Programs that
Learn the user’s preferences Make life for the user as simple as possible E.g., intelligent mail reader E.g., adaptive web pages Move links, create “direct” links, ... Index page synthesis (Perkowitz & Etzioni, IJCAI 1999) Learn how to find reliable information E.g., learn which other people have similar preferences to this user, use their opinions to make suggestions (other applications: learning to play games, ...)

Mining the Web Analyze data that are available on the Web
Distinguish 3 types: Web content mining Look in contents of documents (text, ...) Web structure mining Look at links between documents Web usage mining Look at user logs (e.g. who accessed a web page, which links often used, ...)

Web Content Mining Relies on information extraction
E.g., in a text: find keywords, ... Techniques from machine learning, statistics, ... used to guess from context what a word means what its function in the text is ... Fill a schema with specific slots, based on analysis of text Even more complicated: recognise objects in pictures, ... I.E. is a complex matter

Mining for Genes Jenssen et al. (2001), Nature Genetics 28, “A literature network of human genes” Mining MEDLINE database of abstracts Find names of genes occurring together Construct similarity graph Construct a database with this information Database contains knowledge no single individual has, or could obtain without data mining Similar techniques could be used on the web One extra problem: uncertainty about reliability

Web Structure Mining Analyse structure of the web E.g., Google
Which sites have many incoming / outgoing links? Identify “hubs” Find clusters of sites that are strongly interconnected Web communities ... E.g., Google Identifies important pages based on links that point to it (rather than contents of page itself)

Web Usage Mining Log user behaviour E.g., adaptive web sites
Which links are often followed, in which order, how long is a page looked at, ... Possible at several levels: General usage statistics User-specific statistics Relating behaviour to properties of user, insofar available E.g., adaptive web sites Adaplix project automatic index page creation

Web Mining As It Currently Is
Machine learning / data mining strongly rely on Data quantity Data quality Quantity is usually not a problem on the Web Quality is! Much data not in easily processable format E.g. Inside text documents : need information extraction Unstructured, poorly structured, heterogeneously structured Lots of noise ...

How Is All This Related to the Semantic Web?
There can be a synergy : Machine learning can help with building the Semantic Web The Semantic Web will help mining the Web, making Web interfaces and agents more intelligent

What Machine Learning Can Do for the Semantic Web
Upgrading the current web to a semantic web involves a lot of work Can partially be automated! Examples: Learning ontologies Automatic document classification Information integration ...

Learning Ontologies Maedche & Staab (2001), “Ontology learning for the semantic web” View: Manually creating of ontologies is very labour-intensive Fully automating creating of ontologies is not feasible Hence: develop tool that helps building ontologies Basic components: Good graphical interface (interaction man-machine) Powerful underlying machine learning techniques

Text-To-Onto Framework : Import / reuse existing ontologies
Extract ontology from documents Identify new terms, map onto existing concepts or define new ones Identify relationships between concepts ... Many opportunities for general machine learning techniques Prune ontology Refine ontology

Some Useful Techniques for Learning Ontologies
Term extraction from texts Identification of concepts Hierarchical Clustering Clustering: finding groups of “similar” things Hierarchical clustering: clusters of clusters Taxonomy can be constructed through hierarchical clustering of concepts Association rules Find sets of terms that often occur together May indicate important relations E.g., events in texts often co-occur with locations

Information Integration
Doan, Domingos, Halevy: “Reconciling Schemas of Disparate Data Sources”, ACM SIGMOD 2001 Context: Given databases with different schemas: Find similarities in schemas, guess how concepts map onto each other Integrate the schemas Essentially the same as mapping ontologies onto each other

Automated Document Classification
Mitchell et al. Based on examples of web pages + what kind of page they are (course page, student page, ...), Learn to classify new pages Can be based on contents of page, links pointing to page, typical structure of certain kinds of web sites (e.g. universities), ... Note: helps to relate objects to ontology Problem: how to get labeled examples Unlimited amount of unlabelled pages available But labelling them manually is labour intensive!

Exploiting Unlabelled Data
A solution: co-training (Blum & Mitchell 1998) Learn separate (imperfect) classifiers from disjoint sets of sufficient information E.g. Learn to classify pages from Content of page (“Home page of CS 101”) Links pointing to page (“CS 101”) Take classifications that classifier A is most certain of, add these labels to training set for B (and vice versa) Repeat multiple times (kind of bootstrapping process) Co-training allows to exploit large amounts of unlabelled data!

What the Semantic Web Can Do for Machine Learning
Will make mining the web much easier Reason 1: removal of ambiguity More precise knowledge of what is meant with certain terms Reason 2: structured vs. unstructured data Learning from structured data is much easier than from unstructured data Reason 3: availability of background knowledge Can be used to make better decisions when learning

Removal of Ambiguity Example: text document classification
E.g., given a text, tell in which newsgroups it belongs Typical approaches: “bag of words” Look only at which words occur, in the text, and how often Each time a word occurs that occurs mainly in one particular class, increase probability for that class But words are ambiguous! Increased classification accuracy can be expected by removing ambiguity

Mining From (Un)structured Data
Mining data = intensively querying data Answering a querying is Easy in structured data Relational database, XML, ... Harder in semi-structured data (e.g., HTML) Hard in unstructured data Information exraction needed Could do this by learning a “wrapper” This involves one extra layer of learning Relating this to our text example: taking into account function of words in text

Availability of Background Knowledge
Learning = finding relevant patterns in behaviour Important to have the right context to describe these patterns Example: Making interesting offers to clients “People who bought this book also bought ...” = “Instance-based” learning Estimate profile of user Find users with similar profile Look at behaviour of those users to help current user

Availability of Background Knowledge
Can work better if more background knowledge is available, e.g., type of book, author, ... For instance, for books: “similar profile” = users that up till now bought same books as this user May not be many people “similar” = often bought books by same author Probably many more people, allows for more reasonable guess “similar” = often bought books of same genre (fiction, ...) May work even better Ontologies (among other) provide such background knowledge

Web Mining Revisited Semantic Web will change
Content mining Clearer view on contents and meaning of documents Structure mining More relevant structure Usage mining More relevant information on actions of user Will in general improve intelligence of systems E.g. mail filter gets a better view of contents of mails

Promising Learning Techniques
Many different learning techniques exist Neural networks, support vector machines, instance-based learning, bayesian learning, association rules, ... Not all equally suitable for any task E.g. SVM for document classification works well E.g. instance-based learning: find other users with same profile as this user to make predictions Intelligent agents will use a mix of them Relational learners seem interesting Can handle explicit information on objects and relations between them Classic example: Inductive logic programming

Inductive Logic Programming
Induces rules in first order logic from examples or other rules Such rules can be used to reason with The reasoning can be explained Cf. example of mail program Can use existing background knowledge “knowledge intensive learning” Currently: good background knowledge has to be engineered manually Will become more easily available with semantic web Example: mining in chemical domains

Mining in chemical domains
Example problem: relate activity of molecule to its properties Useful for, e.g., drug development Which properties are important? Chemically relevant properties: functional groups, 3D structure, ... ? Has to be encoded manually Ideally: get relevant information from some trustworthy data source as and when needed Intelligent agents will exploit (“tap”) the common intelligence of the Web

Conclusions Machine learning is an promising tool for the Semantic Web
For building it For exploiting it Clear synergy between Semantic Web efforts and Machine Learning efforts

Some References Maedche, “A Machine Learning Perspective for the Semantic Web”, position paper Maedche & Staab (2001): Ontology Learning for the Semantic Web, IEEE Intelligent Systems 16(2) Jenssen et al., Nature Genetics 28 Doan et al. (2001), ACM SIGMOD conf. Kosala & Blockeel (2000), SIGKDD Explorations 2(1) Mitchell (1996), Machine Learning

Machine Learning and the Semantic Web

Similar presentations

Presentation on theme: "Machine Learning and the Semantic Web"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning and the Semantic Web

Similar presentations

Presentation on theme: "Machine Learning and the Semantic Web"— Presentation transcript:

Similar presentations

About project

Feedback