Download presentation
Presentation is loading. Please wait.
Published byElwin Franklin Modified over 8 years ago
1
COMP3410 DB32: Technologies for Knowledge Management 10 : Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources from other sources, esp. Knowledge Management by Stuart Roberts, School of Computing, University of Leeds)
2
What has Machine Learning got to do with Computing / Information Systems? “Most international organizations produce more information in a week than many people could read in a lifetime” Adriaans and Zantinge
3
Objectives of knowledge discovery or data mining Data mining is about discovering patterns in data. For this we need: –KD/DM techniques, algorithms, tools, eg BootCat, WEKA –A methodological framework to guide us, in collecting data and applying the best algorithms: CRISP-DM
4
Data Mining, Knowledge Discovery, Text Mining Data Mining was originally about “learning” patterns from DataBases, data structured as Records, Fields Knowledge Discovery is “exotic term” for DM??? Increasingly, data is unstructured text (WWW), so Text Mining is a new subfield of DM, focussing on Knowledge Discovery from unstructured text data
5
define: data mining Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from statistics and pattern recognition. en.wikipedia.org/wiki/Data_mining en.wikipedia.org/wiki/Data_mining
6
define: text mining Text mining, also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining is a young interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics and computational linguistics.... en.wikipedia.org/wiki/Text_mining en.wikipedia.org/wiki/Text_mining
7
define: knowledge discovery Knowledge discovery is the process of finding novel, interesting, and useful patterns in data. Data mining is a subset of knowledge discovery. It lets the data suggest new hypotheses to test. www.purpleinsight.com/downloads/docs/visualizer_ tutorial/glossary/go01.html www.purpleinsight.com/downloads/docs/visualizer_ tutorial/glossary/go01.html Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from statistics and pattern recognition. en.wikipedia.org/wiki/Knowledge_discovery en.wikipedia.org/wiki/Knowledge_discovery
8
Data Mining: Overview Concepts, Instances or examples, Attributes Data Mining Concept Descriptions Each instance is an example of the concept to be learned or described. The instance may be described by the values of its attributes.
9
Instances Input to a data mining algorithm is in the form of a set of examples, or instances. Each instance is represented as a set of features or attributes. Usually in DB Data-Mining this set takes the form of a flat file; each instance is a record in the file, each attribute is a field in the record. In text-mining, instance is word/term in a corpus. The concepts to be learned are formed from patterns discovered within the set of instances.
10
concepts The types of concepts we try to ‘learn’ include: Key “differences” – terms specific to our domain corpus Clusters or ‘Natural’ partitions; –Eg we might cluster customers according to their shopping habits. Rules for classifying examples into pre-defined classes. –Eg “Mature students studying information systems with high grade for General Studies A level are likely to get a 1 st class degree” General Associations –Eg “People who buy nappies are in general likely also to buy beer”
11
More concepts The types of concepts we try to ‘learn’ include: Numerical prediction –Eg look for rules to predict what salary a graduate will get, given A level results, age, gender, programme of study and degree result – this may give us an equation: Salary = a*A-level + b*Age + c*Gender + d*Prog + e*Degree (but are Gender, Programme really numbers???)
12
DB Example: weather to play?
14
/usr/local/weka-3-4-5/data/weather.arff @relation weather @attribute outlook {sunny,overcast,rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes
15
Text mining example: discovering terms in a domain, using WWW-BootCat “First catch your rabbit” (Mrs Beaton’s cookbook): Other tools are possible, but WWW-BootCat *should* be easier to use … First: sign up for Domain, SketchEngine account, Google key; download seeds-en from http://corpus.leeds.ac.uk/internet.html http://corpus.leeds.ac.uk/internet.html (see coursework spec for URLs)
16
First collect your corpus Advanced Search option with parameter settings: –using SergeSharoff's seed-en http://corpus.leeds.ac.uk/internet/seeds-en list of typical medium-frequency English words as seed-words,http://corpus.leeds.ac.uk/internet/seeds-en –Google key set to the Key which I set up beforehand at https://www.google.com/accounts/NewAccount https://www.google.com/accounts/NewAccount –Language set to English –Select URLs ticked, so I can cut-and-paste the list of urls to a textfile (TO HAND IN WITH CW) –Corpus name set to EnglishUK (in my case), or English?? (change ?? To your Domain) –email address set to USERNAME@comp.leeds.ac.ukUSERNAME@comp.leeds.ac.uk –Query Extension set to site:.uk (in my case), or site:.?? (change ?? To your Domain) –other Advanced Options left at default values...??? –... then click on Build a corpus!, follow instructions as they appear, and (after some wait) download the corpus in raw and vertical formats (either direct from URL or wait for email to tell you URL…)
17
Problems? WWW-Bootcat: log in, Advanced options: upload seed-en, check URLs, site:.??; Build Corpus If it crashes, ?bad HTML in website?, try again Download your corpus, because… 500,000-word quota – room for 2 corpuses (only), so you can only compare 2 at a time in WWWBootCat Or compare on your linux account… /home/www/db32/cw/EnglishUS, EnglishUK
18
Comparing text corpora Aim: to find terms in C1 not in C2? and terms in C2 not in C1? Sort C1, C2 in Vertical format (1 word per line) to give C1termlist, C2termlist: –sort C1 > C1termlist; sort C2> C2termlist –diff C1termlist C2termlist BUT this shows LOTS of differences many “not significant”: 1 example (hapax legomena)
19
Comparing “significant” terms Better: to find “significant” terms in C1 not in C2 sort C1 | uniq -c | sort -n -r > C1termlist Terms with frequencies – most common first Can be compared “OLAP-style” – you can spot high- freq words in one list but not the other ? No need for further processing?
20
Comparing word-frequencies BootCat (and others, eg Paul Rayson) offer tools to compare frequencies of words – to find words used MUCH MORE in one corpus than another Several different metrics available, eg “mutual information”, “normalised frequency difference”,… Not necessary for DB32 coursework (probably) … BUT I will be impressed if you do use these advanced metrics!
21
Knowledge Discovery: Key points Knowledge Discovery (Data Mining) tools semi- automate the process of discovering patterns in data. Tools differ in terms of what concepts they discover (differences, key-terms, clusters, decision-trees, rules)… … and in terms of the output they provide (eg clustering algorithms provide a set of subclasses) Selecting the right tools for the job is based on business objectives: what is the USE for the knowledge discovered
22
Self-test You should be able to: –Decide which is the appropriate data mining technique for a given a problem defined in terms of business objectives. –Decide which is the most appropriate form of output.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.