Knowledge and Information Retrieval

Knowledge and Information Retrieval
Session 8 Data Mining and Information Extraction and XML

Agenda Data Mining and Network Mining XML
Information Extraction Architecture Bina Nusantara University

Data Mining Data mining (the analysis step of the "Knowledge Discovery in Databases“), is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. Loadingdata.py: import urllib2 from pylab import plot, show from numpy import genfromtxt, zeros url = ' u = urllib2.urlopen(url) localFile = open('iris.csv','w') localFile.write(u.read()) localFile.close() data = genfromtxt('iris.csv',delimiter=',',usecols=(0,1,2,3)) target = genfromtxt('iris.csv',delimiter=',',usecols=(4),dtype=str) plot(data[target=='setosa',0],data[target=='setosa',2],'bo') plot(data[target=='versicolor',0],data[target=='versicolor',2],'ro') plot(data[target=='virginica',0],data[target=='virginica',2],'go') show()

Regression Linear regression is an approach for modeling the relationship between a variable y and one or more variables (or independent variable) denoted X. Regresi.py: from pylab import figure, subplot,plot, hist, xlim, show from numpy.random import rand x = rand(40,1) # explanatory variable y = x*x*x+rand(40,1)/5 # depentend variable from sklearn.linear_model import LinearRegression linreg = LinearRegression() linreg.fit(x,y) from numpy import linspace, matrix xx = linspace(0,1,40) plot(x,y,'o',xx,linreg.predict(matrix(xx).T),'--r') show()

Network Mining Often, the data that we have to analyze is structured in the form of networks, for example our data could describe the friendships between a group of facebook users or the coauthor ships of papers between scientists. import networkx as nx G = nx.read_gml('lesmiserables.gml',relabel=True) nx.draw(G,node_size=0,edge_color=’b’,alpha=.2,font_size=7)

Sec. 3.1 Cliques It is also interesting to study the network through the identification of its cliques. A clique is a group where a node is connected to all the other ones and a maximal clique is a clique that is not a subset of any other clique in the network. print max(cliques) [u'Joly', u'Gavroche', u'Bahorel', u'Enjolras', u'Courfeyrac', u'Bossuet', u'Combeferre', u'Feuilly', u'Prouvaire', u'Grantaire']

Information Extraction Architecture
Information comes in many shapes and sizes. One important form is structured data, where there is a regular and predictable organization of entities and relationships Relasilokasi.py: locs = [('Omnicom', 'IN', 'New York'), ('DDB Needham', 'IN', 'New York'), ('Kaplan Thaler Group', 'IN', 'New York'), ('BBDO South', 'IN', 'Atlanta'), ('Georgia-Pacific', 'IN', 'Atlanta')] query = [e1 for (e1, rel, e2) in locs if e2=='Atlanta'] print(query) >>> ['BBDO South', 'Georgia-Pacific'] OrgName LocationName Omnicom New York DDB Needham Kaplan Thaler Group BBDO South Atlanta Georgia-Pacific

Simple Information Extraction Systems
Sec. 3.1 Simple Information Extraction Systems Raw text processed using some steps

Categorizing and Tagging Words
The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, Here CC, a coordinating conjunction; now and completely are RB(adverbs); for is IN(preposition); something is NN( noun); and different is JJ(adjective). >>> text = word_tokenize("And now for something completely different") >>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')] Several of the corpora included with NLTK have been tagged for their part-of-speech.

Chunking The basic technique we will use for entity detection is chunking, which segments and labels multi-token sequences The smaller boxes show the word-level tokenization and part-of-speech tagging Each of these larger boxes is called a chunk

Noun Phrase Chunking we will first define a chunk grammar, consisting of rules that indicate how sentences should be chunked NPchunkparser.py: import nltk, re, pprint sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] grammar = "NP: {<DT>?<JJ>*<NN>}" cp = nltk.RegexpParser(grammar) result = cp.parse(sentence) print(result) result.draw() >>> (S (NP the/DT little/JJ yellow/JJ dog/NN) barked/VBD at/IN (NP the/DT cat/NN))

Exploring Text Corpora
We saw how we could interrogate a tagged corpus to extract phrases matching a particular sequence of part-of-speech tags

Sec. 3.1 Chinking We can define a chink to be a sequence of tokens that is not included in a chunk. In the following example, barked/VBD at/IN is a chink: [ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ] Has been used for dictionary lookup in some search engines. Each vocabulary terms(key) is hashed into an integer.

Chinking

Developing and Evaluating Chunkers
import nltk, re, pprint text = ''' he PRP B-NP accepted VBD B-VP concern NN I-NP . . O ''' nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()

Conll2000 NP Chunks import nltk, re, pprint from nltk.corpus import conll2000 print(conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99]) >>> (S Over/IN (NP a/DT cup/NN) of/IN (NP coffee/NN) ,/, (NP Mr./NNP Stone/NNP) told/VBD (NP his/PRP$ story/NN) ./.) CoNLL 2000 corpus contains three chunk types: NP chunks, which we have already seen; VP chunks such as has already delivered; and PP chunks such as because of

Conll2000 NP Chunks import nltk, re, pprint from nltk.corpus import conll2000 grammar = r"NP: {<[CDJNP].*>+}" cp = nltk.RegexpParser(grammar) test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) print(cp.evaluate(test_sents)) >>> ChunkParse score: IOB Accuracy: 87.7% Precision: 70.6% Recall: 67.8% F-Measure: 69.2% The IOB format (short for Inside, Outside, Beginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. Named Entity Recognition). The B- prefix before a tag indicates that the tag is the beginning of a chunk, and an I- prefix before a tag indicates that the tag is inside a chunk. An O tag indicates that a token belongs to no chunk.

Train it using the CoNLL 2000 corpus
>>> test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) >>> train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP']) >>> unigram_chunker = UnigramChunker(train_sents) >>> print(unigram_chunker.evaluate(test_sents)) ChunkParse score: IOB Accuracy: 92.9% Precision: 79.9% Recall: 86.8% F-Measure: 83.2%

Training classifier base chunker
a. Joey/NN sold/VBD the/DT farmer/NN rice/NN ./. b. Nick/NN broke/VBD my/DT computer/NN monitor/NN ./. These two sentences have the same part-of-speech tags, yet they are chunked differently. In the first sentence, the farmer and rice are separate chunks, while the corresponding material in the second sentence, the computer monitor, is a single chunk

Adding features >>> def npchunk_features(sentence, i, history): ... word, pos = sentence[i] ... if i == 0: ... prevword, prevpos = "<START>", "<START>" ... else: ... prevword, prevpos = sentence[i-1] ... return {"pos": pos, "word": word, "prevpos": prevpos} chunker = ConsecutiveNPChunker(train_sents) print(chunker.evaluate(test_sents)) ChunkParse score: IOB Accuracy: 94.5% Precision: 84.2% Recall: 89.4% F-Measure: 86.7%

Named Entity Recognition
Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on

Named Entity Recognition
NLTK provides a classifier that has already been trained to recognize named entities, accessed with the function nltk.ne_chunk(). If we set the parameter binary=True , then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE. >>>print(nltk.ne_chunk(sent)) (S The/DT (GPE U.S./NNP) is/VBZ one/CD ... according/VBG to/TO (PERSON Brooke/NNP T./NNP Mossman/NNP) ...)

Relation Extraction Once named entities have been identified in a text, we then want to extract the relations that exist between them Relationentity.py: # Natural Language Toolkit: code_cascaded_chunker import nltk,re, pprint IN = re.compile(r'.*\bin\b(?!\b.+ing)') for doc in nltk.corpus.ieer.parsed_docs('NYT_ '): for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN): print(nltk.sem.rtuple(rel)) >>> [ORG: u'WHYY'] u'in' [LOC: u'Philadelphia'] … [ORG: u'BBDO South'] u'in' [LOC: u'Atlanta'] [ORG: u'Georgia-Pacific'] u'in' [LOC: u'Atlanta']

XML The Extensible Markup Language (XML) is a markup language much like HTML or SGML. This is recommended by the World Wide Web Consortium and available as an open standard. XML is extremely useful for keeping track of small to medium amounts of data without requiring a SQL-based backbone. XML is the language that is used for communication between Web service components XML Structure

XML Structure

XML, SOAP and WSDL Standard technologies:
XML. The language (data format) used by the Web service Simple Object Access Protocol (SOAP). SOAP provides a way to communicate between applications running on different operating systems Web Services Description Language (WSDL). An XML file that specifies the location of the service and the operations (or methods) the service exposes.

XML for IR The adoption of XML as a standard for the publication and interchange of structured data creates a great opportunity for better information retrieval. XML has the ability to represent the semantics of data in a structured form. It has become crucial to address the question of how we can eﬃciently query and search large corpora of XML documents. Useful for Enterprise Search , Library Systems and Digital Library

XML-Based Enterprise Search
Enterprise Search: the application of IR technology to information finding within organizations Enterprise search may be interpreted as search of digital textual materials owned by an organization, including: search of their external Web site company intranet, and any other electronic text that they hold.

APIs in XML (SAX and DOM)
The two most basic and broadly used APIs to XML data are the SAX and DOM interfaces. Simple API for XML (SAX) : It parses the file as it reads it from disk and the entire file is never stored in memory. Document Object Model (DOM) API: This is a World Wide Web Consortium recommendation wherein the entire file is read into memory and stored in a hierarchical (tree-based) form to represent all the features of an XML document.

SAX vs DOM SAX obviously cannot process information as fast as DOM can when working with large files. On the other hand, using DOM exclusively can really kill your resources, especially if used on a lot of small files. SAX is read-only, while DOM allows changes to the XML file. Since these two different APIs literally complement each other, there is no reason why you cannot use them both for large projects.

Sec. 3.1 XML Structure

Similarity and ranking
Sec. 3.1 Similarity and ranking

Case study Analytics and Big-Data The Hanging Tree Broken Dreams
We have more than 10,000 books from which we need to search for a book as per the query entered by customer. In addition, we need to create an XML information retrieval system which can call out all the books which resembles the customer query. Here are a few names of books : Analytics and Big-Data The Hanging Tree Broken Dreams Blessed kid Girl with a Dragon Tattoo The query entered by customer is : Book for Analytics newbie.

Term Frequency (TF) Matrix
Technique to find out the relevance of a word . Here is a frequency count of a set of words in the 5 books:

calculation TF = 1 + log (TF) if TF > 0 0 if TF = 0
Document 1 : = 8.6 Document 2 : + 0 + 2 = 7.3 Document 3 : 2.5 + 3.0 + 0 + 2 = 7.5 Document 4 : 2.6 + 2.3 = 7.9 Document 5 : 2.3 = 7.8

calculation Result shows, Document 1 will be more relevant to display for the query, but we still make a concrete conclusion . Since, document 4 and 5 are not far away from Document 1. They might turn out to be relevant too. Document 1 : = 8.6 Document 2 : + 0 + 2 = 7.3 Document 3 : 2.5 + 3.0 + 0 + 2 = 7.5 Document 4 : 2.6 + 2.3 = 7.9 Document 5 : 2.3 = 7.8

IDF IDF is another parameter which helps us find out the relevance of words. It is based on the principle that less frequent words are generally more informative. IDF = log (N/DF) where N represents the number of documents and DF represents the number of documents in which we see the occurrence of this word.

IDF We now can clearly see that the words like “The” “for” etc. are not really relevant as they occur in almost all the document. Whereas, words like honest, Analytics Big-Data are really niche words which should be kept in the analysis.

TF-IDF Matrix As we now know the relevance of words (IDF) and the occurrence of words in the documents (TF), we now can multiply the two. Then, find the subject of the document and thereafter the similarity of query with the document. Now it clearly comes out that document 1 is most relevant to the query “Book for Analytics newbie”.

TF-IDF from sklearn.feature_extraction.text import TfidfVectorizer corpus = ["This is very strange", "This is very nice"] vectorizer = TfidfVectorizer(min_df=1) X = vectorizer.fit_transform(corpus) idf = vectorizer._tfidf.idf_ print dict(zip(vectorizer.get_feature_names(), idf))

XRANK Architecture The ElemRank Computation module computes the ElemRanks of XML elements . The ElemRanks are then combined with ancestor information to generate an index structure called HDIL (Hybrid Dewey Inverted List). Source: Lin guo et al, XRANK: Ranked Keyword Search over XML Documents, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp 16-27, 2003.

LXML lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

Example using LXML from lxml import etree from StringIO import StringIO xml = '<a xmlns="test"><b xmlns="test"/></a>' root = etree.fromstring(xml) print etree.tostring(root) >>> <a xmlns="test"><b xmlns="test"/></a>

Case study(90 minutes): Use samples of XML Corpus, and:
1. Propose a model for Ranked Keyword Search over XML Documents 2. Display a similarity based on TF-IDF

Enterprise Search Architecture

Problems in Enterprise Search
The gathering process may take a long time and generate telecommunications charges Filtering of large collections of binary format documents may be very time consuming Formats such as MSWord, PDF and JPEG are capable of storing metadata such as title, author, subject, and date in practice, however, these metadata are usually missing

Environment in Library

Exercise Read 1 document and display the relation using Relation Extraction. Example PERSON in ORGANIZATION

References Ricardo Baeza-Yates and Berthier ribeiro-Neto (2011), Modern Information Retrieval, 2nd edition, ACM Books Press. ISBN: Stefan Buttcher, Charles L.A Clarke and Gordon V. Cormack (2010), Information Retrieval- Implementing and Evaluating Search Engines, MIT Press, ISBN: NLTK Book, Python.org Raymond J. Mooney, Mining Knowledge from Text Using Information Extraction, SIGKDD Explorations., vol 7(1). Conference on Computational Natural Language Learning (CoNLL-2000)

Knowledge and Information Retrieval

Similar presentations

Presentation on theme: "Knowledge and Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Knowledge and Information Retrieval

Similar presentations

Presentation on theme: "Knowledge and Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback