Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler

My outline 1. Overview about clustering 2. Web service clustering 3. Ontologies improve text document clustering 4. Heterarchy and Core Ontology 5. Compiling Background Knowledge into the Text Document Representation 6. Conclusion

Overview about clustering Definition of clustering  We have some definitions that we can define clustering in text mining  procedure of dividing texts into several clusters, where each cluster contains relevant text and each cluster differs from others  A grouping of data objects such that the objects within a group are similar, to one another and different from the objects in other groups.

Web service clustering  web services are distributed autonomous software components that are self-describing and designed by different vendors to provide business functions to other applications through an internet connection.  some major providers have even decided to advertise their services through their human- readable websites,. For example, Google’s and Amazon’s web services

Web service clustering  The mechanism for clustering web services to bootstrap is a service search engine.  That in web services we use web service description files (WSDL)files

Web service clustering  The clustering of web service files is different from the traditional web service discovery problem because there are no queries to match against.  the idea of representing a web service using document vectors is still relevant.  gathering the features for a WSDL file is not as simple as collecting description documents when assuming no central UDDI registries.

Web service clustering  system that can automatically cluster a group of WSDL files obtained by querying a search engine (e.g., Google).  process of mining four types of features of a WSDL file. 1. the content of the web service is characterized by the application-specific terms located in the WSDL file

Web service clustering 2. the context of the web service is represented by the application-specific terms appearing in all index web pages of publicly accessible parent directories of the current directory containing the WSDL file. 3. the service host is the second- and top-level portion of the domain name (i.e., a segment of the authority part of the URI) of the host containing the WSDL file. 4. the service name is the name of the WSDL file.

Web service clustering

 From the previous figure  word analyzer begins by tokenizing the WSDL or HTML files to construct the initial sets of C and X,  remove non-words from these sets  words in the two sets are conflated, and analyzed for their content-bearing property to remove function words  The remaining content words in the two sets are then clustered to identify application-specific terms and general computing terms.  we utilize regular expressions to extract the service name, s name and the service host address, short. These steps are implemented as modules identifyServiceHost and identifyServiceName as an example, the service name of this WSDL file http://weather.terrapin.com/

Web service clustering

The web service clusters produced based on the four types of features.

Ontologies improve text document clustering  The beneficial effects can be achieved for text document clustering by integrating an explicit conceptual account of terms found in ontologies like WordNet.  The clustering is then performed with Bi-Section- KMeans, which has been shown to perform as good as other text clustering algorithms.

Heterarchy and Core Ontology  Definition 1 (Core Ontology) A core ontology is a sign system which consists of :  A lexicon: The lexicon L contains a set of natural language terms.  A set of concepts C*.  The reference function  A heterarchy H: Concepts are taxonomically related by the directed, acyclic, transitive, reflexive relation.

 Example lexicon L = {Hotel, Grand Hotel, Hotel Schwarzer Adler, Accommodation,...)  concepts C* = {ROOT, HOTEL, ACCOMMODATION, …}  reference function F = {(Hotel, HOTEL), (Grand Hotel, HOTEL), (Hotel Schwarzer Adler, HOTEL), …}, i.e. "Hotel", "Grand Hotel" and "Hotel Schwarzer Adler" refer to the concept HOTEL.  heterarchy H = {(HOTEL, ACCOMMODATION), (ACCOMMODATION, ROOT), …}

Compiling Background Knowledge into the Text Document Representation we have three strategies that we can compile the text document Term vs. Concepts Vector Strategies  Enriching the term vectors with concepts from the core ontology has two benefits.  First it resolves synonyms; and  second it introduces more general concepts which help identifying related topics.  For instance, a document about beef may not be related to a document.

Compiling Background Knowledge into the Text Document Representation  Strategies for Disambiguation  The assignment of terms to concepts in Word net is ambiguous.  adding or replacing terms by concepts may add noise to the representation and may induce a loss of information.  We have 3 strategies in the disambiguation

Compiling Background Knowledge into the Text Document Representation  All Concepts (“all”). The baseline strategy is not to do anything about disambiguation and consider all concepts for augmenting the text document representation.  First Concept (“first”). Wordnet returns an ordered list of concepts when applying Ref C to a set of terms. Thereby, the ordering is supposed to reflect how common it is that term reflects a concept in “standard” English language. m ore common term meanings are listed before less common ones.

Compiling Background Knowledge into the Text Document Representation  Disambiguation by Context (“context”). The sense of a term t that refers to several different concepts Ref C(t) := {b, c,...} may be disambiguated by a simplified version of first strategy

Compiling Background Knowledge into the Text Document Representation  Strategies for considering the concept hierarchy  The third set of strategies varies the amount of background knowledge.  principal idea is that if a term like ‘beef’ appears, one does not only represent the document by the concept corresponding to ‘beef’, but also by the concepts corresponding to ‘meat’ and ‘food’ etc. up to a certain level of generality.

Conclusion  Clustering web services into functional similar groups can greatly reduce the search space of a service discovery task. Therefore, it can be seen as a predecessor of web service discovery or an important functionality provided by future service search engines.  Clustering based on all three document vectors (word vector, concept vector, category vector) also gets significantly better results than the baseline, but does not outperform clustering based only on word vector and category vector.

References  Web service clustering using text mining techniques(Int. J. Agent- Oriented Software Engineering, Vol. X, No. Y,, Wei Liu* and Wilson Wong )  Ontologies Improve Text Document Clustering(Andreas Hotho, Steffen Staab, Gerd Stumme,Institute AIFB, University of Karlsruhe,germany)  E. Agirre and G. Rigau. Word sense disambiguation using conceptual density. In Proc. of COLING’96, 1996.

Thanks for your attention

Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Similar presentations

Presentation on theme: "Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Similar presentations

Presentation on theme: "Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler."— Presentation transcript:

Similar presentations

About project

Feedback