1 Unsupervised Learning from URL Corpora Deepak P, IBM Research, Bangalore Deepak Khemani, Dept. of CS&E, IIT Madras Work done while at IIT Madras.

1 Unsupervised Learning from URL Corpora Deepak P*, IBM Research, Bangalore Deepak Khemani, Dept. of CS&E, IIT Madras *Work done while at IIT Madras

2 URLs as an Information Source URLs are Compact Semi-Structured Ordered Text Main motivations behind choosing URLs for a webmaster: Understandability Brevity

3 Why use URLs for learning? URLs are small entities Techniques working with them would be magnitudes faster than those working on whole web pages URLs are ubiquitous We can obtain URLs by means of web search engines, hub pages etc, without going to the server where the page is hosted

4 Differential Information Content of URL Segments Need for Differential Weighting of URL Segments http://www.cs.abc.edu/courses/current/cs511/assignments Weighting of segments should decrease as we go down the URL Level 0 is an exception where weighting should increase as we go down the URL

5 URL-Sim Computations (1/4) Preprocessing: Removing stopwords and scheme tag, followed by tokenizing and level tagging Each delimiter separates a token Each ‘/’ is a level separator http://www.iitm.ac.in/students www.iitm.ac.in/students iitm.ac.in/students iitm ac in / students 000 1

6 URL-Sim Computations (2/4) Differential Weighting of URL Segments http://www.cs.abc.edu/courses/current/cs511/assignments Direction of Weight Decrease is depicted as above

7 URL-Sim Computations (3/4) Weight Tagging Arrange in Order of decreasing weighting Attach weights to segments based on level and order (within the level) http://www.iitm.ac.in/students in ac iitm / students Level Weight 0 10 1 08 2 06 3 04 4 02 1 2 3 13 2 1 1 5 3.33 1.67 8 Each segment thus has a weight associated with it

8 URL-Sim Computations (4/4) URL Pair wise Similarity weight computations http://java.sun.comhttp://www.java.com/getjava sun 6.67 java 3.33 java 10.0 getjava 8.00 Similarity 0.00 NO MATCH Whole Segment Match, Similarity Increment = (3.33+10.0)/2 = 6.66 6.66 Largest Matched Substring = “java” Similarity Increment Contribution by “java” = 3.33/2 = 1.66 Similarity Increment Contribution by “getjava” = 8.0*|java|/|getjava|*2 = 2.28 Total Similarity Increment = 1.66+2.28 = 3.94 10.60 |string| = length of the string

9 Agglomerative Clustering (Willet, 1988) BA EC D ABCDE

10 Hierarchical Agglomerative Clustering Using URL-Sim Hierarchical Agglomerative Clustering starts with each URL in it’s own cluster and works by merging the closest pair of clusters Purity of a cluster: Argmin i (Number of elements with label i)/(Total number of elements in the cluster) Merging Mistakes: A merge is a mistake if the merge results in decrease of total purity Merging Accuracy: (Number of non-mistake mergers)/(Total Number of Mergers) Results 94.82% after 30% mergers 92.74% after 50% mergers

11 HAC Using URL-Sim - Results Hierarchical Agglomerative Clustering Results

12 Keyword Identification Using URL-Sim Keyword Identification Reuses most parts of the Similarity Computation Routine http://www.javadevtalk.comhttp://www.developer.com/java javadevtalk 10.00 developer 10.00 java 08.00 Largest Matching Substring: “dev” Similarity Increment = 3.03 Add “dev” to the candidate list and set current score to 3.03 Keyword Score dev 3.03 Largest Matching Substring: “java” Similarity Increment = 5.82 Add “java” to the candidate list and set current score to 5.82 java 5.82 This is done for each URL pair in the corpus. The final scores are used to determine the keywords for the corpus Useful for determining the topic that a topical corpus deals with

13 Keyword Identification: Results Get topical corpora with corresponding topics How high in the ranked list the corresponding keyword occurs is a measure of accuracy for the topic identification technique Results: Accuracy Measure: 1.933

14 Topic Identification: Results Topical Corpus NameList of Sorted Keyword Score Tuples Cricket Computer Science Government IIT Jobs London Sports Tennis Cornell (WebKB) Texas (WebKB) Washington (WebKB)

15 Character N-Gram Vectors for Clustering Why Character N-Grams? URLs are chosen as a trade-off between expressivity and brevity Noise induced due to such requirements include Abbreviations: gov for government Homophones: 2 for to, 4 for for Character N-Grams are very tolerant to such noise Character N-Gram vectors give a vector space embedding of the data so that well known linear partitional clustering algorithms can be used

16 Character N-Gram Vectors Monogram Vector Example ABCDEFGHIJKMNOPQRSTUVXXYZ 1010000030011000001000000 http://iitm.ac.in iitm ac in i:2, t:1, m:1 a:1 c:1 i:1 n:1 i:3, t:1, m:1, a:1, c:1, n:1 Bigram Vector Example iitm ac in ii:1, it:1, tm:1 ac:1 in:1 ii:1, it:1, tm:1, ac:1, in:1

17 The K-Means Clustering Method (McQueen, 1967) 0 1 2 3 4 5 6 7 8 9 10 0123456789 K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means 0 1 2 3 4 5 6 7 8 9 10 0123456789 Update the cluster means reassign

18 Performance for varying N Distance Measure: Euclidean Distance

19 Comparative Analysis Why compare? Both the techniques (URL-Sim) and N- Gram vectors perform well The techniques are very different and are based on different assumptions If they are different, they can possibly be combined to generate a better technique

20 Comparison Methodology Generate the ordered set of URL Pairs ordered in the ascending order of distances between the elements in the pair according to both Euclidean distances between bigram vectors and URL-Sim Compare the two lists using Spearman Rank Correlation Coefficient

21 Correlation Analysis Results CorpusCorrelation Corpus10.2528 Corpus20.3295 Corpus30.0563 BankSearch0.4673 WebKB-0.14 Observations Techniques tend to agree more when the clusters or classes involved are very different/distinct. Tend to disagree more when the different clusters are related (presence of some parameters which cause URLs in different clusters to be similar) in some sense or the other. There seems to be no trivial technique of combining both the techniques

22 Contributions URL-Sim as a Similarity Measure for URL-Pairs Topic Identification using URL-Sim Bigram Vectors as representations of URLs Feasibility of Unsupervised Learning from URL Corpora

23 Screenshot 1: Clustered Web Search

24 Screenshot 2: Clustered Web Search

1 Unsupervised Learning from URL Corpora Deepak P, IBM Research, Bangalore Deepak Khemani, Dept. of CS&E, IIT Madras Work done while at IIT Madras.

Similar presentations

Presentation on theme: "1 Unsupervised Learning from URL Corpora Deepak P, IBM Research, Bangalore Deepak Khemani, Dept. of CS&E, IIT Madras Work done while at IIT Madras."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Unsupervised Learning from URL Corpora Deepak P*, IBM Research, Bangalore Deepak Khemani, Dept. of CS&E, IIT Madras *Work done while at IIT Madras.

Similar presentations

Presentation on theme: "1 Unsupervised Learning from URL Corpora Deepak P*, IBM Research, Bangalore Deepak Khemani, Dept. of CS&E, IIT Madras *Work done while at IIT Madras."— Presentation transcript:

Similar presentations

About project

Feedback

1 Unsupervised Learning from URL Corpora Deepak P, IBM Research, Bangalore Deepak Khemani, Dept. of CS&E, IIT Madras Work done while at IIT Madras.

Presentation on theme: "1 Unsupervised Learning from URL Corpora Deepak P, IBM Research, Bangalore Deepak Khemani, Dept. of CS&E, IIT Madras Work done while at IIT Madras."— Presentation transcript: