Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings Baoning Wu and Brian D. Davison Lehigh University Symposium on Applied Computing 2006
Motivation Link-based ranking algorithms are important to current popular search engines. (e.g., HITS for Teoma) Link farms will deteriorate the performance of link-based ranking algorithms
HITS algorithm Each page has two measures, authority score a shows how good this page is for a query, hub score h shows the possibility that the page points to good authority pages. E is the adjacency matrix. a = E T h h = E a
Example: for query “weather” calculator.html
Factors that degrade HITS Mutually reinforcing relationships Duplicate pages Link farms
Complete hyperlink Definition: The link with its anchor text as a unit. Duplication of a complete link is a much stronger sign of copying behavior on the Web than a duplicate link target.
Document - Complete link Matrix
Bipartite Graph Two disjoint sets X and Y, each edge starts from an element in X and ends with an element in Y.
Link farms Link farms are usually densely connected via multiple overlapping small bipartite cores. Task: to detect densely connected bipartite components from “document - complete link” matrix
Algorithm for finding bipartite components
Result: k=2 and l=2
Adjustment: document-document matrix
Final matrix
Weighted adjacency matrix
Experiment: HITS result of “rental car”
Experiment: B&H HITS result of “rental car” about_travelguides/addlisting.html
Experiment: CL-HITS result of “rental car”
Experiment: B&H HITS result of “translation online”
Experiment: CL-HITS result of “translation online” /worldlingo_translator.html
Duplicate example: BH-HITS result of “maps”
Duplicate example: CL-HITS result of “maps”
User evaluation CategoryHITSBHITSCL-HITSCL-POP Quite relevant12.9%24.5%48.4%46.3% Relevant10.7%18.3%28.8%26.2% Not sure6.6%10.5%6.7%6.4% Irrelevant26.8%14.8%11.3%12.7% Totally irrelevant42.8%31.9%4.6%8.1%
Discussion Using link alone, the precision at 10 is 66.4%. Much lower than using “complete link”. Random anchor texts.
Questions?