A Kosher Source of Ham Nathan Friess John Aycock Department of Computer Science University of Calgary Canada.

Slides:



Advertisements
Similar presentations
Web Marketing for Dummies Written by Jan Zimmerman Reviewed by Paige Petersen.
Advertisements

Why Blog? To market or promote yourself, service or product.
WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
Forward-to-a-friend & Refer-a-friend October 17, 2008.
SOCIAL MEDIA & PHYSICAL ACTIVITY PROMOTION: MAKING THE CONNECTIONS Presented by: Sandra De Freitas
1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor.
Phishing (pronounced “fishing”) is the process of sending messages to lure Internet users into revealing personal information such as credit card.
Google AdSense Referral Partner Team Feb 1 st 2005 AdSense Partner Program Invitation Enable your bloggers to uncover the revenue potential of their blogs.
Business Blogging. Reaching Your Customers Educate customers about products Easy user-friendly interface New content displayed at the top Search Engine.
Spam and . Spam Spam is unwanted usually meant to sell something to the recipient. If a business or organization with which you are affiliated.
Spam May CS239. Taxonomy (UBE)  Advertisement  Phishing Webpage  Content  Links From: Thrifty Health-Insurance Mailed-By: noticeoption.comReply-To:
August 15 click! 1 Basics Kitsap Regional Library.
Your Website Chat & Live Customer Support Solution "Instant Customer GratificationSM" Brought to you by: Affordable Business Productivity and Communications.
SEO PACKAGES. Types of Plans Starter Plan Business Plan Enterprises Plan.
Jennifer Ford.  Blog – A type of website or online journal that allows you to publish articles and updates that visitors.
Refreshing design, online and in print Making The Most Out Of Your Website Is The Web Working For You? Making The Most Out Of Your.
Artur Strzelecki.  10 teams  10 non-profit organizations  6 students per team  2 weeks of developing campaigns  ~50€
Affiliate Marketing. What is Affiliate Marketing Type of performance-based marketing that rewards affiliates for generating leads or sales. The most common.
ELECTRONIC COMMERCE MIS E MARKETING LECTURER INCHARGE- ALM AYOOBKHAN
Branded Websites. Branded Website Training Click the “Edit Pencil” to edit the website Enter in your iBoomerang username and password.
Computer Concepts 2014 Chapter 7 The Web and .
THE INTERNET AND WORLD WIDE WEB: Chapter 2 by Silvia Pereira.
© Anthony J. Nowakowski, Ph.D. Communications © Anthony J. Nowakowski, Ph.D. EDC 601 Instructional Technologies .
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Network and Systems Security By, Vigya Sharma (2011MCS2564) FaisalAlam(2011MCS2608) DETECTING SPAMMERS ON SOCIAL NETWORKS.
Copyright © Allyn & Bacon 2008 POWER PRACTICE Chapter 7 The Internet and the World Wide Web START This multimedia product and its contents are protected.
Internet Basics A management-level overview of the Internet, its architecture, capabilities, and protocols. Copyright 2011 SPMI / Online Development.
Improving the Visibility and Marketability of Your Web Site Giovanna Genard, marketing.
Mass Communication WeeSan Lee
Department of Computer Sciences The University of Texas at Austin Zmail : Zero-Sum Free Market Control of Spam Benjamin J. Kuipers, Alex X. Liu, Aashin.
Day 2 – Marketing Research…
U & I: Users & Information Lab Sept 2008  Alice Oh 
1 1 Best Practices for End Users Anti-Spam Research Group IETF 56 - San Francisco March 20, 2003 John Morris ftp://67.cdt.org/pub/ietf56-asrg-spamreport.ppt.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
WIRESCRIPT1 WIRESCRIPT Web Interactive REview of Scientific Culture, Research, Innovation Policy and Technology.
IT internet security. The Internet The Internet - a physical collection of many networks worldwide which is referred to in two ways: The internet (lowercase.
Hands-on: Setup Menu. What we will cover Access to Setup Codes Memos Forms Permissions Settings And more!
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
System Initialization 1)User starts application. 2)Client loads settings. 3)Client loads contact address book. 4)Client displays contact list. 5)Client.
By Hina Patel TCM 471 Introduction SPAM? Solicited Unsolicited How spammer gets the address Prevention from spam Conclusion.
Internet Research Tips Daniel Fack. Internet Research Tips The internet is a self publishing medium. It must be be analyzed for appropriateness of research.
Promotion of e-Commerce sites. A business which uses e- commerce to trade online must also advertise. Several traditional methods can be used, such as.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
XP New Perspectives on The Internet, Fifth Edition— Comprehensive, 2005 Update Tutorial 7 1 Mass Communication on the Internet Using Newsgroups Tutorial.
Internet Services. Internet Services: Discussion Groups The Internet is not just the world wide web, it also includes: discussion groups: newsgroups,
The Online World ONLINE ADVERTISING BTEC IC&T – LEVEL 2.
Effective Information Access Over Public Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.
Leveraging Delivery for Spam Mitigation.
Marketing with Lyris. Agenda Why marketing? best practices Tips for effective messaging Writing good content Things to avoid.
Advantages and disadvantages of TechMed using web 2.0 technologies. What is Web 2.0? Web 2.0 describes World Wide Web sites that use technology beyond.
AJ Guardado Comp. Sci. 49S.  In order to correctly manage programs (AdSense, AdWords), properly charge for the PPC revenue model, and detect invalid.
Lesson 6, Unit 3 Using the Internet for Research Based on the Plan Ahead educational materials made available by Gap Inc. at and.
HINDU STYLE PORTFOLIO TEMPLATE
A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi
Easy WP Guide V2.6 for WordPress 3.8. easywpguide.com Adding Tags within your Post Adding Tags whilst editing your Post, will automatically assign those.
NGfL CYMRU GCaD NEXT. NGfL CYMRU GCaD On the next slide choose a number and work out the definition in response.
■ A blog originally was a personal website meant to be like a diary or journal. ■ Basically a type of website, like a forum or a social bookmarking site.
2016 CSO System Training & Networking Conference / Copyright © 2016 #csoconf 2016 CSO System Training & Networking Conference / Copyright © 2016 #csoconf.
How to get backlinks of high quality ?  Let’s begin by understanding what backlinks are?  Backlinks refer to the incoming links excluding ads that.
Don’t click on that! Kevin Hill.  Spam: Unwanted commercial ◦ Advertising ◦ Comes from people wanting to sell you stuff. ◦ Headers may be forged.
What is a Blog A blog (or weblog) is a website in which items are posted and displayed with the newest at the top. Blogs often focus on a particular subject,
Agenda Spoofing Types of Spoofing o IP Spoofing o URL spoofing o Referrer spoofing o Caller ID spoofing o Address Spoofing.
Chapter 15: The Internet and Your Job Search
Communicating on the Internet Sagdullayev Pulatbek (Steel)
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Lesson 3: Communicating on the Internet
Overview Blogs and wikis are two Web 2.0 tools that allow users to publish content online Blogs function as online journals Wikis are collections of searchable,
Naïve Bayes Classifiers
Digital Marketing Offerings
The Internet and Electronic mail
Presentation transcript:

A Kosher Source of Ham Nathan Friess John Aycock Department of Computer Science University of Calgary Canada

Mar 26, Building A Ham Corpus Hard To Publish Samples Hard To Publish Samples Copyright Issues Copyright Issues Privacy Issues Privacy Issues Some Goals Some Goals Realistic Samples Realistic Samples Ex: English words, grammar Ex: English words, grammar Variety of Contexts Variety of Contexts Technical writing, conversational Technical writing, conversational

Mar 26, Related Work Garcia, Hoepman, van Nieuwenhuizen (2004) Garcia, Hoepman, van Nieuwenhuizen (2004) Simulation of legitimate users, spammers, mailing lists Simulation of legitimate users, spammers, mailing lists Gather ham text from Usenet Gather ham text from Usenet But what about Usenet spam? But what about Usenet spam?

Mar 26, Usenet Public discussions Public discussions Contributing and viewing Contributing and viewing Organized into newsgroups Organized into newsgroups “Typical” interaction: “Typical” interaction: One person posts One person posts Many people reply to form a thread Many people reply to form a thread Underlying protocols: NNTP, RFC 1036 messages Underlying protocols: NNTP, RFC 1036 messages

Mar 26, From: Alice Newsgroups: rec.food.cooking Subject: Re: Tasty chicken Date: Sun, 13 Apr :47: (PDT) Message-ID: References: On Apr 12, 7:27 pm, Bob wrote: > >BBQ chicken is the best! > >Bob > I agree, especially on a hot summer day. Alice

Mar 26,

7 Hypothesis Do not harvest all Usenet articles! Do not harvest all Usenet articles! Harvest REPLIES in threads Harvest REPLIES in threads A reply A reply Isn’t just “Re: “ in subject Isn’t just “Re: “ in subject Has a “References” header Has a “References” header

Mar 26, From: Alice Newsgroups: rec.food.cooking Subject: Re: Tasty chicken Date: Sun, 13 Apr :47: (PDT) Message-ID: References: On Apr 12, 7:27 pm, Bob wrote: > >BBQ chicken is the best! > >Bob > I agree, especially on a hot summer day. Alice

Mar 26, Experiments Gather articles from several newsgroups Gather articles from several newsgroups Train DSPAM on TREC 05 corpus Train DSPAM on TREC 05 corpus Use DSPAM to classify replies, non- replies Use DSPAM to classify replies, non- replies Manually verify some of DSPAM’s results Manually verify some of DSPAM’s results

Mar 26, Newsgroup Lists Custom list (38 groups) Custom list (38 groups) Google “high traffic” Google “high traffic” Hand-picked for variety of contexts Hand-picked for variety of contexts English English NewsAdmin top-100 text list (77 groups) NewsAdmin top-100 text list (77 groups) Removed testing groups, job postings, spamtrap, net-abuse Removed testing groups, job postings, spamtrap, net-abuse

Mar 26, Example Newsgroups alt.gossip.celebrities alt.gossip.celebrities alt.religion.scientology alt.religion.scientology comp.lang.python comp.lang.python linux.kernel linux.kernel misc.invest.stocks misc.invest.stocks rec.food.cooking rec.food.cooking rec.games.pinball rec.games.pinball rec.motorcycles rec.motorcycles sci.electronics.repair sci.electronics.repair sci.math sci.math soc.retirement soc.retirement

Mar 26, Pre-processing Discard headers Discard headers Simulations only interested in bodies Simulations only interested in bodies Replies: Remove quoted text Replies: Remove quoted text Quoted text will be classified separately Quoted text will be classified separately

Mar 26, Results (Custom) SpamHam Non-Replies 15,323 (16.6%) 76,996 Replies 5,299 (1.2%) 420,956 Non-replies have 10x more spam Non-replies have 10x more spam

Mar 26, Manual Classification Definition of spam: conservative Definition of spam: conservative E.g. E.g. Repeated postings, similar text / templates Repeated postings, similar text / templates Repeated words Repeated words Only a few random words and a URL Only a few random words and a URL Off-topic is not spam Off-topic is not spam Selling goods is not spam Selling goods is not spam If in doubt, not spam If in doubt, not spam

Mar 26, Results (Manual, Custom) Replies DSPAM SpamHam Manual Spam335 Ham Non-RepliesDSPAMSpamHam ManualSpam Ham81345 Underestimated HAM Underestimated SPAM

Mar 26, Results (Top 100) SpamHam Non-Replies 27,282 (16.8%) 134,909 Replies 55,421 (5.8%) 895,618 Non-replies have 3x more spam Non-replies have 3x more spam Non-English groups problematic Non-English groups problematic

Mar 26, Newsgroups: sci.math Subject: Re: WHY HAS THIS SITE BECOME SUCH A SPAM TARGET??? Message-ID: Date: Sun, 27 Apr :03: > (quoted text removed) I don't know what ICP is. Google works hard on detecting "click fraud". If John and Jane sell floral arrangements through e-commerce and compete with each other, John can hire people to click on Jane's ads. That's assuming Jane has ads displayed on Web pages (search engines, blogs, etc.). One common arrangement is that Jane pays a few pennies when someone clicks on an ad for her products. The money is divided between the blogger (for example) and those who send "good" ads for the blogger (targeted to the blogger's readers). … A Reply: False Positive

Mar 26, Newsgroups: alt.gossip.celebrities Subject: SEXY SOFIE Date: Wed, 11 Jun :46: (PDT) Message-ID: SEXY SOFIE A Non-Reply: False Negative And many more like it… And many more like it…

Mar 26, Limitations Spammers can use References header? Spammers can use References header? Requires either: Requires either: Generate fake Message-IDs Generate fake Message-IDs Easy to correlate in Usenet client Easy to correlate in Usenet client Harvest real Message-IDs Harvest real Message-IDs Requires additional bandwidth Requires additional bandwidth Spam will be buried in threads, not as visible Spam will be buried in threads, not as visible

Mar 26, Conclusions Harvesting replies in Usenet is a good source of ham Harvesting replies in Usenet is a good source of ham If you can tolerate some noise If you can tolerate some noise Replies: 1 – 6 % spam Replies: 1 – 6 % spam Non-replies: 3x, 10x worse Non-replies: 3x, 10x worse

A Kosher Source of Ham Nathan Friess John Aycock Department of Computer Science University of Calgary Canada