CSC 556– DBMS II, Spring 2013, Week 7 Bayesian Inference Paul Graham’s Plan for Spam, + A Patent Application for Learning Mobile Preferences, + some text.

Slides:



Advertisements
Similar presentations
Chapter # 4 BIS Database Systems
Advertisements

Text Categorization.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Chapter 5: Introduction to Information Retrieval
Entity Relationship (ER) Modeling
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 4 Entity Relationship (ER) Modeling.
Entity Relationship (ER) Modeling
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Multiple Instance Learning
CS 589 Information Risk Management 23 January 2007.
Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung.
Data Mining with Naïve Bayesian Methods
Ch 4: Information Retrieval and Text Mining
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Chapter 4 Entity Relationship (ER) Modeling
Database Systems: Design, Implementation, and Management Tenth Edition
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
Chapter 7 Data Modeling with Entity Relationship Diagrams Database Principles: Fundamentals of Design, Implementation, and Management Tenth Edition.
Chapter 4 The Relational Model.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
by B. Zadrozny and C. Elkan
Measures of Central Tendency and Range
Chapter 5 Entity Relationship (ER) Modelling
Concepts and Terminology Introduction to Database.
Data Mining – Input: Concepts, instances, attributes Chapter 2.
PHP meets MySQL.
Text Classification, Active/Interactive learning.
Copyright (c) 2003 David D. Lewis (Spam vs.) Forty Years of Machine Learning for Text Classification David D. Lewis, Ph.D. Independent Consultant Chicago,
A Technical Approach to Minimizing Spam Mallory J. Paine.
1 rules of engagement no computer or no power → no lesson no SPSS → no lesson no homework done → no lesson GE 5 Tutorial 5.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 4 Entity Relationship (ER) Modeling.
Chapter 4 Entity Relationship (ER) Modeling.  ER model forms the basis of an ER diagram  ERD represents conceptual database as viewed by end user 
3 & 4 1 Chapters 3 and 4 Drawing ERDs October 16, 2006 Week 3.
Database Systems: Design, Implementation, and Management Ninth Edition Chapter 4 Entity Relationship (ER) Modeling.
Welcome to Econ 420 Applied Regression Analysis Study Guide Week Four Ending Wednesday, September 19 (Assignment 4 which is included in this study guide.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 4 ENTITY RELATIONSHIP (ER) MODELING Instructor Ms. Arwa Binsaleh 1.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Vector Space Models.
Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Fall 2015 Room 150 Harvill.
Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Fall 2015 Room 150 Harvill.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Department of Mathematics Computer and Information Science1 CS 351: Database Management Systems Christopher I. G. Lanclos Chapter 4.
Lecturer’s desk Physics- atmospheric Sciences (PAS) - Room 201 s c r e e n Row A Row B Row C Row D Row E Row F Row G Row H Row A
1 The Relational Data Model David J. Stucki. Relational Model Concepts 2 Fundamental concept: the relation  The Relational Model represents an entire.
CSC 556 – DBMS II, Spring 2013 May 15, 2013 D. Parson, Applying M5P Model Trees to Learning Scrabble Strategy.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Lecturer’s desk INTEGRATED LEARNING CENTER ILC 120 Screen Row A Row B Row C Row D Row E Row F Row G Row.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
Conceptual DB Design Conceptual database design ERD’s.
Formulas, Functions, and other Useful Features
Text Mining CSC 600: Data Mining Class 20.
CS240A Final Project 2.
Advanced information retreival
Tables and Their Characteristics
From last time: on-policy vs off-policy Take an action Observe a reward Choose the next action Learn (using chosen action) Take the next action Off-policy.
Lecture 15: Text Classification & Naive Bayes
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
Lecture 12: Data Wrangling
Content-Based Image Retrieval
Content-Based Image Retrieval
Introduction to Statistics for the Social Sciences SBS200 - Lecture Section 001, Spring 2017 Room 150 Harvill Building 9:00 - 9:50 Mondays, Wednesdays.
Text Mining CSC 576: Data Mining.
Presentation transcript:

CSC 556– DBMS II, Spring 2013, Week 7 Bayesian Inference Paul Graham’s Plan for Spam, + A Patent Application for Learning Mobile Preferences, + some text related matters

Bayes and Incremental (Lazy) Learning Bayes’ formula P(selected | K) = (P(K | selected) x P(selected)) / P(K) My collection of terms on the right hand side. – A priori probability of events P(selected) = size of SelectedSet / size of InspectedSet P(K) = count of K in InspectedSet / size of InspectedSet – A posteriori probability (conditional probability) P(K|selected) = count of K in SelectedSet / size of SelectedSet InspectedSet are available alternatives that the mobile user inspected. They correspond to all s see in Gragam. SelectedSet are available alternative contacted by mobile user. They correspond to spam vs. non-spam choices in Graham.

Graham’s Formula (Hackers and Painters) r g = min(1, 2 * (good(w) / G)) r b = min(1, bad(w) / B) P spam|w = max(.01, min(.99, r b / (r g + r b ))) w is token from , good(w) and bad(w) are occurrence counts of w in good & bad s, G and B are good & bad s. 2 * biases against false positives (don’t’ over-discard) False positives are non-linearly undesirable. min & max related to Laplace Estimator (textbook)

Graham continued Graham uses 15 most interesting tokens. They occur > 5 times with max distance from.5. P spam = (∏ i=1..15 P spam|wi ) / (∏ i=1..15 P spam|wi + ∏ i= P spam|wi ) Assign.4 spam probability to unknown token. Treat as spam if P spam >.9

Spam of the Future (is now) “Assuming they could solve the problem of the headers, the spam of the future will probably look something like this: Hey there. Thought you should check out the following: because that is about as much sales pitch as content-based filtering will leave the spammer room to make. (Indeed, it will be hard even to get this past filters, because if everything else in the is neutral, the spam probability will hinge on the url, and it will take some effort to make that look neutral.) See my example from March 12.

Better Bayesian Filtering (1/2003) Train on lots and lots of data. Don’t ignore headers, don’t stem tokens. Related to textbook section 3.5 n-grams for small n~=3. Use only the most significant entropy-reducing tokens (Graham’s 15). Bias against false positives. Graham spends a lot of time talking about tokenization. Data cleaning, data to include, and data format are all key & work-intensive.

Parson, Glossner & Jinturkar Mobile users have direct associations to their “friends and relatives” (and establishments, etc.). These directional associations have strengths 0.0 through 1.0 borrowed from spreading activation. When arriving / departing, weighted associations scale * transitive strength * proximity. The InspectedSet consists of the number of alternatives inspected by the user on the handset.

Centile Ranking For learning Associations, Preferences, and for Adjusting their Weights Raw Score: NumberOfTimesSelected / NumberofTimesInspected. Sort by Raw Score. Form Equivalence Classes. Position in Sort gives Percentile Rank (CentileWeight). NewStrength = ((OldStrength * DragFactor) + CentileWeight) / (DragFactor + 1) New Preference Learning ranks Keywords according to Association weights that reach those keywords, applies Centile Ranking to top “selected” keywords, then forms single Keyword query formula for top ranked Keywords.

Bayesian Preference Rules P(selected | K) = (P(K | selected) x P(selected)) / P(K) P(selected) = size of SelectedSet / size of InspectedSet P(K) = count of K in InspectedSet / size of InspectedSet P(K|selected) = count of K in SelectedSet / size of SelectedSet InspectedSet are available, inspected alternatives. SelectedSet are available alternative contacted by mobile user. NewStrength = ((OldStrength x DragFactor) + P(selected | K) ) / (DragFactor + 1) Use only Keywords reached via Associations, i.e., contacts that were actually made. Top percent are treated as “selected.” Avoids storing data for unselected keywords. Multiple keywords connected by and, or, minus possible; and is implemented. Garbage collect weak Associations & Preferences (via Threshold).

Deficiencies of Weka’s relational approach Some attributes are naturally sets of values from a domain. Some are counted multisets. Some are naturally maps K  V. Multi-instance classifiers in Weka are only a small subset. They do not addressed multi-value attributes. Text mining is also outside Weka’s scope.

Data Mining Scrabble It requires text processing. Text is not conventional prose. It poses problems for data representation / normalization. It will support multi-instance classification. Each “play” in a game is an instance of attributes. Plays across games have relationships. It may support analysis of social interactions.

What to collect? Each instance contains gameID, moveNumber, playerID, row or column of play, pointGain, pointSum, remainderInTileBag, isWinner/rank. These are conventional scalar attribute values. Multi-value attributes are harder. There are zero or more PLAYED, LEAVED, CROSSED, BORROWED (from sides) and SWAPPED letters per play. Bindings for blanks, 3word, 2word, 3letter, 2letter bonus locations vary in number available and what their bindings (word vs. letter). How do we deal with strings, fragments, and letter blends? Contiguous consonants, also vowels?