Presentation is loading. Please wait.

Presentation is loading. Please wait.

It is possible to extract a global “beauty” ranking within a large collection of images After a person has played the game on a small number of pairs.

Similar presentations


Presentation on theme: "It is possible to extract a global “beauty” ranking within a large collection of images After a person has played the game on a small number of pairs."— Presentation transcript:

1

2

3

4 It is possible to extract a global “beauty” ranking within a large collection of images After a person has played the game on a small number of pairs of images, it is possible to extract the person’s general image preferences. We use the players’ preferences between images to create a simple gender model. Wider Implications of such two-player game

5 The Judges – the people who give ratings The judgments/decisions – the ratings of the judges.

6 Absolute judgment – score assignment to an item (In example: Star rating from 1 to 5). Disadvantages of Absolute judgment – Lack of calibration Limited resolution Relative judgment – A comparison between items (A is better than B) Advantages Easy to make Does not change after new information is received

7 Total judgments - the judges are required to make judgments about all “” images. Disadvantage Requires comparison of every image with every other image, order of comparisons. Therefore, infeasible for large datasets. Partial judgment – make judgments about part of the database Disadvantage Problem dealing with incomplete data Advantage Eases the efforts of the user Doesn’t limit the size of the database

8 Selected Access – the judges are allowed to search for particular items and then rate them. Advantages judges can focus on rating things in which they are most interested. Disadvantages Easy to maliciously manipulate the results. Imbalance of the number of ratings each picture receives. Weighting the ratings becomes extremely difficult. Predefined Access - the judges are given images to rate in a random predefined sequence and cannot influence which images they can rate. Advantage Possibility of cheating decreases as the amount of data increases

9 “What do you like?” Consider only your own opinion “What do you think others will like?” Also consider the opinion of your friends and acquaintances in combination with external information.

10 Indirect methods - infer “beauty” through meta-information Disadvantages once the methods are known, their ratings can quickly be subjected to cheating Direct methods - ask the judges about the “beauty” of an image.

11 Flickr has developed its own algorithm to rank images partly based on meta-data. Ultimately, it is not clear whether “interestingness” measures the ”interestingness” of the image or the popularity of its author. http://www.flickr.com/

12 users vote on images, using either approval/ disapproval or a rating scale (e.g., 1 to 5 stars). Users can search for particular items and vote on them (selected access). Probably the most frequently used method on the Web, e.g., Digg, YouTube, IMDB

13 Rank pictures (of people only) uses a voting system from 1 to 10 The images are given in random order (harder to cheat) http://www.hotornot.com/

14 Sites Method FlickrVotingHot or NotMatchin PartialYes DirectNoYes Predefined access No PartlyYes Others likePartlyNo Yes RelativeNo Yes

15 2-player game played over the Internet The player is matched randomly with another person who is visiting the game’s Web site at the same time. the players play several rounds. In each round, they see the same two images and both are asked to click on the image that they think their partner likes better. If they agree, they both receive points.

16 To score more points, the user have to consider not only his preference but also their partner preference. Every game takes 2 minutes. One pair of images, or “one round”, takes two to five seconds, thus a typical game consists of 30-60 rounds. At the end of the game, the two players can review their decisions and chat with each other. All clicks are recorded and stored in the database.

17 Matchin follows the spirit of GWAP. The game has been played by tens of thousands of players for large periods of time. Scoring: Linear scoring function Exponential scoring function More excitement among the players The rewards became too high Sigmoid function. Maximum at 1,000 points Creates an artificial ladder from which players can fall due to mistakes

18 Players are given more points for consecutive agreements. Matchin uses a sigmoid function for scoring. The first matches earn only a few points, but the score raises exponentially until the seventh match at which the growth of the function decreases.

19 http://www.gwap.com/gwap/gamesPreview/ matchin/

20 The game was launched on GWAP Web site on May 15, 2008. Within only four months, 86,686 games had been played by 14,993 players. In total, there have been 3,562,856 individual decisions (clicks) on images. Since the release of the game, there has been on average a click every three seconds. The game is both very enjoyable and works well for collecting large amounts of data.

21

22 An individual decision/record is stored in the form: id - a number assigned to identify the decision game_id - the ID of the game player - the ID of the player who made the decision in this record Better - the ID of the image the player considered better Worse - the ID of the other image time - the date and time when the decision was made waiting_time - the amount of time the player looked at the two images before making a decision.

23 We examine several methods to combine the relative judgments into a global ranking. For the global ranking, we consider the data as a multi- digraph. The nodes are the images. A decision to prefer image ∈ over a image ∈ represented by a directed edge. The goal: to produce a global ranking over all of the n images. The methods use a ranking function that maps every image to a real number first, called its rank value, and then applies sorting. An image is ahead of a different image if its rank value is larger:

24 We will compare three different ranking functions: 1. Empirical Winning Rate (EWR) 2. ELO rating 3. TrueSkill rating Then we will present a new algorithm: Relative SVD

25 EWR is the number of times an image was preferred over a different image, divided by the total number of comparisons in which it was included. In graph terms, the EWR of an image is just its out degree divided by its degree: Problems Images with low degree might get artificially high or low EWR. Does not take the quality of the competing image into account.

26 In this model, each player’s rating is first being initialized to certain number according to a scale, which should reflect the player’s true skill. If a player wins, his/her ELO rating goes up, otherwise it goes down, According to how surprising the win or loss is. Initialize each image’s ELO rating to Before each comparison, we compute the expected scores:

27 After we know which image won, we update both pictures’ ELO according to this rule: A large value of K makes the scores more sensitive to “winning” or “losing” a single comparison. For all the next experiments =16. The new ELO rating of the image is used for the ranking function Problem The ELO ranking system assumes that all players have the same variance in their performance.

28 Every player has 2 variables, skill and variance around that skill. Every player’s skill is modeled as a normally distributed random variable centered around a mean and per-player variance A player’s particular performance in a game then is drawn from a normal distribution with mean and a per-game variance, where is a constant.

29 When image i wins over image j, the following updates are needed: when - the standard normal probability density - the cumulative distribution function As our ranking function, we use the conservative skill estimate, which is approximately the first percentile of the image’s quality: Thus, with very high probability the image’s quality is above the conservative skill estimate.

30 The new goal: finding out not only general preferences, but also each individual’s preferences. This will allow us to: Recommend images to each user based on his/her preferences. Compare users (which users are similar?) Compare images (which images are similar?) New collaborative filtering algorithm called “Relative SVD” based on comparative judgments

31 Each user is described by a feature vector of length. Store the user feature vectors in a Each image is described by a feature vector of length. Store the image feature vectors in a The amount by which user “likes” image is equal to the dot product of their feature vectors:

32 We interpret the data gathered from our game as a set of triplets (,,). We set for all triplets. The error for a particular decision between a sample from the training data and our model can be written as And the sum of squared errors: The goal is to find the feature matrices that minimize the total sum of squared errors:

33 In order to minimize this error, we compute the partial derivatives for each vector : Applying ordinary gradient descent with a step size of while adding a regularization penalty with parameter, we obtain the following update equations for the feature vectors:

34 1. Initialize the image feature vectors and the user feature vectors with random values. Set, to small positive values. 2. Loop until converged: a. Iterate through all training examples,, ∈. i. Compute ii. Update iii. Update iv. Update b. Compute model error. Experimentation showed =60, =0.02 and =0.01 are good values. After the user and image vectors have been computed, prediction of user preferences on images is easy, simply by computing the dot product between their corresponding feature vectors.

35 Comparison of the 3 Models and the new relative SVD Split our data: for training for testing Train all four models on the training data Use the learned models to predict users’ behavior on the test data

36 For all models, the error decreases as we use more training data. For fewer data points, we find that ELO works best, while EWR and Relative SVD perform worst. As we increase the amount of training data, Relative SVD beats all the other models. Testing error of different ranking algorithms as we increase the number of judgments in the training data.

37 The learning curve for several learning algorithms.

38 Do the players learn which type of images are generally preferred? If they adapt too much, it could have unwanted effects. Test of agreement rate of first-time players vs. experienced players First-time players agree 69.0% of the time with their partner Experienced players agree 71.8% of the time with their partner Conclusion: the players only marginally adapt to the game. Additional check: Do people learn during the game? measuring the agreement rate in the first half compared to the second half of the game. Analysis of 100 games showed the agreement rate goes down from 67% to 64%.

39 http://www.gwap.com/gwap/

40 The gender for 2,594 players was known from their profile settings Of these players, 68% are male and 32% female. To find a pair of images (,) that has a strong gender bias, we compute the conditional entropy Where is the gender, the player’s decision is, > means that image was considered better than image. A pair (,) has a large gender bias (and is therefore good for gender determination) if the conditional entropy [|] is small (learning the decision tells us a lot about the gender). For the class conditionals, two ELO predictors were trained, one with male players only and one with female players only. We then compute [|] for many pairs of images and select pairs for which [|] is smaller than a fixed threshold value. To predict the gender of new users we sample 10 edges from those with strong gender bias and we ask the users to choose the image they prefer for each pair. Then we choose the gender that maximizes the likelihood of the data:. if H(Y | X = x) is the entropy of the variable Y conditional on the variable X taking a certain value x, then H(Y | X) is the result of averaging H(Y | X = x) over all possible values x that X may take.

41

42

43 Top ranked images by the different global ranking algorithms Nature pictures are ranked higher than pictures depicting humans. Among the 100 highest ranked pictures there is not a single picture in which a human is the dominant element. Animal pictures are also preferred over pictures depicting humans. Exotic animals like pandas, tigers, chameleons, fish and butterflies. Pets are also ranked high, but usually below the exotic animals. Pictures of flowers, churches, bridges and famous tourist attractions. The worst pictures: Almost all were taken indoors and include a person. many of these pictures are blurry or too dark. Some of the worst pictures are screenshots or pictures of documents or text. The pictures that made it into the top 100 are neither provocative nor offensive. Why?

44 Collaborative filtering indicates that there are substantial differences among players in judging images, and taking those differences into account can greatly help in predicting the users’ behavior on new images. We can predict with a probability of 83% which of two images a known player will prefer, compared to only 70% if we do not know the player beforehand. Players do not learn much about their partner’s preferences by playing the game More experienced players had about the same error rate as new players

45 The paper provides a new method to elicit user preferences. For two images, we ask users not to tell which one they prefer, but rather which one a random person will prefer. We compared several algorithms for combining these relative judgments into a total ordering and found that they can correctly predict a user’s behavior in 70% of the cases. We described a new algorithm called Relative SVD to perform collaborative filtering on pair-wise relative judgments. Relative SVD outperforms other ordering algorithms that do not distinguish among individual players in predicting a known player’s behavior. We saw a gender test that asks users to make some relative judgments and, based only on these judgments, we can predict a random user’s gender in an accuracy of 80%.

46 Generalize the game to other kinds of media Short videos or songs. Generalize the game to other types of questions “Which image do you think your partner thinks is more interesting?” “Given your partner is female which image do you think your partner prefers?” It remains to be investigated how much other personal information can be gathered in the same way as our gender test does.

47 Can I play again? Pleassssse!!!


Download ppt "It is possible to extract a global “beauty” ranking within a large collection of images After a person has played the game on a small number of pairs."

Similar presentations


Ads by Google