BY READING MANUALS IN A MONTE CARLO FRAMEWORK

BY READING MANUALS IN A MONTE CARLO FRAMEWORK
Learning to Win BY READING MANUALS IN A MONTE CARLO FRAMEWORK

The problem Leveraging automatically extracted textual knowledge could greatly improve performance No prior method has attempted incorporating textual information into these control applications Written documentation for a task can provide insight that a model would otherwise have to learn through trial and error Control applications are widespread, and automating them is an important task. This paper proposes a method to incorporate textual information into this automation, with the hope that it can improve performance.

Why does it matter? In an environment with a large state space and a high branching factor, it is prohibitively slow to optimize a model solely through experimental learning By incorporating written instructions, much of the guesswork is eliminated This paper proposes a method of learning to win the game Civilization II by extracting text from the game manual and utilizing it in a Monte-Carlo framework In an environment with a large state space and high branching factor, training a model to optimize through all possible states is implausible. If we incorporate written guides, we can remove much of the trial and error that would otherwise be necessary, simply by making recommended choices. In particular, this paper proposes utilizing the game manual for “Civilization II” in order to improve win rates of an automated Monte-Carlo Search.

A game ends when only one civilization remains on the map
The Game Civilization II is a multiplayer game set on a grid-based map of a randomly generated world. Each grid location represents either land or sea, and has various resources and terrain attributes A game ends when only one civilization remains on the map For background, Civilization 2 is a multiplayer game set on a grid based map. Each square in the grid represents either land or sea, and has various resources and terrain associated with it. The objective of the game is to be the only remaining civilization on the map

Proposed Approach The baseline method learns an action-value function using a Monte- Carlo Search Framework The proposed model augments the baseline by adding linguistic features via a neural network The training approach is most similar to a reinforcement-learning, where the main source of supervision is a utility function based on the in-game score In this paper, the baseline method uses a Monte-Carlo Search Framework to learn an action-value function. This action-value function estimates the in-game score given a current state and a candidate action. By maximizing the action-value function, the algorithm hopes to take the best possible action. The novel method being proposed improves on this model by adding linguistic features that are selected through a neural network. Finally, learning in the model occurs by improving the estimated action-value function to match the actual utility function, which in this case is the in-game score. This approach is most similar to reinforcement-learning.

Monte-carlo algorithm
The Monte-Carlo algorithm is pretty simple. We start in the PlayGame procedure, which first initializes the game state. Then for each timestep, we simulate N games with our current state as the starting point. In each of these simulated games, we calculate our action-value Q, then determine the best action to follow based on Q. Notice that there is a small probability that we will pick a random action. This is called an epsilon-greedy approach. The simulated games will continue this way until the game ends, then return the first action they chose as well as the game score in the last state. The PlayGame function now chooses the action that maximizes the final game score and applies this in the game.

Neural Network Model To leverage textual information, the document is passed into a four-layer neural network. The full model uses a four-layer neural network to involve textual data in the algorithm from the last slide.

Neural Network Layers The 1st layer is a representation of the current state, a candidate action, and the input document. The 2nd layer consists of two parts, which encode sentence relevance and predicate labeling The 3rd layer is a fixed feature layer, which deterministically computes a vector from the input of the previous two layers The 4th layer encodes the action-value function as a weighted linear combination of units of the feature layer The 1st layer of the network is a representation of the current state, a candidate action, and the input document. In the second layer, two modules are used to determine how relevant a sentence is to the current state, and then to label the predicates of that sentence. The third layer is a fixed-feature layer, which deterministically computes a vector from the input of the previous two layers The 4th layer encodes the action-value function based on a weighted combination of the feature layer’s active features

Modeling Sentence Relevance
We wish to identify a sentence yi that is most relevant to current game state st and action at We model this is a log-linear distribution over sentences: where is a feature function One half of the second layer is used to determine if a sentence y_i is relevant to the given state s_t and action a_t. This is modeled as shown here, as a log-linear distribution across sentences.

Sentence Parsing for Predicate Labeling
Label words of a sentence as one of three options: Action-description State-description Background Sentences are parsed using the Stanford Parser State-description words tend to be descendants of action-description words in these trees “Build your city on a plains or grassland square with a river running through it” build city plains The second part of this layer is used to label the predicates in a sentence. It first uses the Stanford Dependency parser to create a dependency tree. Then it labels each word as either Action-description, State-description, or background. The figure on the right is a partial parse of an example sentence from the model. your square on river grassland

Modeling Predicate Structure
We model the distribution over predicate labels where yi is sentence i, qi is dependency tree i, ej is the predicate label for jth word, and is the partial predicate up to the jth word. In practice, predicate labeling is performed only on the sentence determined to be most relevant to current action and state State description words tend to be descendants of action-description words, so adding the element of the dependency tree to the distribution is intended to reflect this. The formula here shows how the predicate labels are determined. Notice that the probability of a particular label for a word is influenced by the labels for previous words in the dependency tree. Theoretically, this labeling would be done for every sentence/state pair, but in practice it is only performed for the sentence that has been chosen as the most relevant to the current state.

Training Learning is performed online
At each game state the algorithm simulates the next states, observes the outcome of the game, and updates the weight vectors The goal is to minimize the mean-squared error between the action- value Q(s, a, d) and the final utility R(st+n) through SGD At each game state, the algorithm simulates the next states, observes the outcome of the game, and then updates the weight vectors. The goal here is to minimize the mean squared error between the estimated action-value and the final game-score using Stochastic gradient descent.

Experimental Framework
Results are averaged across 200 games Each game starts in the same initial state and runs for 100 steps Each step then runs 500 Monte-Carlo rollouts Each rollout simulates the next 20 steps ……………… This image shows the process of running the experiment. 200 games were run up to step 100. At each state in these games, the Monte-Carlo framework simulates 500 rollouts to determine the optimal next action. Each of these rollouts simulates the next 20 actions taken.

Evaluation Want to evaluate two aspects of the method:
How much is gameplay improved by leveraging textual information? How accurate is the linguistic analysis produced by the method? Method is compared vs baseline performance vs built-in game AI AI is a challenging algorithm designed using extensive knowledge of the game The authors of this paper wanted to evaluate two aspects of their method: First, they wanted to see if gameplay was improved by leveraging textual information from the manual. Second, they wanted to determine the accuracy of their linguistic analysis. For the first goal, they compare the method against the performance of the baseline vs the built-in AI. The AI in this game is known to be challenging even to humans

Game Results Full games can continue for multiple days, so primary evaluation is the percentage of games won or lost after 100 steps. The top baselines represent random choices and the built-in AI playing against itself, respectively. Game only baseline shows result of playing with no textual information, and therefore does not utilize the neural network Sentence relevance model ignores the predicate labeling aspect of input text The primary method of evaluation was to run each game until step 100, then determine if the game had been won, lost, or was still going. The table to the right shows the results of the full model vs several baselines. The top two baselines in the table have a zero percent win rate. These represent random selection of actions, and the Built-in AI playing against itself. The Game-Only baseline does not use any textual information, and has a poor win rate of 17.3%. Next, the sentence relevance model ignores the predicate labeling aspect of the input text, and performs worse than the full model.

Game Results Remaining baselines are used to show that the non-linear model is not the sole cause of performance improvement Random text baseline uses an identical model, but rearranges the word order in each sentence of the input document Latent variable baseline works by providing latent variables related to game state in place of text input to the neural network The conclusion drawn from these results is that textual information is a valuable tool to increasing performance The remaining baselines are used to show that the non-linear model is not the sole cause of performance improvement The Random text baseline uses an identical model, but rearranges the order of words in each sentence of the input document. Finally, the latent variable baseline provides only information related to game state as input to the neural network.

Game Results 50 distinct games are run from start to finish
The model wins 78.8% of these games This shows that previous result of 53.7% wins is an underestimate A secondary method of evaluation was performed by running 50 distinct games from start to finish and recording the total percentage of wins. By this metric the model performed much better than the previous 53.7% win rate.

Linguistic Results First box shows sentences that were selected for their relevance, and those that were not Second box shows predicate-labeled sentences Orange words represent States Blue words represent Actions Red Xs indicate words that were improperly labeled The second aspect being evaluated is effectiveness of linguistic analysis. This slide shows examples of sentences that were chosen as relevant compared to those that were not. It also provides examples of predicate labeling through sentences.

Linguistic Results No ground-truth annotation of relevant sentences
Authors insert randomly selected Wall Street Journal sentences to document and judge accuracy based on percentage of chosen sentences are pulled from game manual Model achieves average of 71.8% accuracy The accuracy drops steeply after the first 25 steps Since there is no ground-truth for relevant sentences given a game state, the authors decided on a different means of evaluation. They inserted randomly selected Wall Street Journal sentences into the input text, with the idea that these new sentences are not at all relevant. Accuracy is determined by how often the model chooses a sentence from the game manual as most relevant, and only results in 71.8% average accuracy. The authors note that the accuracy drops sharply after the 25th step in the game

Linguistic Results Findings shows that text features dominate in the early game states but are less useful as the game progresses This is validated by the ratio of activated text vs game features in the neural network A hybrid method where no text data is used after step 50 achieves a 53.3% win rate, on par with the original model Following this observation, research is done into relevance of the manual as the game progresses. Findings show that text features are very important early on, but decrease in relevance as the game progresses. To validate this, the authors look at the ratio of activated text vs game features in the neural network at each step in the game. This is plotted in the figure on the right. This observation is further verified by a model that only uses text data during the first 50 steps. This model achieves a win rate on part with the original model.

Critique The utility function R(s) acts as the only supervision in training, but is a noisy function The effectiveness of the linguistic analysis through the model could be evaluated better Organization of ideas in the paper could be more streamlined Concepts are introduced and then forgotten about until several pages later Information like attribute to word mappings are tacked on at the end Linguistic analysis – Authors made a weak test to show that their model picked “relevant” sentences by inserting randomized sentences with zero relevance into the document. This doesn’t prove much, because most of the sentences in the document are not directly relevant to any actual game state, so even if those sentences are chosen as relevant, the neural net is performing poorly.

Critique The proposed method optimizes for the immediate following state rather than the eventual end goal Rollouts in the Monte-Carlo search primarily choose the action with highest probability of maximizing the in-game score on the next state The algorithm has a small chance of picking from an action uniformly. If 1% of rollouts choose one of 15 actions at random, then there will be 1/3 of possible actions represented Choosing first action in rollout uniformly and then maximizing probability of action-value for remaining 19 steps gives more diversity in results

Future Work Apply the Monte-Carlo method with documents fan-written documents that are intended for late-game strategies Learn to play the game using modern neural networks to determine if textual input is still valuable

BY READING MANUALS IN A MONTE CARLO FRAMEWORK

Similar presentations

Presentation on theme: "BY READING MANUALS IN A MONTE CARLO FRAMEWORK"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BY READING MANUALS IN A MONTE CARLO FRAMEWORK

Similar presentations

Presentation on theme: "BY READING MANUALS IN A MONTE CARLO FRAMEWORK"— Presentation transcript:

Similar presentations

About project

Feedback