Multi-agent Systems & Reinforcement Learning

Multi-agent Systems & Reinforcement Learning
A Presentation Didn’t get vetted by Dr. Taylor because I’m a slacker and didn’t finish until this morning. We’ve also already seen a bit of this topic. I’m presenting it again. No regrets. Let’s keep our expectations low and take things slow from here people.

What Artificial Intelligence -> Distributed Artificial Intelligence
Concerned with information management issues and distributed/parallel problem solving Distributed Artificial Intelligence -> Multi-agent Systems Different problem solving agents with their own interests and goals It’s pretty intuitive. Multi-agent systems are systems which are comprised of by multiple agents. It originates out Distributed Artificial intelligence. Distributed Artificial Intelligence is a subfield of AI which is concerned with how to do task decomposition so as to get several semi-independent sub problems that can be solved on different processors or different machines. Then also how to take those solutions and synthesize the solution to the original problem. The obvious point here is to try and leverage many computers and multicore processors for speed up over traditional AI. Next logical step is multi-agent systems, a subfield that’s closely related to DAI. There are many different types of multi-agent systems depending on the problem domain, complexity of the agents, environment, and other characteristics. But let’s first address why this is even necessary.

Why Some of the trends in computing
Ubiquity, interconnection, intelligence, delegation. The Internet of Things, self-steering cars, home automation devices. What advantages does it offer over the alternatives? In what circumstances is it useful? The history of computing has been marked with many trends. We see the cost of computing fall every year which makes computing power available in places and devices previously uneconomic, so processing ability is becoming more ubiquitous. Computing devices of the early years were mostly isolated and only had to deal with their human operators. Anymore nowadays they’re usually interconnected, networked into large distributed systems and among all your devices there are only probably a few that can’t access the internet and a lot of applications that just straight up dependent on it. One example that comes to my mind is the game Titan Fall. I was watching a video about how they’ve offloaded the AI routines from the game console to Microsoft’s cloud computing Azure thing so the console can focus more on game performance. So it pretty much sounds like you can’t play unless your device is hooked up to quite a few others. Interconnection is pretty much assumed anymore. We’re getting more intelligent systems. The complexity of tasks we can automate is just steadily growing thanks to advancements in AI and the fallings costs of computing. Then consequently, we’re leaving more and more things up to the computer. So think, the Internet of things, self-steering cars, home automation devices. The world is trending towards multiple autonomous intelligent agents interacting with each other and while each may function well on its own, it is important to understand how their behaviors interact. This is why we have the subfield of AI we call multi-agent systems. Because if there’s one self-steering car on the road, you can bet there will be others and they will be working together to kill us all. Two obvious questions about any type of technology are:

Answers Parallelism Robustness Fault-tolerance Scalability
Simpler programming Not for situations where parallel action is not possible and there is no action uncertainty. There are a number of reasons why a multi-agent system is better than a single-agent system. If you can already subdivide the logic to solve the problem into multiple agents, then multiple agents lend themselves to parallel computation quite easily. You can kind of see how ideas in multi-agent systems borrow heavily from DAI. You can also get more robust systems through multiple agents. You can get fault-tolerant systems by spreading control and responsibility among different agents and then distributing those across different machines. Multi-agent systems are inherently modular. It is easier to add new agents to a multi agent system than it is to add new capabilities to a monolithic system so scalability is easier. Modularity also means rather than tackling the whole task with a centralized agent, programmers can identify subtasks and assign control of those subtasks to different agents. So actually, between multi-agent systems or single-agent systems, multi-agent ones are often the simpler option. On the other hand, it’s not like there are no domains where a global view and centralized control are not the more natural approach. In cases where parallel action is not possible and there is no action uncertainty, single-agent systems should be used. Everything cool so far? Just want to note, robustness and fault-tolerance are two different things. On action uncertainty: if there’s no action uncertainty you can just build a single system to model everything because it means you know the outcome and how to handle all possible interactions. Otherwise you’re going to have to create agents, comprised of possibly differing goals, priorities, capabilities or proprietary information, to simulate these interactions. And thus you’ll have a multi-agent system.

Multi-Agent Systems Two main dimensions: Multi-agent scenarios
Agent Heterogeneity Amount of communication among agents Multi-agent scenarios Homogeneous non-communicating agents Heterogeneous non-communicating agents Homogeneous communicating agents Heterogeneous communicating agents There are many ways to categorize these systems but the survey I looked at used two aspects to organize itself. These are kind of borrowed from one of the taxonomies of distributed artificial intelligence. Also others have defined the important characteristics of multi-agent systems as system function, agent architecture (degree of heterogeneity, reactive vs deliberative), and system architecture (communication, and human involvement). Then Wikipedia has another. The author of the survey even wrote that the issue of benevolence vs competitiveness comes up repeatedly in the systems described so were a third dimension to be added to the categorization of MAS, this issue would be it. But anyway, for the purposes of this presentation we care about these two: I’m going to go through the scenarios from the simplest homogenous non-communicating agents to more complex scenarios and try and highlight the important issues and techniques in multi-agent systems.

The Predator/Prey (“Pursuit”) Domain
But first before we start going into the MAS, we should probably cover the pursuit domain as it is the primary example used to illustrate MAS issues. I actually make a point to try and bring up other real world examples so it's not terribly important but it's worth knowing. This, or some spin off of which has been used in a wide variety of approaches to illustrate different multi-agent scenarios. Kind of like the n-armed bandit or the grid world problem for RL, the pursuit domain is the toy domain of MAS for learning about the concepts. There are many ways you can vary the problem and this is one variant: World size/shape Number of predators Number of prey Limits of sensor vision Limits of communication Prey behavior can be intelligent Moves can happen in turns/real time domain? Two pieces can occupy the same space Capture condition can change Legal moves Since we’re only focused on agent capabilities we don’t really care. Not going to go into important domain characteristics when designing MAS.

Homogeneous Non-Communicating
Issues Reactive vs. Deliberative agents Local vs. Global perspective Modeling other agents’ states How to affect others Techniques Reactive behaviors for formation maintenance Local knowledge sometimes better Recursive Modeling Method Don’t model others – Just pay attention to reward Stigmergy Even in the simplest scenarios there are a number of issues worth covering. In the homogeneous non-communicating version of the pursuit domain we have four predators, each controlled by an identical agent per predator. This means the same capabilities, decision procedures, sensors, effectors, and domain knowledge, but limited information about each other’s internal state and sensory inputs so they may not be able to predict each other’s actions. Basically their only difference is that they start at different locations, thus differing sensor input that results in different effector output, a necessary condition of MAS because otherwise if they all act like one unit it’s essentially a single agent system. Also a lot of these issues carry over into the more complex scenarios as well, the first of which is this reactive vs. deliberative agent issue. When designing any agent-based system it is important to determine how sophisticated the agents’ reasoning will be. You can go with reactive agents which simply follow pre-set behaviors similar to reflexes without maintaining any internal state, like at this point this is just a multi-agent system, no reinforcement learning concepts yet, or you can do deliberative agents that behave more like they are thinking. And it turns out, you don’t always need a deliberative agent. People have had great success with reactive non-communicating agents with studies involving maintaining formation, like a column, or wedge, or diamond, when encountering then passing obstacles. At the same time deliberative agents can be powerful, leveraging game theoretic techniques to find equilibrium points, search for policies and decide how to act rather than just reacting. Sometimes the difference can be a bit blurry too. There is also work with existing systems that mix, reasoning about when to be reactive and when to follow goal-directed plans. Also, fun fact, reactive deliberation was not explicitly designed for MAS, but is take from real-time control in dynamic environments. Another of the issues to consider when building a multi-agent system is how much sensor information should be available to the agents. Even if it is feasible within the domain to give the agents global perspectives of the world, it may be more effective to limit them to local views. The example here might be in resource usage policies. A lot of agents may rush a underutilized resource. Better performance by agents with less knowledge is occasionally summarized by the cliché "Ignorance is bliss. This leads into the agent’s ability to model each other. If you have a global perspective then it’s much easier to model other agents’ states because they’re homogenous. How and whether to model other agents is a ubiquitous issue in MAS, which can get more difficult in the more complex multi-agent scenarios because then you're not only modeling internal states, but also their goals, actions, roles, and abilities. If these things can be known or conveyed, RMM can be used to determine future actions of agents. If you don’t have a global perspective, meaning information isn't directly available, it still might be useful for the agent to try and learn it. This would involve learning a function of turning X's sensor data (which may only be a subset of Y's view) to agent y's sensor data, which then enables X to use a limited RMM to predict agent Y's future actions. At the same time we can also see the ignorance is bliss bit again here with modeling other agents' states. With the recursive modeling method, you can recurse indefinitely. Even if you can get more information by reasoning about what agent A thinks agent B thinks agent A thinks so on and so forth, endless reasoning can lead to inaction. For coordination to be possible, some potential knowledge must be ignored. You can counter by restricting to only 3 levels or setting a timer. And just because you can build models of other agents doesn’t mean you have to or even should. An approach to this is just treating other agents as part of the environment and indirectly learn policies as sensed objects. It is entirely possible to get agents that can learn to cooperate without modeling each other. The method mentioned in the survey involved evaluating and check-pointing policies so agents can return to successful ones. So when no communication is possible, system designers must decide how the agents will affect one another. They can accomplish this by existing and being sensed by other agents, or by effecting change, whether in the environment or to the other agents, so think pushing a ball around a field or knocking over other agents. For indirectly affecting other agents, there is stigmergy. I’m going to regurgitate is what Wikipedia has to say about it: Stigmergy is a mechanism of indirect coordination between agents or actions. The principle is that the trace left in the environment by an action stimulates the performance of a next action, by the same or a different agent. In that way, subsequent actions tend to reinforce and build on each other, leading to the spontaneous emergence of coherent, apparently systematic activity. Stigmergy is a form of self-organization. It produces complex, seemingly intelligent structures, without need for any planning, control, or even direct communication between the agents. As such it supports efficient collaboration between extremely simple agents, who lack any memory, intelligence or even individual awareness of each other Active stigmergy occurs when an agent alters the environment so as to affect the sensory input of another agent, for example leaving a marker behind for other agents to observe. Passive stigmergy involves altering the environment so that the effects of another agent's actions change. We steal the term out of biology, so one of the examples used to illustrate passive stigmergy is an ant colony with lots of dead ants around it. An ant from the colony periodically picks up a dead ant, carrying it a short distance, and then drops it. While the behavior appears to be random, after several hours, the dead ants are clustered in a small number of heaps. Over time, there are fewer and fewer large piles until all the dead ants end up in one pile. We assume the ants behave homogeneously and don’t communicate explicitly, but even so manage to cooperate and achieve a task. The robotics example equivalent is where robots use Q-learning to learn how to forage for pucks. The next logical extension, inspired by the concept of stigmergy, is for agents to try and learn not only actions that directly help their immediate situation, but also actions that allow other agents to be more effective in the future. Unfortunately, typical reinforcement learning with delayed rewards only encourage agents to achieve goals directly because actions leading to higher rewards for others don't reinforce the supporting actions.

Heterogeneous Non-Communicating
Issues Benevolence vs. Competitiveness Stable vs. evolving (arms race, credit/blame) Modeling of others’ goals, actions, and knowledge Social conventions Roles Techniques Game theory, iterative play Minimax-Q Competitive co-evolution Deduce intentions through observation Autoepistemic reasoning (ignorance) Model as a team (individuals follow roles) Focal points/Emergent conventions Design agents play different roles So we still can’t communicate, or at least we consider it prohibitively expensive or such. But now we have multiple types of agents, illustrated above as being in different shades. This means their goals, actions, and domain knowledge can be different. Following that line of thinking, the prey, which inherently has goals opposed to those of the predators, can now be modeled as an agent. Even without communication, we still see lots of issues. At this stage we can now consider one of the most important issues when dealing with multi-agent systems and that is whether agents will be benevolent or competitive. Even if they have different goals, agents can be benevolent and help others achieve their respective goals. Or agents can strictly focus on themselves and in the zero-sum extreme, they must actively oppose other agents to succeed. We already talked about the prisoner's dilemma so that would be game theory and iterative play. Basically about serving your own self interest by establishing a reputation for being cooperative so that's one way of addressing the benevolence vs. compeativeness issue in a non-communicating scenario. Another is to use Q-learning to try and figure out if the opponent has a deterministic policy that can be countered. Like I said, it's an important issue, but one that already got covered. A different issue of these systems is whether or not the agents are evolving. For dynamic environments, great, you want them to adapt to changes. In competitive play, this can lead to problems. Such systems that use competitive evolving agents are using competitive co-evolution. On the other hand, if they're cooperative, then cooperative co-evolution. The survey talks about competetive co-evolution with GAs where each individual in the population is actually a team of four agents/predators. I would imagine a RL approach to be similar, just with training a set of four agents together. This co-evolution of teammates is one possible way to circumvent the absence of communication in a domain. So basically, instead of planning future events together, predators can evolve to know, or at least act like they know, each other's future actions. One study to illustrate this point is where agents started off identically, but kept track of deadlocks, like when both tried to move to the same location, remembering such things in the future so they can try other actions. Allowing each predator to build it's own set of experiences leads to predators learning to stay out of each other's way. In general though, one problem that happens with competitive rather than cooperative co-evolution, is the issue of an escalating "arms race" where each side adapts to the other, constantly specializing and never stabilizing at a good behavior. At the same time, that may be what you want, like you don't want a stable behavior if your environment is dynamic and back to the iterated prisoner's dilemma, if you apply RL against the fixed opponent you can then be optimal. Instability also has the added problem of not being able to confidently assign credit or blame. Basically, the agent can't be too sure if it got better, or if the opponent got worse, which to deal with you kind of have to hold one agent stable while the other evolves, or checkpoint agents and periodically run old versions for comparison, but there are also various concerns there. A reliable measurement method, if one doesn't already exist, would be a valuable contribution to MAS. With heterogeneous agents, the modeling of other agents becomes more complex. Now not only do you not know what they want, you also don't know what they can do, or how they reason and so forth. Since you can't talk, you can only deduce what they want stritcly though observation of their actions. Plan recognition is quite useful in competetive domains, since knowing what the opponent is going to try and do makes it much easier to beat them. To that end, when reasoning about other agents, it can be useful to reason about not only what is true and what is false, but also about what is unknown. When you reason about ignorances, it is called autoepistemic reasoning. A concrete example would be like if you were a fighter jet, and you can see an enemy jet, but not necessarily any projectiles or missles. Observing a preparatory maneuver commonly used before firing could indicate that a missle has been launched. Useful thing to know that you infer. Extending the idea about recursive modeling method, an agent can try and reason on models of teams of agents. Like when looking at a single opponent, their actions may not make any sense except in the context of a team maneuver so really their role within the team needs to be modeled. Even though there's no communication, there's still been interesting work done on how heterogeneous agents can reach "agreements" or make the same decisions. This was illustrated with 40 humans at AAAI-95 Fall Symposium on Active learning. The scenario goes like this: Imagine you and a friend need to meet today. You both arrived in Paris yesterday, but you were unable to get in touch to set a time and place. Nevertheless, it is essential that you meet today. Where will you go, and when? 75% of the people wrote down, with no prior communication, that they would go to the Eifel tower at noon. It speaks to the point that even without communication, people can sometimes coordinate actions and that features that have been seen or used often are obvious candidates to help coordinate. This leads to the idea of the focal point method, which is where if agents need to "meet", with all else being equal, agents should select rare or extreme options. In the same, but different way, conventions might emerge over time, so instead of most rare or extreme, maybe choose the most recent or most frequently visited in the past. So we already talked about modeling teams of agents in roles. Obviously that implies that the other end is organizing them into teams. When agents have similar goals organization is a logical extension. Assignment might be obvious if the agents are very specific and can each only do one thing. Sometimes though domains allow a little flexibility to change roles. One opportunity for RL is to learn what role is best to take on for different situations and be able to dynamically switch when appropriate.

Heterogeneous Communicating
Issues Understanding each other Planning communicative acts Benevolence vs. competitiveness Commitment/decommitment Truth in communication Techniques Language Protocols: CL, ACL, KQML Speech acts Learning social behaviors Multi-agent Q-learning Training other agents’ Q-functions Contract nets for electronic commerce/Market-based systems Belief/Desire/Intention (BDI) models Coalitions Reasoning about truthfulness Heterogenous systems can be very complex and powerful. Greater power is realized when you let them talk. Depending on how much state information gets shared and how control is distributed, systems with communication approximate single-agent systems. So in our example pursuit domain, we still have agents that can be fully heterogeneous, but now the predators can communicate with each other. Since our heterogenous communicating agents can choose not to communicate, and in some cases can also choose to be homogeneous, or at least minimize their heterogeneity, most of the issues discussed in the previous two scenarios apply here as well. However the ability to talk raises new issues, two of the most studied being communication protocols and theories of commitment. Communication protocols is a relatively easy thing to solve. Your agents need to be able to understand each other. Set your language/protocol for when your agents need to interact by using a standard. By setting a message format and content you can transfer knowledge, relay abilities, convey intent. Done. Now assuming your agents have a capacity for communication, you can actually treat this in two different ways. In some sense, communication could be viewed as simply part of an agent's interaction with the environment. It's then possible to view it as an "action" no different from any other. Send a message, cause an effect, though by proxy of another agent, so you can plan around it in this framework. The theory of treating communication as action is called speech acts. However, now that we can communicate, we now have a method for reinforceing social behaviors. You'll remember from before, you couldn't improve assisting behavior because there was no means to reinforce it. Well now with communications, Q-learning can be extended so that you not only take direct rewards, but also rewards from other agents. Benevolence vs Competitiveness also comes up again because there are many more examples of competitive agents for this scenario, particularly involving negotiation. One example work involving multi-agent RL in the pursuit domain is on competing Q-learners. Each agent competes with each other for the right to control a single system. The highest bidder pays to be allowed to act, then receives any reward that results from the action. Another Q-learning approach with benevolent agents is transfer learning where a one agent teaches another agent through communication. The trainer has some expertise and the learner can be rewarded by mimicking the trainer. Also the trainer can send recommendations to help guide the learner to reward states. One of the take aways from this study is that training can speed up learning, but too much advice leads to worse performance because the learner does not experience enough negative examples while training. Drawing on economics, several researchers have designed systems based on the law of supply and demand. In the contract nets framework, agents all have their own goals, are self-interested, and have limited reasoning resources. They bit to accept tasks from other agents and then can either perform the tasks or subcontract them. Agents must pay to subcontract their tasks out and thus lead to optimization by finding low bidders. A related example would be a multi-agent system that controls air temperature in different rooms of a building. A person can set one's thermostat to any temperature. Then depending on the actual air temperature, the agent for that room tries to "buy" either hot or cold air form another room that has an excess. At the same time, the agent can sell the excess air at the current temperature to other rooms. In this way you can model the loss of heat in the transfer from one room to another and have agents try and buy and sell at the best possible prices. The market regulates itself to provide equitable usage of shared resources. So let's say we've established a semi-cooperative situation and our communicating agents have decided to cooperate on a given task for a given amount of time. This is what is called making commitments to each other. Committing to another agent involves agreeing to pursue a given goal, possible in a given manner, regardless of how much it serves one's own interests. It helps a system run more smoothly as an extension of trust, but it's not clear how to make self-interested agents commit in a reasonable way. There are different types of commitment and different commitment states that agents can use to analyze or plan for commitment opportunities. This work is conducted with a model called Belief/Desire/Intention, or BDI, which is popular for modeling other agents. Belief corresponds to domain knowledge and desire to goals. It's also interested in modeling intentions, which are the immediate goals they are trying to achieve and the methods by which they are trying to achieve them. A real life example of this has been air-traffic control, which was implemented and tested in Australia. Each aircraft is represented by a controlling agent which deals with a global sequencing agent. It actually mixes reactive and deliberative actions so that they can break out of planned sequences to address emergency situations. But here, since agents cannot control their beliefs or desires, they can only make commitments to each other regarding their intentions. Coalitions: groups of agents decide to commit to each other. Work conducted in game theory framework. Truthfulness. Wants something to happen. Possibly lies to other agents to get what it wants.

The End Thanks for listening I skipped a lot of material
Multiagent Systems: A Survey from a Machine Learning Perspective, Peter Stone and Manuela Veloso, December 4, 1997 No programming segment Questions time

Multi-agent Systems & Reinforcement Learning

Similar presentations

Presentation on theme: "Multi-agent Systems & Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multi-agent Systems & Reinforcement Learning

Similar presentations

Presentation on theme: "Multi-agent Systems & Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback