CSCE 496/896
Topic
Summary Assignment 5:
Learning
in Multiagent Systems
Questions
and Answers
October 31, 2002
Q1: In decentralized
learning it says that several agents are engaged in the learning process. Am I correct in assuming that they do not necessarily
learn the same things? And, would you
want them to learn the same things?
A1: The answer to the first part of the
question is yes. The second part is
more complicated. It depends on (1) the
problem that the multiagent system is trying to solve, (2) the environment that
the agents are in, and (3) the learning mechanism and what the agents are
supposed to learn. For example, if the
problem were resource sharing and allocation, then you probably would want them
to learn the same things. If the
problem were task decomposition and allocation, then you would probably want
them to learn different things. If the
environment is highly dynamic, communication is costly and noisy, then you
would probably want the agents to learn different things. If the agents were to learn some expertise,
to become better at some thing unique, then you would want them to learn
different things. If the agents were to
learn how to coordinate activities among themselves, then you would probably
want them to learn the same things. So,
when you design learning in MAS, look at these three things. Once you define these three things, then you
could find out whether you want the agents to learn the same things or
different things.
Q2: Do you think it is
possible for a multiagent system to mimic a human society? Do you know of any work or simulations that
are currently doing this sort of thing?
A2: Yes, to some extent, it is possible. And there is work, and there are simulations. Later this semester, during the Advanced Topics sessions, I will talk about the SWARMS technology and the ANT system. In some of these simulations, they have simplistic agents that mimic humans. For example, a human that is risk averse, or risk neutral, or risk prone, and every human tries to maximize its own net utility. The environment can be a market environment (we have mentioned in class) that has goods to be bought and to be sold. And then the researchers first set a list of prices and needs/demands/resources, and let the simulation start. One can find out how the prices do if all agents are risk averse, or risk neutral, or have a mixture of different behaviors. When we talk about SWARMS and ANT, I will give you some references to this particular area of MAS research.
Q3: Is it possible that a
multiagent system could be more adept at solving society’s problems (e.g., the
Israeli/Palestinian conflict, or more generally, achieving world piece)? If so, would it be possible to convince the
parties involved that a solution exists?
A3: This is a very good question. Philosophically speaking, we are agents. The society is already a multiagent system! So, the question above becomes: “is it possible that a software multiagent system could be more adept at solving society’s problems?” First, we need to find out why we are not good at solving, for example, the Israeli/Palestinian conflict. Second, we have to identify what a software multiagent system can do better than a human multiagent system. Well, a software multiagent system can compute things faster, can crunch a lot more numbers, can access a lot of information faster, can work 24 hours a day, can be emotionless, etc. Are these advantages beneficial to solving the Israeli/Palestinian conflict? Probably not. So, in this case, a multiagent system would probably not be helpful.
There is one other aspect that is actually important in AI, not just multiagent systems. As a designer, when you model a problem—specifying the problem so you can implement programs to solve it, you may need to figure out a set of utilities. Utilities are subjective values. For example, $100 to a kid has a higher utility than does $100 to a millionaire. In conflicts such as the Israeli/Palestinian, there are many issues involved and there are many different utilities, some are compounded or complex. Thus, it is very difficult for a designer to accurately model that kind of problems.
For the last question about the existence of a solution: In general, in a human society, most humans agree that a solution exists—they only disagree on which solution that should be. That is why a lot of people negotiate and communicate. For us, as a MAS designer, we want to design the agents such that the agents are able to find out whether a solution exists, or whether a better solution exists. The key is to design it such that each agent is able to determine this through its interaction with other agents and its observation of the world, to make this process local and distributed. If you do not do that, then a centralized solution would be just sufficient.
Finally, look at the problem. Is it inherently distributed, dynamic, and so on? Consider the issues that we have talked about in our Chapter 1 and Chapter 2. If those properties are there in the problem that you want to solve, then a multiagent system can help.
Q4: In the formula of
Q-learning, Q(s,a) is defined from V and V is a reward value of a policy. How to distinguish this value from R? Which one is the reinforced object?
A4: V is a reward value of a policy. A policy may consist of many actions. R is the reward specifically for a state at a certain time t. You action may result in a different state from your expected state. The reinforced object is Q(s,a), the worth of choosing action a when the agent encounters state s.
Q5: In decentralized
learning is the “learning” confined to the system of agents or individual
agents?
A5: It depends on your design. It depends on the involvements of the agents in the learning activity, as discussed in the book. If every agent has to participate in the learning process just for one agent to learn something (for example, about how to form a coalition better and faster), then the learning is by the system of agents. If an agent can learn to do things better by itself and to become an expert in what it does, then the learning can be individualized. Still, an agent in this situation needs to find out what other agents are capable of in order for that agent to become specialized.
Q6: How is credit assignment
different from reinforcement learning?
A6: Reinforcement learning is a learning mechanism, a learning strategy. In reinforcement learning, you have to assign reward, or credit, or blame. How to assign credits is a credit assignment problem. For example, if a group of agents perform well in a task, do you assign the credit to the entire group, evenly distributed among the agents? If a group of agents perform poorly in a task, do you assign the blame to the entire group, penalizing each one evenly? If you punish the group, you may force the group to learn to cooperate better. If you punish a member of the group, you may force that particular member to become better. So, depending on your MAS design and your problem domain and application, you may have different credit assignment schemes. But the underlying learning mechanism is still reinforcement learning.
Q7: In Q-learning what
happens when no Q returns a positive reward?
A7: First, if no Q returns a positive reward, then that means the agent will learn that nothing is worth doing, after the learning has converged. From the standpoint of the designer, if you need the agent to learn that something is worth doing in order to solve the problems that you want to solve, then that is not a good sign. You need to re-design your learning mechanisms and details—to make sure that you can point the agent towards learning that something is worth doing and that something is what you want the agent to do to solve the problems that you want to solve.
Q8: Is interactive vs.
isolated reinforcement dependant on the number of agents in the system?
A8: No, at least not totally. It depends on (1) the communication cost, (2) the reliability/quality of the communication, (3) the roles of the agents, (4) the observability of the environment, and many other issues. For example, if the communication is too costly, then you may want to do isolated reinforcement. If the communication is not reliable or noisy, then you may want to minimize interactive reinforcement. If an agent can derive reinforcement from its observation of the environment, then isolated reinforcement may be suitable. If the agents can afford to be isolated and not count on each other to learn, then they may choose to learn on their own.
Q9: Is learning based on
low-level communication just a kind of information exchanging? So it seems conflict with learning to
communicate where learning is viewed as a method for reducing the load of
communication among individual agents.
A9: I raised the question or issue in class. So this does not actually qualify as a question from the student. However, the student who wrote the above question also wrote down the second sentence above. No, learning to communicate does not conflict with low-level communication as a kind of information exchange. This is important to understand. What do we do when we “learn to communicate”? We learn when to send out the messages, when to check for incoming messages, what to send, how to send, and to whom to send the messages. If we learn when to send out our messages—for example, I send a message out right before you need it, or I request for a service right after you have the service available—then we can cut out repetitive messages (polling) and communication costs. If we learn what to send, then we do not have to send all we know to other agents. So we do not spam other agents! That way, everybody saves processing time. Also, if we know whom to send our messages to, then we will not send messages to somebody who does not want the messages. And so on. This is learning to communicate. Learning based on low-level communication can be seen as a kind of information exchange and no learning actually takes place. But that has no conflict with learning to communicate. An agent can learn to exchange data only, information only, or knowledge only, or others.
Q10: Does an agent limit its
learning curve according to its current goals or should an agent broaden the
scope to include any seemingly useful piece of information?
A10: There are two questions here. First, does an agent limit its learning curve according to its current goals? Sometimes yes. Why? Because learning is costly. If an agent is performing very well, then the agent can turn off its learning mechanism or slow down its learning factor. But if the learning is dependant on more than one agent, then you need to design your agents such that even if agent A slows down its learning, it should still provide help to other agents’ learning processes.
Second, should an agent broaden the scope to include any seemingly useful piece of information? The key here is “seemingly useful”. How does an agent know whether a piece of information is useful? You as a designer have to program that into the agent. Or, you can design a learning mechanism such that the agent will learn to know what information is useful. That is the first concern. Now, how does an agent measure the “usefulness” of a piece of information? Do you use utility? To even tell something is “seemingly useful”, you as a designer have to provide some utility measures for evaluating that piece of information. If that is the case, then the agent no longer broadens its scope, since that provision has already been programmed in the first place. Now, the issue becomes this: should an agent try to learn about task B after it has learn to do task A well? This is assuming that the agent has several tasks to learn to do well. In that case, we are back to the generalist vs. specialist tradeoff we have discussed in class.
See also Q11 and Q15.
Q11: Making every agent
intelligent involves a cost factor. Can
there be an environment where few agents are intelligent agents and can learn
while the remaining agents do not have the ability to learn?
A11: Yes. Some agents may turn off or slow down their learning processes when those agents have become very good at what they do. Also, you could design a hierarchical multiagent system such that super-agents learn and sub-agents do not learn. You could also design a multiagent system with specialist agents so that only agents with important expertise at a given time learn.
See also Q10 and Q15.
Q12: What is learning
rate? How to compute the rate?
A12: The learning rate that we use in a learning algorithm such as that in reinforcement learning is the degree of influence a learning cue asserts onto the decision making process. For example, suppose you perform a task A and you are given a reward R for doing task A. Does that mean that you will ignore doing all other tasks and concentrate on doing task A from now on? If you choose to do that, then you have let the reward change your decision making process significantly. When that happens, we say that your learning rate is high—you are easily affected by the feedback, by the reinforcement. A high learning rate may lead you to learn the wrong things, and as a result, your knowledge will not converge and oscillate. A low learning rate allows you to explore more options first, to investigate which options are good or poor, and gradually settle on a set of good options.
Now, how to compute the rate? There are several ways. First, you may want the learning to be conservative and set a low rate to begin the system. Then you run your system and observe its performance. You may set the rate higher if you think your system’s performance improves too slowly. You may set the rate lower if you think your system’s performance simply does not want to converge to a stable state—the system does not stop changing the Q-values of the actions, for example. Second, you may let the system learn the learning rate itself. Once the system sees an oscillation, it can automatically set the learning rate lower, for example.
Q13: In Q-learning, can we
implement reinforcement by dynamic programming and trace back?
A13: You could. But the idea of Q-learning or any type of reinforcement learning is to take into consideration every single experience that the learner sees. The reinforcement scheme itself does not have to trace back. That is the beauty of it. It accumulates all the experiences that the learner encounters. Imagine what humans do. For example, when you were a kid, when you grew up, when you are now in a college, and so on, do you trace back? Not really, all the things that you have encountered are stored in your memory. You learn to put more values and different values on different things and you learn that in many ways. One of them is reinforcement learning. So, if you use trace back, then you defeat the underlying principle of reinforcement learning.
Q14: Humans can learn good
things, but they can also learn bad things, and even faster! What type of mechanisms in multiagent
systems can keep this in control, especially for a system of self-interested
agents?
A14: First of all, as I have emphasized in class many times, ultimately, you want to design a multiagent system to solve a set of problems. So, you as the designer have the control over the multiagent system. Now, how do you control self-interested agents? That is quite straightforward. You design the system such that the agents are motivated to learn what you want them to learn. If you want them to help other agents, then you design a reward system that benefits them when they help other agents. If you do not want them to cheat, then you design a reward system that penalizes them when they cheat. In Chapter 5 when we discuss distributed rational decision making, we discussed protocols and rules that prevent agents from cheating. Yes, if we design the system so that not-to-learn-bad-things is to the best interest of a self-interested agent, then that self-interested agent will not learn bad things.
Q15: Learning has some costs,
for a multiagent system, this refers to the computing resources. Is there a mechanism to decide the optimal
amount of learning effort?
A15: I have not encountered a formal, direct mechanism that decides the optimal amount of learning effort. Usually, a learning process stops when the performance of the learner converges. For example, the training of a neural network stops after the classification accuracy (the task) no longer improves. But in a multiagent system, learning could be quite costly, especially when communication is involved. Thus, when an agent learns, it can find out first the potential utility of the learned knowledge versus the cost of learning. If it is too costly, then it may decide not to learn. Or, there is also work in machine learning that considers search and storage vs. learning. Every time you learn something, you have to store that piece of knowledge. After that, every time you need to make a decision, you have one more piece of knowledge to search. The storage space is not unlimited and search time may be too costly. If those two things outweigh the benefits that the learned knowledge brings to the performance of the learner, then it is not worth learning that piece of knowledge. In a simple problem, it is relatively easy to decide the optimal amount of learning effort—we can associate each action-situation directly with a simply reward and associate cost to each learning step. But it is very difficult to determine the optimal amount of learning effort in a complex problem where one cannot find an exact mapping. However, such a mechanism is definitely feasible.
See also Q10 and Q11.