CSCE
475/875
Game Day 1: Learning Day
Assigned: September 15, 2011 Game Day: September 22, 2011
Introduction
On Learning Day,
students are required to practice to perform reinforcement learning as
agents. The key is to learn as
accurately as possible while earning as much reward as possible. These two objectives might be at odds with
each other and thus it is important for the students to choose the correct
parameters for the learning.
Note also that
learning in multiagent systems also involve the exploitation vs. exploration
tradeoff. Does one explore in order to
learn the best alternatives accurately, or does one exploit whatever it has
learned to gain rewards even though the learned information might not be
optimal?
The objectives of
Learning Day are to learn and familiarize with the reinforcement learning and
multiagent learning mechanisms, and to learn how to observe the environment in
order to change the parameters of the mechanisms to make them more effective
and efficient. More specifically, you
will learn about Q-learning, its learning rate () and its discount factor ().
Recall the following
definition from our lectures:
Definition 7.4.1
(-learning) -learning is the
following procedure:
Initialize the -function and values
(arbitrarily, for example)
Repeat until
convergence
Observe the current state
.
Select action and take it.
Observe the reward
Perform the following updates (and do not
update any other-values):
Setup
Each team will be provided with the
following:
1.
A set of actions
that the team can perform. Together with
each action is the input state (the situation) on which the action can be
performed and the output state (the outcome) after the action is performed. You will not be given the value of value of
each state. There is no cost for
performing an action. These actions are
repeatable. That is, you can perform the same action again without exhausting
the resources.
2.
A set of states
that the team currently has. Each team
will be given a set of initial states before the Game Day starts. A state, once transitioned, will be “used”
and not available to the team. However,
it is possible that the team may arrive at this state at a later time by
transitioning to it by carrying out other situation-action pairs.
3.
The assumption
that for all situation-action pairs is set to 0.5
initially.
Your goal is to learn to identify the relative
order for all situation-action pairs in terms of their -values. That is, at the end of the game day, you
should submit an ordered-list of all pairs with the respective values.
You may choose to ask another team to perform an action for you
for a particular state. To do so, you
must provide a written request to that team.
If that team agrees to perform the action, it must share with you the
rewards 50-50. We will talk about how
the rewards are determined later.
In terms of what will be tallied to win the game day, we look at
two aspects. First, we will look at the
ordering and match it to the predefined ordering. The better your ordering matches the
predefined one, the better your score will be.
Second, we will look at the rewards that you have collected. A higher total amount of rewards gives you a
better score too.
There will be two rounds.
You will submit the values for the learning rate and discount factor
prior to each round. After Round 1, each
team will e-mail the Game Day Monitor their ordering, and we will e-mail every
team’s ordering to all teams. And then
each team will be allowed to change the values after Round 1 based on what they
discern from the information. And then,
Round 2 will begin.
The scores of the two rounds will be added using a weighted sum:
0.25 for Round 1, and 0.75 for Round 2. This is to allow you to learn from the
environment what better values should be assigned to the learning rate and
discount factor after experiencing Round 1, and thus your Round 2 performance
will be valued more in this Game Day.
Setup: How to Obtain Rewards
To obtain your reward for each
situation-action pair, you will need to hand the pair (written on a paper
token) to the Game Day monitor(s). The
Game Day monitor(s) will consult a “distribution list” and “state transition
map” and give you the “reward” (in paper Monopoly money) and the set of states
as a result of performing the situation-action pair.
The “distribution list” is generated
prior to the Game Day to model each reward function as a Gaussian distribution
with a specific mean and a standard deviation.
That is, the reward for each situation-action pair will follow a
distribution, and is not just a single number.
The information on the distribution will not be revealed to the
teams. But each team should assume that
some situation-action pairs will have a larger standard deviation than others,
and some situation-action pairs will have a larger mean than others.
Note that each paper token will have the
State ID, a blank for the team name, and a blank for you to write down the
action to be performed.
Requirements
Each student group is required to turn
in three reports: pre-game strategies, mid-game strategies, and post-game
lessons learned.
Some ideas on what should be included in
the reports: your strategies for each round of multiagent learning, your
rationale behind the values of the learning rate and the discount factor, how
you divide the members of the group to different tasks, your total rewards for
each round and the final grand total, the ordering of the state-action pairs,
and finally your conclusion.
Your participation on Learning Day will be graded
based on:
50% Game Day Report (pre-game and mid-game strategies,
worksheets)
50% Learning
The Learning Score will be graded based on your
in-class participation on Learning Day, and on your team’s performance.