MAS Game Day 1: Learning Day

CSCE 475/875

Game Day 1: Learning Day

Assigned: September 15, 2011 Game Day: September 22, 2011

Introduction

On Learning Day, students are required to practice to perform reinforcement learning as agents. The key is to learn as accurately as possible while earning as much reward as possible. These two objectives might be at odds with each other and thus it is important for the students to choose the correct parameters for the learning.

Note also that learning in multiagent systems also involve the exploitation vs. exploration tradeoff. Does one explore in order to learn the best alternatives accurately, or does one exploit whatever it has learned to gain rewards even though the learned information might not be optimal?

The objectives of Learning Day are to learn and familiarize with the reinforcement learning and multiagent learning mechanisms, and to learn how to observe the environment in order to change the parameters of the mechanisms to make them more effective and efficient. More specifically, you will learn about Q-learning, its learning rate () and its discount factor ().

Recall the following definition from our lectures:

Definition 7.4.1 (-learning) -learning is the following procedure:

Initialize the -function and values (arbitrarily, for example)

Repeat until convergence

Observe the current state .

Select action and take it.

Observe the reward

Perform the following updates (and do not update any other-values):

Setup

Each team will be provided with the following:

1. A set of actions that the team can perform. Together with each action is the input state (the situation) on which the action can be performed and the output state (the outcome) after the action is performed. You will not be given the value of value of each state. There is no cost for performing an action. These actions are repeatable. That is, you can perform the same action again without exhausting the resources.

2. A set of states that the team currently has. Each team will be given a set of initial states before the Game Day starts. A state, once transitioned, will be “used” and not available to the team. However, it is possible that the team may arrive at this state at a later time by transitioning to it by carrying out other situation-action pairs.

3. The assumption that for all situation-action pairs is set to 0.5 initially.

Your goal is to learn to identify the relative order for all situation-action pairs in terms of their -values. That is, at the end of the game day, you should submit an ordered-list of all pairs with the respective values.

You may choose to ask another team to perform an action for you for a particular state. To do so, you must provide a written request to that team. If that team agrees to perform the action, it must share with you the rewards 50-50. We will talk about how the rewards are determined later.

In terms of what will be tallied to win the game day, we look at two aspects. First, we will look at the ordering and match it to the predefined ordering. The better your ordering matches the predefined one, the better your score will be. Second, we will look at the rewards that you have collected. A higher total amount of rewards gives you a better score too.

There will be two rounds. You will submit the values for the learning rate and discount factor prior to each round. After Round 1, each team will e-mail the Game Day Monitor their ordering, and we will e-mail every team’s ordering to all teams. And then each team will be allowed to change the values after Round 1 based on what they discern from the information. And then, Round 2 will begin.

The scores of the two rounds will be added using a weighted sum: 0.25 for Round 1, and 0.75 for Round 2. This is to allow you to learn from the environment what better values should be assigned to the learning rate and discount factor after experiencing Round 1, and thus your Round 2 performance will be valued more in this Game Day.

Setup: How to Obtain Rewards

To obtain your reward for each situation-action pair, you will need to hand the pair (written on a paper token) to the Game Day monitor(s). The Game Day monitor(s) will consult a “distribution list” and “state transition map” and give you the “reward” (in paper Monopoly money) and the set of states as a result of performing the situation-action pair.

The “distribution list” is generated prior to the Game Day to model each reward function as a Gaussian distribution with a specific mean and a standard deviation. That is, the reward for each situation-action pair will follow a distribution, and is not just a single number. The information on the distribution will not be revealed to the teams. But each team should assume that some situation-action pairs will have a larger standard deviation than others, and some situation-action pairs will have a larger mean than others.

Note that each paper token will have the State ID, a blank for the team name, and a blank for you to write down the action to be performed.

Requirements

Each student group is required to turn in three reports: pre-game strategies, mid-game strategies, and post-game lessons learned.

Pre-game strategies are to be handed in before the Game Day starts. Without this, your team will be disqualified from the Game Day.
The report on mid-game strategies consists of your observations noted down on your worksheets during the Game Day.
Post-game lessons learned are handed in at the end of the Game Day.

Some ideas on what should be included in the reports: your strategies for each round of multiagent learning, your rationale behind the values of the learning rate and the discount factor, how you divide the members of the group to different tasks, your total rewards for each round and the final grand total, the ordering of the state-action pairs, and finally your conclusion.

Your participation on Learning Day will be graded based on:

50% Game Day Report (pre-game and mid-game strategies, worksheets)

50% Learning

The Learning Score will be graded based on your in-class participation on Learning Day, and on your team’s performance.