CSCE475/875 Multiagent
Systems
Handout 7: Game Day 1 Learning
Day Analysis
September 27, 2011
State Transition Map and Rewards
There were six states (S1-S6) and three actions (A1-A3). 5 teams started with S1 and 5 teams started with S6. Each team was capable of performing all three actions. The transition map and the rewards are shown in Table 1 below. For example, performing A1 given state S1 yields either state S1 or S4 and gains a reward of $30. Wherever there is more than one resultant state, our program randomly picked one of them using equal probability.
For each reward, we actually generated a Gaussian distribution with a mean valued at the number indicated in the parentheses and a standard deviation of 0.8.
A1 |
A2 |
A3 |
|
S1 |
S1, S4 ($30) |
S2, S4 ($30) |
S3, S4 ($30) |
S2 |
S5 ($1) |
S2 ($1) |
S1, S2, S5 ($1) |
S3 |
S2 ($50) |
S3, S4 ($10) |
S1 ($1) |
S4 |
S6 ($1) |
S3, S4 ($10) |
S5 ($50) |
S5 |
S2 ($1) |
S5 ($1) |
S2, S5, S6 ($1) |
S6 |
S3, S6 ($30) |
S3, S5 ($30) |
S3, S4 ($30) |
Table 1. State transition map and rewards.
Team Statistics
Note that the Order Accuracy is computed as follows. First, we rank the state-action pairs based on the ground truth generated by our simulation[1], using the Q-Learning equation. The ordering of the state-action pairs (Q(s,a)) is as follows:
1. S1-A1, S1-A3, S6-A1, S6-A3 (roughly the same)
5. S1-A2, S3-A1, S4-A3, S6-A2 (roughly the same)
9. S3-A2, S3-A3, S4-A1, S4-A2 (roughly the same)
13. S2-A3, S5-A3
(roughly the same)
15. S2-A1, S5-A1
(roughly the same)
17.
S2-A2, S5-A2 (roughly the same)
To compute the accuracy of each team’s ordering, we use two series of ranking as shown in Table 1 below. We then compute the absolute difference between a team’s ordering with Series 1, and do the same with Series 2. And then we compute the average of the differences.
Pair |
Series 1 |
Series 2 |
S1 -- A1 |
1 |
4 |
S1 -- A2 |
5 |
8 |
S1 -- A3 |
1 |
4 |
S2 -- A1 |
15 |
16 |
S2 -- A2 |
17 |
18 |
S2 -- A3 |
13 |
14 |
S3 -- A1 |
5 |
8 |
S3 -- A2 |
9 |
12 |
S3 -- A3 |
9 |
12 |
S4 -- A1 |
9 |
12 |
S4 -- A2 |
9 |
12 |
S4 -- A3 |
5 |
8 |
S5 -- A1 |
15 |
16 |
S5 -- A2 |
17 |
18 |
S5 -- A3 |
13 |
14 |
S6 -- A1 |
1 |
4 |
S6 -- A2 |
5 |
8 |
S6 -- A3 |
1 |
4 |
Table 1. Two series of ranking numbers used in the computation of order accuracy.
Tables 2 and 3 show the ordering of the teams after Round 1 and Round 2, respectively.
Pair |
Reagent |
DJ Carpet |
Free Agents |
Power Agent |
Wolfpack |
Split Second |
ULM |
Triple Threat |
JRL |
SIB |
S1 -- A1 |
11 |
3 |
3 |
7 |
7 |
2 |
2 |
4 |
3 |
6 |
S1 -- A2 |
3 |
1 |
6 |
8 |
7 |
4 |
7 |
4 |
4 |
6 |
S1 -- A3 |
5 |
2 |
6 |
9 |
7 |
6 |
8 |
4 |
5 |
3 |
S2 -- A1 |
12 |
7 |
18 |
17 |
5 |
16 |
9 |
5 |
6 |
6 |
S2 -- A2 |
9 |
8 |
5 |
18 |
7 |
17 |
10 |
5 |
7 |
6 |
S2 -- A3 |
13 |
8 |
6 |
5 |
7 |
5 |
6 |
3 |
8 |
6 |
S3 -- A1 |
2 |
8 |
1 |
2 |
7 |
1 |
11 |
1 |
1 |
6 |
S3 -- A2 |
6 |
8 |
6 |
10 |
7 |
6 |
12 |
4 |
9 |
6 |
S3 -- A3 |
10 |
6 |
6 |
11 |
7 |
6 |
13 |
4 |
10 |
5 |
S4 -- A1 |
8 |
8 |
6 |
12 |
7 |
6 |
14 |
4 |
11 |
4 |
S4 -- A2 |
7 |
8 |
6 |
13 |
7 |
6 |
15 |
4 |
12 |
6 |
S4 -- A3 |
14 |
8 |
6 |
14 |
7 |
6 |
1 |
4 |
13 |
6 |
S5 -- A1 |
15 |
8 |
4 |
16 |
6 |
6 |
16 |
3 |
14 |
6 |
S5 -- A2 |
16 |
8 |
6 |
6 |
3 |
6 |
17 |
3 |
15 |
6 |
S5 -- A3 |
17 |
5 |
6 |
4 |
4 |
18 |
5 |
3 |
16 |
6 |
S6 -- A1 |
1 |
3 |
2 |
1 |
1 |
2 |
3 |
2 |
17 |
3 |
S6 -- A2 |
18 |
8 |
6 |
3 |
2 |
6 |
4 |
4 |
2 |
2 |
S6 -- A3 |
4 |
8 |
6 |
15 |
7 |
6 |
18 |
4 |
18 |
1 |
Table 2.
The ordering of state-action pairs from each team after Round 1.
Pair |
Reagent |
DJ Carpet |
Free Agents |
Power Agent |
Wolfpack |
Split Second |
ULM |
Triple Threat |
JRL |
SIB |
S1 -- A1 |
16 |
6 |
9 |
7 |
3 |
4 |
7 |
5 |
8 |
15 |
S1 -- A2 |
3 |
3 |
8 |
8 |
2 |
11 |
11 |
4 |
8 |
15 |
S1 -- A3 |
5 |
5 |
3 |
9 |
13 |
17 |
12 |
12 |
8 |
10 |
S2 -- A1 |
7 |
12 |
18 |
17 |
10 |
13 |
5 |
14 |
4 |
11 |
S2 -- A2 |
14 |
16 |
17 |
18 |
12 |
18 |
13 |
13 |
5 |
13 |
S2 -- A3 |
12 |
13 |
13 |
5 |
8 |
16 |
9 |
9 |
6 |
12 |
S3 -- A1 |
1 |
4 |
4 |
2 |
1 |
5 |
2 |
1 |
8 |
6 |
S3 -- A2 |
6 |
10 |
12 |
10 |
13 |
7 |
14 |
12 |
2 |
8 |
S3 -- A3 |
15 |
11 |
6 |
11 |
14 |
10 |
15 |
12 |
8 |
15 |
S4 -- A1 |
11 |
17 |
15 |
12 |
7 |
12 |
16 |
7 |
8 |
14 |
S4 -- A2 |
10 |
17 |
10 |
13 |
13 |
9 |
17 |
12 |
8 |
4 |
S4 -- A3 |
17 |
2 |
1 |
14 |
13 |
3 |
6 |
12 |
8 |
1 |
S5 -- A1 |
13 |
15 |
14 |
16 |
15 |
15 |
10 |
10 |
7 |
14 |
S5 -- A2 |
8 |
14 |
16 |
6 |
9 |
14 |
18 |
8 |
9 |
3 |
S5 -- A3 |
9 |
8 |
11 |
4 |
11 |
6 |
4 |
6 |
3 |
2 |
S6 -- A1 |
2 |
1 |
5 |
1 |
4 |
1 |
1 |
2 |
8 |
9 |
S6 -- A2 |
18 |
7 |
7 |
3 |
5 |
8 |
8 |
11 |
1 |
7 |
S6 -- A3 |
4 |
9 |
2 |
15 |
6 |
2 |
3 |
3 |
8 |
5 |
Table 3.
The ordering of state-action pairs from each team after Round 2.
Now, we present the more detailed team statistics in Tables 4-6. The number of transactions and rewards were tallied based on the log that our program captured during the Game Day.
As shown in Table 4, after Round 1, Team Reagent was ranked #1. Indeed, the team scored the highest amount of rewards ($188) and the highest accuracy in its ordering of the state-action pairs (3.94). Team Power Agent was a close second, followed by Team Split Second and Power Agent. Both Team Triple Threat and Wolfpack finished last. Team JRL managed only three transactions, but they obtained high rewards. The TOTAL RANK is a weighted sum of the two RANK values: 0.5*RANK(Rewards) + 0.5*RANK(OrderAccuracy). (Note: Efficiency is simply Rewards divided by #trans. This will be used later in our Game Day analysis.)
Team |
Round 1 |
|
|
TOTAL |
|||
Name |
#trans |
Rewards |
Efficiency |
RANK |
Order Accuracy |
RANK |
RANK |
Free Agents |
8 |
140 |
17.50 |
5 |
5.17 |
7 |
6 |
Split Second |
9 |
141 |
15.67 |
4 |
4.11 |
2 |
3 |
Power Agent |
11 |
174 |
15.82 |
2 |
4.50 |
3 |
2.5 |
DJ Carpet |
9 |
150 |
16.67 |
3 |
4.50 |
3 |
3 |
ULM |
7 |
140 |
20.00 |
5 |
4.50 |
3 |
4.5 |
Reagent |
10 |
188 |
18.80 |
1 |
3.94 |
1 |
1 |
Triple Threat |
8 |
81 |
10.16 |
9 |
6.44 |
10 |
9.5 |
Wolfpack |
6 |
63 |
10.50 |
10 |
5.50 |
9 |
9.5 |
SIB |
6 |
117 |
19.50 |
8 |
5.28 |
8 |
8 |
JRL |
3 |
129 |
43.00 |
6 |
5.06 |
6 |
6.5 |
AVERAGE |
7.70 |
132.30 |
18.76 |
4.90 |
|
|
|
TOTAL |
77 |
1323 |
|
|
|
Table 4.
Statistics after Round 1. Team Reagent was ranked #1. The team obtained the highest amount of
reward ($188) and also the highest accuracy in its ordering of the state-action
pairs (3.94).
Table 5 shows only the statistics during Round 2, and not accumulative. There were on average more transactions in Round 2 compared to those in Round 1 (20.10 vs. 7.70). In terms of Rewards, as expected, Round 2 yielded a higher average than Round 1 ($259.60 vs. $132.30). This was due to two factors. First, Round 2 lasted for about 20 minutes while Round 1 lasted for about 15 minutes. Second, the operation was smoother in Round 2: the Game Day monitors processed the transactions faster and the teams also submitted their state tokens faster.
An interesting observation is the Efficiency measure in Round 1 and Round 2. Unexpectedly, the Efficiency in Round 2 was on average lower than that in Round 1 (12.55 vs. 18.76). In fact, only two teams were more efficient in Round 2: (1) Team Power Agent (from 15.82 to 22.85), and (2) Team ULM (from 20.00 to 21.22). We will look into each team’s strategy for clues in later sections discussing individual teams.
Team |
Round 2 |
||
Name |
#trans |
Rewards |
Efficiency |
Free Agents |
23 |
399 |
17.35 |
Split Second |
23 |
280 |
12.17 |
Power Agent |
26 |
594 |
22.85 |
DJ Carpet |
22 |
275 |
12.50 |
ULM |
18 |
382 |
21.22 |
Reagent |
27 |
95 |
3.519 |
Triple Threat |
18 |
125 |
6.94 |
Wolfpack |
14 |
147 |
10.50 |
SIB |
17 |
250 |
14.71 |
JRL |
13 |
49 |
3.77 |
AVERAGE |
20.10 |
259.6 |
12.55 |
TOTAL |
201 |
2596 |
Table 5.
Statistics for Round 2 only. The number of transactions (#trans) and
rewards do not include those from Round 1.
Table 6 shows the accumulative statistics after Round 2. The order accuracy, in particular, improved over that in Round 1 (4.28 vs. 4.90). Once again, TOTAL RANK = 0.5*RANK(Rewards) + 0.5*RANK(OrderAccuracy). Overall, Team Power Agent was ranked #1 in terms of the amount of rewards earned ($768), while Team Free Agents was ranked #1 in terms of order accuracy (2.44). Surprisingly, Team Reagent dropped from #1 after Round 1 to #7.5 after Round 2. We will see later perhaps some clues as to the reason for this drop in later sections.
Team |
Round 2 |
|
|
TOTAL |
|||
Name |
#trans |
Rewards |
Efficiency |
RANK |
Order Accuracy |
RANK |
RANK |
Free Agents |
31 |
539 |
17.39 |
2 |
2.44 |
1 |
1.5 |
Split Second |
32 |
421 |
13.16 |
5 |
3.06 |
2 |
3.5 |
Power Agent |
37 |
768 |
20.76 |
1 |
4.50 |
6 |
3.5 |
DJ Carpet |
31 |
425 |
13.71 |
4 |
3.17 |
3 |
3.5 |
ULM |
25 |
522 |
20.88 |
3 |
4.67 |
7 |
5 |
Reagent |
37 |
283 |
7.65 |
7 |
5.11 |
8 |
7.5 |
Triple Threat |
26 |
206 |
7.92 |
9 |
4.11 |
4 |
6.5 |
Wolfpack |
20 |
210 |
10.50 |
8 |
4.17 |
5 |
6.5 |
SIB |
23 |
367 |
15.96 |
6 |
5.61 |
9 |
7.5 |
JRL |
16 |
178 |
11.13 |
10 |
5.94 |
10 |
10 |
AVERAGE |
27.8 |
391.9 |
14.10 |
4.28 |
|
|
|
TOTAL |
278 |
3919 |
|
|
|
Table 6.
Statistics after Round 2. All numbers are accumulative. Team Free Agents was ranked #1 after Round
2.
To compute the final score for the Learning Day (50% of the Game Day), we compute TOTALRANK(Combined) = 0.25*TOTALRANK(Round 1) + 0.75*TOTALRANK(Round 2) for each team. The different weights are used according to the specification outlined in the Game Day handout. Table 7 shows the result. Team Free Agents finished first, and thus they won Game Day 1. They were followed closely by Team Power Agent, and then Teams Split Second and DJ Carpet. These three teams were closely bundled. Then, Team ULM finished fifth. Team Reagent finished sixth. Then there was another cluster of three teams: Triple Threat, Wolfpack, and SIB. Finally, Team JRL finished a distant tenth.
Team Name |
TOTAL RANK |
FINAL RANK |
||
Round 1 |
Round 2 |
Combined |
||
Free Agents |
6 |
1.5 |
2.625 |
1 |
Split Second |
3 |
3.5 |
3.375 |
3 |
Power Agent |
2.5 |
3.5 |
3.25 |
2 |
DJ Carpet |
3 |
3.5 |
3.375 |
3 |
ULM |
4.5 |
5 |
4.875 |
5 |
Reagent |
1 |
7.5 |
5.875 |
6 |
Triple Threat |
9.5 |
6.5 |
7.25 |
7 |
Wolfpack |
9.5 |
6.5 |
7.25 |
7 |
SIB |
8 |
7.5 |
7.625 |
9 |
JRL |
6.5 |
10 |
9.125 |
10 |
Table 7.
The final ranking of teams. Team
Free Agents finished #1. Team JRL
finished last.
Individual Team Analysis
First, Table 8 shows the learning rate and discount factor used in Round 1 and Round 2 by each team. Most teams started with a higher learning rate in Round 1 and then lowered it for Round 2 (0.63 vs. 0.53), except for Team Reagent and Team JRL. Incidentally, referring back to Tables 4 and 6, Team Reagent’s Order Accuracy worsened from 3.94 to 5.11, while Team JRL’s Order Accuracy worsened from 5.06 to 5.94. Correspondingly, most teams used a higher or the same discount factor in Round 2 than in Round 1 (0.475 vs. 0.375), except for Team SIB. Once again, referring back to Tables 4 and 6, Team SIB’s Order Accuracy worsened from 5.28 to 5.61. Indeed, out of ten teams, there were four teams with worse order accuracy in Round 2 than in Round 1. Three teams have been accounted for in the above. The remaining team is Team ULM, going from 4.50 to 4.67. The learning rate and discount factor values seem to be important factors in determining the learning performance of the agents.
Team Name |
Round 1 |
Round 2 |
||
Learning Rate |
Discount Factor |
Learning Rate |
Discount Factor |
|
Free Agents |
0.7 |
0.3 |
0.7 |
0.5 |
Split Second |
0.85 |
0.4 |
0.85 |
0.70 |
Power Agent |
0.7 |
0.3 |
0.5 |
0.6 |
DJ Carpet |
0.8 |
0.3 |
0.5 |
0.6 |
ULM |
0.9 |
0.3 |
0.7 |
0.5 |
Reagent |
0.25 |
0.75 |
0.6 |
0.75 |
Triple Threat |
0.5 |
0.2 |
0.3 |
0.4 |
Wolfpack |
0.75 |
0.1 |
0.75 |
0.1 |
SIB |
0.7 |
0.7 |
0.2 |
0.2 |
JRL |
0.15 |
0.4 |
0.2 |
0.4 |
AVERAGE |
0.63 |
0.375 |
0.53 |
0.475 |
Table 8.
Learning rates and discount factors used by each team for Round 1 and
Round 2.
Before we start looking at teams individually, here is a general sense of the two rounds and the role of the intermission’s information sharing.
In general, Round 1 is for exploration, and Round 2 is for a bit more exploitation. That is, Round 1 should be used to explore different state-action pairs. And as a result, one should use a higher learning rate, to emphasize each current transaction and its reward more. The intermission’s information sharing should give each team some ideas about how their ordering compares to others. If your team’s ordering is very different from others’, perhaps your Q-values for these state-action pairs have not converged. If your team’s ordering is very similar to others’, then perhaps your Q-values have converged. Given that logic, then Round 2 should be more for exploitation if you are confident that your Q-values have converged. In that scenario, using a lower learning rate and a bigger discount factor will help towards that.
But, one critical issue is that what if other teams’ orderings are less accurate than yours. Since your confidence in your own Q-values depends on how they match up, what should one do? This is where agent observation comes into play. For example, your team may observe what other teams are doing. If a team seldom approaches the Game Day Monitors to submit a transaction, then that means that team’s learning result is not to be trusted. Given your observation of other teams’ behaviors, you should be able to disregard untrustworthy orderings, thereby better utilizing the intermission’s information sharing to determine your learning rate and discount factor more appropriately.
There are also other factors. Note that for any learning approach to work, in particular for reinforcement learning to work, there must be sufficient learning episodes. In this Game Day, that means each team should secure a lot of transactions in order to better model the stochastic nature of the environment.
Table 9 below shows some correlations among the number of transactions, rewards, and order accuracy values. As expected, the number of transactions and rewards received by each team were highly correlated (greater than 0.55 after each round). Further, the number of transactions and order accuracy were also rather highly correlated (greater than 0.42 after each round[2]). So, our intuition in general is correct. Also, rewards and order accuracy were more correlated in Round 1 compared to Round 2. This is also expected. As more teams turned to exploit what they had learned in Round 1, their focus switched to learning for the sake of earning rewards more than for the sake of learning about the Q-values.
Correlations |
|||
#Trans – Rewards |
#Trans – Accuracy |
Rewards – Accuracy |
|
After Round 1 |
0.556964 |
-0.42779 |
-0.83065 |
After Round 2 |
0.616655 |
-0.43117 |
-0.33744 |
Table 9. Correlations between number of transactions,
rewards, and order accuracy.
Table 10 documents my comments on each team’s worksheet and reports. My observations are contextualized on the discussions above.
Team
Name |
Pre-Game |
Tracking |
Mid-Game |
Tracking |
Post-Game |
My
Observation |
Free Agents |
Have strategies for both rounds:
exploration in Round 1 and exploitation in Round 2; no contingency planning. |
Not properly recorded. A missing ranking for (S3,A1). |
Pointed out that they would focus on
obtaining new Q values for each pair to start Round 2 |
Correctly recorded. |
Pointed out that it was important to
work efficiently and go through as many transactions as possible; willing to
tradeoff accuracy for more rewards; that the learning rate in Round 2 could
have been lower. |
This team executed exceedingly well
during the game day and was able to balance earning rewards and keeping the
ordering accurate. They were able to
perform 2-3 transitions particularly involving action A3 on the all the six
states. That greatly improved their
order accuracy. |
Split Second |
Have good strategies for both rounds:
exploration in Round 1 and exploitation in Round 2; comprehensive discussions
on myopic vs. long-term approach; but no mention of utilizing intermission’s
information sharing; contingency planning; distribution of tasks |
Correctly recorded |
Pointed out that they did not explore
the space as much as they would have liked; found a “trap” state (S2) with
low rewards; raised discount factor to be able to see trouble states; retained
.85 as learning rate to learn more. |
Almost all correctly recorded with a
couple of missing state-action pairs. |
Assumed that the state transition was
deterministic; made use of “stuck” state to improve Q-value; made use of
others’ ordering cautiously; observed the potential advantages and
disadvantages of a high discount factor; pointed out why they chose a high
learning rate in Round 2 and countered that with also a high discount factor;
that contracting was not time-cost efficient; and rejected contract offers
correctly; saw the importance in developing a strong, accurate set of
Q-values quickly: “The sooner it is done, the sooner the agent can begin
exploiting the environment in a manner close to its true potential.” |
This team was very well prepared. Though they used a high learning rate in
both rounds, their choice of actions helped them explore the space rather
well in Round 2. They also were able to exploit the rewarding state-action
pairs in Round 2 to still gain while exploring. The high learning rate and high discount
factor combination was a bold move but it appeared to work as the future term
was able to tamper the local rewards. |
Power Agent |
Have the most thought out set of
strategies for both rounds; contingency planning; making use of information
sharing during intermission; making use of contracting; comprehensive. |
Not properly recorded. Q-value ordering was incorrect at several
places. |
Good notes and pointed out the lack of
time to make full use of information sharing during intermission; did compare
their ordering against the average ordering |
Correctly recorded. |
Pointed out that submitting
transactions in Round 1 was a bottleneck that slowed down the learning
process; that they adopted a faster way in Round 2; that agent should be open
to dynamically adapt their strategies if needed (GOOD!); and that contracting
was not very inviting. |
This team was very well prepared. They also were able to home in on very
rewarding state-action pairs and thus gained the highest amount of rewards
after Round 2. However, because of
such a focus, they neglected to a certain degree in improving the Q-value of
the other state-action pairs as 9 out of 18 state-action pairs only received
one learning episode. Should probably
have balanced this out. This is a
typical tradeoff problem. |
DJ Carpet |
Have good strategies for both rounds:
exploration in Round 1 and exploitation in Round 2; but no mention of
utilizing intermission’s information sharing; distribution of task; no
contingency planning |
Correctly recorded , except for the
Q-value for S1,A2 |
Pointed out a lot of unvisited
states and thus didn’t change the
learning rate and discount factor too much. |
|
Pointed out that they had aimed to hit
a productive transition order, but foiled with an “unexpected state
transition”. |
Strategies were appropriate for both
rounds. But assumed that the state
transitions were deterministic.
Strategies were quite opportunistic – if arriving at state that would
generate good rewards with a certain action, the team would pursue;
otherwise, explore a bit more. |
ULM |
Have good strategies for both rounds:
exploration in Round 1 and exploitation in Round 2; but no mention of
utilizing intermission’s information sharing; no contingency planning. |
Not properly recorded. Q-value ordering was incorrect at several
places. |
Pointed out that due to time
constraints, were not able to map the entirety of the space |
Not properly recorded. Q-value ordering was incorrect at several
places. |
Pointed out that Round 2 strategy led
them to prefer “known” paths over “unknown” paths, leading to poor choices
with low rewards for some states; pointed out the lack of sufficient number
of transactions hurt the performance of Q-learning; pointed out that they
assumed the state transition was deterministic; argued for a new Q-learning
variant to address such a problem |
The learning rates used were quite
high: 0.9 in Round 1 and 0.7 in Round 2.
As a result, your learning was not stable, leading to the poorer order
accuracy after Round 2. Your choice of discount factor was more
appropriate. The team’s assumption
that the state transition was deterministic proved to be critical. Also, Q-learning should be able to address
the stochastic nature of the environment: it will converge given enough learning
episodes. Finally, the team should
have made used of the information sharing of the intermission. |
Reagent |
Strategies were no quite right: the
goal is not to just maximize the total reward, but it is also to gain high
order accuracy of the state-action pairs. No contingency planning. |
Not recorded. Ordering is incorrect. State-action pairs with 0.5 (8 of them)
should be ranked at the same position. (However, I corrected the ordering and
the team still ranked #1 in order accuracy in Round 1) |
No notes. |
Not recorded. |
Concluded that Q-learning falls short
in stochastic environments; pointed out that a problem with entering the
wrong state for a transaction into their code, leading to having to use a
higher learning rate to try to correct this error in Round 2 |
Using a low learning rate in Round 1
was not appropriate as the initial values of 0.5 were not to be trusted; 0.75
used in Round 1 for discount factor was also probably too high because it was
too far-seeing, not appropriate when the Q-values were far from being
accurate in Round 1. Did not exploit
information sharing. The higher
learning rate was the main factor for this team’s drop from #1 in Round 1 to
#8 in order accuracy in Round 2.
Should have used a lower learning rate and the system would gradually
learn to correct that error made in Round 1 while stabilizing on other
Q-values. |
Triple Threat |
Have strategies for Round 1 but not
exactly for Round 2’s learning rate and discount rate (no explanation);
interesting approach to learn the probability of transition; distribution of
tasks; no contingency planning |
Not properly recorded. The Q-values do
not correspond to the ordering submitted. |
The team pointed out that they were
stuck in a loop between S2 and S5 |
Not properly recorded. Missing Q-values. |
Pointed out that “brute forcing our
way out” of a loop of low rewards took time;
that they learned that they need to work on their organization |
The interesting strategy of trying to
model the transition probabilities would require sufficient number of
learning episodes; cooperation strategies did not consider other team’s
motivation; should have made us of the information sharing intermission to
get out of a loop (see comments on Wolfpack) |
Wolfpack |
Have good strategies for both rounds:
exploration in Round 1 and exploitation in Round 2; but no mention of
utilizing intermission’s information sharing; division of tasks. |
Correctly recorded. |
Good notes on S3, A1 from learning
from other teams’ during the intermission.
And the team used the information in Round 2! |
Correctly recorded. |
Good notes. Tried to contract other teams to no avail;
pointed out that exploitation was not as successful in Round 2 because of
lack of exploration in Round 1; that should have created an excel sheet to
compute Qs and Vs faster; that they forgot to consider uncertainty in the
environment. However, incorrectly
concluded that the environment changed in Round 2 – the environment did not
change; it was simply stochastic. |
The team was quite well prepared in
terms of strategies. But tactically,
they were not sufficiently prepared as they couldn’t compute by hand as fast
as other teams, and as a result they
didn’t submit enough transactions. On
the other hand, this team made use of the intermission’s information sharing
to immediately choose to perform A1 as soon as they observed S3. Good move. |
SIB |
Have strategies for both rounds:
exploration in Round 1 and exploitation in Round 2; but no mention of
utilizing intermission’s information sharing |
Correctly recorded. |
No notes. |
Not properly recorded; Ordering not corrected
reported. S4-A1 should be #15, and
then the last three should be #16. |
No notes. |
The team was prepared in terms of
overall strategy. But the team did no
have any “real-time” strategy to make use of information sharing. The team also did not report their ordering
properly (for one state-action pair in Round 2). |
JRL |
A simple pre-game strategy, with no
contingency plan, and no strategy for exploiting the information sharing
during intermission; no task allocation among the team members. |
Poorly recorded. Further, given only three transactions
affecting only S6, S3, S2, it was impossible for JRL to turn in a
high-resolution ordering of the state-action pairs as shown in Table 2. |
Pointed out that they increased the
learning rate from 0.15 to 0.2 anticipating insufficient transactions again
in Round 2 |
Correctly recorded. |
The team realized that they were not
well prepared for the Game Day; they also pointed out that there were stuck
in a loop for 11 iterations and received poor rewards. |
The team was not well prepared; the
pre-game strategy was lacking; and there were simply too few transactions.
For reinforcement learning to work, there must be sufficient learning
episodes. Did not make use of the
information sharing of the intermission. |
Table 10.
My comments and observations of team strategies, worksheets, and
reports.
Lessons Learned
Here are some overall lessons learned.
1. There was no motivation for the teams to cooperate via the contracting process. In this case, the design of the MAS environment did not provide any benefits for the agents to cooperate that could offset the time-cost. Besides, the information sharing during the intermission, if it were to be used, should provide sufficient help.
2. Several teams argued that getting stuck in a loop of state transitions yielding no or very low rewards. That is true. However, as an agent, when this was observed, each team could still gain by using the opportunity to refine its Q-value of the state-action pairs involved in the loop.
3. More transactions led to better learning, as shown in the above correlation numbers (Table 9). Thus, acting quickly and efficiently was critical. Teams that were slow in submitting their state tokens received fewer transactions, leading to poorer performances.
4. Lowering the learning rate or keeping it the same appeared to work better than increasing the learning rate from Round 1 to Round 2 for this MAS environment. In general, increasing the learning rate as time progresses would tend to unlearn what has been learned.
5. Using a high discount factor could have a clamping effect on the learning performance brought on by a high learning rate. This is because looking into the future term essentially incorporates previous Q-values into the fray.
6. The information sharing during the intermission was only exploited by a handful of teams. As alluded to earlier, by comparing your ordering with others’ could help you decide your learning rate and discount factor. It could also help you decide what actions to choose to perform in Round 2 for a certain state.
7. Why did the rewards per transaction (efficiency) drop? There were two factors. Because there were more than twice as many transactions in Round 2 as opposed to those in Round 1 (201 vs. 77), the law of averages came into play. The simulated Gaussian distribution of the rewards took place. Second, in general, most teams chose to have a smaller learning rate in Round 2, and thus observed fewer new states as a result. And that also could add to the law of averages impact.
8. Several teams pointed out the nature of a tradeoff at play: trying to maximize rewards while trying to maximize the order accuracy. These two objectives are in a tug-of-war. Maximizing rewards reduces exploration and increases exploitation, and vice versa with maximizing the order accuracy. Several teams had adopted an opportunistic balancing act: if they encountered a “rewarding” good state, they would keep acting on it until it transitioned out, and if they encountered a new state, they would consult the “information shared” (the excel file of all orderings from Round 1) to pick the likely useful action.
9. Teams that were prepared were ranked higher. As an agent, each team should be observant, adaptive, responsive, and reflective.
Game Day League
Here is the League Standings.
Team Name |
Learning Day |
Voting Day |
Auction Day |
League Standings |
Free Agents |
1 |
|
|
1 |
Power Agent |
2 |
|
|
2 |
DJ Carpet |
3 |
|
|
3 |
Split Second |
3 |
|
|
3 |
ULM |
5 |
|
|
5 |
Reagent |
6 |
|
|
6 |
Triple Threat |
7 |
|
|
7 |
Wolfpack |
7 |
|
|
7 |
SIB |
9 |
|
|
9 |
JRL |
10 |
|
|
10 |
[1]
In our
simulation, we varied the learning rate (alpha) from 0.1 to 1.0 with 0.1
increments, and discount factor (beta) the same way too. Thus, we used 100 different
configurations. For each configuration,
we ran 10,000 actions. Also, we ran 100
times for each configuration to obtain a reasonable set of averages for each
configuration. So, all in all, we ran
100 x 100 runs and each run for 10,000 actions for a total of 100 million
actions.
[2] The correlation value is negative because the higher the accuracy, the smaller the value is.