CSCE475/875 Multiagent Systems

Handout 7: Game Day 1 Learning Day Analysis

September 27, 2011

State Transition Map and Rewards

There were six states (S1-S6) and three actions (A1-A3).  5 teams started with S1 and 5 teams started with S6.  Each team was capable of performing all three actions.  The transition map and the rewards are shown in Table 1 below.  For example, performing A1 given state S1 yields either state S1 or S4 and gains a reward of $30.  Wherever there is more than one resultant state, our program randomly picked one of them using equal probability. 

For each reward, we actually generated a Gaussian distribution with a mean valued at the number indicated in the parentheses and a standard deviation of 0.8. 

A1

A2

A3

S1

S1, S4 ($30)

S2, S4 ($30)

S3, S4 ($30)

S2

S5 ($1)

S2 ($1)

S1, S2, S5 ($1)

S3

S2 ($50)

S3, S4 ($10)

S1 ($1)

S4

S6 ($1)

S3, S4 ($10)

S5 ($50)

S5

S2 ($1)

S5 ($1)

S2, S5, S6 ($1)

S6

S3, S6 ($30)

S3, S5 ($30)

S3, S4 ($30)

Table 1.  State transition map and rewards. 

Team Statistics

Note that the Order Accuracy is computed as follows.  First, we rank the state-action pairs based on the ground truth generated by our simulation[1], using the Q-Learning equation.  The ordering of the state-action pairs (Q(s,a)) is as follows:

1.  S1-A1, S1-A3, S6-A1, S6-A3 (roughly the same)

5.  S1-A2, S3-A1, S4-A3, S6-A2 (roughly the same)

9.  S3-A2, S3-A3, S4-A1, S4-A2 (roughly the same)

13. S2-A3, S5-A3 (roughly the same)

15. S2-A1, S5-A1 (roughly the same)

17. S2-A2, S5-A2 (roughly the same)

To compute the accuracy of each team’s ordering, we use two series of ranking as shown in Table 1 below.  We then compute the absolute difference between a team’s ordering with Series 1, and do the same with Series 2.  And then we compute the average of the differences. 

Pair

Series 1

Series 2

S1 -- A1

1

4

S1 -- A2

5

8

S1 -- A3

1

4

S2 -- A1

15

16

S2 -- A2

17

18

S2 -- A3

13

14

S3 -- A1

5

8

S3 -- A2

9

12

S3 -- A3

9

12

S4 -- A1

9

12

S4 -- A2

9

12

S4 -- A3

5

8

S5 -- A1

15

16

S5 -- A2

17

18

S5 -- A3

13

14

S6 -- A1

1

4

S6 -- A2

5

8

S6 -- A3

1

4

Table 1.  Two series of ranking numbers used in the computation of order accuracy.

Tables 2 and 3 show the ordering of the teams after Round 1 and Round 2, respectively.


 

Pair

Reagent

DJ Carpet

Free Agents

Power Agent

Wolfpack

Split Second

ULM

Triple Threat

JRL

SIB

S1 -- A1

11

3

3

7

7

2

2

4

3

6

S1 -- A2

3

1

6

8

7

4

7

4

4

6

S1 -- A3

5

2

6

9

7

6

8

4

5

3

S2 -- A1

12

7

18

17

5

16

9

5

6

6

S2 -- A2

9

8

5

18

7

17

10

5

7

6

S2 -- A3

13

8

6

5

7

5

6

3

8

6

S3 -- A1

2

8

1

2

7

1

11

1

1

6

S3 -- A2

6

8

6

10

7

6

12

4

9

6

S3 -- A3

10

6

6

11

7

6

13

4

10

5

S4 -- A1

8

8

6

12

7

6

14

4

11

4

S4 -- A2

7

8

6

13

7

6

15

4

12

6

S4 -- A3

14

8

6

14

7

6

1

4

13

6

S5 -- A1

15

8

4

16

6

6

16

3

14

6

S5 -- A2

16

8

6

6

3

6

17

3

15

6

S5 -- A3

17

5

6

4

4

18

5

3

16

6

S6 -- A1

1

3

2

1

1

2

3

2

17

3

S6 -- A2

18

8

6

3

2

6

4

4

2

2

S6 -- A3

4

8

6

15

7

6

18

4

18

1

Table 2.  The ordering of state-action pairs from each team after Round 1. 

Pair

Reagent

DJ Carpet

Free Agents

Power Agent

Wolfpack

Split Second

ULM

Triple Threat

JRL

SIB

S1 -- A1

16

6

9

7

3

4

7

5

8

15

S1 -- A2

3

3

8

8

2

11

11

4

8

15

S1 -- A3

5

5

3

9

13

17

12

12

8

10

S2 -- A1

7

12

18

17

10

13

5

14

4

11

S2 -- A2

14

16

17

18

12

18

13

13

5

13

S2 -- A3

12

13

13

5

8

16

9

9

6

12

S3 -- A1

1

4

4

2

1

5

2

1

8

6

S3 -- A2

6

10

12

10

13

7

14

12

2

8

S3 -- A3

15

11

6

11

14

10

15

12

8

15

S4 -- A1

11

17

15

12

7

12

16

7

8

14

S4 -- A2

10

17

10

13

13

9

17

12

8

4

S4 -- A3

17

2

1

14

13

3

6

12

8

1

S5 -- A1

13

15

14

16

15

15

10

10

7

14

S5 -- A2

8

14

16

6

9

14

18

8

9

3

S5 -- A3

9

8

11

4

11

6

4

6

3

2

S6 -- A1

2

1

5

1

4

1

1

2

8

9

S6 -- A2

18

7

7

3

5

8

8

11

1

7

S6 -- A3

4

9

2

15

6

2

3

3

8

5

Table 3.  The ordering of state-action pairs from each team after Round 2. 

Now, we present the more detailed team statistics in Tables 4-6.  The number of transactions and rewards were tallied based on the log that our program captured during the Game Day. 

As shown in Table 4, after Round 1, Team Reagent was ranked #1.  Indeed, the team scored the highest amount of rewards ($188) and the highest accuracy in its ordering of the state-action pairs (3.94).  Team Power Agent was a close second, followed by Team Split Second and Power Agent.  Both Team Triple Threat and Wolfpack finished last.   Team JRL managed only three transactions, but they obtained high rewards.  The TOTAL RANK is a weighted sum of the two RANK values: 0.5*RANK(Rewards) + 0.5*RANK(OrderAccuracy).  (Note: Efficiency is simply Rewards divided by #trans.  This will be used later in our Game Day analysis.)

Team

Round 1

 

 

TOTAL

Name

#trans

Rewards

Efficiency

RANK

Order Accuracy

RANK

RANK

Free Agents

8

140

17.50

5

5.17

7

6

Split Second

9

141

15.67

4

4.11

2

3

Power Agent

11

174

15.82

2

4.50

3

2.5

DJ Carpet

9

150

16.67

3

4.50

3

3

ULM

7

140

20.00

5

4.50

3

4.5

Reagent

10

188

18.80

1

3.94

1

1

Triple Threat

8

81

10.16

9

6.44

10

9.5

Wolfpack

6

63

10.50

10

5.50

9

9.5

SIB

6

117

19.50

8

5.28

8

8

JRL

3

129

43.00

6

5.06

6

6.5

AVERAGE

7.70

132.30

18.76

4.90

 

 

TOTAL

77

1323

 

 

 

Table 4.  Statistics after Round 1.  Team Reagent was ranked #1.  The team obtained the highest amount of reward ($188) and also the highest accuracy in its ordering of the state-action pairs (3.94).

Table 5 shows only the statistics during Round 2, and not accumulative.  There were on average more transactions in Round 2 compared to those in Round 1 (20.10 vs. 7.70).  In terms of Rewards, as expected, Round 2 yielded a higher average than Round 1 ($259.60 vs. $132.30).  This was due to two factors.  First, Round 2 lasted for about 20 minutes while Round 1 lasted for about 15 minutes.  Second, the operation was smoother in Round 2: the Game Day monitors processed the transactions faster and the teams also submitted their state tokens faster. 

An interesting observation is the Efficiency measure in Round 1 and Round 2.  Unexpectedly, the Efficiency in Round 2 was on average lower than that in Round 1 (12.55 vs. 18.76).  In fact, only two teams were more efficient in Round 2: (1) Team Power Agent (from 15.82 to 22.85), and (2) Team ULM (from 20.00 to 21.22).  We will look into each team’s strategy for clues in later sections discussing individual teams.

 

 

 

Team

Round 2

Name

#trans

Rewards

Efficiency

Free Agents

23

399

17.35

Split Second

23

280

12.17

Power Agent

26

594

22.85

DJ Carpet

22

275

12.50

ULM

18

382

21.22

Reagent

27

95

3.519

Triple Threat

18

125

6.94

Wolfpack

14

147

10.50

SIB

17

250

14.71

JRL

13

49

3.77

AVERAGE

20.10

259.6

12.55

TOTAL

201

2596

Table 5.  Statistics for Round 2 only.  The number of transactions (#trans) and rewards do not include those from Round 1.

Table 6 shows the accumulative statistics after Round 2.  The order accuracy, in particular, improved over that in Round 1 (4.28 vs. 4.90).  Once again, TOTAL RANK = 0.5*RANK(Rewards) + 0.5*RANK(OrderAccuracy).  Overall, Team Power Agent was ranked #1 in terms of the amount of rewards earned ($768), while Team Free Agents was ranked #1 in terms of order accuracy (2.44).  Surprisingly, Team Reagent dropped from #1 after Round 1 to #7.5 after Round 2.  We will see later perhaps some clues as to the reason for this drop in later sections.

Team

Round 2

 

 

TOTAL

Name

#trans

Rewards

Efficiency

RANK

Order Accuracy

RANK

RANK

Free Agents

31

539

17.39

2

2.44

1

1.5

Split Second

32

421

13.16

5

3.06

2

3.5

Power Agent

37

768

20.76

1

4.50

6

3.5

DJ Carpet

31

425

13.71

4

3.17

3

3.5

ULM

25

522

20.88

3

4.67

7

5

Reagent

37

283

7.65

7

5.11

8

7.5

Triple Threat

26

206

7.92

9

4.11

4

6.5

Wolfpack

20

210

10.50

8

4.17

5

6.5

SIB

23

367

15.96

6

5.61

9

7.5

JRL

16

178

11.13

10

5.94

10

10

AVERAGE

27.8

391.9

14.10

4.28

 

 

TOTAL

278

3919

 

 

 

Table 6.  Statistics after Round 2.  All numbers are accumulative.  Team Free Agents was ranked #1 after Round 2.  

To compute the final score for the Learning Day (50% of the Game Day), we compute TOTALRANK(Combined) = 0.25*TOTALRANK(Round 1) + 0.75*TOTALRANK(Round 2) for each team.   The different weights are used according to the specification outlined in the Game Day handout. Table 7 shows the result.  Team Free Agents finished first, and thus they won Game Day 1.  They were followed closely by Team Power Agent, and then Teams Split Second and DJ Carpet.   These three teams were closely bundled.  Then, Team ULM finished fifth.  Team Reagent finished sixth.  Then there was another cluster of three teams: Triple Threat, Wolfpack, and SIB.  Finally, Team JRL finished a distant tenth. 

Team

Name

TOTAL RANK

FINAL RANK

Round 1

Round 2

Combined

Free Agents

6

1.5

2.625

1

Split Second

3

3.5

3.375

3

Power Agent

2.5

3.5

3.25

2

DJ Carpet

3

3.5

3.375

3

ULM

4.5

5

4.875

5

Reagent

1

7.5

5.875

6

Triple Threat

9.5

6.5

7.25

7

Wolfpack

9.5

6.5

7.25

7

SIB

8

7.5

7.625

9

JRL

6.5

10

9.125

10

Table 7.  The final ranking of teams.  Team Free Agents finished #1.  Team JRL finished last.

Individual Team Analysis

First, Table 8 shows the learning rate and discount factor used in Round 1 and Round 2 by each team.  Most teams started with a higher learning rate in Round 1 and then lowered it for Round 2 (0.63 vs. 0.53), except for Team Reagent and Team JRL.  Incidentally, referring back to Tables 4 and 6, Team Reagent’s Order Accuracy worsened from 3.94 to 5.11, while Team JRL’s Order Accuracy worsened from 5.06 to 5.94. Correspondingly, most teams used a higher or the same discount factor in Round 2 than in Round 1 (0.475 vs. 0.375), except for Team SIB.  Once again, referring back to Tables 4 and 6, Team SIB’s Order Accuracy worsened from 5.28 to 5.61.  Indeed, out of ten teams, there were four teams with worse order accuracy in Round 2 than in Round 1.  Three teams have been accounted for in the above.  The remaining team is Team ULM, going from 4.50 to 4.67.  The learning rate and discount factor values seem to be important factors in determining the learning performance of the agents.

Team

Name

Round 1

Round 2

Learning Rate

Discount Factor

Learning Rate

Discount Factor

Free Agents

0.7

0.3

0.7

0.5

Split Second

0.85

0.4

0.85

0.70

Power Agent

0.7

0.3

0.5

0.6

DJ Carpet

0.8

0.3

0.5

0.6

ULM

0.9

0.3

0.7

0.5

Reagent

0.25

0.75

0.6

0.75

Triple Threat

0.5

0.2

0.3

0.4

Wolfpack

0.75

0.1

0.75

0.1

SIB

0.7

0.7

0.2

0.2

JRL

0.15

0.4

0.2

0.4

AVERAGE

0.63

0.375

0.53

0.475

Table 8.  Learning rates and discount factors used by each team for Round 1 and Round 2.

Before we start looking at teams individually, here is a general sense of the two rounds and the role of the intermission’s information sharing.

In general, Round 1 is for exploration, and Round 2 is for a bit more exploitation.  That is, Round 1 should be used to explore different state-action pairs.  And as a result, one should use a higher learning rate, to emphasize each current transaction and its reward more.  The intermission’s information sharing should give each team some ideas about how their ordering compares to others.  If your team’s ordering is very different from others’, perhaps your Q-values for these state-action pairs have not converged.  If your team’s ordering is very similar to others’, then perhaps your Q-values have converged.  Given that logic, then Round 2 should be more for exploitation if you are confident that your Q-values have converged.  In that scenario, using a lower learning rate and a bigger discount factor will help towards that.

But, one critical issue is that what if other teams’ orderings are less accurate than yours.  Since your confidence in your own Q-values depends on how they match up, what should one do?  This is where agent observation comes into play.  For example, your team may observe what other teams are doing.  If a team seldom approaches the Game Day Monitors to submit a transaction, then that means that team’s learning result is not to be trusted.   Given your observation of other teams’ behaviors, you should be able to disregard untrustworthy orderings, thereby better utilizing the intermission’s information sharing to determine your learning rate and discount factor more appropriately.

There are also other factors.  Note that for any learning approach to work, in particular for reinforcement learning to work, there must be sufficient learning episodes.  In this Game Day, that means each team should secure a lot of transactions in order to better model the stochastic nature of the environment. 

Table 9 below shows some correlations among the number of transactions, rewards, and order accuracy values.  As expected, the number of transactions and rewards received by each team were highly correlated (greater than 0.55 after each round).  Further, the number of transactions and order accuracy were also rather highly correlated (greater than 0.42 after each round[2]).  So, our intuition in general is correct.  Also, rewards and order accuracy were more correlated in Round 1 compared to Round 2.  This is also expected.  As more teams turned to exploit what they had learned in Round 1, their focus switched to learning for the sake of earning rewards more than for the sake of learning about the Q-values.

Correlations

#Trans –  Rewards

#Trans   Accuracy

Rewards – Accuracy

After Round 1

0.556964

-0.42779

-0.83065

After Round 2

0.616655

-0.43117

-0.33744

Table 9.  Correlations between number of transactions, rewards, and order accuracy.

Table 10 documents my comments on each team’s worksheet and reports.  My observations are contextualized on the discussions above.

Team Name

Pre-Game

Tracking

Mid-Game

Tracking

Post-Game

My Observation

Free Agents

Have strategies for both rounds: exploration in Round 1 and exploitation in Round 2; no contingency planning.

Not properly recorded.  A missing ranking for (S3,A1).

Pointed out that they would focus on obtaining new Q values for each pair to start Round 2

Correctly recorded.

Pointed out that it was important to work efficiently and go through as many transactions as possible; willing to tradeoff accuracy for more rewards; that the learning rate in Round 2 could have been lower.

This team executed exceedingly well during the game day and was able to balance earning rewards and keeping the ordering accurate.  They were able to perform 2-3 transitions particularly involving action A3 on the all the six states.  That greatly improved their order accuracy.

Split Second

Have good strategies for both rounds: exploration in Round 1 and exploitation in Round 2; comprehensive discussions on myopic vs. long-term approach; but no mention of utilizing intermission’s information sharing; contingency planning; distribution of tasks

Correctly recorded

Pointed out that they did not explore the space as much as they would have liked; found a “trap” state (S2) with low rewards; raised discount factor to be able to see trouble states; retained .85 as learning rate to learn more.

Almost all correctly recorded with a couple of missing state-action pairs.

Assumed that the state transition was deterministic; made use of “stuck” state to improve Q-value; made use of others’ ordering cautiously; observed the potential advantages and disadvantages of a high discount factor; pointed out why they chose a high learning rate in Round 2 and countered that with also a high discount factor; that contracting was not time-cost efficient; and rejected contract offers correctly; saw the importance in developing a strong, accurate set of Q-values quickly: “The sooner it is done, the sooner the agent can begin exploiting the environment in a manner close to its true potential.”

This team was very well prepared.  Though they used a high learning rate in both rounds, their choice of actions helped them explore the space rather well in Round 2. They also were able to exploit the rewarding state-action pairs in Round 2 to still gain while exploring.  The high learning rate and high discount factor combination was a bold move but it appeared to work as the future term was able to tamper the local rewards. 

Power Agent

Have the most thought out set of strategies for both rounds; contingency planning; making use of information sharing during intermission; making use of contracting; comprehensive.

Not properly recorded.  Q-value ordering was incorrect at several places.

Good notes and pointed out the lack of time to make full use of information sharing during intermission; did compare their ordering against the average ordering

Correctly recorded.

Pointed out that submitting transactions in Round 1 was a bottleneck that slowed down the learning process; that they adopted a faster way in Round 2; that agent should be open to dynamically adapt their strategies if needed (GOOD!); and that contracting was not very inviting.

This team was very well prepared.  They also were able to home in on very rewarding state-action pairs and thus gained the highest amount of rewards after Round 2.  However, because of such a focus, they neglected to a certain degree in improving the Q-value of the other state-action pairs as 9 out of 18 state-action pairs only received one learning episode.  Should probably have balanced this out.  This is a typical tradeoff problem.

DJ Carpet

Have good strategies for both rounds: exploration in Round 1 and exploitation in Round 2; but no mention of utilizing intermission’s information sharing; distribution of task; no contingency planning

Correctly recorded , except for the Q-value for S1,A2

Pointed out a lot of unvisited states  and thus didn’t change the learning rate and discount factor too much.

 

Pointed out that they had aimed to hit a productive transition order, but foiled with an “unexpected state transition”.

Strategies were appropriate for both rounds.  But assumed that the state transitions were deterministic.  Strategies were quite opportunistic – if arriving at state that would generate good rewards with a certain action, the team would pursue; otherwise, explore a bit more.

ULM

Have good strategies for both rounds: exploration in Round 1 and exploitation in Round 2; but no mention of utilizing intermission’s information sharing; no contingency planning.

Not properly recorded.  Q-value ordering was incorrect at several places.

Pointed out that due to time constraints, were not able to map the entirety of the space

Not properly recorded.  Q-value ordering was incorrect at several places.

Pointed out that Round 2 strategy led them to prefer “known” paths over “unknown” paths, leading to poor choices with low rewards for some states; pointed out the lack of sufficient number of transactions hurt the performance of Q-learning; pointed out that they assumed the state transition was deterministic; argued for a new Q-learning variant to address such a problem

The learning rates used were quite high: 0.9 in Round 1 and 0.7 in Round 2.  As a result, your learning was not stable, leading to the poorer order accuracy after Round 2. Your choice of discount factor was more appropriate.  The team’s assumption that the state transition was deterministic proved to be critical.  Also, Q-learning should be able to address the stochastic nature of the environment: it will converge given enough learning episodes.  Finally, the team should have made used of the information sharing of the intermission.

Reagent

Strategies were no quite right: the goal is not to just maximize the total reward, but it is also to gain high order accuracy of the state-action pairs. No contingency planning.

Not recorded.  Ordering is incorrect.  State-action pairs with 0.5 (8 of them) should be ranked at the same position.

(However, I corrected the ordering and the team still ranked #1 in order accuracy in Round 1)

No notes.

Not recorded.

Concluded that Q-learning falls short in stochastic environments; pointed out that a problem with entering the wrong state for a transaction into their code, leading to having to use a higher learning rate to try to correct this error in Round 2

Using a low learning rate in Round 1 was not appropriate as the initial values of 0.5 were not to be trusted; 0.75 used in Round 1 for discount factor was also probably too high because it was too far-seeing, not appropriate when the Q-values were far from being accurate in Round 1.  Did not exploit information sharing.  The higher learning rate was the main factor for this team’s drop from #1 in Round 1 to #8 in order accuracy in Round 2.  Should have used a lower learning rate and the system would gradually learn to correct that error made in Round 1 while stabilizing on other Q-values. 

Triple Threat

Have strategies for Round 1 but not exactly for Round 2’s learning rate and discount rate (no explanation); interesting approach to learn the probability of transition; distribution of tasks; no contingency planning

Not properly recorded. The Q-values do not correspond to the ordering submitted.

The team pointed out that they were stuck in a loop between S2 and S5

Not properly recorded.  Missing Q-values.

Pointed out that “brute forcing our way out” of a loop of low rewards took time;  that they learned that they need to work on their organization

The interesting strategy of trying to model the transition probabilities would require sufficient number of learning episodes; cooperation strategies did not consider other team’s motivation; should have made us of the information sharing intermission to get out of a loop (see comments on Wolfpack)

Wolfpack

Have good strategies for both rounds: exploration in Round 1 and exploitation in Round 2; but no mention of utilizing intermission’s information sharing; division of tasks.

Correctly recorded.

Good notes on S3, A1 from learning from other teams’ during the intermission.  And the team used the information in Round 2!   

Correctly recorded.

Good notes.  Tried to contract other teams to no avail; pointed out that exploitation was not as successful in Round 2 because of lack of exploration in Round 1; that should have created an excel sheet to compute Qs and Vs faster; that they forgot to consider uncertainty in the environment.  However, incorrectly concluded that the environment changed in Round 2 – the environment did not change; it was simply stochastic.

The team was quite well prepared in terms of strategies.  But tactically, they were not sufficiently prepared as they couldn’t compute by hand as fast as other teams,  and as a result they didn’t submit enough transactions.  On the other hand, this team made use of the intermission’s information sharing to immediately choose to perform A1 as soon as they observed S3.  Good move.

SIB

Have strategies for both rounds: exploration in Round 1 and exploitation in Round 2; but no mention of utilizing intermission’s information sharing

Correctly recorded.

No notes.

Not properly  recorded; Ordering not corrected reported.  S4-A1 should be #15, and then the last three should be #16.

No notes.

The team was prepared in terms of overall strategy.  But the team did no have any “real-time” strategy to make use of information sharing.  The team also did not report their ordering properly (for one state-action pair in Round 2).

JRL

A simple pre-game strategy, with no contingency plan, and no strategy for exploiting the information sharing during intermission; no task allocation among the team members.

Poorly recorded.  Further, given only three transactions affecting only S6, S3, S2, it was impossible for JRL to turn in a high-resolution ordering of the state-action pairs as shown in Table 2. 

Pointed out that they increased the learning rate from 0.15 to 0.2 anticipating insufficient transactions again in Round 2

Correctly recorded.

The team realized that they were not well prepared for the Game Day; they also pointed out that there were stuck in a loop for 11 iterations and received poor rewards.

The team was not well prepared; the pre-game strategy was lacking; and there were simply too few transactions. For reinforcement learning to work, there must be sufficient learning episodes.  Did not make use of the information sharing of the intermission.

Table 10.  My comments and observations of team strategies, worksheets, and reports.

Lessons Learned

Here are some overall lessons learned.

1.       There was no motivation for the teams to cooperate via the contracting process.  In this case, the design of the MAS environment did not provide any benefits for the agents to cooperate that could offset the time-cost.  Besides, the information sharing during the intermission, if it were to be used, should provide sufficient help.

2.       Several teams argued that getting stuck in a loop of state transitions yielding no or very low rewards.  That is true.  However, as an agent, when this was observed, each team could still gain by using the opportunity to refine its Q-value of the state-action pairs involved in the loop.

3.       More transactions led to better learning, as shown in the above correlation numbers (Table 9).  Thus, acting quickly and efficiently was critical. Teams that were slow in submitting their state tokens received fewer transactions, leading to poorer performances.

4.       Lowering the learning rate or keeping it the same appeared to work better than increasing the learning rate from Round 1 to Round 2 for this MAS environment.  In general, increasing the learning rate as time progresses would tend to unlearn what has been learned.

5.       Using a high discount factor could have a clamping effect on the learning performance brought on by a high learning rate.  This is because looking into the future term essentially incorporates previous Q-values into the fray.

6.       The information sharing during the intermission was only exploited by a handful of teams.  As alluded to earlier, by comparing your ordering with others’ could help you decide your learning rate and discount factor.  It could also help you decide what actions to choose to perform in Round 2 for a certain state.

7.       Why did the rewards per transaction (efficiency) drop?  There were two factors.  Because there were more than twice as many transactions in Round 2 as opposed to those in Round 1 (201 vs. 77), the law of averages came into play.  The simulated Gaussian distribution of the rewards took place.  Second, in general, most teams chose to have a smaller learning rate in Round 2, and thus observed fewer new states as a result.  And that also could add to the law of averages impact.

8.       Several teams pointed out the nature of a tradeoff at play: trying to maximize rewards while trying to maximize the order accuracy. These two objectives are in a tug-of-war. Maximizing rewards reduces exploration and increases exploitation, and vice versa with maximizing the order accuracy.  Several teams had adopted an opportunistic balancing act: if they encountered a “rewarding” good state, they would keep acting on it until it transitioned out, and if they encountered a new state, they would consult the “information shared” (the excel file of all orderings from Round 1) to pick the likely useful action.

9.       Teams that were prepared were ranked higher.  As an agent, each team should be observant, adaptive, responsive, and reflective.

Game Day League

Here is the League Standings. 

Team Name

Learning Day

Voting Day

Auction Day

League Standings

Free Agents

1

 

 

1

Power Agent

2

 

 

2

DJ Carpet

3

 

 

3

Split Second

3

 

 

3

ULM

5

 

 

5

Reagent

6

 

 

6

Triple Threat

7

 

 

7

Wolfpack

7

 

 

7

SIB

9

 

 

9

JRL

10

 

 

10

 



[1] In our simulation, we varied the learning rate (alpha) from 0.1 to 1.0 with 0.1 increments, and discount factor (beta) the same way too.  Thus, we used 100 different configurations.  For each configuration, we ran 10,000 actions.  Also, we ran 100 times for each configuration to obtain a reasonable set of averages for each configuration.  So, all in all, we ran 100 x 100 runs and each run for 10,000 actions for a total of 100 million actions.

 

[2] The correlation value is negative because the higher the accuracy, the smaller the value is.