In the multi-armed bandit problem, a gambler must decide which arm
of K non-identical slot machines to play in a sequence of trials
so as to maximize his reward.
This classical problem has received much attention because of the
simple model it provides of the trade-off between
exploration (trying out each arm to find the best one) and
exploitation (playing the arm believed to give the best payoff).
Past solutions for the bandit problem have almost always relied on
assumptions about the statistics of the slot machines.
In this work, we make no statistical assumptions whatsoever about the
nature of the process generating the payoffs of the slot machines.
We give a solution to the bandit problem in which an adversary,
rather than a benign stochastic process, has complete control over
the payoffs. In a sequence of T plays, we prove that the expected
per-round payoff of our algorithm approaches that of the best arm
at the rate T^(-1/3), and we give an improved rate of convergence
when the best arm has fairly low payoff. We also consider a setting
in which the player has a team of ``experts'' advising him on which
arm to play; here, we give a strategy that will guarantee expected
payoff close to that of the best expert.
Finally, we apply our result to the problem of learning to play an
unknown repeated matrix game against an all-powerful adversary.