bandit-switch cost

weixin_33912445發表於2015-02-05

Recently I am studying bandit problem with switching cost. Recall a bandit problem is to play an online game against an adversary. On every round t, you can choose an arm to play from K arms and then suffer a loss coming from adversary. 

For stochastic bandit, every loss/reward coming from same arm is from same distribution. And to estimate the regret, we need to estimate means from sample rewards and also number of arms being pulled. For adversarial bandit, it doesn't have such a condition. Every round, the adversary will pick one loss vector and player choose one arm and suffers corresponding loss. To estimate regret, usually one maintains a probability distribution over all arms, chooses arms according to the probability distribution and then updates probability distribution according to the received loss.

There is a variant of bandit problem that player will suffer an additional switch cost if the chosen arm at this round is different from last round. This condition will restrict player's choice to let her much more prefers staying at the same arm. But this topic will be hard if the adversary is non-oblivious, that is, it will remember player's past choices. If player fixes one arm in the past, the adversary will probably give more loss on this arm in the next round. Even though it will result in a very bad grade to stick on one arm, it can still be regarded as a baseline, that is, we can compare performance of designed strategies against constant arm.

For adversarial bandit problem with switch cost against an oblivious adversary, the upper bound of policy regret which compares to best constant strategy is O(T^{2/3}). And last year Dekel proved minimax regret has same lower bound. So adversarial bandit with switch cost has lower regret bound meet upper bound and therefore, there is no great interest to explore this area. For stochastic oblivious adversarial where every loss/reward all coming from one fixed distribution with different parameters, it has been proved the upper bound is O(log T).

Next the research area might be extended to m-memory-bounded adversary not restricted on only switch cost.

相關文章