Abstract
We study the linear contextual bandit problem.
When the problem dimension is d, the time horizon is T, and there are n leq 2^{d/2} candidate actions per time period, we (1) show that the minimax expected regret is Omega(sqrt{dT log T log n}) for every algorithm, and (2) introduce a Variable-Confidence-Level (VCL) SupLinUCB algorithm whose regret matches the lower bound up to iterated logarithmic factors. Our algorithmic result saves two sqrt{log T} factors from previous analysis, and our information-theoretical lower bound also improves previous results by one sqrt{log T} factor, revealing a regret scaling quite different from classical multi-armed bandits in which no logarithmic T term is present in minimax regret (Audibert and Bubeck, 2009).
When the action set is infinitely large, we prove regret upper bound of$O(sqrt{d^2Tlog T})times poly(loglog T). This upper bound matches the lower bound of Omega(sqrt{d^2 Tlog T}) (as a corollary of the lower bound on finite action sets) up to iterated logarithmic terms.
On the upper bound side, our proof techniques include variable confidence levels and a careful analysis of layer sizes of SupLinUCB (Chu et al., 2011) for finite action sets, and sharp tail bounds of self-normalized empirical processes for the infinite action set case. On the lower bound side, we delicately constructe an adversarial sequences showing the tightness of the elliptical potential lemmas.
Time
2019-06-18 14:00 ~ 14:45Speaker
Yuan Zhou, Indiana University at BloomingtonRoom
Room 102, School of Information Management & Engineering, Shanghai University of Finance & Economics