Sections 1, 2, 4, and 5 and the proof of Theorem 1 in Section 3. The proof of Theorem 3 and the appendices are optional. UCB: Finite-time Analysis of the Multiarmed Bandit Problem Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer 2002
A paper that addresses relationship between first-visit and every-visit MC (Singh and Sutton, 1996). For some theoretical relationships see section starting at section 3.3 (and referenced appendices). The equivalence of MC and first visit TD(1) is proven in the See starting at Section 2.4.
Safe Exploration in Markov Decision Processes Moldovan and Abbeel, ICML 2012 (safe exploration in non-ergodic domains by favoring policies that maintain the ability to return to the start state)
Some useful slides (part C) from Michael Bowling on game theory, stochastic games, correlated equilibria; and (Part D) from Michael Littman with more on stochastic games.
Autonomous helicopter flight via reinforcement learning. Andrew Ng, H. Jin Kim, Michael Jordan and Shankar Sastry. In S. Thrun, L. Saul, and B. Schoelkopf (Eds.), Advances in Neural Information Processing Systems (NIPS) 17, 2004.
This work addresses an improvement to finetuning by adding columns to a deep net and never removing the previously learned weights and avoids catastrophic forgetting. Progressive Neural Networks