Chennai Mathematical Institute

Seminars




Computer Science Seminar
Date & Time: Monday, 19 September, 14:00 IST
Venue: CMI Seminar Hall
Towards a Better Theoretical Understanding of Policy Iteration

Shivaram Kalyanakrishnan
IIT Bombay.
19-09-22


Abstract

The Markov Decision Process (MDP) is a well-studied abstraction of sequential decision making, with applications ranging from scheduling and control to game-playing and financial trading. An MDP models an agent's environment. Given an MDP, the foremost question is how an agent must take actions from different states so as to maximise long-term reward. Formally, the MDP PLANNING problem is that of computing an optimal POLICY for a given MDP.

Introduced by Howard in 1960, Policy Iteration (PI) is an elegant, practically-effective, and widely-used family of algorithms for MDP planning. Given a policy, it is straightforward to evaluate whether it is optimal, and if it is not, to identify a set of improving policies. The idea behind PI is indeed to repeatedly update to some policy from the improving set until an optimal policy is obtained. For an MDP with n states and k actions per state, the total number of policies, k^{n}, serves as a trivial upper bound on the number of iterations that PI can take to terminate. Does PI enjoy a tighter upper bound? To date, only marginal progress has been made on this relatively basic question, even though PI itself has been used for over half a century.

In 1999, Mansour and Singh showed an upper bound of k^{n}/n iterations for a deterministic variant of PI and roughly (k/2)^{n} expected iterations for a randomised variant. In a recent line of work, we have tightened the bound for deterministic variants to k^{0.7207 n}, and for randomised variants to (2 + ln(k - 1))^{n}. Are these bounds tight? Based on trends seen in our analysis, as well as experimental results, we conjecture that PI must enjoy even tighter upper bounds. In fact, the tightest LOWER bound shown to date for the most common variant of PI is only linear in n, leaving open the possibility of the true complexity being anywhere between linear and exponential!

I will present the MDP Planning problem, describe PI, outline our latest analysis, and conclude with a list of open problems.