Menu

Enter the terms you wish to search for.

Faculty
Publications
Books
Working Papers
Case Studies
Postdoctoral Scholars
Research Labs & Initiatives
Behavioral Lab
Data, Analytics & Research Computing

Faculty
Publications
Books
Working Papers
Case Studies
Research Labs & Initiatives
Behavioral Lab
DARC

Faculty & Research Publications Off-Policy Evaluation in Partially Observed Markov Decision Processes under Sequential Ignorability

Off-Policy Evaluation in Partially Observed Markov Decision Processes under Sequential Ignorability

By Yuchen HuStefan Wager

The Annals of Statistics

August2023 Vol. 51 Issue 4 Pages 1561–1585.

Economics

Operations, Information & Technology

View Publication

We consider off-policy evaluation of dynamic treatment rules under sequential ignorability, given an assumption that the underlying system can be modeled as a partially observed Markov decision process (POMDP). We propose an estimator, partial history importance weighting, and show that it can consistently estimate the stationary mean rewards of a target policy, given long enough draws from the behavior policy. We provide an upper bound on its error that decays polynomially in the number of observations (i.e., the number of trajectories times their length) with an exponent that depends on the overlap of the target and behavior policies as well as the mixing time of the underlying system. Furthermore, we show that this rate of convergence is minimax, given only our assumptions on mixing and overlap. Our results establish that off-policy evaluation in POMDPs is strictly harder than off-policy evaluation in (fully observed) Markov decision processes but strictly easier than model-free off-policy evaluation.

Footer contact links

Contact Us
Visit Us
Stay In Touch

Footer 1

Companies, Organizations & Recruiters
Stanford Community
Newsroom

Footer 2

Library
Jobs
MyGSB

Footer legal links

Accessibility
Non-Discrimination Policy
Privacy Policy
Terms of Use
Stanford University