Ambiguous Dynamic Treatment Regimes: A Reinforcement Learning Approach