Structured policies in dynamic programming: solving a toy example
I am trying to solve a dynamic programming toy example. Here is the prompt: imagine you arrive in a new city for $N$ days and every night need to pick a restaurant to get dinner at. The qualities of the restaurants are iid according to distribution $F$ (assume [0,1]). The goal is to maximize the sum of the qualities of the restaurants that you get dinner at over the $N$ days. Every day you need to choose whether you go to a new restaurant and obtain a utility drawn at random from distribution $F$, or you go to the best restaurant you have attended so far, and obtain a known utility.
Here is my formulation:
- Let $m_t$, the state at time $t$, denote the quality of the best restaurant visited so far.
- Let $a_t \in \{Q, E\}$ be the action at time $t$; $E$ for EXPLORE, $Q$ for quit.
- Given the action set, the rewards are ($W_t$ denotes the random variable of the quality of the restaurant drawn at random at time $t$): $$r_t(m_t | E) = W_t \sim F $$ $$r_t(m_t | Q) = m_t $$
- The transition probabilities at time $t$ are: $$ p_t\big ( m_t | m_t, Q \big ) =1 $$ $$ p_t\big ( m_t | m_t, E \big ) =F(m_t), \text{ with } W_{t} \in [0, m_t]$$ $$ p_t\big ( W_t | m_t, E \big ) =1-F(m_t), \text{ with } W_{t} \in ( m_t, 1] $$
First I write down the Bellman optimality equations: For $t=N$, I get: $$ u^*_N (m_N) = \max_{a_t \in \{Q, E\}} \big\{ m_N, \ \ E(W_N) \big\} $$ For $tN$, I get: $$ u^*_t (m_t) = \max_{a_t \in \{Q, E\}} \bigg\{ m_t+ u^*_{t+1} (m_t) , \ \ \ E(W_t)+ u^*_{t+1} (m_t) F(m_t) + \int_{m_t}^1 u^*_{t+1} (x) dF(x) \bigg\} $$
What I am trying to show is that, if it is optimal to quit exploration in period $\tau$, then it is optimal to not explore in every period $t=\tau+1, ..., N$. In agreement with the equations above, I assume: $$m_\tau- E(W_\tau) \geq \int_{m_\tau}^1 (u^*_{\tau+1} (x)-u^*_{\tau+1} (m_\tau)) dF(x) $$ and I want to show: $$m_\tau - E(W_{\tau+1}) = m_\tau - E(W_{\tau})\geq \int_{m_\tau}^1 (u^*_{\tau+2} (x)-u^*_{\tau+2} (m_t)) dF(x) $$
I have been reading the structured policies chapters on Puterman and Bertsekas (Vol II) and did not find anything that could get me rolling. Any help or pointer to literature is more than welcome!
After I show this I can argue that there will exist a sequence of thresholds $c_t$ so that is is optimum to stop exploration at time $t$ it $m_t \geq c_t$.
EDIT: Distribution $F$ is know to the tourist.
Topic dynamic-programming mathematics optimization
Category Data Science