Thank you, Alejandra!
Indeed, it seems most literature focuses on multi-armed bandits. The idea can be extended to states more generically, using max_a R(s,a)+upper_bound(V(s’|s,a)) to sample new states in line with their associated uncertainty. In infinite horizon problems, you could also pick states with high upper bounds directly, sidestepping the dependency on stochastic transitions.
In finite horizon problems that would be more tricky. Perhaps literature on Upper Confidence Bounds for Trees (UCT) is of interest to you? It combines Monte Carlo Tree Search with the notion of confidence bounds, and is quite successfully deployed in applications on games with sequential decisions.
https://www.jair.org/index.php/jair/article/view/11099/26289