Título:
|
Action selection for MDPs: anytime AO* vs. UCT
|
Autor/a:
|
Bonet, Blai; Geffner, Héctor
|
Abstract:
|
Comunicació presentada a: the 26th AAAI Conference on Artificial Intelligence, celebrada a Toronto, Canadà, del 22 al 26 de juliol de 2012 |
Abstract:
|
In the presence of non-admissible heuristics, A* and
other best-first algorithms can be converted into anytime
optimal algorithms over OR graphs, by simply continuing
the search after the first solution is found. The same
trick, however, does not work for best-first algorithms
over AND/OR graphs, that must be able to expand leaf
nodes of the explicit graph that are not necessarily part
of the best partial solution. Anytime optimal variants
of AO* must thus address an exploration-exploitation
tradeoff: they cannot just ”exploit”, they must keep exploring
as well. In this work, we develop one such variant
of AO* and apply it to finite-horizon MDPs. This
Anytime AO* algorithm eventually delivers an optimal
policy while using non-admissible random heuristics
that can be sampled, as when the heuristic is the cost
of a base policy that can be sampled with rollouts. We
then test Anytime AO* for action selection over large
infinite-horizon MDPs that cannot be solved with existing
off-line heuristic search and dynamic programming
algorithms, and compare it with UCT. |
Abstract:
|
H. Geffner is partially supported by grants TIN2009-10232, MICINN, Spain, and EC-7PM SpaceBook. |
Derechos:
|
© 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org)
|
Tipo de documento:
|
Objeto de conferencia Artículo - Versión aceptada |
Editor:
|
Association for the Advancement of Artificial Intelligence (AAAI)
|
Compartir:
|
|