Zhe Xu, Ivan Gavran, Yousef Ahmad, Rupak Majumdar, Daniel Neider, Ufuk Topcu, Bo Wu 
PosterID:
58
PDF
Slides
Poster
BibTeX

Incorporating highlevel knowledge is an effective way to expedite reinforcement learning (RL), especially for complex tasks with sparse rewards. We investigate an RL problem where the highlevel knowledge is in the form of reward machines, i.e., a type of Mealy machine that encodes non Markovian reward functions. We focus on a setting in which this knowledge is a priori not available to the learning agent. We develop an iterative algorithm that performs joint inference of reward machines and policies for RL (more specifically, qlearning). In each iteration, the algorithm maintains a hypothesis reward machine and a sample of RL episodes. It uses a separate qfunction defined for each state of the current hypothesis reward machine to determine the policy and performs RL to update the qfunctions. While performing RL, the algorithm updates the sample by adding RL episodes along which the obtained rewards are inconsistent with the rewards based on the current hypothesis reward machine. In the next iteration, the algorithm infers a new hypothesis reward machine from the updated sample. Based on an equivalence relationship we defined between states of reward machines, we transfer the qfunctions between the hypothesis reward machines in consecutive iterations. We prove that the proposed algorithm converges almost surely to an optimal policy in the limit. The experiments show that learning highlevel knowledge in the form of reward machines leads to fast convergence to optimal policies in RL, while the baseline RL methods fail to converge to optimal policies after a substantial number of training steps. 
Canb  10/29/2020, 00:00 – 01:00 
10/29/2020, 18:00 – 19:00 
Paris  10/28/2020, 14:00 – 15:00 
10/29/2020, 08:00 – 09:00 
NYC  10/28/2020, 09:00 – 10:00 
10/29/2020, 03:00 – 04:00 
LA  10/28/2020, 06:00 – 07:00 
10/29/2020, 00:00 – 01:00 
Learning Sequences of Approximations for Hierarchical Motion Planning
Martim BrandÃ£o, Ioannis Havoutis
Joint Inference of Reward Machines and Policies for Reinforcement Learning
Zhe Xu, Ivan Gavran, Yousef Ahmad, Rupak Majumdar, Daniel Neider, Ufuk Topcu, Bo Wu
Imitation Learning over Heterogeneous Agents with Restraining Bolts
Giuseppe De Giacomo, Marco Favorito, Luca Iocchi, Fabio Patrizi
Refining Process Descriptions from Execution Data in Hybrid Planning Domain Models
Alan Lindsay, Santiago Franco, Rubiya Reba, Thomas L. McCluskey