PARKING RL TRIALS
We explore the parking environment from highway-env, a collection of environments for autonomous driving and tactical decision-making tasks. Parking is a goal-conditioned continuous control task in which the ego-vehicle must park in a given space with the appropriate heading. We train policy-based, Q-learning-based, and model-based reinforcement
learning (RL) agents and compare their quantitative and qualitative results. We make modifications to the model-based agent’s observation space and reward function as ways to improve its learning and customize the behavioral performance of the agent.
The observations made by the agent in the environment are processed into a 6-length vector, which includes the x, y position and velocity of the agent and its angle of rotation described by sin h and cos h. The episode ends when the agent vehicle is close enough to the goal, crashes, or the maximum time is reached. The agent has a continuous action space where it must control its acceleration and steering. Rewards are given to the agent based on its proximity to the current parking goal. In order to calculate this distance as a reward, a weighted p-norm is used.
We use this environment to test three types of RL algorithms and compare their results with approximately similar experiences or samples from the environment. The following on-policy algorithms are proximal policy optimization (PPO), the off-policy algorithm is soft-actor critic (SAC), and a model-based algorithm using cross-entropy method (CEM).
We make modifications to explore the potential for an agent to be more efficient with the given goal-based task by modifying the agent’s observation and reward weights. Given the nature of this environment, model-based methods are most efficient and will be used to carry out these tests.
For the first modification, in the current input domain, the agent observes a six-length vector that includes its position, velocity, and rotation. Our proposed modification is to include the position of the goal within this observation, extending the observation to be an eight-length vector. For the second modification, we propose modifications to the agent's reward weights to see whether giving more weight to the rotation could aid the agent in aligning with the parking spot.
CEM Agent with Observation Modification
CEM Agent with Reward Weight Modification
For our first modifications, we found that expanding the observations of the agent to include the goal itself allowed for learning the task in fewer CEM iterations. This met our expected hypothesis about aiding the agent to train more efficiently. That being said, it does not improve the overall returns.
For our second modification, we found that the agent did converge one iteration faster than the original model-based agent as well as exhibited unique qualitative behaviors from the other two agents. We increased the rewards that weighted the agent’s rotation alignment with the goal and though it did not necessarily always park in an aligned fashion as we had hoped, an interesting behavior in the agent did emerge where it would turn in front of the parking spot and then reverse into it. We observed that the reward function and its weight can be useful for suggesting a certain behavior in an agent.
For a more detailed and expansive conclusion, you may view the full report by clicking here.