rl-deep-deterministic-policy-gradient

所属分类:机器人/智能制造
开发工具:Others
文件大小:0KB
下载次数:0
上传日期:2024-03-19 09:04:44
上 传 者sh-1993
说明:  该强化学习项目的中心是使用深度确定性策略梯度(DDPG)优化撒砂机器人的行为,最大限度地提高撒砂效率,同时避免喷涂区域。
(This reinforcement learning project centers on optimizing a sanding robot s behavior using Deep Deterministic Policy Gradient (DDPG), maximizing sanding efficiency while avoiding painted areas.)

文件列表:
Visuals/
LICENSE

# Reinforcement Learning - Deep Deterministic Policy Gradient ## Table of Contents 1. [Project Overview](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/#1-project-overview) 2. [Development Tools](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/#2-development-tools) 3. [Visual Journey](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/#3-visual-journey) 4. [Implementation Highlights](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/#4-implementation-highlights) 5. [Demo](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/#5-demo) 6. [Bibliography](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/#6-bibliography) 7. [Authors](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/#7-authors) ## 1. Project Overview This project was developed for submission in **Reinforcement Learning** course at **[Aalto University](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/https://www.aalto.fi/en/)**. The main goal was to optimize the robot's behavior through deep reinforcement learning, ensuring that it efficiently sanding designated spots while avoids no-sanding areas. The project involves two main tasks: 1. **Basic DDPG Implementation**: Implement the Deep Deterministic Policy Gradient (DDPG) algorithm and run it in three sanding environments: easy, middle, difficult. 2. **DDPG Extension**: Extend the DDPG algorithm with Random Network Distillation (RND) to improve performance in middle and difficult environments. ### 1.1 System Overview **Robot Characteristics**: The robot is represented as a purple circle with a radius of 10, operating on a 2D plane with x and y coordinates ranging from -50 to 50. **Sanding & No-Sanding Areas**: Environments consist of sanding (green) and no-sanding (red) areas, each with a radius of 10. The configurations vary based on the task. ### 1.2 State Representation The state space is represented as a list comprising the robot's current location, the locations of sanding areas, and the locations of no-sanding areas. ### 1.3 Action Space Actions are represented as target coordinates for the robot arm, enabling it to navigate from its current position to the desired target. This movement is facilitated by a PD-controller, which carries a risk of overshooting. ### 1.4 Reward Definition The reward function is based on the number of sanded sanding locations minus the number of sanded no-sanding locations, encouraging the robot to sand designated spots while avoiding no-sanding areas. ### 1.5 Difficulty Levels Three difficulty levels are provided, each with varying numbers of sanding and no-sanding areas:
|
Easy
1 sanding spot
1 no-sanding spot
|
Middle
2 sanding spot
2 no-sanding spot
|
Difficult
4 sanding spot
4 no-sanding spot
| |------------------------------------|---------------------------------------|-----------------------------------------| | ![Simulation - Easy Environment](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/Visuals/simulation_easy.gif) | ![Simulation - Middle Environment](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/Visuals/simulation_middle.gif) | ![Simulation - Difficult Environment](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/Visuals/simulation_difficult.gif) |
## 2. Development Tools - **Development Environment**: Jupyter Notebook - **Language**: Python 3.10 - **Libraries**: - **[PyTorch](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/https://pytorch.org/)**: An open-source deep learning framework that facilitates the development and training of neural networks. - **[Gymnasium](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/https://gymnasium.farama.org/index.html)**: An API standard for reinforcement learning with a diverse collection of reference environments. ## 3. Visual Journey This section presents the achieved performance in terms of average reward using both DDPG and its extension across different environments.
|
DDPG - Easy Environment
| |------------------------------------| | ![DDPG - Easy Environment](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/Visuals/DDPG_easy.PNG) | |
DDPG - Middle Environment
|
DDPG - Difficult Environment
| |------------------------------------|---------------------------------------| | ![DDPG - Middle Environment](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/Visuals/DDPG_middle.PNG) | ![DDPG - Difficult Environment](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/Visuals/DDPG_difficult.PNG) | |
DDPG Extension - Middle Environment
|
DDPG Extension - Difficult Environment
| |------------------------------------|---------------------------------------| | ![DDPG Extension - Middle Environment](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/Visuals/DDPG_extension_middle.PNG) | ![DDPG Extension - Difficult Environment](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/Visuals/DDPG_extension_difficult.PNG) | |
DDPG vs. DDPG Extension - Middle Environment
|
DDPG vs. DDPG Extension - Difficult Environment
| |------------------------------------|---------------------------------------| | ![DDPG vs. DDPG Extension - Middle Environment](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/Visuals/DDPG_DDPG_extension_middle.PNG) | ![DDPG vs. DDPG Extension - Difficult Environment](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/Visuals/DDPG_DDPG_extension_difficult.PNG) |
## 4. Implementation Highlights **Highlight 1/3**: The core update mechanism of the basic DDPG algorithm. It involves computing critic and actor losses using current and target Q-values to optimize their current networks, and performs soft updates on target networks for improved training stability and convergence. ```python def _update(self): # Get batch data batch = self.buffer.sample(self.batch_size, device=self.device) # batch contains: state = batch.state # shape [batch, state_dim] action = batch.action # shape [batch, action_dim] next_state = batch.next_state # shape [batch, state_dim] not_done = batch.not_done # shape [batch, 1] reward = self.reward_manipulation(batch) # Compute current q qs = self.q(state, action) # Compute target q qs_target = self.q_target(next_state, self.pi_target(next_state)) qs_target = reward + self.gamma * (qs_target * not_done) # Compute critic loss critic_loss = F.mse_loss(qs, qs_target) # Optimize the critic network self.q_optim.zero_grad() critic_loss.backward() self.q_optim.step() # Compute actor loss actor_loss = -self.q(state, self.pi(state)).mean() # Optimize the actor network self.pi_optim.zero_grad() actor_loss.backward() self.pi_optim.step() # Update the target q and target pi using u.soft_update_params() function cu.soft_update_params(self.q, self.q_target, self.tau) cu.soft_update_params(self.pi, self.pi_target, self.tau) return {} ``` **Highlight 2/3**: A custom RND network class for DDPG extension is designed with three fully connected layers and ReLU activation functions. The neural network architecture comprises two hidden layers, each consisting of 64 units, followed by an output layer that produces the desired output dimension, which in this case corresponds to the action dimension. ```python # Custom RNDNetwork class with three fully connected layers with ReLU activation functions class RNDNetwork(torch.nn.Module): def __init__(self, input_dim, output_dim): super().__init__() self.feature = nn.Sequential( nn.Linear(input_dim + output_dim, 64), nn.ReLU(), nn.Linear(64, 64), nn.ReLU(), nn.Linear(64, output_dim) ) def forward(self, state, action): x = torch.cat([state, action], 1) x = self.feature(x) return x ``` **Highlight 3/3**: The computation of the internal reward using the Random Network Distillation (RND) approach. It calculates the mean-squared error (MSE) loss between predictor and target features, updates the predictor network, and returns the scaled mean-centered internal reward. ```python def calculate_internal_reward(self, batch): # Query features of the current state predictor_features = self.rnd_predictor(batch.state, batch.action) target_features = self.rnd_target(batch.state, batch.action) # Compute RND loss rnd_loss = F.mse_loss(predictor_features, target_features) # Optimize the predictor network self.rnd_optimizer.zero_grad() rnd_loss.backward() self.rnd_optimizer.step() # Calculate the mean-centered and clamped internal reward with torch.no_grad(): internal_reward = (rnd_loss - rnd_loss.mean()).clamp(-2, 2) # Return the scaled internal reward return self.rnd_coef * internal_reward ``` ## 5. Demo The project is not uploaded, and access to its contents is provided upon request. ## 6. Bibliography The following academic papers were used to gather sufficient information and complete the project: 1. Burda, Y., Edwards, H., Storkey, A. J., & Klimov, O. (2018). Exploration by Random Network Distillation. CoRR, abs/1810.12894. [Read the paper](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/http://arxiv.org/abs/1810.12894). ## 7. Authors - Ferenc Szendrei [Back to Top](https://github.com/frenkx/rl-deep-deterministic-policy-gradient/blob/master/#reinforcement-learning---deep-deterministic-policy-gradient)

近期下载者

相关文件


收藏者