Reinforcement Learning: Deep Q-Learning Algorithm Implementation in Python
Reinforcement Learning: Deep Q-Learning Algorithm Implementation in Python
This code implements the Deep Q-Learning algorithm, a popular method for reinforcement learning, in Python. The code includes key features like experience replay, target network updates, and the use of the smooth L1 loss function for optimization.
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import random
# Define the environment (replace with your specific environment)
class Environment:
# ... your environment code here
# Define the neural network (replace with your desired architecture)
class QNetwork(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(QNetwork, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, output_size)
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
# Initialize the environment, networks, and hyperparameters
env = Environment()
policy_net = QNetwork(env.observation_space, 128, env.action_space)
target_net = QNetwork(env.observation_space, 128, env.action_space)
optimizer = optim.Adam(policy_net.parameters())
# Hyperparameters
batch_size = 32
buffer_size = 1000
gamma = 0.99
epsilon = 1.0
epsilon_decay = 0.995
epsilon_min = 0.01
target_update = 10
# Create the experience replay buffer
buffer = []
# Train the agent
rewards = []
for episode in range(1000):
# Reset the environment
state = env.reset()
# Convert the state to a tensor
state = torch.tensor(state, dtype=torch.float32)
# Initialize the total reward for this episode
total_reward = 0
# Loop through the episode
for step in range(1000):
# Choose an action based on epsilon-greedy policy
if random.random() < epsilon:
action = random.randint(0, env.action_space - 1)
else:
with torch.no_grad():
action = policy_net(state).argmax().item()
# Take the action and observe the next state, reward, and done flag
next_state, reward, done, _ = env.step(action)
next_state = [round(num, 1) for num in next_state]
# Convert the next state to a tensor
next_state = torch.tensor(next_state, dtype=torch.float32)
# Append the next state to the buffer
buffer.append((state, action, reward, next_state, done))
# Update the current state
state = next_state
# Update the total reward
total_reward += reward
# If the episode is done, break the loop
if done:
break
# Save the total reward for this episode
rewards.append(total_reward)
# Update the epsilon value
epsilon = max(epsilon * epsilon_decay, epsilon_min)
# Update the target network every few episodes
if episode % target_update == 0:
target_net.load_state_dict(policy_net.state_dict())
# If the buffer is not yet full, continue collecting experience
if len(buffer) < buffer_size:
continue
# Sample a batch of transitions from the buffer
batch = random.sample(buffer, batch_size)
# Convert the batch to tensors
states, actions, rewards, next_states, dones = map(torch.tensor, zip(*batch))
# Calculate the Q values for the current states and actions
q_values = policy_net(states).gather(1, actions.unsqueeze(1)).squeeze()
# Calculate the Q values for the next states
next_q_values = target_net(next_states).max(1)[0]
# Calculate the target Q values
target_q_values = rewards + gamma * next_q_values * (1 - dones)
# Calculate the loss
loss = F.smooth_l1_loss(q_values, target_q_values.detach())
# Optimize the model
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Return the learned policy and the list of rewards
return policy_net, rewards
Explanation:
- Environment: Define the environment you're working with (replace
# ... your environment code herewith your specific implementation). - QNetwork: Define the neural network architecture for estimating Q-values (replace the example with your desired structure).
- Initialization: Initialize the environment, networks, optimizer, and hyperparameters.
- Experience Replay: The
bufferstores past transitions (state, action, reward, next state, done) to sample batches for training. - Training Loop:
- Epsilon-Greedy Policy: The agent chooses actions using an epsilon-greedy policy, balancing exploration and exploitation.
- Update Buffer: Add the current transition to the experience replay buffer.
- Target Network: Update the target network periodically with the policy network's weights for stability.
- Batch Sampling: Randomly sample a batch of transitions from the buffer for training.
- Q-Value Calculation: Calculate Q-values for current and next states using the policy and target networks.
- Target Q-Value: Calculate the target Q-values using the reward, discount factor, and estimated Q-value for the next state.
- Loss Calculation: Use the smooth L1 loss function to minimize the difference between predicted Q-values and target Q-values.
- Optimization: Optimize the policy network using the Adam optimizer and backpropagation.
- Return: Return the learned policy network and the list of rewards for each episode.
This code provides a basic implementation of the Deep Q-Learning algorithm. You can adapt it to your specific environment and customize the network architecture, hyperparameters, and training process to achieve better performance.
原文地址: https://www.cveoy.top/t/topic/m69W 著作权归作者所有。请勿转载和采集!