游戏结束奖励计算函数代码解析
这段代码是一个游戏结束时的奖励计算函数。当游戏结束时,根据当前玩家和对手的健康值来计算奖励值。
- 如果当前玩家的健康值小于0,则将对手的剩余健康点数作为惩罚。如果对手也有负健康点数,则奖励值为1。
- 如果当前对手的健康值小于0,则将当前玩家的剩余健康点数作为奖励。奖励值乘以一个奖励系数('reward_coeff')以使其比惩罚值大,防止智能体成为懦夫。
- 如果游戏仍在进行中,则计算新的奖励值,并将当前的健康值设置为上一个健康值。
最后,如果不需要重置游戏,则将自定义的完成标志设置为False。在返回值中,使用'norm_coefficient'对奖励进行归一化处理。
# Game is over and player loses.
if curr_player_health < 0:
custom_reward = -math.pow(self.full_hp, (curr_oppont_health + 1) / (self.full_hp + 1)) # Use the remaining health points of opponent as penalty.
# If the opponent also has negative health points, it's a even game and the reward is +1.
custom_done = True
# Game is over and player wins.
elif curr_oppont_health < 0:
# custom_reward = curr_player_health * self.reward_coeff # Use the remaining health points of player as reward.
# Multiply by reward_coeff to make the reward larger than the penalty to avoid cowardice of agent.
# custom_reward = math.pow(self.full_hp, (5940 - self.total_timesteps) / 5940) * self.reward_coeff # Use the remaining time steps as reward.
custom_reward = math.pow(self.full_hp, (curr_player_health + 1) / (self.full_hp + 1)) * self.reward_coeff
custom_done = True
# While the fighting is still going on
else:
custom_reward = self.reward_coeff * (self.prev_oppont_health - curr_oppont_health) - (self.prev_player_health - curr_player_health)
self.prev_player_health = curr_player_health
self.prev_oppont_health = curr_oppont_health
custom_done = False
# When reset_round flag is set to False (never reset), the session should always keep going.
if not self.reset_round:
custom_done = False
# Max reward is 6 * full_hp = 1054 (damage * 3 + winning_reward * 3) norm_coefficient = 0.001
return self._stack_observation(), 0.001 * custom_reward, custom_done, info # reward normalization
原文地址: https://www.cveoy.top/t/topic/7PV 著作权归作者所有。请勿转载和采集!