游戏结束奖励计算函数代码解析

这段代码是一个游戏结束时的奖励计算函数。当游戏结束时，根据当前玩家和对手的健康值来计算奖励值。

如果当前玩家的健康值小于0，则将对手的剩余健康点数作为惩罚。如果对手也有负健康点数，则奖励值为1。
如果当前对手的健康值小于0，则将当前玩家的剩余健康点数作为奖励。奖励值乘以一个奖励系数（'reward_coeff'）以使其比惩罚值大，防止智能体成为懦夫。
如果游戏仍在进行中，则计算新的奖励值，并将当前的健康值设置为上一个健康值。

最后，如果不需要重置游戏，则将自定义的完成标志设置为False。在返回值中，使用'norm_coefficient'对奖励进行归一化处理。

# Game is over and player loses.
        if curr_player_health < 0:
            custom_reward = -math.pow(self.full_hp, (curr_oppont_health + 1) / (self.full_hp + 1))    # Use the remaining health points of opponent as penalty. 
                                                   # If the opponent also has negative health points, it's a even game and the reward is +1.
            custom_done = True

        # Game is over and player wins.
        elif curr_oppont_health < 0:
            # custom_reward = curr_player_health * self.reward_coeff # Use the remaining health points of player as reward.
                                                                   # Multiply by reward_coeff to make the reward larger than the penalty to avoid cowardice of agent.

            # custom_reward = math.pow(self.full_hp, (5940 - self.total_timesteps) / 5940) * self.reward_coeff # Use the remaining time steps as reward.
            custom_reward = math.pow(self.full_hp, (curr_player_health + 1) / (self.full_hp + 1)) * self.reward_coeff
            custom_done = True

        # While the fighting is still going on
        else:
            custom_reward = self.reward_coeff * (self.prev_oppont_health - curr_oppont_health) - (self.prev_player_health - curr_player_health)
            self.prev_player_health = curr_player_health
            self.prev_oppont_health = curr_oppont_health
            custom_done = False

        # When reset_round flag is set to False (never reset), the session should always keep going.
        if not self.reset_round:
            custom_done = False
             
        # Max reward is 6 * full_hp = 1054 (damage * 3 + winning_reward * 3) norm_coefficient = 0.001
        return self._stack_observation(), 0.001 * custom_reward, custom_done, info # reward normalization