翻译：The foreach and fused implementations are typically faster than the for-loop single-tensor implementation Thus if the user has not specified BOTH flags ie when foreach = fused = None we will attemp

foreach和fused实现通常比for循环、单张量实现更快。因此，如果用户没有指定BOTH标志（即foreach = fused = None），当张量全部在CUDA上时，我们将尝试默认为foreach实现。例如，如果用户为fused指定了True但没有为foreach指定任何内容，我们将运行fused实现。如果用户为foreach指定了False，但没有为fused指定任何内容（或者为fused指定了False，但没有为foreach指定任何内容），我们将运行for-loop实现。如果用户同时为foreach和fused指定了True，我们将优先考虑fused，因为它通常更快。我们尝试使用最快的，因此层次结构为fused->foreach->for-loop。然而，由于fused实现相对较新，我们希望给它充分的烘烤时间，因此当用户没有指定任何标志时，我们默认为foreach而不是fused。