► 开发者指南 / 使用 TensorFlow 进行多 GPU 分布式训练

使用 TensorFlow 进行多 GPU 分布式训练

作者： fchollet
创建日期 2020/04/28
最后修改日期 2023/06/29
描述： Keras 模型使用 TensorFlow 进行多 GPU 训练指南。

介绍

通常有两种方法可以将计算分布到多个设备上

数据并行，在这种方式下，单个模型被复制到多个设备或多台机器上。每个副本处理不同的数据批次，然后合并结果。这种设置有许多变体，它们在不同模型副本如何合并结果、是否在每个批次保持同步或是否更松散耦合等方面有所不同。

模型并行，在这种方式下，单个模型的不同部分在不同的设备上运行，共同处理单个数据批次。这种方式最适合具有自然并行结构的模型，例如具有多个分支的模型。

本指南重点介绍数据并行，特别是同步数据并行，在同步数据并行中，模型的不同副本在处理每个批次后保持同步。同步性使模型收敛行为与单设备训练时观察到的行为相同。

具体来说，本指南教你如何使用 tf.distribute API 在多块 GPU 上训练 Keras 模型，对代码进行最少的修改，适用于安装在单台机器（单主机，多设备训练）上的多块 GPU（通常为 2 到 16 块）。这是研究人员和小型工业工作流程中最常见的设置。

设置

import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import tensorflow as tf
import keras

单主机、多设备同步训练

在这种设置下，你有一台机器，上面安装了几块 GPU（通常为 2 到 16 块）。每个设备将运行模型的一个副本（称为副本）。为了简单起见，在下文中，我们将假设我们处理的是 8 块 GPU，这不失一般性。

工作原理

在训练的每一步

当前的数据批次（称为全局批次）被分成 8 个不同的子批次（称为本地批次）。例如，如果全局批次有 512 个样本，那么 8 个本地批次中的每一个将有 64 个样本。
8 个副本中的每一个独立地处理一个本地批次：它们运行前向传播，然后运行反向传播，输出权重相对于模型在本地批次上的损失的梯度。
源自本地梯度的权重更新在 8 个副本之间高效地合并。由于这是在每个步骤结束时完成的，因此副本始终保持同步。

实际上，同步更新模型副本权重的过程是在每个单独的权重变量级别处理的。这是通过镜像变量 (mirrored variable) 对象完成的。

如何使用它

要使用 Keras 模型进行单主机、多设备同步训练，你需要使用 tf.distribute.MirroredStrategy API。工作原理如下：

实例化一个 MirroredStrategy，可选地配置你想要使用的特定设备（默认情况下，策略将使用所有可用的 GPU）。
使用策略对象开启一个作用域，并在该作用域内创建你需要的所有包含变量的 Keras 对象。通常，这意味着在分布式作用域内创建和编译模型。在某些情况下，首次调用 fit() 也可能创建变量，因此最好也将 fit() 调用放在该作用域内。
像往常一样通过 fit() 训练模型。

重要的是，我们建议你使用 tf.data.Dataset 对象在多设备或分布式工作流程中加载数据。

示意图如下：

# Create a MirroredStrategy.
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

# Open a strategy scope.
with strategy.scope():
    # Everything that creates variables should be under the strategy scope.
    # In general this is only model construction & `compile()`.
    model = Model(...)
    model.compile(...)

    # Train the model on all available devices.
    model.fit(train_dataset, validation_data=val_dataset, ...)

    # Test the model on all available devices.
    model.evaluate(test_dataset)

这是一个简单的端到端可运行示例：

def get_compiled_model():
    # Make a simple 2-layer densely-connected neural network.
    inputs = keras.Input(shape=(784,))
    x = keras.layers.Dense(256, activation="relu")(inputs)
    x = keras.layers.Dense(256, activation="relu")(x)
    outputs = keras.layers.Dense(10)(x)
    model = keras.Model(inputs, outputs)
    model.compile(
        optimizer=keras.optimizers.Adam(),
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[keras.metrics.SparseCategoricalAccuracy()],
    )
    return model


def get_dataset():
    batch_size = 32
    num_val_samples = 10000

    # Return the MNIST dataset in the form of a [`tf.data.Dataset`](https://tensorflowcn.cn/api_docs/python/tf/data/Dataset).
    (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

    # Preprocess the data (these are Numpy arrays)
    x_train = x_train.reshape(-1, 784).astype("float32") / 255
    x_test = x_test.reshape(-1, 784).astype("float32") / 255
    y_train = y_train.astype("float32")
    y_test = y_test.astype("float32")

    # Reserve num_val_samples samples for validation
    x_val = x_train[-num_val_samples:]
    y_val = y_train[-num_val_samples:]
    x_train = x_train[:-num_val_samples]
    y_train = y_train[:-num_val_samples]
    return (
        tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size),
        tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(batch_size),
        tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(batch_size),
    )


# Create a MirroredStrategy.
strategy = tf.distribute.MirroredStrategy()
print("Number of devices: {}".format(strategy.num_replicas_in_sync))

# Open a strategy scope.
with strategy.scope():
    # Everything that creates variables should be under the strategy scope.
    # In general this is only model construction & `compile()`.
    model = get_compiled_model()

    # Train the model on all available devices.
    train_dataset, val_dataset, test_dataset = get_dataset()
    model.fit(train_dataset, epochs=2, validation_data=val_dataset)

    # Test the model on all available devices.
    model.evaluate(test_dataset)

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
Number of devices: 1
Epoch 1/2
 1563/1563 ━━━━━━━━━━━━━━━━━━━━ 7s 4ms/step - loss: 0.3830 - sparse_categorical_accuracy: 0.8884 - val_loss: 0.1361 - val_sparse_categorical_accuracy: 0.9574
Epoch 2/2
 1563/1563 ━━━━━━━━━━━━━━━━━━━━ 9s 3ms/step - loss: 0.1068 - sparse_categorical_accuracy: 0.9671 - val_loss: 0.0894 - val_sparse_categorical_accuracy: 0.9724
 313/313 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0988 - sparse_categorical_accuracy: 0.9673

使用回调来确保容错

进行分布式训练时，应始终确保有从故障中恢复的策略（容错）。最简单的方法是将 ModelCheckpoint 回调传递给 fit()，以定期（例如每 100 个批次或每个 epoch）保存模型。然后可以从保存的模型重新开始训练。

这是一个简单示例：

# Prepare a directory to store all the checkpoints.
checkpoint_dir = "./ckpt"
if not os.path.exists(checkpoint_dir):
    os.makedirs(checkpoint_dir)


def make_or_restore_model():
    # Either restore the latest model, or create a fresh one
    # if there is no checkpoint available.
    checkpoints = [checkpoint_dir + "/" + name for name in os.listdir(checkpoint_dir)]
    if checkpoints:
        latest_checkpoint = max(checkpoints, key=os.path.getctime)
        print("Restoring from", latest_checkpoint)
        return keras.models.load_model(latest_checkpoint)
    print("Creating a new model")
    return get_compiled_model()


def run_training(epochs=1):
    # Create a MirroredStrategy.
    strategy = tf.distribute.MirroredStrategy()

    # Open a strategy scope and create/restore the model
    with strategy.scope():
        model = make_or_restore_model()

        callbacks = [
            # This callback saves a SavedModel every epoch
            # We include the current epoch in the folder name.
            keras.callbacks.ModelCheckpoint(
                filepath=checkpoint_dir + "/ckpt-{epoch}.keras",
                save_freq="epoch",
            )
        ]
        model.fit(
            train_dataset,
            epochs=epochs,
            callbacks=callbacks,
            validation_data=val_dataset,
            verbose=2,
        )


# Running the first time creates the model
run_training(epochs=1)

# Calling the same function again will resume from where we left off
run_training(epochs=1)

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
Creating a new model
1563/1563 - 7s - 4ms/step - loss: 0.2275 - sparse_categorical_accuracy: 0.9320 - val_loss: 0.1373 - val_sparse_categorical_accuracy: 0.9571
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
Restoring from ./ckpt/ckpt-1.keras
1563/1563 - 6s - 4ms/step - loss: 0.0944 - sparse_categorical_accuracy: 0.9717 - val_loss: 0.0972 - val_sparse_categorical_accuracy: 0.9710