.. raw:: html

.. code:: python #@save def resnet18(num_classes): """A slightly modified ResNet-18 model.""" def resnet_block(num_channels, num_residuals, first_block=False): blk = nn.Sequential() for i in range(num_residuals): if i == 0 and not first_block: blk.add(d2l.Residual( num_channels, use_1x1conv=True, strides=2)) else: blk.add(d2l.Residual(num_channels)) return blk net = nn.Sequential() # This model uses a smaller convolution kernel, stride, and padding and # removes the maximum pooling layer net.add(nn.Conv2D(64, kernel_size=3, strides=1, padding=1), nn.BatchNorm(), nn.Activation('relu')) net.add(resnet_block(64, 2, first_block=True), resnet_block(128, 2), resnet_block(256, 2), resnet_block(512, 2)) net.add(nn.GlobalAvgPool2D(), nn.Dense(num_classes)) return net .. raw:: html

.. raw:: html

.. code:: python #@save def resnet18(num_classes, in_channels=1): """A slightly modified ResNet-18 model.""" def resnet_block(in_channels, out_channels, num_residuals, first_block=False): blk = [] for i in range(num_residuals): if i == 0 and not first_block: blk.append(d2l.Residual(in_channels, out_channels, use_1x1conv=True, strides=2)) else: blk.append(d2l.Residual(out_channels, out_channels)) return nn.Sequential(*blk) # This model uses a smaller convolution kernel, stride, and padding and # removes the maximum pooling layer net = nn.Sequential( nn.Conv2d(in_channels, 64, kernel_size=3, stride=1, padding=1), nn.BatchNorm2d(64), nn.ReLU()) net.add_module("resnet_block1", resnet_block(64, 64, 2, first_block=True)) net.add_module("resnet_block2", resnet_block(64, 128, 2)) net.add_module("resnet_block3", resnet_block(128, 256, 2)) net.add_module("resnet_block4", resnet_block(256, 512, 2)) net.add_module("global_avg_pool", nn.AdaptiveAvgPool2d((1,1))) net.add_module("fc", nn.Sequential(nn.Flatten(), nn.Linear(512, num_classes))) return net .. raw:: html

.. raw:: html

mxnet pytorch

.. raw:: html

Chức năng ``initialize`` cho phép chúng tôi khởi tạo các tham số trên một thiết bị mà chúng tôi lựa chọn. Đối với một bồi dưỡng về các phương pháp khởi tạo xem :numref:`sec_numerical_stability`. Điều đặc biệt thuận tiện là nó cũng cho phép chúng tôi khởi tạo mạng trên các thiết bị \* nhiều\* cùng một lúc. Hãy để chúng tôi thử làm thế nào điều này hoạt động trong thực tế. .. code:: python net = resnet18(10) # Get a list of GPUs devices = d2l.try_all_gpus() # Initialize all the parameters of the network net.initialize(init=init.Normal(sigma=0.01), ctx=devices) Sử dụng chức năng ``split_and_load`` được giới thiệu trong :numref:`sec_multi_gpu`, chúng ta có thể chia một minibatch dữ liệu và sao chép các phần vào danh sách các thiết bị được cung cấp bởi biến ``devices``. Phiên bản mạng\* automatically\* sử dụng GPU thích hợp để tính toán giá trị của sự lan truyền chuyển tiếp. Ở đây chúng tôi tạo ra 4 quan sát và chia chúng qua GPU. .. code:: python x = np.random.uniform(size=(4, 1, 28, 28)) x_shards = gluon.utils.split_and_load(x, devices) net(x_shards[0]), net(x_shards[1]) .. parsed-literal:: :class: output [11:23:07] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) .. parsed-literal:: :class: output (array([[ 2.2610195e-06, 2.2045988e-06, -5.4046795e-06, 1.2869961e-06, 5.1373149e-06, -3.8298003e-06, 1.4338968e-07, 5.4683442e-06, -2.8279201e-06, -3.9651122e-06], [ 2.0698672e-06, 2.0084667e-06, -5.6382496e-06, 1.0498482e-06, 5.5506434e-06, -4.1065477e-06, 6.0830178e-07, 5.4521761e-06, -3.7365016e-06, -4.1891649e-06]], ctx=gpu(0)), array([[ 2.4629790e-06, 2.6015525e-06, -5.4362617e-06, 1.2938226e-06, 5.6387885e-06, -4.1360108e-06, 3.5758899e-07, 5.5125261e-06, -3.1957350e-06, -4.2976326e-06], [ 1.9431686e-06, 2.2600429e-06, -5.2698201e-06, 1.4807408e-06, 5.4830934e-06, -3.9678903e-06, 7.5750904e-08, 5.6764356e-06, -3.2530229e-06, -4.0943960e-06]], ctx=gpu(1))) Khi dữ liệu đi qua mạng, các tham số tương ứng được khởi tạo \* trên thiết bị dữ liệu được truyền qua\*. Điều này có nghĩa là khởi tạo xảy ra trên cơ sở mỗi thiết bị. Vì chúng tôi đã chọn GPU 0 và GPU 1 để khởi tạo, mạng chỉ được khởi tạo ở đó chứ không phải trên CPU. Trong thực tế, các tham số thậm chí không tồn tại trên CPU. Chúng tôi có thể xác minh điều này bằng cách in ra các tham số và quan sát bất kỳ lỗi nào có thể phát sinh. .. code:: python weight = net[0].params.get('weight') try: weight.data() except RuntimeError: print('not initialized on cpu') weight.data(devices[0])[0], weight.data(devices[1])[0] .. parsed-literal:: :class: output not initialized on cpu .. parsed-literal:: :class: output (array([[[ 0.01382882, -0.01183044, 0.01417865], [-0.00319718, 0.00439528, 0.02562625], [-0.00835081, 0.01387452, -0.01035946]]], ctx=gpu(0)), array([[[ 0.01382882, -0.01183044, 0.01417865], [-0.00319718, 0.00439528, 0.02562625], [-0.00835081, 0.01387452, -0.01035946]]], ctx=gpu(1))) Tiếp theo, chúng ta hãy thay thế mã thành đánh giá độ chính xác bằng một mã hoạt động song song trên nhiều thiết bị. Điều này phục vụ như là một sự thay thế của chức năng ``evaluate_accuracy_gpu`` từ :numref:`sec_lenet`. Sự khác biệt chính là chúng tôi chia nhỏ một minibatch trước khi gọi mạng. Tất cả những thứ khác về cơ bản là giống hệt nhau. .. code:: python #@save def evaluate_accuracy_gpus(net, data_iter, split_f=d2l.split_batch): """Compute the accuracy for a model on a dataset using multiple GPUs.""" # Query the list of devices devices = list(net.collect_params().values())[0].list_ctx() # No. of correct predictions, no. of predictions metric = d2l.Accumulator(2) for features, labels in data_iter: X_shards, y_shards = split_f(features, labels, devices) # Run in parallel pred_shards = [net(X_shard) for X_shard in X_shards] metric.add(sum(float(d2l.accuracy(pred_shard, y_shard)) for pred_shard, y_shard in zip( pred_shards, y_shards)), labels.size) return metric[0] / metric[1] .. raw:: html

.. raw:: html

Chúng tôi sẽ khởi tạo mạng bên trong vòng đào tạo. Đối với một bồi dưỡng về các phương pháp khởi tạo xem :numref:`sec_numerical_stability`. .. code:: python net = resnet18(10) # Get a list of GPUs devices = d2l.try_all_gpus() # We will initialize the network inside the training loop .. raw:: html

.. raw:: html

mxnet pytorch

.. raw:: html

.. code:: python def train(num_gpus, batch_size, lr): train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size) ctx = [d2l.try_gpu(i) for i in range(num_gpus)] net.initialize(init=init.Normal(sigma=0.01), ctx=ctx, force_reinit=True) trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr}) loss = gluon.loss.SoftmaxCrossEntropyLoss() timer, num_epochs = d2l.Timer(), 10 animator = d2l.Animator('epoch', 'test acc', xlim=[1, num_epochs]) for epoch in range(num_epochs): timer.start() for features, labels in train_iter: X_shards, y_shards = d2l.split_batch(features, labels, ctx) with autograd.record(): ls = [loss(net(X_shard), y_shard) for X_shard, y_shard in zip(X_shards, y_shards)] for l in ls: l.backward() trainer.step(batch_size) npx.waitall() timer.stop() animator.add(epoch + 1, (evaluate_accuracy_gpus(net, test_iter),)) print(f'test acc: {animator.Y[0][-1]:.2f}, {timer.avg():.1f} sec/epoch ' f'on {str(ctx)}') .. raw:: html

.. raw:: html

.. code:: python def train(net, num_gpus, batch_size, lr): train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size) devices = [d2l.try_gpu(i) for i in range(num_gpus)] def init_weights(m): if type(m) in [nn.Linear, nn.Conv2d]: nn.init.normal_(m.weight, std=0.01) net.apply(init_weights) # Set the model on multiple GPUs net = nn.DataParallel(net, device_ids=devices) trainer = torch.optim.SGD(net.parameters(), lr) loss = nn.CrossEntropyLoss() timer, num_epochs = d2l.Timer(), 10 animator = d2l.Animator('epoch', 'test acc', xlim=[1, num_epochs]) for epoch in range(num_epochs): net.train() timer.start() for X, y in train_iter: trainer.zero_grad() X, y = X.to(devices[0]), y.to(devices[0]) l = loss(net(X), y) l.backward() trainer.step() timer.stop() animator.add(epoch + 1, (d2l.evaluate_accuracy_gpu(net, test_iter),)) print(f'test acc: {animator.Y[0][-1]:.2f}, {timer.avg():.1f} sec/epoch ' f'on {str(devices)}') .. raw:: html

.. raw:: html

mxnet pytorch

.. raw:: html

.. code:: python train(num_gpus=1, batch_size=256, lr=0.1) .. parsed-literal:: :class: output test acc: 0.93, 13.3 sec/epoch on [gpu(0)] .. figure:: output_multiple-gpus-concise_2e111f_47_1.svg .. raw:: html

.. raw:: html

.. code:: python train(net, num_gpus=1, batch_size=256, lr=0.1) .. parsed-literal:: :class: output test acc: 0.90, 13.8 sec/epoch on [device(type='cuda', index=0)] .. figure:: output_multiple-gpus-concise_2e111f_50_1.svg .. raw:: html

.. raw:: html

mxnet pytorch

.. raw:: html

.. code:: python train(num_gpus=2, batch_size=512, lr=0.2) .. parsed-literal:: :class: output test acc: 0.92, 6.9 sec/epoch on [gpu(0), gpu(1)] .. figure:: output_multiple-gpus-concise_2e111f_56_1.svg .. raw:: html

.. raw:: html

.. code:: python train(net, num_gpus=2, batch_size=512, lr=0.2) .. parsed-literal:: :class: output test acc: 0.77, 8.2 sec/epoch on [device(type='cuda', index=0), device(type='cuda', index=1)] .. figure:: output_multiple-gpus-concise_2e111f_59_1.svg .. raw:: html

.. raw:: html

mxnet

.. raw:: html

- Gluon cung cấp nguyên thủy để khởi tạo mô hình trên nhiều thiết bị bằng cách cung cấp một danh sách ngữ cảnh. .. raw:: html

.. raw:: html

mxnet pytorch

.. raw:: html

1. Phần này sử dụng ResNet-18. Hãy thử các thời đại khác nhau, quy mô hàng loạt và tỷ lệ học tập. Sử dụng nhiều GPU hơn để tính toán. Điều gì xảy ra nếu bạn dùng thử điều này với 16 GPU (ví dụ: trên phiên bản AWS p2.16xlarge)? 2. Đôi khi, các thiết bị khác nhau cung cấp sức mạnh tính toán khác nhau. Chúng ta có thể sử dụng GPU và CPU cùng một lúc. Làm thế nào chúng ta nên chia công việc? Nó có đáng để nỗ lực không? Tại sao? Tại sao không? 3. Điều gì sẽ xảy ra nếu chúng ta thả ``npx.waitall()``? Làm thế nào bạn sẽ sửa đổi đào tạo sao cho bạn có một chồng chéo lên đến hai bước cho song song? `Discussions `__ .. raw:: html

.. raw:: html