微软自动调参工具 NNI 使用事例教程

第一步:安装

nni的安装通过pip命令就可以安装了。并且提供了example供参考学习。

系统配置要求:tensorflow,python >= 3.5


    python3 -m pip install --upgrade nni

    git clone https://github.com/Microsoft/nni.git

    python3 -m pip install tensorflow

第二步:设置超参数的搜索范围

NNI的示例程序如下:

cd ./nni/examples/trials/mnist/

三个文件

  • config.yml
  • mnist.py
  • search_space.json

这三个文件决定了NNI配置文件,main.py和超参数搜索空间。

1.打开 search_space.json文件

{
    "batch_size": {"_type":"choice", "_value": [16, 32, 64, 128]},
    "hidden_size":{"_type":"choice","_value":[128, 256, 512, 1024]},
    "lr":{"_type":"choice","_value":[0.0001, 0.001, 0.01, 0.1]},
    "momentum":{"_type":"uniform","_value":[0, 1]}
}

在这里可以定义我们的超参数和搜索范围,可以根据自己的需要随意调整。
搜索的类型有很多种,常用的有uniform,choice等。
本案例只做了uniform,choice,其他所有案例根据git显示如下:

{"_type": "choice", "_value": options}

{"_type": "uniform", "_value": [low, high]}

{"_type": "quniform", "_value": [low, high, q]}

{"_type": "normal", "_value": [mu, sigma]}

{"_type": "randint", "_value": [lower, upper]}

第二步:配置config.yaml
打开config.yaml

authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 20

trainingServicePlatform: local
searchSpacePath: search_space.json

useAnnotation: false
tuner:

  builtinTunerName: TPE
  classArgs:

    optimize_mode: maximize
trial:
  command: python mnist.py
  codeDir: .
  gpuNum: 0


authorName: az
experimentName: demo

trialConcurrency: 5

maxExecDuration: 24h
maxTrialNum: 10
trainingServicePlatform: local

searchSpacePath: search_space.json
useAnnotation: false

logDir: ./log
logLevel: info

tuner:
  builtinTunerName: TPE

trial:
  command: python3 run_demo.py
  codeDir: .
  gpuNum: 1

localConfig:
  gpuIndices: 0,3

  maxTrialNumPerGpu: 2
  useActiveGpu: false

除了command,maxExecDuration,trialConcurrency,gpuNum,optimize_mode需要更改,这里的参数一般不需要更改。

command是nni的运行后将要执行的指令,mnist.py改为你的main.py或者train.py等等主程序。

maxExecDuration是整个NNI自动调参的时间,注意不是一次训练的时间。
trialConcurrency是trail的并发数,这个需要根据自己的GPU数量设置, 而不是下面的gpuNum,trail代表一次调参的过程,理解为用一种超参数在运行你的train.py,并发数设为x,就有x个trainer在训练!
gpuNum是每个trail所需要的gpu个数,而不是整个nni调参所需要的gpu个数。对于大型任务,单独训练一次需要N个GPU的话,这个值就设置为N;如果单次训练,一个GPU就足够,请把这个值设置为1。
需要的GPU总数为trialConcurrency _gpuNum,即 trail的个数_每个trail需要的gpu个数
optimize_mode对应着优化的方向,有最大和最小两种方式,具体如何设置在下一步中提到。

第三步 修改我们的代码

"""
A deep MNIST classifier using convolutional layers.

This file is a modification of the official pytorch mnist example:
https://github.com/pytorch/examples/blob/master/mnist/main.py
"""

import os
import argparse
import logging
import nni
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from nni.utils import merge_parameter
from torchvision import datasets, transforms

logger = logging.getLogger('mnist_AutoML')

class Net(nn.Module):
    def __init__(self, hidden_size):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4*4*50, hidden_size)
        self.fc2 = nn.Linear(hidden_size, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4*4*50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        if (args['batch_num'] is not None) and batch_idx >= args['batch_num']:
            break
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args['log_interval'] == 0:
            logger.info('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))

def test(args, model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)

            test_loss += F.nll_loss(output, target, reduction='sum').item()

            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    accuracy = 100. * correct / len(test_loader.dataset)

    logger.info('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset), accuracy))

    return accuracy

def main(args):
    use_cuda = not args['no_cuda'] and torch.cuda.is_available()

    torch.manual_seed(args['seed'])

    device = torch.device("cuda" if use_cuda else "cpu")

    kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}

    data_dir = args['data_dir']

    train_loader = torch.utils.data.DataLoader(
        datasets.MNIST(data_dir, train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=args['batch_size'], shuffle=True, **kwargs)
    test_loader = torch.utils.data.DataLoader(
        datasets.MNIST(data_dir, train=False, transform=transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.1307,), (0.3081,))
        ])),
        batch_size=1000, shuffle=True, **kwargs)

    hidden_size = args['hidden_size']

    model = Net(hidden_size=hidden_size).to(device)
    optimizer = optim.SGD(model.parameters(), lr=args['lr'],
                          momentum=args['momentum'])

    for epoch in range(1, args['epochs'] + 1):
        train(args, model, device, train_loader, optimizer, epoch)
        test_acc = test(args, model, device, test_loader)

        nni.report_intermediate_result(test_acc)
        logger.debug('test accuracy %g', test_acc)
        logger.debug('Pipe send intermediate result done.')

    nni.report_final_result(test_acc)
    logger.debug('Final result is %g', test_acc)
    logger.debug('Send final result done.')

def get_params():

    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument("--data_dir", type=str,
                        default='./data', help="data directory")
    parser.add_argument('--batch_size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument("--batch_num", type=int, default=None)
    parser.add_argument("--hidden_size", type=int, default=512, metavar='N',
                        help='hidden layer size (default: 512)')
    parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
                        help='learning rate (default: 0.01)')
    parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
                        help='SGD momentum (default: 0.5)')
    parser.add_argument('--epochs', type=int, default=10, metavar='N',
                        help='number of epochs to train (default: 10)')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--no_cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--log_interval', type=int, default=1000, metavar='N',
                        help='how many batches to wait before logging training status')

    args, _ = parser.parse_known_args()
    return args

if __name__ == '__main__':
    try:

        tuner_params = nni.get_next_parameter()
        logger.debug(tuner_params)
        params = vars(merge_parameter(get_params(), tuner_params))
        print(params)
        main(params)
    except Exception as exception:
        logger.exception(exception)
        raise

第四步 代码运行

nnictl create --config examples\trials\mnist-pytorch\config_windows.yml --port 8088

切换到代码的目录下,直接运行。
-p代表使用的端口号。注意如果代码使用的是conda虚拟环境,需要激活conda虚拟环境。

第五步 查看训练过程

打开命令行给的网站,如下图

微软自动调参工具 NNI 使用事例教程

图中,左上脚, select space,Config,logfiles 点击,体现出设置的参数。如下图所示

微软自动调参工具 NNI 使用事例教程

微软自动调参工具 NNI 使用事例教程
微软自动调参工具 NNI 使用事例教程

Hyper-parameter 体现参数训练结果

微软自动调参工具 NNI 使用事例教程

Trial jobs 体现每一次参数调整测试结果与测试图

微软自动调参工具 NNI 使用事例教程

; 第六步 停止

nnictl stop

常见基本操作
参考网站:https://nni.readthedocs.io/en/latest/Tutorial/WebUI.html

`python
The Web UI urls are: http://223.255.255.1:8080 http://127.0.0.1:8080
commands description

  1. nnictl experiment show show the information of experiments
  2. nnictl trial ls list all of trial jobs
  3. nnictl top monitor the status of running experiments
  4. nnictl log stderr show stderr log content
  5. nnictl log stdout show stdout log content
  6. nnictl stop stop an experiment
  7. nnictl trial kill kill a trial job by id
  8. nnictl –help get help information about nnictl

Original: https://blog.csdn.net/weixin_38353277/article/details/121250088
Author: 中科哥哥
Title: 微软自动调参工具 NNI 使用事例教程

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/663547/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球