第一步:安装
nni的安装通过pip命令就可以安装了。并且提供了example供参考学习。
系统配置要求:tensorflow,python >= 3.5
python3 -m pip install --upgrade nni
git clone https://github.com/Microsoft/nni.git
python3 -m pip install tensorflow
第二步:设置超参数的搜索范围
NNI的示例程序如下:
cd ./nni/examples/trials/mnist/
三个文件
- config.yml
- mnist.py
- search_space.json
这三个文件决定了NNI配置文件,main.py和超参数搜索空间。
1.打开 search_space.json文件
{
"batch_size": {"_type":"choice", "_value": [16, 32, 64, 128]},
"hidden_size":{"_type":"choice","_value":[128, 256, 512, 1024]},
"lr":{"_type":"choice","_value":[0.0001, 0.001, 0.01, 0.1]},
"momentum":{"_type":"uniform","_value":[0, 1]}
}
在这里可以定义我们的超参数和搜索范围,可以根据自己的需要随意调整。
搜索的类型有很多种,常用的有uniform,choice等。
本案例只做了uniform,choice,其他所有案例根据git显示如下:
{"_type": "choice", "_value": options}
{"_type": "uniform", "_value": [low, high]}
{"_type": "quniform", "_value": [low, high, q]}
{"_type": "normal", "_value": [mu, sigma]}
{"_type": "randint", "_value": [lower, upper]}
第二步:配置config.yaml
打开config.yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 20
trainingServicePlatform: local
searchSpacePath: search_space.json
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python mnist.py
codeDir: .
gpuNum: 0
authorName: az
experimentName: demo
trialConcurrency: 5
maxExecDuration: 24h
maxTrialNum: 10
trainingServicePlatform: local
searchSpacePath: search_space.json
useAnnotation: false
logDir: ./log
logLevel: info
tuner:
builtinTunerName: TPE
trial:
command: python3 run_demo.py
codeDir: .
gpuNum: 1
localConfig:
gpuIndices: 0,3
maxTrialNumPerGpu: 2
useActiveGpu: false
除了command,maxExecDuration,trialConcurrency,gpuNum,optimize_mode需要更改,这里的参数一般不需要更改。
command是nni的运行后将要执行的指令,mnist.py改为你的main.py或者train.py等等主程序。
maxExecDuration是整个NNI自动调参的时间,注意不是一次训练的时间。
trialConcurrency是trail的并发数,这个需要根据自己的GPU数量设置, 而不是下面的gpuNum,trail代表一次调参的过程,理解为用一种超参数在运行你的train.py,并发数设为x,就有x个trainer在训练!
gpuNum是每个trail所需要的gpu个数,而不是整个nni调参所需要的gpu个数。对于大型任务,单独训练一次需要N个GPU的话,这个值就设置为N;如果单次训练,一个GPU就足够,请把这个值设置为1。
需要的GPU总数为trialConcurrency _gpuNum,即 trail的个数_每个trail需要的gpu个数
optimize_mode对应着优化的方向,有最大和最小两种方式,具体如何设置在下一步中提到。
第三步 修改我们的代码
"""
A deep MNIST classifier using convolutional layers.
This file is a modification of the official pytorch mnist example:
https://github.com/pytorch/examples/blob/master/mnist/main.py
"""
import os
import argparse
import logging
import nni
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from nni.utils import merge_parameter
from torchvision import datasets, transforms
logger = logging.getLogger('mnist_AutoML')
class Net(nn.Module):
def __init__(self, hidden_size):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 20, 5, 1)
self.conv2 = nn.Conv2d(20, 50, 5, 1)
self.fc1 = nn.Linear(4*4*50, hidden_size)
self.fc2 = nn.Linear(hidden_size, 10)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.max_pool2d(x, 2, 2)
x = F.relu(self.conv2(x))
x = F.max_pool2d(x, 2, 2)
x = x.view(-1, 4*4*50)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.log_softmax(x, dim=1)
def train(args, model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
if (args['batch_num'] is not None) and batch_idx >= args['batch_num']:
break
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args['log_interval'] == 0:
logger.info('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item()))
def test(args, model, device, test_loader):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item()
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)
accuracy = 100. * correct / len(test_loader.dataset)
logger.info('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset), accuracy))
return accuracy
def main(args):
use_cuda = not args['no_cuda'] and torch.cuda.is_available()
torch.manual_seed(args['seed'])
device = torch.device("cuda" if use_cuda else "cpu")
kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
data_dir = args['data_dir']
train_loader = torch.utils.data.DataLoader(
datasets.MNIST(data_dir, train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=args['batch_size'], shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(
datasets.MNIST(data_dir, train=False, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=1000, shuffle=True, **kwargs)
hidden_size = args['hidden_size']
model = Net(hidden_size=hidden_size).to(device)
optimizer = optim.SGD(model.parameters(), lr=args['lr'],
momentum=args['momentum'])
for epoch in range(1, args['epochs'] + 1):
train(args, model, device, train_loader, optimizer, epoch)
test_acc = test(args, model, device, test_loader)
nni.report_intermediate_result(test_acc)
logger.debug('test accuracy %g', test_acc)
logger.debug('Pipe send intermediate result done.')
nni.report_final_result(test_acc)
logger.debug('Final result is %g', test_acc)
logger.debug('Send final result done.')
def get_params():
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument("--data_dir", type=str,
default='./data', help="data directory")
parser.add_argument('--batch_size', type=int, default=64, metavar='N',
help='input batch size for training (default: 64)')
parser.add_argument("--batch_num", type=int, default=None)
parser.add_argument("--hidden_size", type=int, default=512, metavar='N',
help='hidden layer size (default: 512)')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
help='learning rate (default: 0.01)')
parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
help='SGD momentum (default: 0.5)')
parser.add_argument('--epochs', type=int, default=10, metavar='N',
help='number of epochs to train (default: 10)')
parser.add_argument('--seed', type=int, default=1, metavar='S',
help='random seed (default: 1)')
parser.add_argument('--no_cuda', action='store_true', default=False,
help='disables CUDA training')
parser.add_argument('--log_interval', type=int, default=1000, metavar='N',
help='how many batches to wait before logging training status')
args, _ = parser.parse_known_args()
return args
if __name__ == '__main__':
try:
tuner_params = nni.get_next_parameter()
logger.debug(tuner_params)
params = vars(merge_parameter(get_params(), tuner_params))
print(params)
main(params)
except Exception as exception:
logger.exception(exception)
raise
第四步 代码运行
nnictl create --config examples\trials\mnist-pytorch\config_windows.yml --port 8088
切换到代码的目录下,直接运行。
-p代表使用的端口号。注意如果代码使用的是conda虚拟环境,需要激活conda虚拟环境。
第五步 查看训练过程
打开命令行给的网站,如下图
图中,左上脚, select space,Config,logfiles 点击,体现出设置的参数。如下图所示
Hyper-parameter 体现参数训练结果
Trial jobs 体现每一次参数调整测试结果与测试图
; 第六步 停止
nnictl stop
常见基本操作
参考网站:https://nni.readthedocs.io/en/latest/Tutorial/WebUI.html
`python
The Web UI urls are: http://223.255.255.1:8080 http://127.0.0.1:8080
commands description
- nnictl experiment show show the information of experiments
- nnictl trial ls list all of trial jobs
- nnictl top monitor the status of running experiments
- nnictl log stderr show stderr log content
- nnictl log stdout show stdout log content
- nnictl stop stop an experiment
- nnictl trial kill kill a trial job by id
- nnictl –help get help information about nnictl
Original: https://blog.csdn.net/weixin_38353277/article/details/121250088
Author: 中科哥哥
Title: 微软自动调参工具 NNI 使用事例教程
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/663547/
转载文章受原作者版权保护。转载请注明原作者出处!