目录
使用TimeSformer预训练模型,并提取视频特征 (Linux 代码实战)
TimeSformer理解
关于<Is Space-Time Attention All You Need for Video Understanding?>论文学习
使用TimeSformer预训练模型,并提取视频特征
(Linux 代码实战)
一、下载官方代码:
git clone https://github.com/facebookresearch/TimeSformer
cd TimeSformer # 进入文件夹
二、创建环境:
创建环境
conda create -n TimeSformer python=3.7 -y
激活环境
conda activate TimeSformer
下载pytorch及其相关包
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch
按照官方步骤安装剩下的包
pip install 'git+https://github.com/facebookresearch/fvcore'
pip install simplejson
pip install einops
pip install timm
conda install av -c conda-forge
pip install psutil
pip install scikit-learn
pip install opencv-python
pip install tensorboard
pip install psutil
pip install matplotlib
pip install opencv-python
pip install av
pip install scipy
pip install tensorboard
pip install sklearn
pip install timm
三、准备想要预训练的数据集:
(这里我下载的是THUMOS14里未剪辑的视频)
下载TH14数据集压缩包
wget -c https://storage.googleapis.com/thumos14_files/TH14_validation_set_mp4.zip
解压数据集
unzip TH14_validation_set_mp4.zip
1)生成train.csv
官方文档要求的csv格式
Construct the Kinetics video loader with a given csv file. The format of
the csv file is:
path_to_video_1 label_1
path_to_video_2 label_2
...
path_to_video_N label_N
四、进行预训练
1)选择模型配置:
这里本人选择的是 TimeSformer_divST_16x16_448.yaml,更改第9行(更改数据集路径)和第42行(根据自己的GPU数量进行修改)
2)进行程序运行:
先将 TimeSformer/tools/文件夹内的 run_net.py粘贴到 TimeSformer/文件夹下,然后运行程序
python run_net.py --cfg configs/Kinetics/TimeSformer_divST_16x16_448.yaml
若下载初始权重不成功,可复制网址,粘贴到网页上,进行下载,在传入服务器相应文件夹中
用预训练好的模型 提取视频特征 并保存为.npy文件
1)首先,在TimeSformer里创建文件 Video_frame_lift.py
输入模型的是图片,所以需要先对视频提帧并保存(最后输入模型的根据模型具体参数,分别是8,16,32张图片,原始策略是均匀分段选择图片,可以自己更改)
首先需要准备一个存放视频目录的文件,方便进行批量处理,我这里选择生成格式为 视频名+’\t’+视频路径的txt文件,生成代码如下:
Video_frame_lift.py
import os
path = '/Video_feature_extraction/TH14_validation_set_mp4' # 要遍历的目录
txt_path = '/Video_feature_extraction/video_validation.txt' # 生成txt文件的路径
with open(txt_path, 'w') as f:
for root, dirs, names in os.walk(path):
for name in names:
ext = os.path.splitext(name)[1] # 获取后缀名
if ext == '.mp4':
video_path = os.path.join(root, name) # mp4文件原始地址
video_name = name.split('.')[0]
f.write(video_name+'\t'+video_path+'\n')
2)然后,用ffmpeg进行视频提帧,创建文件 ffmpeg.py
ffmpeg.py
import os
import sys
import subprocess
OUT_DATA_DIR="/Video_feature_extraction/validation_pics" # 输出图片的文件夹
txt_path = "/Video_feature_extraction/video_validation.txt"
filelist = []
i = 1
with open(txt_path, 'r', encoding='utf-8') as f:
for line in f:
line = line.rstrip('\n')
video_name = line.split('\t')[0].split('.')[0]
dst_path = os.path.join(OUT_DATA_DIR, video_name)
video_path = line.split('\t')[1]
if not os.path.exists(dst_path):
os.makedirs(dst_path)
print(i)
i += 1
cmd = 'ffmpeg -i {} -r 1 -q:v 2 -f image2 {}/%05d.jpg'.format(video_path, dst_path)
print(cmd)
subprocess.call(cmd, shell=True,stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
注:如果发现生成的文件夹下,没有抽出来的图片,可能是ffmpeg版本问题。
3)在TimeSformer文件夹内创建models文件夹,然后创建 transforms.py
(即TimeSformer/models/transforms.py)
transforms.py
import torchvision
import random
from PIL import Image, ImageOps
import numpy as np
import numbers
import math
import torch
class GroupRandomCrop(object):
def __init__(self, size):
if isinstance(size, numbers.Number):
self.size = (int(size), int(size))
else:
self.size = size
def __call__(self, img_group):
w, h = img_group[0].size
th, tw = self.size
out_images = list()
x1 = random.randint(0, w - tw)
y1 = random.randint(0, h - th)
for img in img_group:
assert(img.size[0] == w and img.size[1] == h)
if w == tw and h == th:
out_images.append(img)
else:
out_images.append(img.crop((x1, y1, x1 + tw, y1 + th)))
return out_images
class GroupCenterCrop(object):
def __init__(self, size):
self.worker = torchvision.transforms.CenterCrop(size)
def __call__(self, img_group):
return [self.worker(img) for img in img_group]
class GroupRandomHorizontalFlip(object):
"""Randomly horizontally flips the given PIL.Image with a probability of 0.5"""
def __init__(self, is_flow=False):
self.is_flow = is_flow
def __call__(self, img_group, is_flow=False):
v = random.random()
if v < 0.5:
ret = [img.transpose(Image.FLIP_LEFT_RIGHT) for img in img_group]
if self.is_flow:
for i in range(0, len(ret), 2):
ret[i] = ImageOps.invert(ret[i]) # invert flow pixel values when flipping
return ret
else:
return img_group #没有堆叠在一起, 只是以列表的形式把一组图片保存起来。
class GroupNormalize(object):
def __init__(self, mean, std):
self.mean = mean
self.std = std
def __call__(self, tensor):
rep_mean = self.mean * (tensor.size()[0]//len(self.mean))
rep_std = self.std * (tensor.size()[0]//len(self.std))
# TODO: make efficient
for t, m, s in zip(tensor, rep_mean, rep_std):
t.sub_(m).div_(s)
return tensor
class GroupScale(object):
""" Rescales the input PIL.Image to the given 'size'.
'size' will be the size of the smaller edge.
For example, if height > width, then image will be
rescaled to (size * height / width, size)
size: size of the smaller edge
interpolation: Default: PIL.Image.BILINEAR
"""
def __init__(self, size, interpolation=Image.BILINEAR):
self.worker = torchvision.transforms.Resize(size, interpolation)
def __call__(self, img_group):
return [self.worker(img) for img in img_group]
class GroupOverSample(object):
def __init__(self, crop_size, scale_size=None):
self.crop_size = crop_size if not isinstance(crop_size, int) else (crop_size, crop_size)
if scale_size is not None:
self.scale_worker = GroupScale(scale_size)
else:
self.scale_worker = None
def __call__(self, img_group):
if self.scale_worker is not None:
img_group = self.scale_worker(img_group)
image_w, image_h = img_group[0].size
crop_w, crop_h = self.crop_size
offsets = GroupMultiScaleCrop.fill_fix_offset(False, image_w, image_h, crop_w, crop_h)
oversample_group = list()
for o_w, o_h in offsets:
normal_group = list()
flip_group = list()
for i, img in enumerate(img_group):
crop = img.crop((o_w, o_h, o_w + crop_w, o_h + crop_h))
normal_group.append(crop)
flip_crop = crop.copy().transpose(Image.FLIP_LEFT_RIGHT)
if img.mode == 'L' and i % 2 == 0:
flip_group.append(ImageOps.invert(flip_crop))
else:
flip_group.append(flip_crop)
oversample_group.extend(normal_group)
oversample_group.extend(flip_group)
return oversample_group
class GroupMultiScaleCrop(object):
# 完成对所有图片的裁剪, 过程没看懂,反正随机剪切后是224 x 224
def __init__(self, input_size, scales=None, max_distort=1, fix_crop=True, more_fix_crop=True):
self.scales = scales if scales is not None else [1, .875, .75, .66]
self.max_distort = max_distort
self.fix_crop = fix_crop
self.more_fix_crop = more_fix_crop
self.input_size = input_size if not isinstance(input_size, int) else [input_size, input_size] # 前面输入是224
self.interpolation = Image.BILINEAR # 双线性插值
def __call__(self, img_group): # 具体使用的时候,都是不需要传入img_group的,DataSet类会自动传入
im_size = img_group[0].size # 第一幅图像的大小
crop_w, crop_h, offset_w, offset_h = self._sample_crop_size(im_size)
crop_img_group = [img.crop((offset_w, offset_h, offset_w + crop_w, offset_h + crop_h)) for img in img_group]
ret_img_group = [img.resize((self.input_size[0], self.input_size[1]), self.interpolation)
for img in crop_img_group]
return ret_img_group # 没有堆叠在一起,只是以列表的形式把一组图片保存起来。
def _sample_crop_size(self, im_size):
image_w, image_h = im_size[0], im_size[1] # 宽,高
# find a crop size
base_size = min(image_w, image_h) # 最小值
crop_sizes = [int(base_size * x) for x in self.scales] # 最小值乘以 [1, .875, .75, .66]
# 小知识 2 * [2, 4, 6, 8] = [2, 4, 6, 8, 2, 4, 6, 8] 所以要用上面的方法
crop_h = [self.input_size[1] if abs(x - self.input_size[1]) < 3 else x for x in crop_sizes]
crop_w = [self.input_size[0] if abs(x - self.input_size[0]) < 3 else x for x in crop_sizes]
# 如果所有的 abs(x - self.input_size[1]) >= 3, 则,crop_h = [224, 196.0, 168.0, 147.84]
# 如果所有的 abs(x - self.input_size[1]) < 3, 则,crop_h = [224, 224.0, 224.0, 224] 没看懂
pairs = []
for i, h in enumerate(crop_h):
for j, w in enumerate(crop_w):
if abs(i - j)
4)接下来,在TimeSformer文件夹内创建 dataloader.py
import json
import torchvision
import random
import os
import numpy as np
import torch
import torch.nn.functional as F
import cv2
from torch.utils.data import Dataset
from torch.autograd import Variable
from models.transforms import *
class VideoClassificationDataset(Dataset):
def __init__(self, opt, mode):
# python 3
# super().__init__()
super(VideoClassificationDataset, self).__init__()
self.mode = mode # to load train/val/test data
self.feats_dir = opt['feats_dir']
if self.mode == 'val':
self.n = 5000 #提取的视频数量
if self.mode != 'inference':
print(f'load feats from {self.feats_dir}')
with open(self.feats_dir) as f:
feat_class_list = f.readlines()
self.feat_class_list = feat_class_list
mean =[0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
model_transform_params = {
"side_size": 256,
"crop_size": 224,
"num_segments": 8,
"sampling_rate": 5
}
# Get transform parameters based on model
transform_params = model_transform_params
transform_train = torchvision.transforms.Compose([
GroupMultiScaleCrop(transform_params["crop_size"], [1, .875, .75, .66]),
GroupRandomHorizontalFlip(is_flow=False),
Stack(roll=False),
ToTorchFormatTensor(div=True),
GroupNormalize(mean, std),
])
transform_val = torchvision.transforms.Compose([
GroupScale(int(transform_params["side_size"])),
GroupCenterCrop(transform_params["crop_size"]),
Stack(roll=False),
ToTorchFormatTensor(div=True),
GroupNormalize(mean, std),
])
self.transform_params = transform_params
self.transform_train = transform_train
self.transform_val = transform_val
print("Finished initializing dataloader.")
def __getitem__(self, ix):
"""This function returns a tuple that is further passed to collate_fn
"""
ix = ix % self.n
fc_feat = self._load_video(ix)
data = {
'fc_feats': Variable(fc_feat),
'video_id': ix,
}
return data
def __len__(self):
return self.n
def _load_video(self, idx):
prefix = '{:05d}.jpg'
feat_path_list = []
for i in range(len(self.feat_class_list)):
video_name = self.feat_class_list[i].rstrip('\n').split('\t')[0]+'-'
feat_path = self.feat_class_list[i].rstrip('\n').split('\t')[1]
feat_path_list.append(feat_path)
video_data = {}
if self.mode == 'val':
images = []
frame_list =os.listdir(feat_path_list[idx])
average_duration = len(frame_list) // self.transform_params["num_segments"]
# offests为采样坐标
offsets = np.array([int(average_duration / 2.0 + average_duration * x) for x in range(self.transform_params["num_segments"])])
offsets = offsets + 1
for seg_ind in offsets:
p = int(seg_ind)
seg_imgs = Image.open(os.path.join(feat_path_list[idx], prefix.format(p))).convert('RGB')
images.append(seg_imgs)
video_data = self.transform_val(images)
video_data = video_data.view((-1, self.transform_params["num_segments"]) + video_data.size()[1:])
return video_data
4)视频特征提取并存为npy文件
先选择你要用的模型,这里我选择 作者提供的已经预训练好的model(需要VPN才能下载作者模型),将model下载下来。
我下载了这两个:
TimeSformer_divST_8x32_224_K400.pyth
TimeSformer_divST_16x16_448_K600.pyth
提取特征时为了保持一致性,加载模型应该用eval()模式,这样同一个视频每次提取的特征是固定不变的。在TimeSformer文件夹内创建 extract.py
import argparse
import os
import torch
import numpy as np
from torch.utils.data import DataLoader
import random
from dataloader import VideoClassificationDataset
from timesformer.models.vit import TimeSformer
device = torch.device("cuda:6")
if __name__ == '__main__':
opt = argparse.ArgumentParser()
opt.add_argument('test_list_dir', help="Directory where test features are stored.")
opt = vars(opt.parse_args())
test_opts = {'feats_dir': opt['test_list_dir']}
# =================模型建立======================
model = TimeSformer(img_size=224, num_classes=20, num_frames=8, attention_type='divided_space_time',
pretrained_model='checkpoints/TimeSformer_divST_16x16_448_K600.pyth')
model = model.eval().to(device)
print(model)
# ================数据加载========================
print("Use", torch.cuda.device_count(), 'gpus')
test_loader = {}
test_dataset = VideoClassificationDataset(test_opts, 'val')
test_loader = DataLoader(test_dataset, batch_size=1, num_workers=6, shuffle=False)
# ===================训练和验证========================
i = 0
file1 = open("./video_validation.txt")
file1_list = file1.readlines()
for data in test_loader:
model_input = data['fc_feats'].to(device)
name_feature = file1_list[i].rstrip().split('\t')[0].split('.')[0]
i = i + 1
out = model(model_input, )
out = out.squeeze(0)
out = out.cpu().detach().numpy()
print("out.size():",out.size())
np.save('video_feature/' + name_feature + '.npy', out)
print(i)
然后终端输入:
python extract.py ./video_validation.txt
运行程序
参考文章:
使用TimeSformer预训练模型提取视频特征_yxy520ya的博客-CSDN博客_视频特征提取
transforms工具类_大侠刷题啦的博客-CSDN博客Original: https://blog.csdn.net/LiRongLu_/article/details/126528217
Author: 六个核桃Lu
Title: Video Transformer | TimeSformer 理解+ 代码实战
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/718678/
转载文章受原作者版权保护。转载请注明原作者出处!