CLIP模型的使用和训练-利用CLIP实现zero-shot的分类任务

2023年8月3日上午1:37 • Python • 阅读 57

CLIP模型

文章目录

CLIP模型
*
@[toc]
1 论文介绍
–
2 利用CLIP做分类任务
–
- 2.1 识别杯子的二分类任务
- 2.2 人脸分类（celebface）
3 CLIP的再训练

1 论文介绍

官方网站

1.1 训练阶段

模型架构分为两部分，图像编码器和文本编码器，图像编码器可以是比如 resnet50，然后文本编码器可以是 transformer。

训练数据是网络社交媒体上搜集的图像文本对。在训练阶段，对于一个batch 的数据，首先通过文本编码器和图像编码器，得到文本和图像的特征，接着将所有的文本和图像特征分别计算内积，就能得到一个矩阵，然后从图像的角度看，行方向就是一个分类器，从文本角度看，列方向也是一个分类器。

而由于我们已经知道一个batch中的文本和图像的匹配关系，所以目标函数就是最大化同一对图像和文本特征的内积，也就是矩阵对角线上的元素，而最小化与不相关特征的内积。文章的作者从社交媒体上搜集了有大约4亿对的数据

; 1.2 测试阶段

在测试阶段，可以直接将训练好的CLIP用于其他数据集而不需要finetune。和训练阶段类似，首先将需要分类的图像经过编码器得到特征，然后对于目标任务数据集的每一个标签，或者你自己定义的标签，都构造一段对应的文本，如上图中的 dog 会改造成 “A photo of a dog”，以此类推。然后经过编码器得到文本和图像特征，接着将文本特征与图像特征做内积，内积最大对应的标签就是图像的分类结果。这就完成了目标任务上的 zero-shot 分类。

1.3 优缺点

千万不要被它zero-shot的能力吓到，这不是真正的zero-shot！在400M个文本图像配对的训练中，模型肯定看到了大量打着相关文本标签的图像，而且图像的应用范围比ImageNet要广得多——这也是为什么方法能够在一些高级场景（如clipart）轻松超越ImageNet预训练模型。但是要说这种方法碾压了有监督方法，就有点震惊体哗众取宠的意味了。
另一个耐人寻味的地方，是方法同时训练了图像和文本特征（感谢评论区 @llll 的提醒，一开始我看成只训练图像了）。我直觉地认为文本预训练特征比视觉预训练特征更可靠，但是作者却放弃了OpenAI祖传的超大的文本预训练模型，令人略感意外。尤其是，NLP的预训练模型体量远超视觉预训练模型，所以固定文本模型，也许是更实用的方法？
最让我感兴趣的问题，是图像和文本之间的交互方式。直接用文本的encoding结果做为图像的监督信号，显然噪声太大了；能否借鉴captioning等方向的做法，允许图像和文本在encoding过程中多次交互，从而提升效果？当然，这里还是涉及到语言模型太大，无法高效训练。不过，OpenAI也可以选择暴力出奇迹，直接从头训练大规模的跨模态预训练模型。只是这样做的话，400M的数据集可能就太小了。
再往深了说，NLP的预训练之所以能做得好，关键是pretext任务比较好。相比起来，CV还在苦苦寻找合适的pretext任务。当前我对跨模态的最大预期，就是能够在NLP的辅助下，定义CV的pretext任务。CLIP迈出了第一步，前面的路还长得很。

1.4 官方给定的实验结果

; 2 利用CLIP做分类任务

2.1 识别杯子的二分类任务

import os
import clip
import torch
from torchvision.datasets import CIFAR100
from PIL import Image

img_pah = 'cup3.jpg'
classes = ['cup', 'not_cup']

#&#x52A0;&#x8F7D;&#x6A21;&#x578B;
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

#&#x51C6;&#x5907;&#x8F93;&#x5165;&#x96C6;
image = Image.open(img_pah)
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in classes]).to(device) #&#x751F;&#x6210;&#x6587;&#x5B57;&#x63CF;&#x8FF0;

#&#x7279;&#x5F81;&#x7F16;&#x7801;
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

#&#x9009;&#x53D6;&#x53C2;&#x6570;&#x6700;&#x9AD8;&#x7684;&#x6807;&#x7B7E;
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1) #&#x5BF9;&#x56FE;&#x50CF;&#x63CF;&#x8FF0;&#x548C;&#x56FE;&#x50CF;&#x7279;&#x5F81;
values, indices = similarity[0].topk(1)

#&#x8F93;&#x51FA;&#x7ED3;&#x679C;
print("\nTop predictions:\n")
print('classes:{} score:{:.2f}'.format(classes[indices.item()], values.item()))

针对与其他分类任务，只需要更改classes即可

2.2 人脸分类（celebface）

import os
from torch.utils.data import DataLoader
import clip
import torch
import torchvision
import time

device = "cuda" if torch.cuda.is_available() else "cpu"

def model_load(model_name):
    # &#x52A0;&#x8F7D;&#x6A21;&#x578B;
    model, preprocess = clip.load(model_name, device) #ViT-B/32 RN50x16
    return model, preprocess

def data_load(data_path):
    #&#x52A0;&#x8F7D;&#x6570;&#x636E;&#x96C6;&#x548C;&#x6587;&#x5B57;&#x63CF;&#x8FF0;
    celeba = torchvision.datasets.CelebA(root='CELEBA', split='test', download=True)
    text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in celeba.attr_names]).to(device)
    return celeba, text_inputs

def test_model(start, end, celeba, text_inputs, model, preprocess):
    #&#x6D4B;&#x8BD5;&#x6A21;&#x578B;
    length = end - start + 1
    face_accuracy = 0
    face_score = 0

    for i, data in enumerate(celeba):
        face_result = 0
        if i < start:
            continue
        image, target = data
        image_input = preprocess(image).unsqueeze(0).to(device)

        with torch.no_grad():
            image_features = model.encode_image(image_input)
            text_features = model.encode_text(text_inputs)

        image_features /= image_features.norm(dim=-1, keepdim=True)
        text_features /= text_features.norm(dim=-1, keepdim=True)

        text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
        top_score, top_label = text_probs.topk(6, dim=-1)
        for k, score in zip(top_label[0], top_score[0]):
            if k.item() < 40 and target[k.item()] == 1:
                face_result = 1
                face_score += score.item()
                print('Predict right! The predicted is {}'.format(celeba.attr_names[k.item()]))
            else:
                print('Predict flase! The predicted is {}'.format(celeba.attr_names[k.item()]))
        face_accuracy += face_result

        if i == end:
            break
    face_score = face_score / length
    face_accuracy = face_accuracy / length

    return face_score, face_accuracy

def main():
    start = 0
    end = 1000
    model_name = 'ViT-B/32' #ViT-B/32 RN50x16
    data_path = 'CELEBA'

    time_start = time.time()
    model, preprocess = model_load(model_name)
    celeba, text_inputs = data_load(data_path)
    face_score, face_accuracy = test_model(start, end, celeba, text_inputs, model, preprocess)
    time_end = time.time()

    print('The prediction:')
    print('face_accuracy: {:.2f} face_score: {}%'.format(face_accuracy, face_score*100))
    print('runing time: %.4f'%(time_end - time_start))

if __name__ == '__main__':
    main()

3 CLIP的再训练

from torch.utils.data import Dataset, DataLoader
import torch
import clip
from torch import nn, optim
import pandas as pd
from PIL import Image
import os

device = 'cuda' if torch.cuda.is_available() else 'cpu'

class image_caption_dataset(Dataset):
    def __init__(self, df, preprocess):
        self.images = df["image"]
        self.caption = df["caption"]
        self.preprocess = preprocess

    def __len__(self):
        return len(self.caption)

    def __getitem__(self, idx):
        images = self.preprocess(Image.open(self.images[idx]))
        caption = self.caption[idx]
        return images, caption

def load_data(cup_path, cupnot_path, batch_size, preprocess):
    df = {'image': [], 'caption':[]}
    cup_list = os.listdir(cup_path)
    cupnot_list = os.listdir(cupnot_path)

    caption = cup_path.split('/')[-1]
    for img in cup_list:
        img_path = os.path.join(cup_path, img)
        df['image'].append(img_path)
        df['caption'].append(caption)

    caption = cupnot_path.split('/')[-1]
    for img in cupnot_list:
        img_path = os.path.join(cupnot_path, img)
        df['image'].append(img_path)
        df['caption'].append(caption)

    dataset = image_caption_dataset(df, preprocess)
    train_dataloader = DataLoader(dataset, batch_size=batch_size)
    return train_dataloader

def convert_models_to_fp32(model):
    for p in model.parameters():
        p.data = p.data.float()
        p.grad.data = p.grad.data.float()

def load_pretrian_model(model_path):
    model, preprocess = clip.load(model_path, device=device, jit=False)  # &#x8BAD;&#x7EC3;&#x65F6; jit&#x5FC5;&#x987B;&#x8BBE;&#x7F6E;&#x4E3A;false
    if device == "cpu":
        model.float()
    else:
        clip.model.convert_weights(model)
    return model, preprocess

def train(epoch, batch_size, learning_rate, cup_path, cupnot_path):
    # &#x52A0;&#x8F7D;&#x6A21;&#x578B;
    model, preprocess = load_pretrian_model('ViT-B/32')

    #&#x52A0;&#x8F7D;&#x6570;&#x636E;&#x96C6;
    train_dataloader = load_data(cup_path, cupnot_path, batch_size, preprocess)

    #&#x8BBE;&#x7F6E;&#x53C2;&#x6570;
    loss_img = nn.CrossEntropyLoss().to(device)
    loss_txt = nn.CrossEntropyLoss().to(device)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate, betas=(0.9, 0.98), eps=1e-6, weight_decay=0.2)

    for i in range(epoch):
        for batch in train_dataloader:
            list_image, list_txt = batch  # list_images is list of image in numpy array(np.uint8), or list of PIL images

            #list_image = list_image.to(device)

            texts = clip.tokenize(list_txt).to(device)
            images = list_image.to(device)

            logits_per_image, logits_per_text = model(images, texts)
            if device == "cpu":
                ground_truth = torch.arange(batch_size).long().to(device)
            else:
                #ground_truth = torch.arange(batch_size).half().to(device)
                ground_truth = torch.arange(batch_size, dtype=torch.long, device=device)

            #&#x53CD;&#x5411;&#x4F20;&#x64AD;
            total_loss = (loss_img(logits_per_image, ground_truth) + loss_txt(logits_per_text, ground_truth)) / 2
            optimizer.zero_grad()
            total_loss.backward()
            if device == "cpu":
                optimizer.step()
            else:
                convert_models_to_fp32(model)
                optimizer.step()
                clip.model.convert_weights(model)

        print('[%d] loss: %.3f' %(i + 1, total_loss))
    torch.save(model, './model/model1.pkl')

def main():
    epoch = 100
    batch_size = 6
    learning_rate = 5e-5
    cup_path = './data/It is photo with cup'
    cupnot_path = './data/It is photo without cup'
    train(epoch, batch_size, learning_rate, cup_path, cupnot_path)

if __name__ == '__main__':
    main()

更新工程文件：

「CLIP」https://www.aliyundrive.com/s/mM8n836Km5M 提取码: te40
点击链接保存，或者复制本段内容，打开「阿里云盘」APP ，无需下载极速在线查看，视频原画倍速播放。

Original: https://blog.csdn.net/weixin_42772394/article/details/120688085
Author: 浅草夏洛洛
Title: CLIP模型的使用和训练-利用CLIP实现zero-shot的分类任务

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/731988/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

想下载什么资源，直接用python写爬虫脚本，想要什么分分钟就爬到了

import requests # 发送网络请求 import json import prettytable as pt Python学习交流群：924040232 header…

Python 2023年5月24日
0062
java入门（4）–流程控制

程序如果只是逐条地顺序执行，那程序的行为恐怕要简单得多了，但也会失去大部分的强悍功能和精彩。正是” 分支“打破了顺序执行的呆板局面，给程序注入了真正的生命…

Python 2023年6月6日
0050
Django 内设置读写分离

1.启动Mysql一主一从服务Mysql的一主多从下，只有主库才能进行读写，从库只能读不能写2.在setting.py，中首先配置多个数据库 DATABASES = { #&amp…

Python 2023年6月3日
0041
manim 导入ai生成的svg无法显示的问题

使用inkscape产生的svg导入manim后可以正常显示，但是ai产生的svg导入后无法显示，在manim中set_color后可显示，这给动画制作带来了很大的不便我将ink…

Python 2023年6月12日
0057
AI+医疗：使用神经网络进行医学影像识别分析 ⛵

💡 作者：韩信子@ShowMeAI📘 计算机视觉实战系列：https://www.showmeai.tech/tutorials/46📘 行业名企应用系列：https://www….

Python 2023年10月25日
0034
【多服务场景化解决方案】AR虚拟技术助力智能家装

1 、介绍总览本应用采用了华为图形引擎服务的AR虚拟技术，您可以在手机相机里摆放想要购置的家具家电，交互式体验让您可以轻松操控它们的3D图例，以此来确定这些家具家电是否适合…

Python 2023年10月17日
0050
使用containerlab搭建cilium BGP环境解析

container mode用于于其他容器共享网络命名空间 my-node: kind: linux sidecar-node: kind: linux network-mode:…

Python 2023年10月21日
0037
pygame入门(3)

文字显示绘制文字也是依靠blit()实现的获取字体想要显示文字首先必须创建字体类，即Font类，因此需要事先准备好字体，字体可以是默认字体也可以是来自于系统的字体，也可以来自…

Python 2023年9月22日
0035
6.1 文本情感倾向性分析

是日常生活中最常用的两类自然语言处理任务，一、自然语言情感分析人类自然语言具有高度的复杂性，相同的对话在不同的情景，不同的情感，不同的人演绎，表达的效果往往也会迥然不同。例如&…

Python 2023年9月28日
0049
高斯滤波(Gauss filtering)

1.概念介绍高斯滤波是一种 线性平滑滤波，适用…

Python 2023年8月1日
0049
京东金融客户端用户触达方式的探索与实践

一、关于用户触达用户触达：可以简单理解为通过某种方式将消息传递给用户的行为；触达的特定消息从功能上可分展示、引导落地两层。用户触达作为一种产品运营方式，已经融入我们日常生产活动的…

Python 2023年10月22日
0042
matplotlib绘制混淆矩阵_混淆矩阵及其可视化

我们可以通过一个简单的例子来直观理解混淆矩阵。下面两个列表分别是通过分类模型我们得到的预测结果以及真实的类别。 y_pred=["ant", "ant…

Python 2023年9月5日
0040
python-websocket-channels-单对单聊天

视频demo 测试websocket发送信息 GIFdemo ; 代码结构代码细节 Talking_view/settings.py APP里添加了 ‘channel…

Python 2023年8月4日
0034
Mac Conda全操作总结

Anaconda 安装 $ conda config –add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pk…

Python 2023年9月8日
0028
tar、gzip、zip、jar是什么，怎么查看？

原创：扣钉日记（微信公众号ID：codelogs），欢迎分享，转载请保留出处。如果你是后端程序员，我想你一定见过 *.tar.gz、 *.zip、 *.jar后缀的文件吧，这些都…

Python 2023年10月21日
0033
pytest学习记录（六）运行测试子集

通过关键字表达式过滤执行用测试名划分测试集合： pytest -k 测试名（and or not ），过滤并运行”测试模块名、测试类名或测试函数名”中包…

Python 2023年9月14日
0032

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30