基因变异自动分类

2023年5月28日上午7:45 • 人工智能 • 阅读 85

来源：kaggle.com/elemento/personalizedmedicine-rct

在癌细胞的生命周期中可能会发生数千次的基因突变（genetic mutations），有些突变是”坏的”（有助于肿瘤的生长），有些突变是”好的”。因此，对这些突变进行准确的识别对于治疗方案的制定是具有重要意义的。

传统的对基因突变的解释是手工进行的。这是一项非常耗时的任务，临床病理学家必须根据基于文本的临床文献的证据手动审查和分类每一个基因突变。

因此，如果可以开发一种机器学习算法，可以根据研究人员注释的那些突变作为训练集，自动对遗传变异进行分类，将节省大量的人力。

本任务预定义了9种类别，每个基因突变都属于这9种之一。

训练集和测试集都是通过两个不同的文件提供的。一个文件提供了基因的变异信息（在 training_variants/ test_variants中）；另一个文件提供了人类专家用来对基因进行分类的临床证据（文本数据，在 training_text/ test_text中）。两者通过 ID字段进行关联。

3.1 安装依赖

如果系统的python环境是通过Anaconda安装的，那么仍需要安装以下两个第三方依赖

!pip install mlxtend
!pip install imbalanced-learn
!pip install nltk
!pip install seaborn

如果上述命令（一般是第二个）报如下错误：

ERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: 'COPYING' Consider using the --user option or check the permissions.

那么，只需要将上述安装代码改为 !pip install --user imbalanced-learn即可。

3.2 导入所有依赖

执行以下代码导入所有依赖：

import re
import nltk
import math
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from mlxtend.classifier import StackingClassifier
from tqdm import tqdm
from collections import  defaultdict
from scipy.sparse import hstack
from nltk.corpus import stopwords
from sklearn.naive_bayes import MultinomialNB
from sklearn.calibration import CalibratedClassifierCV
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics._classification import log_loss
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier

warnings.filterwarnings('ignore')


nltk.download('stopwords')

4.1 工具函数

在数据探索的过程中，对具有共同逻辑的代码抽象成工具函数，从而提升效率。

def plot_confusion_matrix(y_true, y_predict):
"""
    给定真实值和预测值，画出其混淆矩阵、查准率矩阵与查全率矩阵。
"""
    C = confusion_matrix(y_true, y_predict)
    A = C/C.sum(axis=1)
    B = C/C.sum(axis=0)

    labels = [1, 2, 3, 4, 5, 6, 7, 8, 9]

    print('-'*20, 'Confusion matrix', '-'*20)
    plt.figure(figsize=(20, 7))
    sns.heatmap(C, annot=True, cmap='YlGnBu', fmt='.3f', xticklabels=labels, yticklabels=labels)
    plt.xlabel("Predicted class")
    plt.ylabel('True class')
    plt.show()

    print('-'*20, 'Precision matrix', '-'*20)
    plt.figure(figsize=(20, 7))
    sns.heatmap(B, annot=True, cmap='YlGnBu', fmt='.3f', xticklabels=labels, yticklabels=labels)
    plt.xlabel("Predicted class")
    plt.ylabel('True class')
    plt.show()

    print('-'*20, 'Recall matrix', '-'*20)
    plt.figure(figsize=(20, 7))
    sns.heatmap(A, annot=True, cmap='YlGnBu', fmt='.3f', xticklabels=labels, yticklabels=labels)
    plt.xlabel("Predicted class")
    plt.ylabel('True class')
    plt.show()

def get_feature_ratio_dict(alpha, feature_name, df):
"""
    用于Response编码。

    获取基因数据有关特征在各个变异类别中的的占比数据，这里的特征可以是基因
    名称（'GENE'），也可以是变异名称（'Variation'）。

    alpha: 拉普拉斯平滑系数
    feature_name: 特征名称
    df: 存储原始数据的data frame
"""
    value_count = df[feature_name].value_counts()
    fea_dict = dict()

    for i, denominator in value_count.items():
        vec = []
        for k in range(1, 10):
            sub_df = df.loc[(df['Class']==k) & (df[feature_name]==i)]
            vec.append((sub_df.shape[0]+alpha*10) / (denominator+90*alpha))
        fea_dict[i] = vec

    return fea_dict

def get_response_code(alpha, feature_name, df):
"""
    用于Response编码。

    对于df中每一个实例的指定列进行Response编码处理。

    alpha: 拉普拉斯平滑系数
    feature_name: 特征名称
    df: 存储原始数据的data frame
"""
    fea_dict = get_feature_ratio_dict(alpha, feature_name, df)
    fea = []

    for _, row in df.iterrows():
        if row[feature_name] in fea_dict:
            fea.append(fea_dict[row[feature_name]])
        else:
            fea.append([1/9] * 9)

    return fea

def get_word_freq(df):
    """对于输入的df，统计其"TEXT"字段下所有内容的各词词频"""
    freq_dict = defaultdict(int)

    for _, row in df.iterrows():
        for word in row['TEXT'].split():
            freq_dict[word] += 1
    return freq_dict

def get_word_freq_list(df):
    """统计每个变异类别所对应的所有出现过的描述文本的各词词频"""
    freq_dict_list = []

    for i in range(1, 10):
        sub_df = df[df['Class'] == i]
        freq_dict_list.append(get_word_freq(sub_df))
    return freq_dict_list

def get_text_response(df):
    """对于描述文本，提取其Response编码"""
    text_feature_responsecoding = np.zeros((df.shape[0], 9))
    freq_dict_list = get_word_freq_list(df)
    all_freq_dict = get_word_freq(df)

    for i in range(0, 9):
        row_index = 0
        for _, row in df.iterrows():
            sum_prob = 0
            for word in row['TEXT'].split():
                sum_prob += math.log((freq_dict_list[i].get(word, 0)+10) / (all_freq_dict.get(word, 0)+90))
            text_feature_responsecoding[row_index][i] = math.exp(sum_prob / len(row['TEXT'].split()))
            row_index += 1
    return text_feature_responsecoding

4.2 原始数据读入


cls_data = pd.read_csv('./my_data/training_variants.zip')

text_data = pd.read_csv('./my_data/training_text.zip', sep='\|\|', names=['ID', "TEXT"], skiprows=1, encoding='utf-8')

4.3 停用词处理及数据集合并

过滤掉停用词有利于提高增大特征方差，从而提升模型的效果。

这里直接使用了 nltk自带的停用词表：


stop_words = set(stopwords.words('english'))

def nlp_preprocessing(total_text, index, column):
    """过滤停用词以及其他一些非法字符"""
    if type(total_text) is not int:
        string = ''
        total_text = re.sub('[^a-zA-Z0-9\n]', ' ', total_text)
        total_text = re.sub('\s+', ' ', total_text)
        total_text = total_text.lower()

        for word in total_text.split():
            if not word in stop_words:
                string += word + ' '
        text_data[column][index] = string

pbar = tqdm(total=len(text_data))
for index, row in text_data.iterrows():
    if type(row['TEXT']) is str:
        nlp_preprocessing(row['TEXT'], index, 'TEXT')
    else:
        print("对于id为", index, "的gene变异没有文本描述")
    pbar.update(1)

all_data = pd.merge(cls_data, text_data, on='ID', how='left')

all_data.loc[all_data['TEXT'].isnull(), 'TEXT'] = all_data['Gene'] + ' ' + all_data['Variation']

4.4 特征提取方法

由于原始特征均为文本，因此，需要对其进行编码处理。尝试的编码方式分为两种： OneHot编码_与 _Response编码。

在文本处理领域，OneHot编码提供了一种由字符到数字的转换方式。其计算过程简述如下：

执行完上述操作后，就得到了经过OneHot编码后的特征向量。

Response编码是针对文本数据的一种处理方式，它通过对词频的统计与计算，得到每个词的特征向量，并用该向量来代替原始词，从而完成从文本到数字的转换。其计算过程概述如下：

对于给定的一列文本数据[c h a r 1 , c h a r 2 , . . . , c h a r n char_1,char_2,…,char_n c h a r 1 ,c h a r 2 ,…,c h a r n ]以及它们对应的类别标签 [1, 2, 1, ..., 9]（这里假设一共有9个类别），其中c h a r i char_i c h a r i 表示一个单词，对该文本数据进行Response编码的步骤如下;

如果c h a r i char_i c h a r i 对应的是一个长句子而非单词，那么首先需要将其进行分词处理，再进行编码。具体的过程简述如下：

4.5 数据探索


y_true = all_data['Class'].values
train_val_x, X_test, train_val_y, y_test = train_test_split(all_data, y_true, stratify=y_true, test_size=0.2)
X_train, X_val, y_train, y_val = train_test_split(train_val_x, train_val_y, stratify=train_val_y, test_size=0.2)
print('Number of data points in train data            :', X_train.shape[0])
print('Number of data points in test data             :', X_test.shape[0])
print('Number of data points in cross-validation data :', X_val.shape[0])


train_class_distr = X_train['Class'].value_counts().sort_index()
train_class_distr.plot(kind='bar')
plt.xlabel('Class')
plt.ylabel('Data points per class')
plt.title('Distribution of y in train data')
plt.grid()
plt.show()

test_class_distr = X_val['Class'].value_counts().sort_index()
test_class_distr.plot(kind='bar')
plt.xlabel('Class')
plt.ylabel('Data points per class')
plt.title('Distribution of y in test data')
plt.grid()
plt.show()

val_class_distr = X_test['Class'].value_counts().sort_index()
val_class_distr.plot(kind='bar')
plt.xlabel('Class')
plt.ylabel('Data points per class')
plt.title('Distribution of y in validation data')
plt.grid()
plt.show()


unique_genes = X_train['Gene'].value_counts()
print(f"Number of unique genes in train data is: {len(unique_genes)}. The distribution of genes in trian data is as follows:")
s = sum(unique_genes.values)
h = unique_genes.values / s
plt.plot(h, label="Histrogram of Genes")
plt.xlabel('Index of a Gene')
plt.ylabel('Frequency of Occurances')
plt.legend()
plt.grid()
plt.show()


train_gene_response = np.array(get_response_code(1, 'Gene', X_train))
test_gene_response = np.array(get_response_code(1, 'Gene', X_test))
val_gene_response = np.array(get_response_code(1, 'Gene', X_val))

print(f"The shape of response coded feature is {train_gene_response.shape[1]}")


gene_vectorizer = CountVectorizer()
train_gene_onehot = gene_vectorizer.fit_transform(X_train['Gene'])
test_gene_onehot = gene_vectorizer.transform(X_test['Gene'])
val_gene_onehot = gene_vectorizer.transform(X_val['Gene'])

print(f"The shape of onehot coded feature is {train_gene_onehot.shape[1]}")

接下来，我们仅利用Gene特征来构建机器学习模型以评估其效果


alpha = [10**x for x in range(-5, 1)]
loss_list = []

for i in alpha:
    clf = SGDClassifier(alpha=i, loss='log')

    sig_clf = CalibratedClassifierCV(clf)
    sig_clf.fit(train_gene_onehot, y_train)
    y_pred = sig_clf.predict_proba(val_gene_onehot)
    loss_ = log_loss(y_val, y_pred)
    print(f"For alpha={i}, loss is {loss_}")
    loss_list.append(loss_)

fig, ax = plt.subplots()
ax.plot(alpha, loss_list, c='g')
for i, txt in enumerate(np.round(loss_list, 3)):
    ax.annotate((alpha[i], np.round(txt, 3)), (alpha[i], loss_list[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()

best_alpha = alpha[np.argmin(loss_list)]
clf = SGDClassifier(alpha=best_alpha, loss='log')
sig_clf = CalibratedClassifierCV(clf)
sig_clf.fit(train_gene_onehot, y_train)

y_pred = sig_clf.predict_proba(train_gene_onehot)
print('For values of best alpha:', best_alpha, "The train log loss is:", log_loss(y_train, y_pred))

y_pred = sig_clf.predict_proba(val_gene_onehot)
print('For values of best alpha:', best_alpha, "The cross val log loss is:", log_loss(y_val, y_pred))

y_pred = sig_clf.predict_proba(test_gene_onehot)
print('For values of best alpha:', best_alpha, "The test log loss is:", log_loss(y_test, y_pred,))


unique_genes = X_train['Variation'].value_counts()
print(f"Number of unique variation in train data is: {len(unique_genes)}. The distribution of variation in trian data is as follows:")
s = sum(unique_genes.values)
h = unique_genes.values / s
plt.plot(h, label="Histrogram of Variation")
plt.xlabel('Index of a Variation')
plt.ylabel('Frequency of Occurances')
plt.legend()
plt.grid()
plt.show()


train_varin_response = np.array(get_response_code(1, 'Variation', X_train))
test_varin_response = np.array(get_response_code(1, 'Variation', X_test))
val_varin_response = np.array(get_response_code(1, 'Variation', X_val))

print(f"The shape of response coded feature is {train_varin_response.shape[1]}")


varin_vectorizer = CountVectorizer()
train_varin_onehot = gene_vectorizer.fit_transform(X_train['Variation'])
test_varin_onehot = gene_vectorizer.transform(X_test['Variation'])
val_varin_onehot = gene_vectorizer.transform(X_val['Variation'])

print(f"The shape of onehot coded feature is {train_varin_onehot.shape[1]}")

接下来，我们仅利用Variation特征来构建机器学习模型以评估其效果


alpha = [10**x for x in range(-5, 1)]
loss_list = []

for i in alpha:
    clf = SGDClassifier(alpha=i, loss='log')

    sig_clf = CalibratedClassifierCV(clf)
    sig_clf.fit(train_varin_onehot, y_train)
    y_pred = sig_clf.predict_proba(val_varin_onehot)
    loss_ = log_loss(y_val, y_pred)
    print(f"For alpha={i}, loss is {loss_}")
    loss_list.append(loss_)

fig, ax = plt.subplots()
ax.plot(alpha, loss_list, c='g')
for i, txt in enumerate(np.round(loss_list, 3)):
    ax.annotate((alpha[i], np.round(txt, 3)), (alpha[i], loss_list[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()

best_alpha = alpha[np.argmin(loss_list)]
clf = SGDClassifier(alpha=best_alpha, loss='log')
sig_clf = CalibratedClassifierCV(clf)
sig_clf.fit(train_varin_onehot, y_train)

y_pred = sig_clf.predict_proba(train_varin_onehot)
print('For values of best alpha:', best_alpha, "The train log loss is:", log_loss(y_train, y_pred))

y_pred = sig_clf.predict_proba(val_varin_onehot)
print('For values of best alpha:', best_alpha, "The cross val log loss is:", log_loss(y_val, y_pred))

y_pred = sig_clf.predict_proba(test_varin_onehot)
print('For values of best alpha:', best_alpha, "The test log loss is:", log_loss(y_test, y_pred,))


train_text_response = get_text_response(X_train)
test_text_response = get_text_response(X_test)
val_text_response = get_text_response(X_val)


train_text_response = (train_text_response.T / train_text_response.sum(axis=1)).T
test_text_response = (test_text_response.T / test_text_response.sum(axis=1)).T
val_text_response = (val_text_response.T / val_text_response.sum(axis=1)).T


text_vectorizer = CountVectorizer(min_df=3)
train_text_onehot = text_vectorizer.fit_transform(X_train['TEXT'])
test_text_onehot = text_vectorizer.transform(X_test['TEXT'])
val_text_onehot = text_vectorizer.transform(X_val['TEXT'])


train_text_onehot = normalize(train_text_onehot, axis=0)
test_text_onehot = normalize(test_text_onehot, axis=0)
val_text_onehot = normalize(val_text_onehot, axis=0)


alpha = [10**x for x in range(-5, 1)]
loss_list = []

for i in alpha:
    clf = SGDClassifier(alpha=i, loss='log')

    sig_clf = CalibratedClassifierCV(clf)
    sig_clf.fit(train_text_onehot, y_train)
    y_pred = sig_clf.predict_proba(val_text_onehot)
    loss_ = log_loss(y_val, y_pred)
    print(f"For alpha={i}, loss is {loss_}")
    loss_list.append(loss_)

fig, ax = plt.subplots()
ax.plot(alpha, loss_list, c='g')
for i, txt in enumerate(np.round(loss_list, 3)):
    ax.annotate((alpha[i], np.round(txt, 3)), (alpha[i], loss_list[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()

best_alpha = alpha[np.argmin(loss_list)]
clf = SGDClassifier(alpha=best_alpha, loss='log')
sig_clf = CalibratedClassifierCV(clf)
sig_clf.fit(train_text_onehot, y_train)

y_pred = sig_clf.predict_proba(train_text_onehot)
print('For values of best alpha:', best_alpha, "The train log loss is:", log_loss(y_train, y_pred))

y_pred = sig_clf.predict_proba(val_text_onehot)
print('For values of best alpha:', best_alpha, "The cross val log loss is:", log_loss(y_val, y_pred))

y_pred = sig_clf.predict_proba(test_text_onehot)
print('For values of best alpha:', best_alpha, "The test log loss is:", log_loss(y_test, y_pred,))

5.1 预备操作

train_onehot = hstack((hstack((train_gene_onehot, train_varin_onehot)), train_text_onehot)).tocsr()
test_onehot = hstack((hstack((test_gene_onehot, test_varin_onehot)), test_text_onehot)).tocsr()
val_onehot = hstack((hstack((val_gene_onehot, val_varin_onehot)), val_text_onehot)).tocsr()

print("OneHot encoding feature:")
print(f"Train data: {train_onehot.shape}")
print(f"Test data: {test_onehot.shape}")
print(f"Val data: {val_onehot.shape}")

train_response = np.hstack((np.hstack((train_gene_response, train_varin_response)), train_text_response))
test_response = np.hstack((np.hstack((test_gene_response, test_varin_response)), test_text_response))
val_response = np.hstack((np.hstack((val_gene_response, val_varin_response)), val_text_response))

print("Response encoding feature:")
print(f"Train data: {train_response.shape}")
print(f"Test data: {test_response.shape}")
print(f"Val data: {val_response.shape}")

5.2 Naive Bayes模型

alpha = [0.00001, 0.0001, 0.001, 0.1, 1, 10, 100,1000]
loss_list = []

for i in alpha:
    clf = MultinomialNB(alpha=i)

    sig_clf = CalibratedClassifierCV(clf)
    sig_clf.fit(train_onehot, y_train)
    y_pred = sig_clf.predict_proba(val_onehot)
    loss_ = log_loss(y_val, y_pred)
    loss_list.append(loss_)
    print(f"For alpha={i}, log Loss:", loss_)

fig, ax = plt.subplots()
ax.plot(np.log10(alpha), loss_list, c='g')
for i, txt in enumerate(np.round(loss_list, 3)):
    ax.annotate((alpha[i],str(txt)), (np.log10(alpha[i]), loss_list[i]))

plt.grid()
plt.xticks(np.log10(alpha))
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()

best_alpha = alpha[np.argmin(loss_list)]
clf = SGDClassifier(alpha=best_alpha, loss='log')
sig_clf = CalibratedClassifierCV(clf)
sig_clf.fit(train_onehot, y_train)

y_pred = sig_clf.predict_proba(train_onehot)
print('For values of best alpha:', best_alpha, "The train log loss is:", log_loss(y_train, y_pred))

y_pred = sig_clf.predict_proba(val_onehot)
print('For values of best alpha:', best_alpha, "The cross val log loss is:", log_loss(y_val, y_pred))

y_pred = sig_clf.predict_proba(test_onehot)
print('For values of best alpha:', best_alpha, "The test log loss is:", log_loss(y_test, y_pred,))

plot_confusion_matrix(y_test, sig_clf.predict(test_onehot))

5.3 KNN模型

alpha = [5, 11, 15, 21, 31, 41, 51, 99]
loss_list = []

for i in alpha:
    clf = KNeighborsClassifier(n_neighbors=i)

    sig_clf = CalibratedClassifierCV(clf)
    sig_clf.fit(train_response, y_train)
    y_pred = sig_clf.predict_proba(val_response)
    loss_ = log_loss(y_val, y_pred)
    loss_list.append(loss_)
    print(f"For alpha={i}, log loss is: {loss_}")

fig, ax = plt.subplots()
ax.plot(np.log10(alpha), loss_list, c='g')
for i, txt in enumerate(np.round(loss_list, 3)):
    ax.annotate((alpha[i],str(txt)), (np.log10(alpha[i]), loss_list[i]))

plt.grid()
plt.xticks(np.log10(alpha))
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()

best_alpha = alpha[np.argmin(loss_list)]
clf = KNeighborsClassifier(n_neighbors=best_alpha)
sig_clf = CalibratedClassifierCV(clf)
sig_clf.fit(train_response, y_train)

y_pred = sig_clf.predict_proba(train_response)
print('For values of best alpha:', best_alpha, "The train log loss is:", log_loss(y_train, y_pred))

y_pred = sig_clf.predict_proba(val_response)
print('For values of best alpha:', best_alpha, "The cross val log loss is:", log_loss(y_val, y_pred))

y_pred = sig_clf.predict_proba(test_response)
print('For values of best alpha:', best_alpha, "The test log loss is:", log_loss(y_test, y_pred,))

plot_confusion_matrix(y_test, sig_clf.predict(test_response))

5.4 LR模型

alpha = [10**x for x in range(-6, 3)]
loss_list = []

for i in alpha:
    clf = SGDClassifier(class_weight='balanced', alpha=i, loss='log')

    sig_clf = CalibratedClassifierCV(clf)
    sig_clf.fit(train_onehot, y_train)
    y_pred = sig_clf.predict_proba(val_onehot)
    loss_ = log_loss(y_val, y_pred)
    loss_list.append(loss_)
    print(f"For alpha={i}, log loss is: {loss_}")

fig, ax = plt.subplots()
ax.plot(np.log10(alpha), loss_list, c='g')
for i, txt in enumerate(np.round(loss_list, 3)):
    ax.annotate((alpha[i],str(txt)), (np.log10(alpha[i]), loss_list[i]))

plt.grid()
plt.xticks(np.log10(alpha))
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()

best_alpha = alpha[np.argmin(loss_list)]
clf = SGDClassifier(class_weight='balanced', alpha=best_alpha, loss='log')
sig_clf = CalibratedClassifierCV(clf)
sig_clf.fit(train_onehot, y_train)

y_pred = sig_clf.predict_proba(train_onehot)
print('For values of best alpha:', best_alpha, "The train log loss is:", log_loss(y_train, y_pred))

y_pred = sig_clf.predict_proba(val_onehot)
print('For values of best alpha:', best_alpha, "The cross val log loss is:", log_loss(y_val, y_pred))

y_pred = sig_clf.predict_proba(test_onehot)
print('For values of best alpha:', best_alpha, "The test log loss is:", log_loss(y_test, y_pred,))

plot_confusion_matrix(y_test, sig_clf.predict(test_onehot))

alpha = [10**x for x in range(-6, 3)]
loss_list = []

for i in alpha:
    clf = SGDClassifier(alpha=i, loss='log')

    sig_clf = CalibratedClassifierCV(clf)
    sig_clf.fit(train_onehot, y_train)
    y_pred = sig_clf.predict_proba(val_onehot)
    loss_ = log_loss(y_val, y_pred)
    loss_list.append(loss_)
    print(f"For alpha={i}, log loss is: {loss_}")

fig, ax = plt.subplots()
ax.plot(np.log10(alpha), loss_list, c='g')
for i, txt in enumerate(np.round(loss_list, 3)):
    ax.annotate((alpha[i],str(txt)), (np.log10(alpha[i]), loss_list[i]))

plt.grid()
plt.xticks(np.log10(alpha))
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()

best_alpha = alpha[np.argmin(loss_list)]
clf = SGDClassifier(alpha=best_alpha, loss='log')
sig_clf = CalibratedClassifierCV(clf)
sig_clf.fit(train_onehot, y_train)

y_pred = sig_clf.predict_proba(train_onehot)
print('For values of best alpha:', best_alpha, "The train log loss is:", log_loss(y_train, y_pred))

y_pred = sig_clf.predict_proba(val_onehot)
print('For values of best alpha:', best_alpha, "The cross val log loss is:", log_loss(y_val, y_pred))

y_pred = sig_clf.predict_proba(test_onehot)
print('For values of best alpha:', best_alpha, "The test log loss is:", log_loss(y_test, y_pred,))

plot_confusion_matrix(y_test, sig_clf.predict(test_onehot))

5.5 SVM模型

alpha = [10**x for x in range(-5, 3)]
loss_list = []

for i in alpha:
    clf = SGDClassifier(class_weight='balanced', alpha=i, loss='hinge')

    sig_clf = CalibratedClassifierCV(clf)
    sig_clf.fit(train_onehot, y_train)
    y_pred = sig_clf.predict_proba(val_onehot)
    loss_ = log_loss(y_val, y_pred)
    loss_list.append(loss_)
    print(f"For alpha={i}, log loss is: {loss_}")

fig, ax = plt.subplots()
ax.plot(np.log10(alpha), loss_list, c='g')
for i, txt in enumerate(np.round(loss_list, 3)):
    ax.annotate((alpha[i],str(txt)), (np.log10(alpha[i]), loss_list[i]))

plt.grid()
plt.xticks(np.log10(alpha))
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()

best_alpha = alpha[np.argmin(loss_list)]
clf = SGDClassifier(class_weight='balanced', alpha=best_alpha, loss='hinge')
sig_clf = CalibratedClassifierCV(clf)
sig_clf.fit(train_onehot, y_train)

y_pred = sig_clf.predict_proba(train_onehot)
print('For values of best alpha:', best_alpha, "The train log loss is:", log_loss(y_train, y_pred))

y_pred = sig_clf.predict_proba(val_onehot)
print('For values of best alpha:', best_alpha, "The cross val log loss is:", log_loss(y_val, y_pred))

y_pred = sig_clf.predict_proba(test_onehot)
print('For values of best alpha:', best_alpha, "The test log loss is:", log_loss(y_test, y_pred,))

plot_confusion_matrix(y_test, sig_clf.predict(test_onehot))

5.6 RF模型

alpha = [50, 100,200,300, 500]
max_depth = [5, 10]
loss_list = []

for i in alpha:
    for j in max_depth:
        clf = RandomForestClassifier(n_estimators=i, max_depth=j, n_jobs=-1)

        sig_clf = CalibratedClassifierCV(clf)
        sig_clf.fit(train_onehot, y_train)
        y_pred = sig_clf.predict_proba(val_onehot)
        loss_ = log_loss(y_val, y_pred)
        loss_list.append(loss_)
        print(f"For estimators={i}, depth={j}, log loss is: {loss_}")


best_index = np.argmin(loss_list)
best_alpha = alpha[int(best_index/2)]
best_depth = max_depth[int(best_index%2)]
clf = RandomForestClassifier(n_estimators=best_alpha, max_depth=best_depth, n_jobs=-1)
sig_clf = CalibratedClassifierCV(clf)
sig_clf.fit(train_onehot, y_train)

y_pred = sig_clf.predict_proba(train_onehot)
print('For values of best alpha:', best_alpha, ', best max-depth:', best_depth, "The train log loss is:", log_loss(y_train, y_pred))

y_pred = sig_clf.predict_proba(val_onehot)
print('For values of best alpha:', best_alpha, ', best max-depth:', best_depth, "The cross val log loss is:", log_loss(y_val, y_pred))

y_pred = sig_clf.predict_proba(test_onehot)
print('For values of best alpha:', best_alpha, ', best max-depth:', best_depth, "The test log loss is:", log_loss(y_test, y_pred,))

plot_confusion_matrix(y_test, sig_clf.predict(test_onehot))

alpha = [50, 100,200,300, 500]
max_depth = [2, 3, 5, 10]
loss_list = []

for i in alpha:
    for j in max_depth:
        clf = RandomForestClassifier(n_estimators=i, max_depth=j, n_jobs=-1)

        sig_clf = CalibratedClassifierCV(clf)
        sig_clf.fit(train_onehot, y_train)
        y_pred = sig_clf.predict_proba(val_onehot)
        loss_ = log_loss(y_val, y_pred)
        loss_list.append(loss_)
        print(f"For estimators={i}, depth={j}, log loss is: {loss_}")

best_index = np.argmin(loss_list)
best_alpha = alpha[int(best_index/4)]
best_depth = max_depth[int(best_index%4)]
clf = RandomForestClassifier(n_estimators=best_alpha, max_depth=best_depth, n_jobs=-1)
sig_clf = CalibratedClassifierCV(clf)
sig_clf.fit(train_onehot, y_train)

y_pred = sig_clf.predict_proba(train_onehot)
print('For values of best alpha:', best_alpha, ', best max-depth:', best_depth, "The train log loss is:", log_loss(y_train, y_pred))

y_pred = sig_clf.predict_proba(val_onehot)
print('For values of best alpha:', best_alpha, ', best max-depth:', best_depth, "The cross val log loss is:", log_loss(y_val, y_pred))

y_pred = sig_clf.predict_proba(test_onehot)
print('For values of best alpha:', best_alpha, ', best max-depth:', best_depth, "The test log loss is:", log_loss(y_test, y_pred,))

plot_confusion_matrix(y_test, sig_clf.predict(test_onehot))

5.7 模型集成


clf1 = MultinomialNB(alpha=0.1)
sig_clf1 = CalibratedClassifierCV(clf1)

clf2 = KNeighborsClassifier(n_neighbors=5)
sig_clf2 = CalibratedClassifierCV(clf2)

clf3 = SGDClassifier(alpha=0.001, loss='log', class_weight='balanced')
sig_clf3 = CalibratedClassifierCV(clf3)

clf4 = SGDClassifier(class_weight='balanced', alpha=0.001, loss='hinge')
sig_clf4 = CalibratedClassifierCV(clf4)

clf5 = RandomForestClassifier(n_estimators=500, max_depth=10, n_jobs=-1)
sig_clf5 = CalibratedClassifierCV(clf5)

alpha = [0.0001,0.001,0.01,0.1,1]
best_loss = 999
best_alpha = 999

for i in alpha:
    lr = LogisticRegression(C=i)
    sclf = StackingClassifier(classifiers=[sig_clf1, sig_clf2, sig_clf3, sig_clf4, sig_clf5],
                              meta_classifier=lr, use_probas=True)
    sclf.fit(train_onehot, y_train)
    loss_ = log_loss(y_val, sclf.predict_proba(val_onehot))
    print(f"Stacking classifiers for alpha={i}, log loss is: {loss_}")
    if best_loss > loss_:
        best_loss = loss_
        best_alpha = i

lr = LogisticRegression(C=best_alpha)
sclf = StackingClassifier(classifiers=[sig_clf1, sig_clf2, sig_clf3, sig_clf4, sig_clf5],
                          meta_classifier=lr, use_probas=True)
sclf.fit(train_onehot, y_train)

log_error = log_loss(y_train, sclf.predict_proba(train_onehot))
print("Log loss (train) on the stacking classifier:", log_error)

log_error = log_loss(y_val, sclf.predict_proba(val_onehot))
print("Log loss (CV) on the stacking classifier:", log_error)

log_error = log_loss(y_test, sclf.predict_proba(test_onehot))
print("Log loss (test) on the stacking classifier:", log_error)

plot_confusion_matrix(y_test, sclf.predict(test_onehot))

Original: https://blog.csdn.net/SunJW_2017/article/details/123252964
Author: 芳樽里的歌
Title: 基因变异自动分类

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/530432/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

V831——完美的单目测距

V831 文章目录 V831 前言一、单目测距的原理二、参数计算 * 1.相机焦距 2.测距总结前言经过一下午的努力，最终终于实现了完美的单目测距，网上教的都是openc…

人工智能 2023年6月19日
00126
LeetCode 752. 打开转盘锁

今天在看bfs模板的时候看到了一个题目，解密码锁的这道题，半天也没啥思路和行动力，看了人家的java版的注释，花了40分钟才搞懂这个题，也真的是菜。写完之后发现这个题目还可以去优化…

人工智能 2023年6月29日
00101
[YOLO专题-27]：YOLO V5 小目标检测遇到的问题与常见解决办法

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年7月12日
0085
YOLOv5-v6.0学习笔记

YOLOv5-v6.0学习笔记 1. 网络结构 * 1.1 Backbone – 1.1.1 Conv模块 1.1.2 Focus模块 1.1.3 CSPDarkNet…

人工智能 2023年6月16日
0077
小爱音箱mini系统故障怎么办_小爱音箱MINI使用1年多了，我来说说使用感受

前言家里的蓝牙音箱入了有几个了，今天来说说小爱音箱MINI，这个音箱用了也有1年多了，因为需要插电，所有主要是在家里用，闲暇的时候可以听听书，听听相声。小爱mini音箱可玩性…

人工智能 2023年5月27日
00170
数据分析之渠道质量分析

无论是线上还是线下投放，在做数据分析的时候，经常会遇到需要对投放的渠道去做分析。做渠道分析的目标是为了让有限的资金预算能通过更合理投放方案，获得更多的收益，提高投放的效率。我们以…

人工智能 2023年7月16日
0075
强化学习之stable_baseline3详细说明和各项功能的使用

本文基于官方文档的基础上，把其中的重要部分整合和翻译，并整理成容易理解的顺序。其中蕴含有大量使用案例，方便大家理解和查看。官方文档：https://stable-baseline…

人工智能 2023年6月16日
0096
基于时间序列的残差自回归模型

实验数据来源于课本课后习题： 1 、首先加载所需的数据包，并画出时序图：时序图可以看出数据呈现上升趋势。 2 、所以我们先对趋势进行拟合，首先通过时间 t 作为解释变量对趋势进行…

人工智能 2023年6月17日
0095
KNN(K最近邻算法)原理及代码实现

机器学习没有免费午餐定理和三大机器学习任务如何对模型进行评估K-Means(K均值聚类)原理及代码实现KNN(K最近邻算法)原理及代码实现KMeans和KNN的联合演习文章目录…

人工智能 2023年6月19日
0092
激活函数和最常用的10个激活函数

1. 什么是激活函数 activation function 激活函数是一种添加到ANN中的函数，它决定了最终要发射给下一个神经元的内容。在人工神经网络中，一个节点的激活函数 …

人工智能 2023年7月27日
0085
感知机算法之Python代码实现

感知机算法之Python代码实现 ; 1.算法简介感知机学习算法原始形式：输入：训练集T输出：w,b感知机模型：f(x)=sign(w·x+b)算法步骤：1.初始化参数w0,b0…

人工智能 2023年6月15日
00125
如何在pycharm中使用anaconda的虚拟环境

最近项目中有许多同学咨询如何在pycharm中使用anaconda的虚拟环境（envs），这里就给大家简单介绍一下。首先我们需要安装anaconda，这里就不在追述了，网上安装教…

人工智能 2023年7月4日
0084
libcurl与分片传输、断点续传相关研究

分片传输、断点续传相关研究场景，构建一个下载类组件，基于libcurl，达到正常下载、分片传输、断点续传等功能，同时保证组件的健壮性、对极限情况的兼容性、对上层业务回抛信息的完善…

人工智能 2023年6月28日
0098
数据挖掘十大算法之分类算法(分类介绍及评价指标)

文章目录 * – 1. 分类相关知识 – + 1.1 分类的概念 + 1.2 分类的流程 + 1.3 分类模型评价标准 – 2. 二分类分类案例…

人工智能 2023年6月19日
0098
机器学习——决策树（decision tree）

相关文章链接：机器学习——人工神经网络（NN）机器学习——卷积神经网络（CNN）机器学习——循环神经网络（RNN）机器学习——长短期记忆（LSTM）机器学习——决策树（d…

人工智能 2023年6月17日
0087
A-Level经济真题（13）

A-Level有70多门科目，其中，A-Level经济是最受欢迎考试科目之一。英国大学很多专业都要求学生学习A-Level经济课程。但是，对于A-Level经济究竟偏文还是偏理，考…

人工智能 2023年5月30日
0098

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31