Scanpy(六)空间转录组数据的分析与可视化

2023年6月15日上午6:56 • 人工智能 • 阅读 102

目前空间转录组技术基本上还未到单细胞水平，一般一个空间位置spots上有多个细胞，10X Genomics Visum一般为1-10个细胞，空间转录组学能够提供空间位置，但是分辨率达不到单细胞水平，单细胞转录组学分辨率能够达到单细胞分辨率，但是没有空间位置信息，因此将空间转录组与单细胞转录组结合的话，那就既有空间位置信息，又达到了单细胞分辨率，这为科学界提供了有力的工具，大大的提高了生物体的组织、器官等研究的分辨率和准确性。

我们导入scanpy工具：

import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sc.logging.print_versions()
sc.set_figure_params(facecolor="white", figsize=(8, 8))
sc.settings.verbosity = 3

"""
anndata     0.8.0
scanpy      1.9.1
"""

Reading data

我们将使用人类淋巴结（human lymphnode）的Visium空间转录组学数据集，该数据集发布在10x genomics website：link；

函数 datasets.visium_sge() 从10x Genomics下载数据集，并返回一个包含计数矩阵counts和空间坐标 spatital coordinates 的AnnData对象。

adata = sc.datasets.visium_sge(sample_id="V1_Human_Lymph_Node")
adata.var_names_make_unique()

adata
"""
AnnData object with n_obs × n_vars = 4035 × 36601
    obs: 'in_tissue', 'array_row', 'array_col'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial'
    obsm: 'spatial'
"""

我们将使用 pp.calculate_qc_metrics 并基于线粒体 read counts 计算QC指标。


adata.var["mt"] = adata.var_names.str.startswith("MT-")
adata.var["mt"]
"""
MIR1302-2HG    False
               ...

AC007325.2     False
Name: mt, Length: 36601, dtype: bool
"""

sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)

adata
"""
AnnData object with n_obs × n_vars = 4035 × 36601
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    uns: 'spatial'
    obsm: 'spatial'
"""

回顾 Scanpy(三)处理3k PBMCs并聚类中的内容，我们可以了解到某些质量的含义：

n_genes_by_counts：每个细胞中，有表达的基因的个数；
total_counts：每个细胞的基因总计数（总表达量）；
pct_counts_mt：每个细胞中，线粒体基因表达量占该细胞所有基因表达量的百分比；

QC and preprocessing

我们根据 total_counts 和 n_genes_by_counts 进行一些基本过滤。我们先将这些质量指标可视化为直方图：

fig, axs = plt.subplots(1, 4, figsize=(15, 4))
sns.distplot(adata.obs["total_counts"], kde=False, ax=axs[0])
sns.distplot(adata.obs["total_counts"][adata.obs["total_counts"] < 10000], kde=False, bins=40, ax=axs[1])
sns.distplot(adata.obs["n_genes_by_counts"], kde=False, bins=60, ax=axs[2])
sns.distplot(adata.obs["n_genes_by_counts"][adata.obs["n_genes_by_counts"] < 4000], kde=False, bins=60, ax=axs[3])

第二个图是第一个图在0到10000区间的可视化， total_counts 代表一个细胞的基因总表达量。第四个图是第三个图在0到4000的可视化， n_genes_by_counts 代表一个细胞中，有表达的基因的个数。 注意，计数矩阵为4035×36601。

通过上面的可视化结果，我们保留 total_counts 在5000到35000的细胞，下一步，我们保留线粒体基因 pct_counts_mt 占比小于20%的细胞。最后，做基因上的过滤：

sc.pp.filter_cells(adata, min_counts=5000)
sc.pp.filter_cells(adata, max_counts=35000)
adata = adata[adata.obs["pct_counts_mt"] < 20]
print(f"#cells after MT filter: {adata.n_obs}")
sc.pp.filter_genes(adata, min_cells=10)

"""
filtered out 44 cells that have less than 5000 counts
filtered out 130 cells that have more than 35000 counts
#cells after MT filter: 3861
filtered out 16916 genes that are detected in less than 10 cells
"""

adata
"""
AnnData object with n_obs × n_vars = 3861 × 19685
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells'
    uns: 'spatial'
    obsm: 'spatial'
"""

我们继续使用scanpy的内置函数 normaliza_total标准化Visium counts data（包含counts和空间转录信息的数据），并在此之后检测高变基因。

sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor="seurat", n_top_genes=2000)

"""
normalizing counts per cell
    finished (0:00:00)
If you pass n_top_genes, all cutoffs are ignored.

extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
"""

adata
"""
AnnData object with n_obs × n_vars = 3861 × 19685
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'spatial', 'log1p', 'hvg'
    obsm: 'spatial'
"""

Manifold embedding and clustering based on transcriptional similarity

首先，我们按照标准聚类步骤进行（pca降维，计算neighbors graph，umap降维，leiden聚类）：

sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(adata, key_added="clusters")

"""
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:01)
computing neighbors
    using 'X_pca' with n_pcs = 50
    finished: added to .uns['neighbors']
    .obsp['distances'], distances for each pair of neighbors
    .obsp['connectivities'], weighted adjacency matrix (0:00:06)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:06)
running Leiden clustering
    finished: found 10 clusters and added
    'clusters', the cluster labels (adata.obs, categorical) (0:00:00)
"""

adata
"""
AnnData object with n_obs × n_vars = 3861 × 19685
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts', 'clusters'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'spatial', 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'leiden'
    obsm: 'spatial', 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
"""

我们基于umap可视化一些相关变量，以检查是否存在与总计数（ total_counts）和检测到的基因（ n_genes_by_counts）相关的特定数据分布：

plt.rcParams["figure.figsize"] = (4, 4)
sc.pl.umap(adata, color=["total_counts", "n_genes_by_counts", "clusters"], wspace=0.4)

Visualization in spatial coordinates

我们可以先看看空间坐标：

adata.obsm['spatial'].shape

adata.obsm['spatial']
"""
array([[8346, 6982],
       [4270, 1363],
       ...,
       [4822, 1840]], dtype=int64)
"""

现在让我们来看看 total_counts 和 n_genes_by_counts 在空间坐标中的分布。我们将使用 sc.pl.spatial 函数将圆形点（circular spots）覆盖在提供的 Hematoxylin and eosin stain（H＆E）图像上。

plt.rcParams["figure.figsize"] = (8, 8)
sc.pl.spatial(adata, img_key="hires", color=["total_counts", "n_genes_by_counts"])

函数 sc.pl.spatial接受4个附加参数：

img_key：存储在 adata.uns中 images里的 key；
crop_coord：用于裁剪的坐标（left, right, top, bottom）；
alpha_img：透明度；
bw：将图像转换为灰度图的标志；

此外，在 sc.pl.spatial中， size参数会改变其行为：它会成为spots大小的比例因子。

之前，我们在基因表达空间中进行聚类，并使用UMAP将结果可视化。现在我们通过在空间维度上可视化样本，我们可以深入了解组织的结构，并可能深入了解细胞间的通信：

sc.pl.spatial(adata, img_key="hires", color="clusters", size=1.5)

在基因表达空间中属于同一簇的spots通常在空间维度上同时出现。例如，属于簇4的点通常被属于簇0的点包围。

我们可以放大特定的感兴趣区域，以获得特定的结论。此外，通过改变spot的alpha值，我们可以更好地从H&E图像中看到潜在的组织形态。

sc.pl.spatial(adata, img_key="hires", color="clusters", groups=["0", "4"], alpha=0.5, size=1.3)

簇的marker基因

我们进一步检查簇4。我们计算标记基因并绘制一个热图，图中显示了簇中前10个标记基因的表达水平。

sc.tl.rank_genes_groups(adata, "clusters", method="t-test")
sc.pl.rank_genes_groups_heatmap(adata, groups="4", n_genes=10, groupby="clusters")

"""
ranking genes
    finished: added to .uns['rank_genes_groups']
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:02)
WARNING: dendrogram data not found (using key=dendrogram_clusters). Running sc.tl.dendrogram with default parameters. For fine tuning it is recommended to run sc.tl.dendrogram independently.

    using 'X_pca' with n_pcs = 50
Storing dendrogram info using .uns['dendrogram_clusters']
WARNING: Groups are not reordered because the groupby categories and the var_group_labels are different.

categories: 0, 1, 2, etc.

var_group_labels: 4
"""

我们对基因CR2进行分析：

sc.pl.spatial(adata, img_key="hires", color=["clusters", "CR2"])

空间数据的高变基因

空间转录组学使研究人员能够研究基因表达趋势在空间中的变化，从而确定基因表达的空间模式。为此，我们使用SpatialDE，这是一种基于高斯过程的统计框架，旨在识别空间转录组的可变基因：

pip install spatialde

最近，还提出了其他用于识别空间变异基因的工具，例如：SPARK，trendsceek，HMRF

首先，我们将标准化的counts和coordinates转换为pandas dataframe，这是输入spatialDE所需的：

import SpatialDE

%%time
counts = pd.DataFrame(adata.X.todense(), columns=adata.var_names, index=adata.obs_names)
coord = pd.DataFrame(adata.obsm['spatial'], columns=['x_coord', 'y_coord'], index=adata.obs_names)

"""
CPU times: total: 93.8 ms
Wall time: 94.7 ms
"""

查看counts：

查看coord：

运行SpatialDE需要相当长的时间（本示例接近2小时）。

results = SpatialDE.run(coord, counts)

results中会保存基于空间转录组数据计算得到的可变基因。

MERFISH示例

如果我们使用基于FISH的技术生成空间数据，只需读cordinate表并将其分配给 adata.obsm即可。

让我们看关于Xia等人2019年的例子。

首先，我们需要下载坐标和counts数据：

import urllib.request

url_coord = "https://www.pnas.org/highwire/filestream/887973/field_highwire_adjunct_files/15/pnas.1912459116.sd15.xlsx"
filename_coord = "pnas.1912459116.sd15.xlsx"
urllib.request.urlretrieve(url_coord, filename_coord)

url_counts = "https://www.pnas.org/highwire/filestream/887973/field_highwire_adjunct_files/12/pnas.1912459116.sd12.csv"
filename_counts = "pnas.1912459116.sd12.csv"
urllib.request.urlretrieve(url_counts, filename_counts)

如果不能访问以上链接，可以手动到github下载：spatial datasets，这是一个空间转录数据集的整理。

然后读取到adata：

coordinates = pd.read_excel("./pnas.1912459116.sd15.xlsx", index_col=0)
counts = sc.read_csv("./pnas.1912459116.sd12.csv").transpose()

adata_merfish = counts[coordinates.index, :]
adata_merfish.obsm["spatial"] = coordinates.to_numpy()

adata_merfish
"""
AnnData object with n_obs × n_vars = 645 × 12903
    obsm: 'spatial'
"""

我们将执行标准的预处理和降维：

sc.pp.normalize_per_cell(adata_merfish, counts_per_cell_after=1e6)
sc.pp.log1p(adata_merfish)
sc.pp.pca(adata_merfish, n_comps=15)
sc.pp.neighbors(adata_merfish)
sc.tl.umap(adata_merfish)
sc.tl.leiden(adata_merfish, key_added="clusters", resolution=0.5)

"""
normalizing by total count per cell
    finished (0:00:00): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
computing PCA
    with n_comps=15
    finished (0:00:00)
computing neighbors
    using 'X_pca' with n_pcs = 15
    finished: added to .uns['neighbors']
    .obsp['distances'], distances for each pair of neighbors
    .obsp['connectivities'], weighted adjacency matrix (0:00:05)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:01)
running Leiden clustering
    finished: found 7 clusters and added
    'clusters', the cluster labels (adata.obs, categorical) (0:00:00)
"""

adata_merfish
"""
AnnData object with n_obs × n_vars = 645 × 12903
    obs: 'n_counts', 'clusters'
    uns: 'log1p', 'pca', 'neighbors', 'umap', 'leiden'
    obsm: 'spatial', 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
"""

我们可以在UMAP空间和spatial坐标中可视化Leiden得到的簇。

sc.pl.umap(adata_merfish, color="clusters")
sc.pl.embedding(adata_merfish, basis="spatial", color="clusters")

Original: https://blog.csdn.net/qq_40943760/article/details/125201948
Author: tzc_fly
Title: Scanpy(六)空间转录组数据的分析与可视化

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/614034/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

利用混淆矩阵及Kappa系数评价土地利用分类精度

文章目录一、准备工作二、处理步骤 * 2.1 Arcmap：栅格转点+多值提取到点 – 栅格转点多值提取至点 2.2 Excel：计算混淆矩阵三、Excel计算…

人工智能 2023年6月30日
0084
MSCOCO数据集转VOC数据集训练目标检测模型

MSCOCO数据集转VOC数据集训练目标检测模型 Images2014 Train images [83K/13GB]2014 Val images [41K/6GB]2014 T…

人工智能 2023年7月12日
0077
【微信小程序 | 实战开发】开发环境、测试环境和生产环境

; 写在前面：你是否想要掌握人工智能的最新技术和应用？你是否想要成为未来社会的创新者和领导者？你是否想要和全球的优秀导师和同学一起学习和交流？如果你的答案是肯定的，那么欢迎来到床…

人工智能 2023年5月31日
00105
聚类算法手册–原理+评估方法

聚类算法在无监督学习 (unsupervised learning)中，训练样本的标记信息是未知的。无监督学习的目标：通过对无标记训练样本的学习来揭露数据的内在性质以及规律。 …

人工智能 2023年6月2日
0075
工业相机飞拍模式介绍及相机曝光值计算

1.检测原理 (一)原理飞拍就是使用硬件比较输出或精准输出端口在极短时间内触发相机拍照，而被测物品在拍照过程中仍处于运动状态，与此同时被测物品通过图像处理软件计算出其位置的偏移量…

人工智能 2023年7月27日
0090
java rdf_知识图谱之认识RDF图

图的分类图数据库从存储结构上来说，只有两种结构：一种是属性图；另一种是RDF图，又称作三元组图。 RDF基础知识 RDF定义 RDF(Resource-Description-F…

人工智能 2023年6月1日
00103
python+opencv分类器训练模型，运动物体识别检测，无人机识别（源码直接下载可用）

一、简介使用opencv_traincascade 分类器的训练模型包括两个主要阶段：模型的训练阶段和检测阶段。本文档概述了训练自己的弱分类器的级联所需的功能。当前指南将逐步完成…

人工智能 2023年7月9日
0090
《大数据》第七章聚类 K-means算法 BFR算法 CURE算法

1.欧氏空间与非欧空间 [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service…

人工智能 2023年6月3日
0079
岭回归（Ridge）和Lasso 回归（笔记）

最近在自学图灵教材《Python机器学习基础教程》，在csdn以博客的形式做些笔记。对于回归问题，线性模型预测的一般公式如下： ŷ = w[0] * x[0] + w[1] * …

人工智能 2023年6月16日
0086
【GAN超分辨】Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data

0.前言本文为ESRGAN的扩展，提出一个能够完全在合成数据上训练的面向真实场景的超分模型——Real-ESRGAN。本文的模型生成器采用ESRGAN为骨干结构，而判别器却采用…

人工智能 2023年7月13日
0054
无监督特征对齐的迁移学习理论框架

0、前言文献《基于无监督特征对齐的变负载下滚动轴承故障诊断方法》引入迁移学习中能够实现无监督领域适应的子空间对齐(subspace alignment， SA)算法并进行改进，…

人工智能 2023年7月17日
0073
神经网络模型分类总结

卷积神经网络总结：卷积神经网络完整总结_AntheLinZ的博客-CSDN博客_典型的cnn结构一、神经网络类别一般的，神经网络模型基本结构按信息输入是否反馈，可以分为两种：…

人工智能 2023年6月30日
00120
Logistic回归——二分类 —— matlab

目录 1.简介 2.应用范围 3.分类 3.应用条件 4.原理详解 4.1 sigmod分类函数 4.2 建立目标函数 4.3 求解相关参数 5.实列分析 5.1 读取数据（exc…

人工智能 2023年6月16日
0082
SPSS：数据分析

目录 1、数据采集 2、数据是否服从正态分布 3、T检验（T Test） 4、方差分析（ANOVA） 5、卡方检验（Chi-square Test） 6、灰色关联度分析（Grey …

人工智能 2023年7月18日
0078
pandas中DataFrame数据合并及分组

划分之后可使用count计数： df_1 = df.groupby(‘key1’) print(df_1.count()) 输出结果为： key2 data1 data2 key1…

人工智能 2023年6月11日
0075
NeurIPS2022 | SegNeXt，重新思考卷积注意力设计

rTopDiv: null, cor rRightDiv: null, cor rBottomDiv: null, cor rLeftDiv: null, mapPosition:…

人工智能 2023年6月17日
0073

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Scanpy(六)空间转录组数据的分析与可视化

目录

大家都在看