Scanpy(六)空间转录组数据的分析与可视化

目录

本篇内容演示如何在Scanpy中使用空间转录组数据(spatial transcriptomics data),首先,我们专注于10x Genomics Visium data,最后,我们为MERFISH技术的空间转录数据分析提供了示例。

目前空间转录组技术基本上还未到单细胞水平,一般一个空间位置spots上有多个细胞,10X Genomics Visum一般为1-10个细胞,空间转录组学能够提供空间位置,但是分辨率达不到单细胞水平,单细胞转录组学分辨率能够达到单细胞分辨率,但是没有空间位置信息,因此将空间转录组与单细胞转录组结合的话,那就既有空间位置信息,又达到了单细胞分辨率,这为科学界提供了有力的工具,大大的提高了生物体的组织、器官等研究的分辨率和准确性。

我们导入scanpy工具:

import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sc.logging.print_versions()
sc.set_figure_params(facecolor="white", figsize=(8, 8))
sc.settings.verbosity = 3

"""
anndata     0.8.0
scanpy      1.9.1
"""

Reading data

我们将使用人类淋巴结(human lymphnode)的Visium空间转录组学数据集,该数据集发布在10x genomics website:link

函数 datasets.visium_sge() 从10x Genomics下载数据集,并返回一个包含计数矩阵counts和空间坐标 spatital coordinates 的AnnData对象。

adata = sc.datasets.visium_sge(sample_id="V1_Human_Lymph_Node")
adata.var_names_make_unique()

adata
"""
AnnData object with n_obs × n_vars = 4035 × 36601
    obs: 'in_tissue', 'array_row', 'array_col'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial'
    obsm: 'spatial'
"""

我们将使用 pp.calculate_qc_metrics 并基于线粒体 read counts 计算QC指标。


adata.var["mt"] = adata.var_names.str.startswith("MT-")
adata.var["mt"]
"""
MIR1302-2HG    False
               ...

AC007325.2     False
Name: mt, Length: 36601, dtype: bool
"""

sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)

adata
"""
AnnData object with n_obs × n_vars = 4035 × 36601
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    uns: 'spatial'
    obsm: 'spatial'
"""

回顾 Scanpy(三)处理3k PBMCs并聚类中的内容,我们可以了解到某些质量的含义:

  • n_genes_by_counts:每个细胞中,有表达的基因的个数;
  • total_counts:每个细胞的基因总计数(总表达量);
  • pct_counts_mt:每个细胞中,线粒体基因表达量占该细胞所有基因表达量的百分比;

QC and preprocessing

我们根据 total_countsn_genes_by_counts 进行一些基本过滤。我们先将这些质量指标可视化为直方图:

fig, axs = plt.subplots(1, 4, figsize=(15, 4))
sns.distplot(adata.obs["total_counts"], kde=False, ax=axs[0])
sns.distplot(adata.obs["total_counts"][adata.obs["total_counts"] < 10000], kde=False, bins=40, ax=axs[1])
sns.distplot(adata.obs["n_genes_by_counts"], kde=False, bins=60, ax=axs[2])
sns.distplot(adata.obs["n_genes_by_counts"][adata.obs["n_genes_by_counts"] < 4000], kde=False, bins=60, ax=axs[3])

Scanpy(六)空间转录组数据的分析与可视化
第二个图是第一个图在0到10000区间的可视化, total_counts 代表一个细胞的基因总表达量。第四个图是第三个图在0到4000的可视化, n_genes_by_counts 代表一个细胞中,有表达的基因的个数。 注意,计数矩阵为4035×36601

通过上面的可视化结果,我们保留 total_counts 在5000到35000的细胞,下一步,我们保留线粒体基因 pct_counts_mt 占比小于20%的细胞。最后,做基因上的过滤:

sc.pp.filter_cells(adata, min_counts=5000)
sc.pp.filter_cells(adata, max_counts=35000)
adata = adata[adata.obs["pct_counts_mt"] < 20]
print(f"#cells after MT filter: {adata.n_obs}")
sc.pp.filter_genes(adata, min_cells=10)

"""
filtered out 44 cells that have less than 5000 counts
filtered out 130 cells that have more than 35000 counts
#cells after MT filter: 3861
filtered out 16916 genes that are detected in less than 10 cells
"""

adata
"""
AnnData object with n_obs × n_vars = 3861 × 19685
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells'
    uns: 'spatial'
    obsm: 'spatial'
"""

我们继续使用scanpy的内置函数 normaliza_total标准化Visium counts data(包含counts和空间转录信息的数据),并在此之后检测高变基因。

sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor="seurat", n_top_genes=2000)

"""
normalizing counts per cell
    finished (0:00:00)
If you pass n_top_genes, all cutoffs are ignored.

extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
"""

adata
"""
AnnData object with n_obs × n_vars = 3861 × 19685
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'spatial', 'log1p', 'hvg'
    obsm: 'spatial'
"""

Manifold embedding and clustering based on transcriptional similarity

首先,我们按照标准聚类步骤进行(pca降维,计算neighbors graph,umap降维,leiden聚类):

sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(adata, key_added="clusters")

"""
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:01)
computing neighbors
    using 'X_pca' with n_pcs = 50
    finished: added to .uns['neighbors']
    .obsp['distances'], distances for each pair of neighbors
    .obsp['connectivities'], weighted adjacency matrix (0:00:06)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:06)
running Leiden clustering
    finished: found 10 clusters and added
    'clusters', the cluster labels (adata.obs, categorical) (0:00:00)
"""

adata
"""
AnnData object with n_obs × n_vars = 3861 × 19685
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts', 'clusters'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'spatial', 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'leiden'
    obsm: 'spatial', 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
"""

我们基于umap可视化一些相关变量,以检查是否存在与总计数( total_counts)和检测到的基因( n_genes_by_counts)相关的特定数据分布:

plt.rcParams["figure.figsize"] = (4, 4)
sc.pl.umap(adata, color=["total_counts", "n_genes_by_counts", "clusters"], wspace=0.4)

Scanpy(六)空间转录组数据的分析与可视化

Visualization in spatial coordinates

我们可以先看看空间坐标:

adata.obsm['spatial'].shape

adata.obsm['spatial']
"""
array([[8346, 6982],
       [4270, 1363],
       ...,
       [4822, 1840]], dtype=int64)
"""

现在让我们来看看 total_countsn_genes_by_counts 在空间坐标中的分布。我们将使用 sc.pl.spatial 函数将圆形点(circular spots)覆盖在提供的 Hematoxylin and eosin stain(H&E)图像上。

plt.rcParams["figure.figsize"] = (8, 8)
sc.pl.spatial(adata, img_key="hires", color=["total_counts", "n_genes_by_counts"])

Scanpy(六)空间转录组数据的分析与可视化

函数 sc.pl.spatial接受4个附加参数:

  • img_key:存储在 adata.unsimages里的 key
  • crop_coord:用于裁剪的坐标(left, right, top, bottom);
  • alpha_img:透明度;
  • bw:将图像转换为灰度图的标志;

此外,在 sc.pl.spatial中, size参数会改变其行为:它会成为spots大小的比例因子。

之前,我们在基因表达空间中进行聚类,并使用UMAP将结果可视化。现在我们通过在空间维度上可视化样本,我们可以深入了解组织的结构,并可能深入了解细胞间的通信:

sc.pl.spatial(adata, img_key="hires", color="clusters", size=1.5)

Scanpy(六)空间转录组数据的分析与可视化
在基因表达空间中属于同一簇的spots通常在空间维度上同时出现。例如,属于簇4的点通常被属于簇0的点包围。

我们可以放大特定的感兴趣区域,以获得特定的结论。此外,通过改变spot的alpha值,我们可以更好地从H&E图像中看到潜在的组织形态。

sc.pl.spatial(adata, img_key="hires", color="clusters", groups=["0", "4"], alpha=0.5, size=1.3)

Scanpy(六)空间转录组数据的分析与可视化

簇的marker基因

我们进一步检查簇4。我们计算标记基因并绘制一个热图,图中显示了簇中前10个标记基因的表达水平。

sc.tl.rank_genes_groups(adata, "clusters", method="t-test")
sc.pl.rank_genes_groups_heatmap(adata, groups="4", n_genes=10, groupby="clusters")

"""
ranking genes
    finished: added to .uns['rank_genes_groups']
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:02)
WARNING: dendrogram data not found (using key=dendrogram_clusters). Running sc.tl.dendrogram with default parameters. For fine tuning it is recommended to run sc.tl.dendrogram independently.

    using 'X_pca' with n_pcs = 50
Storing dendrogram info using .uns['dendrogram_clusters']
WARNING: Groups are not reordered because the groupby categories and the var_group_labels are different.

categories: 0, 1, 2, etc.

var_group_labels: 4
"""

Scanpy(六)空间转录组数据的分析与可视化
我们对基因CR2进行分析:
sc.pl.spatial(adata, img_key="hires", color=["clusters", "CR2"])

Scanpy(六)空间转录组数据的分析与可视化

空间数据的高变基因

空间转录组学使研究人员能够研究基因表达趋势在空间中的变化,从而确定基因表达的空间模式。为此,我们使用SpatialDE,这是一种基于高斯过程的统计框架,旨在识别空间转录组的可变基因:

pip install spatialde

最近,还提出了其他用于识别空间变异基因的工具,例如:SPARK,trendsceek,HMRF

首先,我们将标准化的counts和coordinates转换为pandas dataframe,这是输入spatialDE所需的:

import SpatialDE

%%time
counts = pd.DataFrame(adata.X.todense(), columns=adata.var_names, index=adata.obs_names)
coord = pd.DataFrame(adata.obsm['spatial'], columns=['x_coord', 'y_coord'], index=adata.obs_names)

"""
CPU times: total: 93.8 ms
Wall time: 94.7 ms
"""

查看counts:

Scanpy(六)空间转录组数据的分析与可视化
查看coord:
Scanpy(六)空间转录组数据的分析与可视化

运行SpatialDE需要相当长的时间(本示例接近2小时)。

results = SpatialDE.run(coord, counts)

results中会保存基于空间转录组数据计算得到的可变基因。

MERFISH示例

如果我们使用基于FISH的技术生成空间数据,只需读cordinate表并将其分配给 adata.obsm即可。

让我们看关于Xia等人2019年的例子。

首先,我们需要下载坐标和counts数据:

import urllib.request

url_coord = "https://www.pnas.org/highwire/filestream/887973/field_highwire_adjunct_files/15/pnas.1912459116.sd15.xlsx"
filename_coord = "pnas.1912459116.sd15.xlsx"
urllib.request.urlretrieve(url_coord, filename_coord)

url_counts = "https://www.pnas.org/highwire/filestream/887973/field_highwire_adjunct_files/12/pnas.1912459116.sd12.csv"
filename_counts = "pnas.1912459116.sd12.csv"
urllib.request.urlretrieve(url_counts, filename_counts)

如果不能访问以上链接,可以手动到github下载:spatial datasets,这是一个空间转录数据集的整理。

然后读取到adata:

coordinates = pd.read_excel("./pnas.1912459116.sd15.xlsx", index_col=0)
counts = sc.read_csv("./pnas.1912459116.sd12.csv").transpose()

adata_merfish = counts[coordinates.index, :]
adata_merfish.obsm["spatial"] = coordinates.to_numpy()

adata_merfish
"""
AnnData object with n_obs × n_vars = 645 × 12903
    obsm: 'spatial'
"""

我们将执行标准的预处理和降维:

sc.pp.normalize_per_cell(adata_merfish, counts_per_cell_after=1e6)
sc.pp.log1p(adata_merfish)
sc.pp.pca(adata_merfish, n_comps=15)
sc.pp.neighbors(adata_merfish)
sc.tl.umap(adata_merfish)
sc.tl.leiden(adata_merfish, key_added="clusters", resolution=0.5)

"""
normalizing by total count per cell
    finished (0:00:00): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
computing PCA
    with n_comps=15
    finished (0:00:00)
computing neighbors
    using 'X_pca' with n_pcs = 15
    finished: added to .uns['neighbors']
    .obsp['distances'], distances for each pair of neighbors
    .obsp['connectivities'], weighted adjacency matrix (0:00:05)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:01)
running Leiden clustering
    finished: found 7 clusters and added
    'clusters', the cluster labels (adata.obs, categorical) (0:00:00)
"""

adata_merfish
"""
AnnData object with n_obs × n_vars = 645 × 12903
    obs: 'n_counts', 'clusters'
    uns: 'log1p', 'pca', 'neighbors', 'umap', 'leiden'
    obsm: 'spatial', 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
"""

我们可以在UMAP空间和spatial坐标中可视化Leiden得到的簇。

sc.pl.umap(adata_merfish, color="clusters")
sc.pl.embedding(adata_merfish, basis="spatial", color="clusters")

Scanpy(六)空间转录组数据的分析与可视化

Original: https://blog.csdn.net/qq_40943760/article/details/125201948
Author: tzc_fly
Title: Scanpy(六)空间转录组数据的分析与可视化

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/614034/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球