本篇内容将用Paul于2015年的数据实现轨迹分析。在此之前,先回顾PAGA:结合轨迹推断和聚类的工具。
首先,导入工具,注意为了正常绘制下面的力导向图,我们需要提前安装fa2
:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
from matplotlib import rcParams
import scanpy as sc
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
results_file = './write/paul15.h5ad'
# low dpi (dots per inch) 用于生成 small inline figures
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(3, 3), facecolor='white')
"""
anndata 0.8.0
scanpy 1.9.1
"""
adata = sc.datasets.paul15()
adata
"""
AnnData object with n_obs × n_vars = 2730 × 3451
obs: 'paul15_clusters'
uns: 'iroot'
"""
对数据进行进一步探索:
adata.obs['paul15_clusters']
"""
0 7MEP
1 15Mo
2 3Ery
3 15Mo
4 3Ery
...
2725 2Ery
2726 13Baso
2727 7MEP
2728 15Mo
2729 3Ery
Name: paul15_clusters, Length: 2730, dtype: category
Categories (19, object): ['1Ery', '2Ery', '3Ery', '4Ery', ..., '16Neu', '17Neu', '18Eos', '19Lymph']
"""
注意标签,其实Ery是红细胞erythrocytes的简化符号标记(这种简化在PAGA论文中有说明)。
下面我们以比默认“float32”更高的精度去实验,以确保在不同的计算平台上得到相同的结果。
adata.X = adata.X.astype('float64') # this is not required and results will be comparable without it
预处理和可视化
这里,我们使用sc封装的简单预处理方法(该方法来自zheng于17年的工作):
sc.pp.recipe_zheng17(adata)
"""
running recipe zheng17
normalizing counts per cell
finished (0:00:00)
extracting highly variable genes
finished (0:00:00)
normalizing counts per cell
finished (0:00:00)
finished (0:00:00)
"""
然后,进行主成分分析:
sc.tl.pca(adata, svd_solver='arpack')
"""
computing PCA
with n_comps=50
finished (0:00:01)
"""
adata
"""
AnnData object with n_obs × n_vars = 2730 × 1000
obs: 'paul15_clusters', 'n_counts_all'
var: 'n_counts', 'mean', 'std'
uns: 'iroot', 'log1p', 'pca'
obsm: 'X_pca'
varm: 'PCs'
"""
此时,我们计算邻域图,并使用Force-directed graph(就像t-sne和UMAP)可视化数据:
sc.pp.neighbors(adata, n_neighbors=4, n_pcs=20)
sc.tl.draw_graph(adata)
"""
computing neighbors
using 'X_pca' with n_pcs = 20
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:05)
drawing single-cell graph using layout 'fa'
finished: added
'X_draw_graph_fa', graph_drawing coordinates (adata.obsm) (0:00:24)
"""
adata
"""
AnnData object with n_obs × n_vars = 2730 × 1000
obs: 'paul15_clusters', 'n_counts_all'
var: 'n_counts', 'mean', 'std'
uns: 'iroot', 'log1p', 'pca', 'neighbors', 'draw_graph'
obsm: 'X_pca', 'X_draw_graph_fa'
varm: 'PCs'
obsp: 'distances', 'connectivities'
"""
sc.pl.draw_graph(adata, color='paul15_clusters', legend_loc='on data')

Force-directed graph
基于库仑力思想对图数据进行聚类,是一种简单的图可视化方法。注意本质目的是基于图结构对数据进行聚类。
Denoising the graph
为了去除图中的噪声,我们将其表示在扩散映射空间(是diffusion map space,不是PCA空间)。计算并筛选扩散分量相当于对图进行去噪–我们只取前几个分量。这类似于使用PCA对数据矩阵进行去噪。该方法已在几篇论文中使用,例如Schiebinger等人(2017)或Tabaka等人(2018)的工作。
这不是PAGA、聚类或拟时估计所必需的步骤。我们也可以继续使用非去噪图。在许多情况下,去噪会带来较好的结果。
sc.tl.diffmap(adata)
sc.pp.neighbors(adata, n_neighbors=10, use_rep='X_diffmap')
"""
computing Diffusion Maps using n_comps=15(=n_dcs)
computing transitions
finished (0:00:00)
eigenvalues of transition matrix
[1. 1. 0.9989278 0.99671 0.99430376 0.98939794
0.9883687 0.98731077 0.98398703 0.983007 0.9790806 0.9762548
0.9744365 0.9729161 0.9652972 ]
finished: added
'X_diffmap', diffmap coordinates (adata.obsm)
'diffmap_evals', eigenvalues of transition matrix (adata.uns) (0:00:00)
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
"""
adata
"""
AnnData object with n_obs × n_vars = 2730 × 1000
obs: 'paul15_clusters', 'n_counts_all'
var: 'n_counts', 'mean', 'std'
uns: 'iroot', 'log1p', 'pca', 'neighbors', 'draw_graph', 'paul15_clusters_colors', 'diffmap_evals'
obsm: 'X_pca', 'X_draw_graph_fa', 'X_diffmap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
"""
adata.obsm['X_diffmap'].shape
# (2730, 15)
再用力导引图Force-directed graph可视化数据:
sc.tl.draw_graph(adata)
"""
drawing single-cell graph using layout 'fa'
finished: added
'X_draw_graph_fa', graph_drawing coordinates (adata.obsm) (0:00:20)
"""
adata
"""
AnnData object with n_obs × n_vars = 2730 × 1000
obs: 'paul15_clusters', 'n_counts_all'
var: 'n_counts', 'mean', 'std'
uns: 'iroot', 'log1p', 'pca', 'neighbors', 'draw_graph', 'paul15_clusters_colors', 'diffmap_evals'
obsm: 'X_pca', 'X_draw_graph_fa', 'X_diffmap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
"""
sc.pl.draw_graph(adata, color='paul15_clusters', legend_loc='on data')
注意,sc.pl.draw_graph
是根据adata.obsm['X_draw_graph_fa']
绘制的;
聚类和PAGA
注意,在之前,我们使用sc.tl.leiden
聚类,但在PAGA中,应该使用sc.tl.louvain
,这是为了重现论文的结果。(回顾PAGA:结合轨迹推断和聚类的工具)
在使用louvain前,需要提前安装,下面指令会自动将缺失的louvain安装好:
pip install scanpy[louvain]
然后我们进行聚类得到25个簇:
sc.tl.louvain(adata, resolution=1.0)
"""
running Louvain clustering
using the "louvain" package of Traag (2017)
finished: found 25 clusters and added
'louvain', the cluster labels (adata.obs, categorical) (0:00:00)
"""
经过文献调查,我们有下面的细胞类型和标记基因关系:
cell type | marker |
---|---|
HSCs | Procr |
Erythroids | Gata1, Klf1, Epor, Gypa, Hba-a2, Hba-a1, Spi1 |
Neutrophils | Elane, Cebpe, Ctsg, Mpo, Gfi1 |
Monocytes | Irf8, Csf1r, Ctsg, Mpo |
Megakaryocytes | Itga2b (encodes protein CD41), Pbx1, Sdpr, Vwf |
Basophils | Mcpt8, Prss34 |
B cells | Cd19, Vpreb2, Cd79a |
Mast cells | Cma1, Gzmb, CD117/C-Kit |
Mast cells & Basophils | Ms4a2, Fcer1a, Cpa3, CD203c (human) |
对于简单的粗粒度可视化,我们计算PAGA图,这是一个粗粒度和简化(抽象)的图。粗粒度图中的非有效边(低于阈值)被去除。
sc.tl.paga(adata, groups='louvain')
"""
running PAGA
finished: added
'paga/connectivities', connectivities adjacency (adata.uns)
'paga/connectivities_tree', connectivities subtree (adata.uns) (0:00:00)
"""
adata
"""
AnnData object with n_obs × n_vars = 2730 × 1000
obs: 'paul15_clusters', 'n_counts_all', 'louvain'
var: 'n_counts', 'mean', 'std'
uns: 'iroot', 'log1p', 'pca', 'neighbors', 'draw_graph', 'paul15_clusters_colors', 'diffmap_evals', 'louvain', 'paga', 'louvain_sizes'
obsm: 'X_pca', 'X_draw_graph_fa', 'X_diffmap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
"""
可视化PAGA,在聚类标签和3个选择的基因表达量上:
sc.pl.paga(adata, color=['louvain', 'Hba-a2', 'Elane', 'Irf8'])
"""
--> added 'pos', the PAGA positions (adata.uns['paga'])
"""
再选择其他的基因进行表达量可视化:
sc.pl.paga(adata, color=['louvain', 'Itga2b', 'Prss34', 'Cma1'])
"""
--> added 'pos', the PAGA positions (adata.uns['paga'])
"""
Cma1是Mast cell的标记基因,仅出现在祖细胞或干细胞(progenitor或stem cell)集群8的一小部分细胞中,进一步,我们分析下面的单细胞解析图。
我们为一些可以人工初步确定细胞类型的簇加标签:
adata.obs['louvain'].cat.categories
"""
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
'13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24'],
dtype='object')
"""
adata
"""
AnnData object with n_obs × n_vars = 2730 × 1000
obs: 'paul15_clusters', 'n_counts_all', 'louvain'
var: 'n_counts', 'mean', 'std'
uns: 'iroot', 'log1p', 'pca', 'neighbors', 'draw_graph', 'paul15_clusters_colors', 'diffmap_evals', 'louvain', 'paga', 'louvain_sizes', 'louvain_colors'
obsm: 'X_pca', 'X_draw_graph_fa', 'X_diffmap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
"""
adata.obs['louvain_anno'] = adata.obs['louvain']
# 为人工可初步分类的簇加标签
adata.obs['louvain_anno'].cat.categories = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10/Ery', '11', '12',
'13', '14', '15', '16/Stem', '17', '18', '19/Neu', '20/Mk', '21', '22/Baso', '23', '24/Mo']
将初步分类的结果重新用PGAG可视化,由于设置了阈值,使得一些边被删除了:
sc.tl.paga(adata, groups='louvain_anno')
"""
running PAGA
finished: added
'paga/connectivities', connectivities adjacency (adata.uns)
'paga/connectivities_tree', connectivities subtree (adata.uns) (0:00:00)
"""
sc.pl.paga(adata, threshold=0.03, show=False)
adata
"""
AnnData object with n_obs × n_vars = 2730 × 1000
obs: 'paul15_clusters', 'n_counts_all', 'louvain', 'louvain_anno'
var: 'n_counts', 'mean', 'std'
uns: 'iroot', 'log1p', 'pca', 'neighbors', 'draw_graph', 'paul15_clusters_colors', 'diffmap_evals', 'louvain', 'paga', 'louvain_sizes', 'louvain_colors', 'louvain_anno_sizes', 'louvain_anno_colors'
obsm: 'X_pca', 'X_draw_graph_fa', 'X_diffmap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
"""
使用PAGA-initialization重新计算embedding
首先,我们可以使用力导引图或者UMAP可视化单细胞数据,只是在可视化算法执行时,我们直接为其指定好每个数据的初始2D位置init_pos
:
sc.tl.draw_graph(adata, init_pos='paga')
"""
drawing single-cell graph using layout 'fa'
finished: added
'X_draw_graph_fa', graph_drawing coordinates (adata.obsm) (0:00:20)
"""
除了sc.tl.draw_graph
,我们也可以用sc.tl.umap
,注意这两个函数的参数init_pos
:
'paga'/True
、None/False
或任何有效的2维的obsm
元素。作用是使用预设置的2维坐标数据进行初始化。如果为False/None
(默认值),则随机初始化。
现在我们可以在一个有意义的布局中以单细胞分辨率去观察所有标记基因:
sc.pl.draw_graph(adata, color=['louvain_anno', 'Itga2b', 'Prss34', 'Cma1'], legend_loc='on data')
选择更一致连续的颜色标记各类型细胞:
pl.figure(figsize=(8, 2))
for i in range(28):
pl.scatter(i, 1, c=sc.pl.palettes.zeileis_28[i], s=200)
pl.show()
zeileis_colors = np.array(sc.pl.palettes.zeileis_28)
new_colors = np.array(adata.uns['louvain_anno_colors'])
new_colors[[16]] = zeileis_colors[[12]] # Stem colors / green
new_colors[[10, 17, 5, 3, 15, 6, 18, 13, 7, 12]] = zeileis_colors[[5, 5, 5, 5, 11, 11, 10, 9, 21, 21]] # Ery colors / red
new_colors[[20, 8]] = zeileis_colors[[17, 16]] # Mk early Ery colors / yellow
new_colors[[4, 0]] = zeileis_colors[[2, 8]] # lymph progenitors / grey
new_colors[[22]] = zeileis_colors[[18]] # Baso / turquoise
new_colors[[19, 14, 2]] = zeileis_colors[[6, 6, 6]] # Neu / light blue
new_colors[[24, 9, 1, 11]] = zeileis_colors[[0, 0, 0, 0]] # Mo / dark blue
new_colors[[21, 23]] = zeileis_colors[[25, 25]] # outliers / grey
adata.uns['louvain_anno_colors'] = new_colors
可视化细粒度和粗粒度的PAGA:
sc.pl.paga_compare(
adata, threshold=0.03, title='', right_margin=0.2, size=10, edge_width_scale=0.5,
legend_fontsize=12, fontsize=12, frameon=False, edges=True, save=True)
注意右边PAGA和最开始的PAGA看起来不一样,但实际上,拓扑结构还是相同的。
关于scanpy.pl.paga_compare
:
- 散点图和PAGA图并排显示,包含散点图(细粒度PAGA)和抽象图。
计算基因表达变化:PAGA Path
首先,选择扩散伪时间的根细胞:
adata.uns['iroot'] = np.flatnonzero(adata.obs['louvain_anno'] == '16/Stem')[0]
numpy.flatnonzero()
:该函数输入一个矩阵,返回扁平化后矩阵中非零元素的位置。
计算pseudotime,注意在计算pseudotime前,需要指定iroot
,否则会随机指定,下面计算pseudotime:
sc.tl.dpt(adata)
"""
computing Diffusion Pseudotime using n_dcs=10
finished: added
'dpt_pseudotime', the pseudotime (adata.obs) (0:00:00)
"""
选择一些标记基因进入列表:
gene_names = ['Gata2', 'Gata1', 'Klf1', 'Epor', 'Hba-a2', # erythroid
'Elane', 'Cebpe', 'Gfi1', # neutrophil
'Irf8', 'Csf1r', 'Ctsg'] # monocyte
读入原始数据raw data做可视化:
adata_raw = sc.datasets.paul15()
sc.pp.log1p(adata_raw)
sc.pp.scale(adata_raw)
adata.raw = adata_raw
sc.pl.draw_graph(adata, color=['louvain_anno', 'dpt_pseudotime'], legend_loc='on data')
通过观察上图,人工得出3条细胞分化路径(注意前提是我们要正确找到干细胞,后续结果才可靠):
paths = [('erythrocytes', [16, 12, 7, 13, 18, 6, 5, 10]),
('neutrophils', [16, 0, 4, 2, 14, 19]),
('monocytes', [16, 0, 4, 11, 1, 9, 24])]
容易发现,这确实需要有生物基础才能进行拟时序分析。
新增以下注释:
adata.obs['distance'] = adata.obs['dpt_pseudotime']
adata.obs['clusters'] = adata.obs['louvain_anno']
adata.uns['clusters_colors'] = adata.uns['louvain_anno_colors']
创建目录write
:
!mkdir write
计算并保存PAGA Path:
_, axs = pl.subplots(ncols=3, figsize=(6, 2.5), gridspec_kw={
'wspace': 0.05, 'left': 0.12})
pl.subplots_adjust(left=0.05, right=0.98, top=0.82, bottom=0.2)
for ipath, (descr, path) in enumerate(paths):
_, data = sc.pl.paga_path(
adata, path, gene_names,
show_node_names=False,
ax=axs[ipath],
ytick_fontsize=12,
left_margin=0.15,
n_avg=50,
annotations=['distance'],
show_yticks=True if ipath==0 else False,
show_colorbar=False,
color_map='Greys',
groups_key='clusters',
color_maps_annotations={
'distance': 'viridis'},
title='{} path'.format(descr),
return_data=True,
show=False)
data.to_csv('./write/paga_path_{}.csv'.format(descr))
pl.savefig('./figures/paga_path_paul15.pdf')
pl.show()
- 列是三条细胞分化分支;
- 前面的行是基因:基因在细胞分化过程中,表达量的变化;
- clusters行:标记了该分支下的细胞类型变化;
- distance行:记录了每条分化路径下,细胞特征与干细胞特征的距离,干细胞来源于我们选择的
iroot
。