糖尿病康复 > Seurat 包图文详解 | 单细胞转录组(scRNA-seq)分析02

Seurat 包图文详解 | 单细胞转录组(scRNA-seq)分析02

时间：2022-07-19 03:44:45

文章目录

一、创建 Seurat 对象二、标准预处理流程1.基因质控指标来筛选细胞2.归一化数据3.识别高异质性特征4.缩放数据5.线性维度约化 PCAVizDimLoadingsDimPlotDimHeatmap5.确定数据集的维度方法一：JackStrawPlot方法二：ElbowPlot6.聚类细胞7.非线性维度约化（UMAP/TSNE）8.发现差异表达特征（cluster bioers）9.识别细胞类型

一、创建 Seurat 对象

使用的示例数据集来自10X Genome 测序的 Peripheral Blood Mononuclear Cells (PBMC)。

下载链接：https://s3-us-west-/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

library(dplyr)library(Seurat)# Load the PBMC datasetpbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")# Initialize the Seurat object with the raw (non-normalized data).pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)pbmc

二、标准预处理流程

流程包括：

基于质控指标（QC metric）来筛选细胞数据归一化和缩放高异质性基因检测

1.基因质控指标来筛选细胞

质控指标：

每个细胞中检测到的基因数

低质量的细胞和空油滴（droplet）只有少量基因两个及以上的细胞会有异常的高基因数

每个细胞中的UMI总数（与上类似）

线粒体基因组的reads比例

低质量或死细胞会有大百分比的线粒体基因组

使用PercentageFeatureSet函数来计数线粒体质控指标

MT-是线粒体基因

# 计算线粒体read的百分比pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)# 显示前5个细胞的质控指标head(pbmc@meta.data, 5)

通过上图，过滤标准设定为：

过滤UMI数大于2500，小于200的细胞过滤线粒体百分比大于5%的细胞

查看特征与特征间的相关性

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")

plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

过滤

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

看看相关性

p1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")p2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")CombinePlots(plots = list(p1, p2))

2.归一化数据

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

LogNormalize that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored inpbmc[["RNA"]]@data.
上述代码可以替换为：pbmc <- NormalizeData(pbmc)

3.识别高异质性特征

高异质性：这些特征在有的细胞中高表达，有的细胞中低表达。在下游分析中关注这些基因有助于找到单细胞数据集中的生物信号[/articles/nmeth.2645 ]

# 识别前2000个特征pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)# 识别前10的高异质性基因top10 <- head(VariableFeatures(pbmc), 10)# 绘图看看plot1 <- VariableFeaturePlot(pbmc)plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)CombinePlots(plots = list(plot1, plot2))

4.缩放数据

这是在PCA等降维操作前的一个步骤，ScaleData函数：

转换每个基因的表达值，使每个细胞的平均表达值为0转换每个基因的表达值，使细胞间方差为1 此步骤在下游分析中具有相同的权重，因此高表达的基因不会占主导地位

all.genes <- rownames(pbmc)pbmc <- ScaleData(pbmc, features = all.genes)head(pbmc[["RNA"]]@scale.data,5)

5.线性维度约化 PCA

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

可视化细胞与特征间的PCA有三种方式：

VizDimLoadings

print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)# 绘图VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

DimPlot

DimPlot(pbmc, reduction = "pca")

DimHeatmap

DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)

主要用来查看数据集中的异质性的主要来源，并且可以确定哪些PC维度可以用于下一步的下游分析。

细胞和特征根据PCA分数来排序

DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)

5.确定数据集的维度

为了克服在单细胞数据中在单个特征中的技术噪音，Seurat 聚类细胞是基于PCA分数的。每个PC代表着一个‘元特征’（带有跨相关特征集的信息）。因此，最主要的主成分代表了压缩的数据集。问题是要选多少PC呢？

方法一：JackStrawPlot

作者受JackStraw procedure 启发。随机置换数据的一部分子集（默认1%）再运行PCA，构建了一个’null distribution’的特征分数，重复这一步。最终会识别出低P-value特征的显著PCs

pbmc <- JackStraw(pbmc, num.replicate = 100)pbmc <- ScoreJackStraw(pbmc, dims = 1:20)# 绘图看看JackStrawPlot(pbmc, dims = 1:15)

In this case it appears that there is a sharp drop-off in significance after the first 10-12 PCs

在上图中展示出在前10到12台PC之后，重要性显著下降

方法二：ElbowPlot

“ElbowPlot”：基于每个分量所解释的方差百分比对主要成分进行排名。在此示例中，我们可以在PC9-10周围观察到“elbow ”，这表明大多数真实信号是在前10台PC中捕获的。

ElbowPlot(pbmc)

为了识别出数据的真实维度，有三种方法：

用更加受监督的方法来确定PCs的异质性，比如可以结合GSEA来分析（ The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example ）The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff.The third is a heuristic that is commonly used, and can be calculated instantly.

在这个例子中三种方法均产生了相似的结果，以PC 7-12作为阈值。

这个例子中，作者选择10，但是实际过程中还要考虑：

树突状细胞和NK细胞可能在PCs12和13中识别，这可能定义了罕见的免疫亚群（比如，MZB1是浆细胞样的er）。但是除非有一定的知识量，否则很难从背景噪音中发现。用户可以选择不同的PCs再进行下游分析，比如选10，15，50等。结果常常有很多的不同。建议在选择该参数时候，尽量偏高一点。如果仅仅使用5PCs会对下游分析产生不利影响

6.聚类细胞

pbmc <- FindNeighbors(pbmc, dims = 1:10)pbmc <- FindClusters(pbmc, resolution = 0.5)# 查看前5聚类head(Idents(pbmc), 5)

7.非线性维度约化（UMAP/TSNE）

# 使用UMAP聚类pbmc <- RunUMAP(pbmc, dims = 1:10)DimPlot(pbmc, reduction = "umap")# 显示在聚类标签DimPlot(pbmc, reduction = "umap", label = TRUE)

# 使用TSNE聚类pbmc <- RunTSNE(pbmc, dims = 1:10)DimPlot(pbmc, reduction = "tsne")# 显示在聚类标签DimPlot(pbmc, reduction = "tsne", label = TRUE)

8.发现差异表达特征（cluster bioers）

# 发现聚类一的所有biomarkerscluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)head(cluster1.markers, n = 5)# 查找将聚类5与聚类0和3区分的所有标记cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)head(cluster5.markers, n = 5)# 与所有其他细胞相比，找到每个簇的标记，仅报告阳性细胞pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)

可视化

# 绘图看看VlnPlot(pbmc, features = c("MS4A1", "CD79A"))

# 使用原始count绘制VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)

FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))

RidgePlot(pbmc, features = c("MS4A1", "CD79A"))

DotPlot(pbmc, features = c("MS4A1", "CD79A"))

top10 <- pbmc.ers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)DoHeatmap(pbmc, features = top10$gene) + NoLegend()

9.识别细胞类型

在这个数据集的情况下，我们可以使用 canonical markers 轻松地将无偏聚类与已知的细胞类型相匹配。

new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")names(new.cluster.ids) <- levels(pbmc)pbmc <- RenameIdents(pbmc, new.cluster.ids)DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

如果觉得《Seurat 包图文详解 | 单细胞转录组(scRNA-seq)分析02》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。