Directly selecting differentially expressed genes for single-cell clustering analyses

Abstract

In single-cell RNA sequencing (scRNA-seq) studies, cell-types and their associated marker genes are often identified by clustering and differential expression gene (DEG) analysis. scRNA-seq data contain many genes not relevant to cell-types and gene selection procedures are needed for more accurate clustering. An ideal gene selection procedure should select all DEGs between cell-types for best cell-type identification. However, because cell-types are unknown, gene selection and DEG analysis are performed separately using different methods. Genes are selected using surrogate criteria not directly related with clustering, which often miss important genes or select unimportant genes. Clustering accuracy could be seriously influenced because of the inferior gene selection. DEGs are often detected by comparing different clusters, leading to many false DEGs due to the selection bias problem. In this paper, we present Festem, a unified method for gene selection and DEG analysis in scRNA-seq studies. Festem investigates gene’s clustering information based on the observation that marginal distributions of DEGs are mixtures of their different cell-type-conditional distributions, and can directly select the clustering-informative DEGs and avoid the selection bias problem. Extensive simulation and real data analyses show that Festem achieves high precision and recall for DEG detection, and enables more accurate clustering and cell-type identification. Applications to several scRNA-seq datasets demonstrate that Festem can identify cell-types that are often missed by other methods. In a large intrahepatic cholangiocarcinoma dataset, we identify CD8+ T cell-types and find that their marker genes are novel prognostic biomarkers.

Publication
bioRxiv
Zihao Chen
Graduate student

My research interests include biostatistics, bioinformatics and omics data analysis, especially scRNA-seq and spatial transcriptomic data analysis.