Development of the in-drop barcoding technique for single cells has opened the floodgate for production of large-scale scRNA-seq data. With the growing popularity of the assay and availability of affordable commercial platforms (ChromiumTM by 10x Genomics, ICELL8 by WaferGen Biosystems etc.) a sharp increase has been observed in the average sample sizes of the recent single-cell studies. A typical Drop-seq experiment involves profiling of several tens of thousands of cells on a single run. Existing rare cell finding algorithms including GiniClust[PMID: 27368803] and RaceID [PMID: 26287467]) are computationally demanding and in practice, do not scale beyond 5-15K transcriptomes. We have developed FiRE, a patent-pending, linear-time, monolithic algorithm for rareness scoring of a massive number of cell transcriptomes in a matter of a few seconds.


Jindal, A., Gupta, P., Jayadeva, and Sengupta, D., 2018. Discovery of rare cells from voluminous single cell expression data.

Cancer cells, after detaching from solid tumors migrate through the bloodstream to colonize at distant organs, leading to the development of cancer metastases. Cancer cells under circulation are called circulating tumor cells (CTCs). As blood-based bio-marker, CTCs offer the real-time snapshot of tumor evolution and therapeutic responses. Despite the promises, acute rareness of CTCs in blood hinders their isolation and characterization. The existing, marker-based sorting techniques fail to detect CTCs with atypical, non-epithelial phenotypes. Through a rigorous, data driven approach, we have come up with a panel of a few hundreds of genes that perfectly distinguishes between single cell transcriptomes of common blood cell types and CTCs.

An explosion in production of single-cell expression data has triggered the need for a search engine. To cater to the need of the hour, we developed CellAtlasSearch, a novel search architecture for high dimensional expression data, which is massively parallel as well as light-weight, thus infinitely scalable. In CellAtlasSearch, we use a Graphical Processing Unit (GPU) friendly version of Locality Sensitive Hashing (LSH) for unmatched speedup in data processing and query. Currently, CellAtlasSearch features over 300 000 reference expression profiles including both bulk and single-cell data. The web-server is accessible at the this link.


Srivastava, D., Iyer, A., Kumar, V., and Sengupta, D., 2018. CellAtlasSearch: A scalable search engine for single cells.

Droplet-based single-cell transcriptomics has recently enabled parallel screening of tens of thousands of single cells. Clustering methods that scale for such high dimensional data without compromising accuracy are scarce. We exploit Locality Sensitive Hashing, an approximate nearest neighbor search technique to develop a de novo clustering algorithm for large-scale single-cell data. On a number of real datasets, dropClust outperformed the existing best practice methods including Seurat in terms of speed, clustering accuracy and detectability of minor cell sub-types. Moreover, dropClust, for the first time, helps discerning transcriptomic signature of the regulatory T cell population in blood.


Sinha, D., Kumar, A., Kumar, H., Bandyopadhyay, S., and Sengupta, D., 2018. dropClust: Efficient clustering of ultra-large scRNA-seq data.