C2S-Scale 27B Turns Single-Cell RNA into 'Cell Sentences' for LLM-Powered Biology
What C2S-Scale 27B does
Google Research, DeepMind and Yale released C2S-Scale 27B, a 27-billion-parameter foundation model for single-cell analysis built on Gemma-2. The model represents single-cell RNA-seq (scRNA-seq) profiles as ranked lists of gene symbols — called ‘cell sentences’ — so a language model can directly parse and reason about cellular states.
C2S-Scale emits the top-K gene symbols by rank-ordering expression values, converting a high-dimensional expression vector into text. This text-native representation aligns single-cell data with standard LLM toolchains and makes tasks such as cell-type prediction, tissue classification, cluster captioning, perturbation prediction and biological question-answering accessible as prompt/completion workflows.
Training, architecture and release
C2S-Scale-Gemma-2-27B is a decoder-only Transformer built on Gemma-2 27B, trained on Google TPU v5 hardware and released under a CC-BY-4.0 license. The pretraining corpus aggregates more than 800 public scRNA-seq datasets, covering over 57 million human and mouse cells with associated metadata and textual context. Pretraining unifies transcriptomic tokens and biological text into a single multimodal corpus so the model can reason across both gene-rank sequences and natural-language context.
Open weights and documentation for both 27B and 2B Gemma variants are available for research use on Hugging Face and the project GitHub repository.
Discovery: an interferon-conditional amplifier
Using a dual-context virtual screen across more than 4,000 compounds, the research team searched for drugs that selectively boost antigen presentation (the MHC-I program) only in immune-context-positive samples — specifically primary patient samples with low interferon tone — while having little or no effect in immune-context-neutral cell-line data.
The model highlighted silmitasertib, a CK2 inhibitor, with a striking context-dependent split: strong MHC-I upregulation in the presence of low-dose interferon and minimal effect without interferon. The team validated this prediction experimentally in human neuroendocrine cell models that were not part of the training data. In their assays the combined treatment (silmitasertib plus low-dose interferon) produced a marked, synergistic increase in antigen presentation — roughly a 50% increase in the reported readouts — compared with either treatment alone.
Flow-cytometry results indicate that the combination lowers the threshold for interferon responsiveness rather than initiating antigen presentation de novo: HLA-A,B,C upregulation is observed only under combined treatment (including IFN-β and IFN-γ) across two neuroendocrine models, with illustrative mean fluorescence intensity gains such as 13.6% at 10 nM and 34.9% at 1000 nM silmitasertib in one model.
Practical implications and caveats
C2S-Scale 27B demonstrates a practical workflow for LLM-native single-cell analysis: by converting expression profiles into text, researchers can run programmatic screens and context-aware queries across thousands of perturbations. The identification of a CK2 inhibitor as an interferon-conditional amplifier shows the approach can generate experimentally testable hypotheses that are then validated in vitro.
However, all reported evidence is preclinical and bench-scale. The proper interpretation is ‘hypothesis-generating AI’ — a tool to propose mechanisms and prioritize experiments, not to make clinical claims. The authors provide open weights and usage docs to enable replication, stress-testing and further research.
Resources
Model weights, usage documentation and relevant code are available on the project’s Hugging Face and GitHub pages for researchers who want to reproduce the results or try the cell-sentence workflow on new data.