DeepSomatic: Google’s AI Spots Hidden Cancer Mutations Across Sequencing Platforms

A cross-platform somatic variant caller

Google Research and UC Santa Cruz released DeepSomatic, an AI model designed to detect somatic small variants in cancer genomes across multiple sequencing technologies. In collaboration with Children’s Mercy, the team showed the tool discovered 10 variants in pediatric leukemia samples that other callers missed. DeepSomatic builds on the DeepVariant framework and targets single nucleotide variants (SNVs) and small insertions and deletions (indels) in whole genome (WGS) and whole exome (WES) data.

How DeepSomatic works

DeepSomatic converts aligned reads into image-like tensors that encode read pileups, base qualities, and alignment context. These tensors summarize local haplotype and error patterns across technologies, making the approach platform-agnostic. A convolutional neural network classifies candidate sites as somatic or not, and the pipeline outputs standard VCF or gVCF files. The model supports both tumor-normal and tumor-only workflows and includes models tuned for FFPE samples.

Datasets and benchmarking

Training and evaluation relied on CASTLE (Cancer Standards Long read Evaluation), a dataset of six matched tumor-normal cell line pairs sequenced on Illumina short reads, PacBio HiFi, and Oxford Nanopore long reads. The research team released benchmark sets and accessions to enable reuse, filling a gap in multi-technology somatic training and testing resources.

Performance highlights

Compared with established baselines, DeepSomatic demonstrated consistent gains for both SNVs and indels. Key reported results include:

Baseline tools in comparisons included SomaticSniper, MuTect2, Strelka2 for short reads and ClairS for long reads. The team highlights particular strength on indel detection, a historically challenging problem.

Generalization to real samples

The authors evaluated DeepSomatic on cases outside the training set. In a glioblastoma sample the model recovered known driver mutations. For pediatric leukemia they used tumor-only mode (no matched normal) and recovered known calls while reporting additional variants. These results suggest the tensor representation and training scheme generalize to new disease contexts and to settings without matched normals.

Practical implications

DeepSomatic provides a pragmatic, unified approach for somatic variant calling across sequencing platforms. By retaining DeepVariant’s image-tensor representation and a CNN classifier, the pipeline maintains consistent preprocessing and outputs from Illumina to PacBio HiFi to Oxford Nanopore. The CASTLE dataset strengthens reproducibility by supplying matched tumor-normal cell lines across three technologies. With support for WGS and WES, tumor-normal and tumor-only workflows, and FFPE models, DeepSomatic addresses several real-world laboratory constraints and improves indel detection where prior methods struggled.

Availability

The pipeline, benchmarks, dataset accessions, and code are released publicly on GitHub. The research post and technical paper provide additional details and links to the repository and tutorials.