OpenAI Released Circuit-Sparsity Tools for Models

Overview

OpenAI has released their openai/circuit-sparsity model on Hugging Face and the openai/circuit_sparsity toolkit on GitHub. The release packages the models and circuits from the paper ‘Weight-sparse transformers have interpretable circuits.

What is a Weight Sparse Transformer?

The models are GPT-2 style decoder-only transformers trained on Python code. Sparsity is not added after training; it is enforced during optimization. After each AdamW step, the training loop retains only the largest magnitude entries in every weight matrix and bias, including token embeddings, and zeros the rest. All matrices maintain the same fraction of non-zero elements.

The sparsest models have approximately 1 in 1000 non-zero weights. Additionally, the OpenAI team enforced mild activation sparsity so that about 1 in 4 node activations are non-zero.

Definition of Sparse Circuits

The central object is a sparse circuit. Each node is defined at a fine granularity: a single neuron, attention channel, or residual channel. An edge represents a single non-zero entry in a weight matrix connecting two nodes.

The research team built 20 simple Python next token binary tasks to probe the models. Each task forces the model to choose between two completions differing by one token. Examples include:

single_double_quote: predict whether to close with a single or double quote.
bracket_counting: decide between ] and ]] based on list nesting depth.
set_or_string: track whether a variable was initialized as a set or a string.

Bridging Sparse and Dense Models

The research team introduces bridges that connect a sparse model to an already trained dense model. Each bridge is an encoder-decoder pair that maps dense activations into sparse activations and back once per sublayer. This allows perturbations in interpretable sparse features to affect the dense model behavior in a controlled manner.

OpenAI Releases

The OpenAI team released the openai/circuit-sparsity model on Hugging Face. This is a 0.4 billion parameter model licensed under Apache 2.0. It includes the model checkpoints, task definitions, and a circuit visualization UI.

Key Takeaways

Weight Sparse Training: Circuit sparsity trains models with extreme weight sparsity enforced during optimization.
Small, Task-Specific Circuits: Defined at the level of individual neurons and channels, recovering circuits with tens of nodes for specific tasks.
Concrete Instantiations: Algorithms for tasks such as quote detection and bracket counting are fully instantiated circuits.
Bridge Mechanism: Encoder-decoder bridges allow mapping between sparse and dense activations, facilitating interactions in production scale models.