Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith +3 more
3/14/2023
cs.LG

Abstract

We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the tuned lens, is a refinement of the earlier "logit lens" technique, which yielded useful insights but is often brittle. We test our method on various autoregressive language models with up to 20B parameters, showing it to be more predictive, reliable and unbiased than the logit lens. With causal experiments, we show the tuned lens uses similar features to the model itself. We also find the trajectory of latent predictions can be used to detect malicious inputs with high accuracy. All code needed to reproduce our results can be found at https://github.com/AlignmentResearch/tuned-lens.

View on arXivView PDF

Code Implementations(2)

PyTorchMIT

Tools for understanding how transformer predictions are built layer-by-layer

58567Python, DockerfileOct 3, 202210 months agoMIT
machine-learningpytorchtransformers
MIT

A library for mechanistic interpretability of GPT-style language models

3,383562Python, MakefileAug 26, 20221 months agoMIT

Cite this paper

@article{belrose2023eliciting,
  title  = {Eliciting Latent Predictions from Transformers with the Tuned Lens},
  author = {Nora Belrose and Igor Ostrovsky and Lev McKinney and Zach Furman and Logan Smith and Danny Halawi and Stella Biderman and Jacob Steinhardt},
  year   = {2023},
  eprint = {2303.08112v6},
  archivePrefix = {arXiv},
  url    = {http://arxiv.org/abs/2303.08112v6}
}

Discussion