Test Time Training

November 20, 2024

Cursory analysis of a test time training paper and the corresponding python code.

Introduction

The core idea of test time training (TTT) is finetuning an LM upon encountering previously-unseen input data.

Definition

Test time training: updating model parameters temporarily during inference using a loss derived from input data.

Results

Over double the accuracy of fully-neural approaches (from 25% to 53%).
When combined with program synthesis, this increases to 62%.

For context, 62% accuracy is roughly average human performance; expert human performance is about 93%.

Crucial parts

Initial finetuning on similar tasks generated at test time.
Leave-one-out generation strategy for constructing test-time dataset.
Per-instance adapter training.
Self-consistency approach under invertible transformations.

Two methods

The paper presents two main categories for ARC solvers:

Program Synthesis

Find transformation
Apply to test example.

Fully Neural

Directly predict the test output, only implicitly reasoning of the underlying transformation.

Data augmentation

These are methods used to increase the amount of training from about 400 tasks to 60,000 tasks.

Leave-one-out

Given a set of N grid pairs from the training pairs of a task, the leave-one-out strategy involves generating an additional N synthetic tasks–each of which tests the transformation on the N-1 remaining training pairs.

Invertible transformations

Their TTT model uses invertible transformations such as

rotation
scaling
flip
color permutation
example permutation

to increase the relatively small set of given and synthetic tasks to over \(N^2\) tasks.

Prompts

The way the model works is basically to send the LM strategically worded prompts instructing it to do the heavy lifting. It looks something like this:

Describing a task

You are an intelligent agent that can induce task descriptions from examples. For Category, please
*do not* use generic terms like Transformation, Pattern Recognition.
—————-
Task: {stringified task inputs and outputs}
LARC Description: {description of the task-1 from LARC dataset}
Good Description: {hierarchical description}
—————-
[truncated]
—————-
Task: {stringified task inputs and outputs for task-K}
LARC Description: {description of the task-K from LARC dataset}
Good Description: {hierarchical description}
—————-
Task: {stringified task inputs and outputs for query task}
LARC Description: {description of the query task from LARC dataset}

Generating additional tasks

You are a problem generator on 2D grids of colors. Here are some examples of such transformations,
please follow the format:
—————-
Example: {description of the generator function-1}
Script: {generator function-1}
—————-
[truncated]
—————-
Example: {description of the generator function-K}
Script: {generator function-K}
Please generate more and make sure they are different:

Limitations

Time

Even if the current TTT implementation worked as intended, it’s large computational requirements preclude it from participating in the ARC challenge.

Leakage

It is not clear whether the public availablity of the ARC dataset has artificially enhanced the performance of this model.

Conclusion

I don’t claim to fully understand the mechanisms that underly the TTT model, but it does seem to present some interesting ideas regarding data augmentation (the leave-one-out strategy and invertible transformations), novel use of LMs, and I found its codebase to be mostly straightforward and illuminating. The paper also introduces some useful terminology such as “fully-neural” and “program synthesis.”