Test Time Training
Cursory analysis of a test time training paper and the corresponding python code.
Introduction
The core idea of test time training (TTT) is finetuning an LM upon encountering previously-unseen input data.
Definition
- Test time training
- updating model parameters temporarily during inference using a loss derived from input data.
Results
- Over double the accuracy of fully-neural approaches (from 25% to 53%).
- When combined with program synthesis, this increases to 62%.
For context, 62% accuracy is roughly average human performance; expert human performance is about 93%.
Crucial parts
- Initial finetuning on similar tasks generated at test time.
- Leave-one-out generation strategy for constructing test-time dataset.
- Per-instance adapter training.
- Self-consistency approach under invertible transformations.
Two methods
The paper presents two main categories for ARC solvers:
Program Synthesis
-
Find transformation
-
Apply to test example.
Fully Neural
- Directly predict the test output, only implicitly reasoning of the underlying transformation.
Data augmentation
These are methods used to increase the amount of training from about 400 tasks to 60,000 tasks.
Leave-one-out
Given a set of N grid pairs from the training pairs of a task, the leave-one-out strategy involves generating an additional N synthetic tasks–each of which tests the transformation on the N-1 remaining training pairs.
Invertible transformations
Their TTT model uses invertible transformations such as
- rotation
- scaling
- flip
- color permutation
- example permutation
to increase the relatively small set of given and synthetic tasks to over \(N^2\) tasks.
Prompts
The way the model works is basically to send the LM strategically worded prompts instructing it to do the heavy lifting. It looks something like this:
Describing a task
You are an intelligent agent that can induce task descriptions from examples. For Category, please
*do not* use generic terms like Transformation, Pattern Recognition.
—————-
Task: {stringified task inputs and outputs}
LARC Description: {description of the task-1 from LARC dataset}
Good Description: {hierarchical description}
—————-
[truncated]
—————-
Task: {stringified task inputs and outputs for task-K}
LARC Description: {description of the task-K from LARC dataset}
Good Description: {hierarchical description}
—————-
Task: {stringified task inputs and outputs for query task}
LARC Description: {description of the query task from LARC dataset}
Generating additional tasks
You are a problem generator on 2D grids of colors. Here are some examples of such transformations,
please follow the format:
—————-
Example: {description of the generator function-1}
Script: {generator function-1}
—————-
[truncated]
—————-
Example: {description of the generator function-K}
Script: {generator function-K}
Please generate more and make sure they are different:
Limitations
Time
Even if the current TTT implementation worked as intended, it’s large computational requirements preclude it from participating in the ARC challenge.
Leakage
It is not clear whether the public availablity of the ARC dataset has artificially enhanced the performance of this model.
Conclusion
I don’t claim to fully understand the mechanisms that underly the TTT model, but it does seem to present some interesting ideas regarding data augmentation (the leave-one-out strategy and invertible transformations), novel use of LMs, and I found its codebase to be mostly straightforward and illuminating. The paper also introduces some useful terminology such as “fully-neural” and “program synthesis.”