The motivation behind this experiment is to check if/how statistical
models can generate patterns beyond those appearing in their training
set, but still consistent.

We define a "quiz" to be a series of four images representing a
functional transformation f. The four images are of the form A, f(A),
B, f(B), where both A, B and f are random, albeit very structured.

"Solving" a quizz means that one predicts properly the four images,
given the three others, that is

f(A), B, f(B) -> B
A, B, f(B) -> f(B)
A, f(A), f(B) -> B
A, f(A), B -> f(B)

The image culture_original.png shows examples from the initial
quizzes, which have been generated with hand-designed algorithmic
procedures. As it can be seen in these examples, they are very
structured: There are always three rectangles, non-overlaping on A and
B (and generally non-overlapping on f(A) and f(B)), and the rectangle
colors are the same in the four images.

From these samples, we train 32 models which are both masked
predictors and generative models. They are 133M parameter
attention-based deep models. They easily reach 95%+ accuracy according
to the definition of "solving" above.

When these 32 models get above 90% of accuracy, we use them to
generate news quizzes, and we for every new quiz they generate, we
take 12 models out of 32 at random, and count of many of them can
solve this new quizz. If 4 models or more solve it perfectly and 4 or
more fail severely (that is are more than 5 pixels off) we record this
quiz.

When we got 50k+ such new quizzes, we delete the models and retrain
them from scratch with batches composed for 50% from the original
distribution and 50% from the generated "culture". And we repeat this
procedure.

The images culture_valid_N.png show examples of the recorded generated
quizzes.