1 %% -*- mode: latex; mode: reftex; mode: flyspell; coding: utf-8; tex-command: "pdflatex.sh" -*-
3 %% Any copyright is dedicated to the Public Domain.
4 %% https://creativecommons.org/publicdomain/zero/1.0/
5 %% Written by Francois Fleuret <francois@fleuret.org>
7 \documentclass[11pt,a4paper,oneside]{article}
8 \usepackage[paperheight=15cm,paperwidth=8cm,top=2mm,bottom=15mm,right=2mm,left=2mm]{geometry}
9 %\usepackage[a4paper,top=2.5cm,bottom=2cm,left=2.5cm,right=2.5cm]{geometry}
10 \usepackage[utf8]{inputenc}
11 \usepackage{amsmath,amssymb,dsfont}
12 \usepackage[pdftex]{graphicx}
13 \usepackage[colorlinks=true,linkcolor=blue,urlcolor=blue,citecolor=blue]{hyperref}
16 \usetikzlibrary{arrows,arrows.meta,calc}
17 \usetikzlibrary{patterns,backgrounds}
18 \usetikzlibrary{positioning,fit}
19 \usetikzlibrary{shapes.geometric,shapes.multipart}
20 \usetikzlibrary{patterns.meta,decorations.pathreplacing,calligraphy}
21 \usetikzlibrary{tikzmark}
22 \usetikzlibrary{decorations.pathmorphing}
23 \usepackage[round]{natbib}
24 \usepackage[osf]{libertine}
25 \usepackage{microtype}
27 \usepackage{mleftright}
30 \setlist[itemize]{leftmargin=0pt,itemindent=1em,itemsep=2ex}
31 \setlist{nosep} % or \setlist{noitemsep} to leave space around whole list
33 \newcommand{\setmuskip}[2]{#1=#2\relax}
34 \setmuskip{\thinmuskip}{1.5mu} % by default it is equal to 3 mu
35 \setmuskip{\medmuskip}{2mu} % by default it is equal to 4 mu
36 \setmuskip{\thickmuskip}{3.5mu} % by default it is equal to 5 mu
38 \setlength{\parindent}{0cm}
39 \setlength{\parskip}{1ex}
40 %\renewcommand{\baselinestretch}{1.3}
41 %\setlength{\tabcolsep}{0pt}
42 %\renewcommand{\arraystretch}{1.0}
44 \def\argmax{\operatornamewithlimits{argmax}}
45 \def\argmin{\operatornamewithlimits{argmin}}
47 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
49 \def\given{\,\middle\vert\,}
50 \def\proba{\operatorname{P}}
51 \newcommand{\seq}{{S}}
52 \newcommand{\expect}{\mathds{E}}
53 \newcommand{\variance}{\mathds{V}}
54 \newcommand{\empexpect}{\hat{\mathds{E}}}
55 \newcommand{\mutinf}{\mathds{I}}
56 \newcommand{\empmutinf}{\hat{\mathds{I}}}
57 \newcommand{\entropy}{\mathds{H}}
58 \newcommand{\empentropy}{\hat{\mathds{H}}}
59 \newcommand{\ganG}{\mathbf{G}}
60 \newcommand{\ganD}{\mathbf{D}}
61 \newcommand{\ganF}{\mathbf{F}}
63 \newcommand{\dkl}{\mathds{D}_{\mathsf{KL}}}
64 \newcommand{\djs}{\mathds{D}_{\mathsf{JS}}}
66 \newcommand*{\vertbar}{\rule[-1ex]{0.5pt}{2.5ex}}
67 \newcommand*{\horzbar}{\rule[.5ex]{2.5ex}{0.5pt}}
69 \def\positionalencoding{\operatorname{pos-enc}}
70 \def\concat{\operatorname{concat}}
71 \def\crossentropy{\LL_{\operatorname{ce}}}
78 {\Large Self-Generated Culture}
86 \centerline{\color{red}(work in progress, to be updated)}
90 \centerline{\url{https://fleuret.org/public/culture/culture.pdf}}
94 \section{Introduction}
96 The hypothesis behind this experiment is that high-level abstract
97 thinking is fueled by social competition. A group of communicating
98 agents that try to demonstrate their cognitive superiority would end
99 up developing a rich and consistent culture.
101 The experiment is designed with a group of GPTs that alternatively
102 learn to solve quizzes and generate new ones.
104 A ``quiz'' is a triplet of the form $(A, d, B)$ where $A$ and $B$ are
105 two sequences and $d$ is a token indicating if the direction is
106 forward or backward. Given $(A, d)$, the challenge is to generate $B$.
108 The experiments starts with a set of quizzes, that is going to be
109 progressively enriched.
113 The initial set of quizzes consist of predicting the dynamics of a
114 very simple world: A $6 \times 8$ grid with three colored ``birds'' moving in
115 a straight line, possibly bouncing on the grid's borders. There are
116 ten different colors.
119 \includegraphics[scale=0.35]{pics/examples_train.png}
125 In each on these quizzes, $A$ is the left image serialized in
126 raster-scan order as a sequence of $6 \times 8 = 48$ tokens, $d$ is
127 either the token ``forward'' or the token ``backward'', and $B$ is the
128 right image, also serialized. The direction of prediction is chosen at
131 \section{Generating Quizzes}
133 Given a set of $N$ GPTs, we can generate new quizzes as follows:
134 Select one of the models, and use it to generate the $97$ tokens of a
137 Then with each one of the $N-1$ other models, predict $B$ from $(A,
138 d)$, and $A$ from $(B, d')$ where $d'$ is the direction token opposite
141 A quiz is validated if \textbf{all the other GPTs but one predict it
142 deterministically correctly in both directions.}
144 This criterion assures that the new quizzes are both solvable and
145 sophisticated, and incrementally complexify the culture. Imposing both
146 direction prevents the generation of quizzes which are not trivial
147 only because the prompt has been randomly degraded.
149 \section{Overall Process}
151 The overall process consists of training the GPTs from scratch by
152 iterating the following steps:
156 \item select the GPT with the lowest recorded test accuracy, train it through one epoch,
158 \item if its test accuracy gets above $97.5\%$, generate $1'000$ new
159 quizzes, add them to the training set, re-compute the accuracy of
166 This procedure results in the discovery of patterns which are not
167 present in the original quizzes:
172 \includegraphics[scale=0.35]{pics/4_birds_1.png}
173 \includegraphics[scale=0.35]{pics/5_birds_1.png}
175 \includegraphics[scale=0.35]{pics/6_birds_1.png}
178 \textbf{New bird shapes}
182 \includegraphics[scale=0.35]{pics/other_shapes_2.png}
183 \includegraphics[scale=0.35]{pics/other_shapes_3.png}
189 \includegraphics[scale=0.35]{pics/other_shapes_1.png}
190 \includegraphics[scale=0.35]{pics/occlusions_1.png}
193 \section{Various thoughts}
197 \item The whole process can be envisioned as natural selection of
198 quizzes in the representation landscape of GPTs. There probably is a
199 subtle relation between the temperature (mutation rate) and the
200 number of models used to validate with the ``all but one'' criterion
201 (survival criterion).
203 \item The ``all but one'' could be ``all but K'', and there may be
204 some information-theoretical thing, where the goal is to maximize
205 mutual information, with $K=N$ being total randomness, so high
206 entropy but no structure, and $K=0$ is total determinism, so no
207 information to share.
209 \item The setup does not push toward any specific invariance or
210 property in the generated quizzes, their consistency is entirely due
211 to the statistics of the ``world quizzes'' that remain in the
212 training set, and to the GPTs' inductive biased.
214 \item The GPTs obviously get a sense of objectness and 2d topology
215 early on, since they rapidly increase the number of birds and
216 ``discover'' occlusion even though they never was in the world
219 \item There may not be so many problems that can be cast as pairs of
220 patterns that are each a deterministic function of the other, which
221 is probably critical here.
223 \item This overall process probably fight the ``simplicity bias'': If
224 a model is lacking a ``cue'' that the others have, there will
225 rapidly be quizzes that require this cue, they will be added to the
226 training data, and that model will catch up.
228 \item The randomness of the process probably allow to even go beyond
229 just synchronizing the abilities of the models. There may be some
230 additional complexification of quizzes that get accepted by chance.
232 \item It can be parallelized by dispatching the GPTs across multiples
233 nodes, and avoiding a quadratic cost by limiting the validation of
234 the quizzes to a subset of them.
236 \item The current process to generate new quizzes, which simply
237 samples them at random is very rudimentary and probably not
238 sufficient in a real-data setup. It can probably be supplemented
239 with a MCTS-type search.
241 \item There may be already in the generated quizzes some structure
242 that \emph{we} do not pick up (e.g. certain color or motion
249 The code is available at
253 \centerline{\url{https://fleuret.org/git/culture}}
255 The experiments are done with a GTX 4090.
257 The GPT used has 37M parameters and the following structure:
261 \texttt{dim\_model} & 512 \\
262 \texttt{dim\_keys} & 64 \\
263 \texttt{dim\_hidden} & 2048 \\
264 \texttt{nb\_heads} & 8 \\
265 \texttt{nb\_blocks} & 12
269 Adam, $\eta = 1e-4$, no scheduling.
271 There are $N_{\text{train}}=250'000$ original quizzes for training and
272 $N_{\text{test}} = 10'000$ for test.
274 At each epoch, for both train and test samples, we mix original
275 quizzes and the generated ones.
277 For training for instance, if there are less than $N_{\text{train}}/2$
278 new quizzes, we take all of them, otherwise we sample
279 $N_{\text{train}}/2$ of them without replacement, and then we sample
280 without replacement enough original quizzes to get $N_{\text{train}}$
283 We proceed similarly to get $N_{\text{test}}$ samples for test.