report/culture.tex

   1 %% -*- mode: latex; mode: reftex; mode: flyspell; coding: utf-8; tex-command: "pdflatex.sh" -*-
   2
   3 %% Any copyright is dedicated to the Public Domain.
   4 %% https://creativecommons.org/publicdomain/zero/1.0/
   5 %% Written by Francois Fleuret <francois@fleuret.org>
   6
   7 \documentclass[11pt,a4paper,oneside]{article}
   8 \usepackage[paperheight=15cm,paperwidth=8cm,top=2mm,bottom=15mm,right=2mm,left=2mm]{geometry}
   9 %\usepackage[a4paper,top=2.5cm,bottom=2cm,left=2.5cm,right=2.5cm]{geometry}
  10 \usepackage[utf8]{inputenc}
  11 \usepackage{amsmath,amssymb,dsfont}
  12 \usepackage[pdftex]{graphicx}
  13 \usepackage[colorlinks=true,linkcolor=blue,urlcolor=blue,citecolor=blue]{hyperref}
  14 \urlstyle{same}
  15 \usepackage{tikz}
  16 \usetikzlibrary{arrows,arrows.meta,calc}
  17 \usetikzlibrary{patterns,backgrounds}
  18 \usetikzlibrary{positioning,fit}
  19 \usetikzlibrary{shapes.geometric,shapes.multipart}
  20 \usetikzlibrary{patterns.meta,decorations.pathreplacing,calligraphy}
  21 \usetikzlibrary{tikzmark}
  22 \usetikzlibrary{decorations.pathmorphing}
  23 \usepackage[round]{natbib}
  24 \usepackage[osf]{libertine}
  25 \usepackage{microtype}
  26
  27 \usepackage{mleftright}
  28
  29 \usepackage{enumitem}
  30 \setlist[itemize]{leftmargin=0pt,itemindent=1em,itemsep=2ex}
  31 \setlist{nosep} % or \setlist{noitemsep} to leave space around whole list
  32
  33 \newcommand{\setmuskip}[2]{#1=#2\relax}
  34 \setmuskip{\thinmuskip}{1.5mu} % by default it is equal to 3 mu
  35 \setmuskip{\medmuskip}{2mu} % by default it is equal to 4 mu
  36 \setmuskip{\thickmuskip}{3.5mu} % by default it is equal to 5 mu
  37
  38 \setlength{\parindent}{0cm}
  39 \setlength{\parskip}{1ex}
  40 %\renewcommand{\baselinestretch}{1.3}
  41 %\setlength{\tabcolsep}{0pt}
  42 %\renewcommand{\arraystretch}{1.0}
  43
  44 \def\argmax{\operatornamewithlimits{argmax}}
  45 \def\argmin{\operatornamewithlimits{argmin}}
  46
  47 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  48
  49 \def\given{\,\middle\vert\,}
  50 \def\proba{\operatorname{P}}
  51 \newcommand{\seq}{{S}}
  52 \newcommand{\expect}{\mathds{E}}
  53 \newcommand{\variance}{\mathds{V}}
  54 \newcommand{\empexpect}{\hat{\mathds{E}}}
  55 \newcommand{\mutinf}{\mathds{I}}
  56 \newcommand{\empmutinf}{\hat{\mathds{I}}}
  57 \newcommand{\entropy}{\mathds{H}}
  58 \newcommand{\empentropy}{\hat{\mathds{H}}}
  59 \newcommand{\ganG}{\mathbf{G}}
  60 \newcommand{\ganD}{\mathbf{D}}
  61 \newcommand{\ganF}{\mathbf{F}}
  62
  63 \newcommand{\dkl}{\mathds{D}_{\mathsf{KL}}}
  64 \newcommand{\djs}{\mathds{D}_{\mathsf{JS}}}
  65
  66 \newcommand*{\vertbar}{\rule[-1ex]{0.5pt}{2.5ex}}
  67 \newcommand*{\horzbar}{\rule[.5ex]{2.5ex}{0.5pt}}
  68
  69 \def\positionalencoding{\operatorname{pos-enc}}
  70 \def\concat{\operatorname{concat}}
  71 \def\crossentropy{\LL_{\operatorname{ce}}}
  72
  73 \begin{document}
  74
  75 \vspace*{-3ex}
  76
  77 \begin{center}
  78 {\Large Self-Generated Culture}
  79
  80 Fran\c cois Fleuret
  81
  82 \today
  83
  84 \vspace*{2ex}
  85
  86 \centerline{\color{red}(work in progress, to be updated)}
  87
  88 \medskip
  89
  90 \centerline{\url{https://fleuret.org/public/culture/culture.pdf}}
  91
  92 \end{center}
  93
  94 \section{Introduction}
  95
  96 The hypothesis behind this experiment is that high-level abstract
  97 thinking is fueled by social competition. A group of communicating
  98 agents that try to demonstrate their cognitive superiority would end
  99 up developing a rich and consistent culture.
 100
 101 The experiment is designed with a group of GPTs that alternatively
 102 learn to solve quizzes and generate new ones.
 103
 104 A ``quiz'' is a triplet of the form $(A, d, B)$ where $A$ and $B$ are
 105 two sequences and $d$ is a token indicating if the direction is
 106 forward or backward. Given $(A, d)$, the challenge is to generate $B$.
 107
 108 The experiments starts with a set of quizzes, that is going to be
 109 progressively enriched.
 110
 111 \section{Bird World}
 112
 113 The initial set of quizzes consist of predicting the dynamics of a
 114 very simple world: A $6 \times 8$ grid with three colored ``birds'' moving in
 115 a straight line, possibly bouncing on the grid's borders. There are
 116 ten different colors.
 117 %
 118 \begin{center}
 119 \includegraphics[scale=0.35]{pics/examples_train.png}
 120 \end{center}
 121 %
 122
 123 \vspace*{-2ex}
 124
 125 In each on these quizzes, $A$ is the left image serialized in
 126 raster-scan order as a sequence of $6 \times 8 = 48$ tokens, $d$ is
 127 either the token ``forward'' or the token ``backward'', and $B$ is the
 128 right image, also serialized. The direction of prediction is chosen at
 129 random.
 130
 131 \section{Generating Quizzes}
 132
 133 Given a set of $N$ GPTs, we can generate new quizzes as follows:
 134 Select one of the models, and use it to generate the $97$ tokens of a
 135 triplet $(A, d, B)$.
 136
 137 Then with each one of the $N-1$ other models, predict $B$ from $(A,
 138 d)$, and $A$ from $(B, d')$ where $d'$ is the direction token opposite
 139 of $d$.
 140
 141 A quiz is validated if \textbf{all the other GPTs but one predict it
 142   deterministically correctly in both directions.}
 143
 144 This criterion assures that the new quizzes are both solvable and
 145 sophisticated, and incrementally complexify the culture. Imposing both
 146 direction prevents the generation of quizzes which are not trivial
 147 only because the prompt has been randomly degraded.
 148
 149 \section{Overall Process}
 150
 151 The overall process consists of training the GPTs from scratch by
 152 iterating the following steps:
 153 %
 154 \begin{itemize}
 155
 156 \item select the GPT with the lowest recorded test accuracy, train it through one epoch,
 157
 158 \item if its test accuracy gets above $97.5\%$, generate $1'000$ new
 159   quizzes, add them to the training set, re-compute the accuracy of
 160   all the models
 161
 162 \end{itemize}
 163
 164 \section{Results}
 165
 166 This procedure results in the discovery of patterns which are not
 167 present in the original quizzes:
 168
 169 \textbf{More birds}
 170
 171 \begin{center}
 172 \includegraphics[scale=0.35]{pics/4_birds_1.png}
 173 \includegraphics[scale=0.35]{pics/5_birds_1.png}
 174
 175 \includegraphics[scale=0.35]{pics/6_birds_1.png}
 176 \end{center}
 177
 178 \textbf{New bird shapes}
 179
 180 \begin{center}
 181
 182 \includegraphics[scale=0.35]{pics/other_shapes_2.png}
 183 \includegraphics[scale=0.35]{pics/other_shapes_3.png}
 184 \end{center}
 185
 186 \textbf{Occlusions}
 187
 188 \begin{center}
 189 \includegraphics[scale=0.35]{pics/other_shapes_1.png}
 190 \includegraphics[scale=0.35]{pics/occlusions_1.png}
 191 \end{center}
 192
 193 \section{Various thoughts}
 194
 195 \begin{itemize}
 196
 197 \item The whole process can be envisioned as natural selection of
 198   quizzes in the representation landscape of GPTs. There probably is a
 199   subtle relation between the temperature (mutation rate) and the
 200   number of models used to validate with the ``all but one'' criterion
 201   (survival criterion).
 202
 203 \item The ``all but one'' could be ``all but K'', and there may be
 204   some information-theoretical thing, where the goal is to maximize
 205   mutual information, with $K=N$ being total randomness, so high
 206   entropy but no structure, and $K=0$ is total determinism, so no
 207   information to share.
 208
 209 \item The setup does not push toward any specific invariance or
 210   property in the generated quizzes, their consistency is entirely due
 211   to the statistics of the ``world quizzes'' that remain in the
 212   training set, and to the GPTs' inductive biased.
 213
 214 \item The GPTs obviously get a sense of objectness and 2d topology
 215   early on, since they rapidly increase the number of birds and
 216   ``discover'' occlusion even though they never was in the world
 217   quizzes.
 218
 219 \item There may not be so many problems that can be cast as pairs of
 220   patterns that are each a deterministic function of the other, which
 221   is probably critical here.
 222
 223 \item This overall process probably fight the ``simplicity bias'': If
 224   a model is lacking a ``cue'' that the others have, there will
 225   rapidly be quizzes that require this cue, they will be added to the
 226   training data, and that model will catch up.
 227
 228 \item The randomness of the process probably allow to even go beyond
 229   just synchronizing the abilities of the models. There may be some
 230   additional complexification of quizzes that get accepted by chance.
 231
 232 \item It can be parallelized by dispatching the GPTs across multiples
 233   nodes, and avoiding a quadratic cost by limiting the validation of
 234   the quizzes to a subset of them.
 235
 236 \item The current process to generate new quizzes, which simply
 237   samples them at random is very rudimentary and probably not
 238   sufficient in a real-data setup. It can probably be supplemented
 239   with a MCTS-type search.
 240
 241 \item There may be already in the generated quizzes some structure
 242   that \emph{we} do not pick up (e.g. certain color or motion
 243   patterns).
 244
 245 \end{itemize}
 246
 247 \section*{Appendix}
 248
 249 The code is available at
 250
 251 \medskip
 252
 253 \centerline{\url{https://fleuret.org/git/culture}}
 254
 255 The experiments are done with a GTX 4090.
 256
 257 The GPT used has 37M parameters and the following structure:
 258
 259 \begin{center}
 260 \begin{tabular}{lc}
 261     \texttt{dim\_model}  & 512  \\
 262     \texttt{dim\_keys}   & 64   \\
 263     \texttt{dim\_hidden} & 2048 \\
 264     \texttt{nb\_heads}   & 8    \\
 265     \texttt{nb\_blocks}  & 12
 266 \end{tabular}
 267 \end{center}
 268
 269 Adam, $\eta = 1e-4$, no scheduling.
 270
 271 There are $N_{\text{train}}=250'000$ original quizzes for training and
 272 $N_{\text{test}} = 10'000$ for test.
 273
 274 At each epoch, for both train and test samples, we mix original
 275 quizzes and the generated ones.
 276
 277 For training for instance, if there are less than $N_{\text{train}}/2$
 278 new quizzes, we take all of them, otherwise we sample
 279 $N_{\text{train}}/2$ of them without replacement, and then we sample
 280 without replacement enough original quizzes to get $N_{\text{train}}$
 281 samples in total.
 282
 283 We proceed similarly to get $N_{\text{test}}$ samples for test.
 284
 285 \end{document}