+Given a query sequence $Q$, a key sequence $K$, and a value sequence
+$V$, compute an attention matrix $A$ by matching $Q$s to $K$s, and
+weight $V$ with it to get $Y$.
+ A_i = \softmax \left( \frac{Q_i \, K\transpose}{\sqrt{d}} \right)
+\quad \quad \quad
+ Y_i = A_i V
+A standard attention layer takes as input two sequences $X$ and $X'$
+and computes
+K & = W^K X \\
+V & = W^V X \\
+Q & = w^Q X' \\
+Y & = \underbrace{\softmax_{row} \left( \frac{Q K\transpose}{\sqrt{d}} \right)}_{A} V
+When $X = X'$, this is \textbf{self attention}, otherwise \textbf{cross
+ attention.}
+Several such processes can be combined in which case $Y$ is the
+concatenation of the separate results. This is \textbf{multi-head
+ attention}.