Best resources to learn about the new AI (ChatGPT and similar)?
May 5, 2023 3:15 PM   Subscribe

What are the best resources to explain how these new language model AIs work? The intended audience (me!) has familiarity with computer science and programming, but no domain expertise in machine learning. Any materials for general audiences would also be appreciated.
posted by leotrotsky to Computers & Internet (9 answers total) 36 users marked this as a favorite
 
The best detailed explanation I've seen is GPT-3 Architecture, on a Napkin. It explains pretty much everything going on in GPT-3 (the architecture of GPT-4 is unknown, but probably similar), but does assume some prior understanding. There are other resources, many of which build off of the paper which sort of started it all, Attention is All You Need. One important difference between GPT-3 and the model described in that paper is that there's no encoder stack (the paper was primarily interested in translation).

It helps if you already know how a neural network works. The Youtuber 3Blue1Brown has an excellent series on the topic.

I have taken my own notes from various sources that explain my current understanding of how GPT-3 works. I'll include them here if they might be helpful, but do not assume they are correct (there are some parts that I don't 100% understand):


How GPT Works (my current understanding, I'm skipping over some stuff that I don't think is necessary for a conceptual understanding, like the one-hot encoding of tokens)

The simplified version is GPT is a deep neural network with a hundred or so layers that's trained on words and their positions to predict the next word, where each layer has a sort of preprocessor that squelches down unimportant relationships.

The longer version:

When GPT is run, either during training or inference (how an end-user interacts with it), a big chunk of text is fed into it. During training this might be a document, a wiki article, a reddit post, etc. When you interact with it, it's your entire conversation with it so far.

The text is broken down into "tokens" which are often whole words, but can be letters, combinations of letters, punctuation or something else. When GPT sees the word "car" it does not process it as a sequence of the letters C-A-R, but a single integer token; this may be part of the reason GPT-3 struggles with tasks that ask it to parse or manipulate words, like "count the number of words in this sentence" or "replace all the vowels with 'e'". You can check exactly how GPT tokenizes words here: https://platform.openai.com/tokenizer. Most common words get their own token.

These tokens are then each encoded as a large list of numbers (vectors). This list can be imagined as a set of coordinates in a higher-dimensional space, where each dimension has some learned aspect of the word. Tokens with similar aspects are located near each other in this space, and they are near each other along the dimensions that they have in common.

The canonical example is that the token for "King" shares some features with "Man". If you subtract the coordinates for "Man" from "King", then add the coordinates for "Woman", the next nearest word (token) is "Queen". This really works, at least with some encodings like Word2Vec, but more often than not the learned dimensions do not translate well into concepts we understand. So the takeaway is not that every attribute you can think of for a word will be a dimension, but that the dimensions do carry meaning-- at least meaning of how words relate to each other.

So GPT is told what words occur in the text as a bunch of coordinates. But it also needs to know something about where they occur: it is not enough to know that "bites", "dog" and "man" are somewhere in there. This happens by adding positional encoding to the vectors. I don't really understand the details of positional encoding, but a key feature is that the encoding method is designed to emphasize relative position. E.g., it is probably more important that "car" is immediately before "drove" than that "car" is exactly the 783rd word in the text. In any case, this may be another reason GPT gets tripped up by some literal text manipulation tasks.

Now the "text" is a big list of coordinates in hyperspace representing some encoding of word meaning and position.

Next, the text goes into the first attention block. The attention block pairs up every word in the text against every other word. During training it learns the relative importance. For example, consider these sentences:

"My car couldn't drive up the mountain because it was too old."
"My car couldn't drive up the mountain because it was too steep."

What does "it" refer to? In the first sentence "it" refers to car, but in the second "it" refers to "mountain". If GPT hasn't seen this exact sentence before, how could it know that? Through training of many, many, many sentences, it has learned that in this context, the attention for "it" is greater for "car" than "mountain". (I believe what is actually measured is the dot product of the "it" vector and the "car" vector, both of which contain relative position and inferential meaning.)

(GPT has multi-headed attention, which is usually described as allowing it to learn multiple associations per word.)

Next it goes through a fully-connected feed-forward layer. This is what I think of as a "normal" neural network layer: good at learning prediction.

Now the first attention block is done. The result of this is added to the original message, normalized and passed to the next block. In GPT-3 there were almost 100 blocks.

At the very end, the result of that last block are passed through a softmax function which produces a vector of values that add up to 1. These are the probabilities of each vector entry being the next word. (I think what you actually get is a vector for every word, because the blocks are identical, so they must also have everything needed to go into the next block.)

Finally, I've used the word "learn" multiple times above, in a way that might seem like the "and then a miracle occurs" step ("Draw the rest of the owl" for millenials). These are the neural network steps. Deep neural networks are incredibly effective at approximating some function that expresses a relationship between input and output. I'm not sure if there's a good non-technical explanation for how they work that doesn't involve at least some math. 3Blue1Brown has a great video series on the topic, it does have math, but there are also visuals as well.

posted by justkevin at 3:50 PM on May 5, 2023 [12 favorites]


Google has some nice self-paced courses online using Collab notebooks to introduce programmers to ML fundamentals. It still feels a bit magical in the sense that the heavy lifting is done by libraries and your job is basically writing ten lines of code to train a model and 500 to shovel data into the right format. If you wanna go real deep on theoretical fundamentals, Deep Learning book covers lots.

Andrej Karpathy recently gave a 2 hour tutorial on building a nanoGPT in Python.

And don't feel too bad about being behind; the pace of innovation has really picked up in the last few months. Leaking the FB weights really turbocharged the hobbyist and startup communities to customize models, tweaking them to run on consumer grade equipment, and build tooling on top.
posted by pwnguin at 4:37 PM on May 5, 2023 [4 favorites]


I've only poked around with their python libraries but I have started Huggingface's Natural Language Processing course.
posted by mmascolino at 7:02 PM on May 5, 2023 [3 favorites]


Stephen Wolfram (of "Mathematica" fame) has a long, detailed article about how ChatGPT works. I confess that I've only skimmed the article, but it looks very good.
posted by alex1965 at 4:03 AM on May 6, 2023 [2 favorites]


One of my very favorite things about GPT is that you can ask it to explain things in whatever way you learn best. So if metaphors work for you, ask it to explain its own workings using a metaphor and it will. Ask it to explain it like you’re six and it will.
posted by missjenny at 8:27 AM on May 6, 2023 [1 favorite]


GPT should not be treated as a reliable source on anything, including itself.

Q: Does GPT-3 use an encoder-decoder transformer model?

A: Yes, GPT-3 (Generative Pre-trained Transformer 3) uses an encoder-decoder transformer model, specifically the autoregressive transformer model architecture. This architecture is based on the original transformer model introduced in the paper "Attention Is All You Need" by Vaswani et al. (2017)....

It goes on to give a lengthy and plausible-sounding answer, and even got the reference right. But GPT doesn't use an encoder stack, it is decoder-only.
posted by justkevin at 8:58 AM on May 6, 2023 [4 favorites]


I came in to recommend the Wolfram article. I found it illuminating as a non-computer science person, despite some bits going over my head.
posted by lookoutbelow at 10:46 AM on May 6, 2023 [1 favorite]


Yes a friend working actively in the field recommend the Wolfram page.
posted by sammyo at 4:58 PM on May 6, 2023


The YouTube channel AI Explained is great, he releases a video every week or so synthesizing the latest news/developments, and occasionally has one that is more in depth about a particular topic.
posted by EarnestDeer at 9:31 PM on May 16, 2023


« Older Please pump me up   |   Another what was this Newer »

You are not logged in, either login or create an account to post comments