



GPT-3 - the model behind ChatGPT - was trained on 300 billion to 400 billion tokens of text data and had 175 billion parameters. GPT-2, which OpenAI released in 2019, was trained on 300 million tokens of text data and had 1.5 billion parameters. It’s difficult to say exactly what “11% lower loss” means in terms of how powerful or accurate a model is, but we can use existing models for context. This tells us how much “better” models can get from “scaling compute” alone. If a model has 10 times the compute, its loss will be about 11% lower. The scaling relationship between loss and compute found by OpenAI in 2020 is a power law. Compute is therefore influenced by both the amount of data and number of parameters. Is simply the number of computer operations (typically matrix multiplications) that must be performed throughout the model’s training. Training data is the size of the dataset a model is trained on.The number of parameters in the model is a measure of its complexity - equivalent to the number of nodes in the neural network.įinally, the amount of “compute” used for a model, measured in floating point operations, or flops, A close antonym of performance is “loss,” which is a measure of how far off a model’s predictions were from reality lower loss means better performance. A large language model is trained to predict text completions the more often it correctly predicts how to complete a text, the better its performance. The performance of a model describes its accuracy in choosing the “right” answer on known data. Scaling laws in AI generally relate the performance of a model to its inputs: training data, model parameters, and compute.
