LLM on Zhiim's Blog

CS336 Note 02: PyTorch, Resource Counting

Thu, 14 May 2026 13:47:19 +0800

1. Memory accounting

1.1 tensor basics

tensor 是用来存储数据的基本单元：模型参数、梯度、优化器状态、激活状态等

1
2
3
4


x = torch.tensor([[1., 2, 3], [4, 5, 6]])
x = torch.zeros(4, 8) # 4x8 matrix of all zeros
x = torch.ones(4, 8) # 4x8 matrix of all ones
x = torch.randn(4, 8) # 4x8 matrix of iid Normal(0, 1) samples

也可以只分配内存但是不初始化值

1

x = torch.empty(4, 8)

1.2 tensor memory

几乎所有的数据都以浮点数的形式存储

1.2.1 float32

float32 data

在大多数情况下，float32 是默认的数据类型

数据的内存占用由数值的数量和数据类型决定

1
2
3
4
5


x = torch.zeros(4, 8)
assert x.dtype == torch.float32 # Default type
assert x.numel() == 4 * 8
assert x.element_size() == 4 # Float is 4 bytes
assert get_memory_usage(x) == 4 * 8 * 4 # 128 bytes

1.2.2 float16

float16 data

float16 数据大小减半，但是可以表示的数值范围较小

1
2
3
4


x = torch.zeros(4, 8, dtype=torch.float16)
assert x.element_size() == 2
x = torch.tensor([1e-8], dtype=torch.float16)
assert x == 0 # Underflow to 0

1.2.3 bfloat16

bfloat16 data

bfloat16 用来解决 float16 的动态范围问题，它和 float16 大小一致，但是有着和 float32 一样的动态范围，不过分辨率较差。对于深度学习来说动态范围比分辨率更重要

1
2


x = torch.tensor([1e-8], dtype=torch.bfloat16)
assert x != 0 # No underflow!

1.2.4 fp8

fp8 data

fp8 是另一种为机器学习设计的数据类型

使用float32 时训练效果较有效，但是需要较多的内存
使用 fp8、float16 和 bfloat16 可能导致训练存在风险，模型不稳定
一般可以使用混合精度训练

2. Compute accounting

2.1 tensor on gpus

默认 tensor 是存储在 CPU 上的，我们需要手动将其移动到 GPU

1
2
3
4
5


text("Move the tensor to GPU memory (device 0).")
y = x.to("cuda:0")
assert y.device == torch.device("cuda", 0)
text("Or create a tensor directly on the GPU:")
z = torch.zeros(32, 32, device="cuda:0")

2.2 tensor operations

2.2.1 tensor storage

在PyTorch 中，tensor 实际上是一个指向一块已划分内存的指针，是以 array 的形式存储，其中还包括了一些 metadata，告诉我们如何获取 tensor 的元素

1
2
3
4
5
6


x = torch.tensor([
 [0., 1, 2, 3],
 [4, 5, 6, 7],
 [8, 9, 10, 11],
 [12, 13, 14, 15],
])

tensor memory

1
2
3
4
5
6
7


assert x.stride(0) == 4 # 访问下方元素
assert x.stride(1) == 1 # 访问右侧元素

#　
r, c = 1, 2
index = r * x.stride(0) + c * x.stride(1)
assert index == 6

2.2.2 tensor slicing

很多对于 tensor 的操作只是返回了 tensor 的不同 view，并不是直接复制了一个 tensor，所以对 tensor 的修改互相影响

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


x = torch.tensor([[1., 2, 3], [4, 5, 6]])

# 对tensor进行slicing
y = x[0]
assert torch.equal(y, torch.tensor([1., 2, 3]))
assert same_storage(x, y)

# 返回不同shape的view
y = x.view(3, 2)
assert torch.equal(y, torch.tensor([[1, 2], [3, 4], [5, 6]]))
assert same_storage(x, y)

# 转置
y = x.transpose(1, 0) # @inspect y
assert torch.equal(y, torch.tensor([[1, 4], [2, 5], [3, 6]]))
assert same_storage(x, y)

有些操作会导致 tensor 变成 non-contiguous，导致后续无法再进行 view 操作。但是我们可以强制让 tensor 变成 contiguous

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


x = torch.tensor([[1., 2, 3], [4, 5, 6]]) # @inspect x
y = x.transpose(1, 0) # @inspect y
assert not y.is_contiguous()
try:
 y.view(2, 3)
 assert False
except RuntimeError as e:
 assert "view size is not compatible with input tensor's size and stride" in str(e)

# 强制contiguous
y = x.transpose(1, 0).contiguous().view(2, 3) # @inspect y
assert not same_storage(x, y)

2.2.3 tensor elementwise

elementwise 操作会对 tensor 的每个元素进行操作，并且返回一个形状相同的新 tensor

1
2
3
4
5
6
7


x = torch.tensor([1, 4, 9])
assert torch.equal(x.pow(2), torch.tensor([1, 16, 81]))
assert torch.equal(x.sqrt(), torch.tensor([1, 2, 3]))
assert torch.equal(x.rsqrt(), torch.tensor([1, 1 / 2, 1 / 3])) # i -> 1/sqrt(x_i)
assert torch.equal(x + x, torch.tensor([2, 8, 18]))
assert torch.equal(x * 2, torch.tensor([2, 8, 18]))
assert torch.equal(x / 0.5, torch.tensor([2, 8, 18]))

2.2.4 tensor matmul

tensor 间也可以进行乘法运算，在多 batch 多 sequence 的情况下，乘法运算会 broadcast 到每一个token

1
2
3
4


x = torch.ones(4, 8, 16, 32)
w = torch.ones(32, 2)
y = x @ w
assert y.size() == torch.Size([4, 8, 16, 2])

2.3 tensor einops

einops 用来在操作 tensor 的时候，给每个维度命名

在 tensor 运算的时候，我们往往需要知道每个维度代表的意义。直接的方法我们可以通过注释标注，或者通过静态类型检查注明

1
2


x = torch.ones(2, 2, 1, 3) # batch seq heads hidden
x: Float[torch.Tensor, "batch seq heads hidden"] = torch.ones(2, 2, 1, 3)

einops 的 einsum 操作可以让维度标记更加方便

1
2
3
4
5
6
7
8


x: Float[torch.Tensor, "batch seq1 hidden"] = torch.ones(2, 3, 4)
y: Float[torch.Tensor, "batch seq2 hidden"] = torch.ones(2, 3, 4)

z = x @ y.transpose(-2, -1) # batch, sequence, sequence
# 等价于
z = einsum(x, y, "batch seq1 hidden, batch seq2 hidden -> batch seq1 seq2")
# 等价于
z = einsum(x, y, "... seq1 hidden, ... seq2 hidden -> ... seq1 seq2")

此外也支持维度收缩和 rearrange 操作

1
2
3
4
5
6


y = x.mean(dim=-1)
# 等价于
y = reduce(x, "... hidden -> ...", "sum")

x: Float[torch.Tensor, "batch seq total_hidden"] = torch.ones(2, 3, 8)
x = rearrange(x, "... (heads hidden1) -> ... heads hidden1", heads=2)

2.4 tensor operations flops

一个浮点运算（FLOP）是一个最基本的运算操作，例如加法和乘法

FLOPs：浮点运算的次数
FLOP/s：每秒浮点运算的次数

对于一个线性模型来说，如果有 B 个数据点，每个数据点是 D 维的，输出是 K 维的。输入数据 x 维度为 (B, D)，权值矩阵维度为 (D, K)，那么对于每个三元 index(i, j, k)，需要进行一次乘法运算 x[i][j] * w[j][k]，以及一个加法运算。总的 FLOPs 是 2 * B * D * K

前向传播中的 FLOPs 近似为 2 * 数据量 * 参数量

Model FLOPs utilization（MFU）：真实的 FLOP/s 除以额定 FLOP/s

2.5 gradients basics

PyTorch 中梯度计算非常简单，假定我们有一个线性模型 $y = 0.5 (x w - 5)^2$

1
2
3
4
5
6
7


x = torch.tensor([1., 2, 3])
w = torch.tensor([1., 1, 1], requires_grad=True) # Want gradient
pred_y = x @ w
loss = 0.5 * (pred_y - 5).pow(2)

# 计算梯度
loss.backward()

2.6 gradients flops

假设有一个两层线性模型：h1 = x @ w1, h2 = h1 @ w2, loss = h2.pow(2).mean()

1
2
3


x = torch.ones(B, D, device=device)
w1 = torch.randn(D, D, device=device, requires_grad=True)
w2 = torch.randn(D, K, device=device, requires_grad=True)

对于参数矩阵 w2 来说

w2.grad[j, k] = sum_i h1[i, j] * h2.grad[i, j]，需要 FLOPs 2 * B * D * K
h1.grad[i, j] = sum_k w2[j, k] * h2.grad[i, k]，需要 FLOPs 2 * B * D * K

所以反向传播中 FLOPs 数量为 4 * 数据量 * 参数量

3. Models

3.1 module parameters

模型参数被保存在 nn.Parameter 对象中

对于 output = x @ w，x 经过 w 乘法之后，输出 output 的方差被放大 sqrt(input_dim)。导致输出的幅度变大，容易发生梯度消失。所以在初始化模型参数的时候往往会将方差缩小对应的倍数，这就是 Xavier 参数初始化

1

w = nn.Parameter(torch.randn(input_dim, output_dim) / np.sqrt(input_dim))

3.2 custom model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


class Linear(nn.Module):
 """Simple linear layer."""
 def __init__(self, input_dim: int, output_dim: int):
 super().__init__()
 self.weight = nn.Parameter(torch.randn(input_dim, output_dim) / np.sqrt(input_dim))

 def forward(self, x: torch.Tensor) -> torch.Tensor:
 return x @ self.weight

class Cruncher(nn.Module):
 def __init__(self, dim: int, num_layers: int):
 super().__init__()
 self.layers = nn.ModuleList([
 Linear(dim, dim)
 for i in range(num_layers)
 ])
 self.final = Linear(dim, 1)

 def forward(self, x: torch.Tensor) -> torch.Tensor:
 # Apply linear layers
 B, D = x.size()
 for layer in self.layers:
 x = layer(x)
 # Apply final head
 x = self.final(x)
 assert x.size() == torch.Size([B, 1])
 # Remove the last dimension
 x = x.squeeze(-1)
 assert x.size() == torch.Size([B])
 return x

默认 CPU tensor 被放置在 paged memory 中，我们可以手动 pin，这样就可以实现异步将 tensor 从 CPU 移动到GPU

1
2
3


if torch.cuda.is_available():
 x = x.pin_memory()
x = x.to(device, non_blocking=True)

3.3 randomness

为了实验结果可复现，一般需要设定固定的随机数种子

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


# Torch
seed = 0
torch.manual_seed(seed)

# NumPy
import numpy as np
np.random.seed(seed)

# Python
import random
random.seed(seed)

3.4 data loading

在语言模型中，数据一般是一个 int 序列（字符 token 化的结果），我们可以很容易把它们作为 numpy array 序列化

1
2


orig_data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=np.int32)
orig_data.tofile("data.npy")

我们也可以从序列化数据文件中读取数据，通过 memmap 可以实现数据的懒加载，不把整个数据加载到内存中

1

data = np.memmap("data.npy", dtype=np.int32)

3.5 optimizer

几种常见的优化器

momentum = SGD + 梯度的指数移动平均
AdaGrad = SGD + 梯度平方的平均
RMSProp = AdaGrad + 梯度平方的指数移动平均
Adam = RMSProp + momentum

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


optimizer = AdaGrad(model.parameters(), lr=0.01)

# 计算损失和梯度
x = torch.randn(B, D, device=get_device())
y = torch.tensor([4., 5.], device=get_device())
pred_y = model(x)
loss = F.mse_loss(input=pred_y, target=y)
loss.backward()

# 优化一步
optimizer.step()

# 清理内存
optimizer.zero_grad(set_to_none=True)

3.6 train loop

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


def train(name: str, get_batch,
 D: int, num_layers: int,
 B: int, num_train_steps: int, lr: float):
 model = Cruncher(dim=D, num_layers=0).to(get_device())
 optimizer = SGD(model.parameters(), lr=0.01)

 for t in range(num_train_steps):
 # Get data
 x, y = get_batch(B=B)

 # Forward (compute loss)
 pred_y = model(x)
 loss = F.mse_loss(pred_y, y)

 # Backward (compute gradients)
 loss.backward()

 # Update parameters
 optimizer.step()
 optimizer.zero_grad(set_to_none=True)

3.7 checkpoint

训练模型往往需要非常长的时间，而我们不希望发生中断的时候遗失训练进度，所以在训练的过程中需要周期性的保存模型参数和优化器状态

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


model = Cruncher(dim=64, num_layers=3).to(get_device())
optimizer = AdaGrad(model.parameters(), lr=0.01)

# 保存checkpoint
checkpoint = {
 "model": model.state_dict(),
 "optimizer": optimizer.state_dict(),
 }
torch.save(checkpoint, "model_checkpoint.pt")

# 加载checkpoint
loaded_checkpoint = torch.load("model_checkpoint.pt")

3.8 mixed precision training

数据类型（float32，bfloat16，fp8）的选择存在折衷

高精度：模型训练更加精准稳定，需要更多的内存和计算资源
低精度：模型训练不精准不稳定，需要更少的内存和计算资源

混合精度训练：默认使用 float32，但是在可行的地方尽可能使用 bfloat16 或者 fp8

在前向传播的时候使用低精度数据类型
在其他部分使用高精度数据类型（参数，梯度计算）

CS336 Note 01: Overview, Tokenization

Tue, 12 May 2026 16:42:15 +0800

CS 336 课程笔记，仅供参考。

1. Overview

1.1 basic

构建一个完整的语言模型由三步组成：tokenization，模型结构和训练方法

1.1.1 Tokenization

tokenizer 可以实现字符串和 int 序列（tokens）之间的转换

tokenizer

本课程主要讲解的是 Byte-Pair Encoding（BPE）tokenizer Sennrich+ 2015

此外还有一个无需 tokenizer 的方法，但是目前并没有起到显著成效 Xue+ 2021, Yu+ 2023, Pagnoni+ 2024, Deiseroth+ 2024

1.1.2 结构

transformer architecture

在 transformer 基础上又有一些变种：

损失函数：ReLU，SwiGLU Shazeer 2020
位置编码：sinusoidal，RoPE Su+ 2021
归一化：LayerNorm，RMSNorm Zhang+ 2019
归一化层的位置：pre-norm vs post-norm Xiong+ 2020
MLP 层：混合专家模式 Shazeer+ 2017
注意力：滑动窗，线性注意力 Jiang+ 2023, Katharopoulos+ 2020
低维注意力：GQA，MLA Ainslie+ 2023, DeepSeek-AI+ 2024
- State-space models: Hyena Poli+ 2023

1.1.3 训练方法

优化器：AdamW，Muon，SOAP Kingma+ 2014, Loshchilov+ 2017, Keller 2024, Vyas+ 2024
学习率缩放：cosine，WSD Loshchilov+ 2016, Hu+ 2024
Batch size：critical batch size McCandlish+ 2018
正则化：dropout，weight decay
超参数：head number，hidden dimension

1.2 系统

Kernels

dram and sram

在 GPU 内部，DRAM 和 SRAM 之间进行数据传输，需要最大化数据传输的能效

1.2.2 Parallelism

gpu group

如果有一个 GPU 集群，数据在 GPU 之间的移动将会比 GPU 内部更为低效

进行 collective operation：gather，reduce，all-reduce
GPU 数据分片（参数，激活状态，梯度，优化器状态）
分布计算：数据、tensor、pipeline、sequence 并行

1.2.3 推理

一般来说，推理计算比训练计算需求更高

inference phase

推理可以分为 prefill 和 decode 两个阶段

prefill：给定 token，可以同时处理所有token
decode：需要递归依次生成token

提高 decode 效率的方法：

使用更加轻量的模型（模型剪枝、量化、蒸馏）
speculative decoding：使用轻量模型初步生成一些 token，然后再通过全量模型打分
系统优化：KV cache，batch 化

1.3 Scaling Laws

进行小规模的实验，来预测大规模场景下的超参数和损失

给定一定的 FLOPs 资源，训练一个更大的模型更好，还是使用更多的 token 训练更好

Compute-optimal scaling laws: Kaplan+ 2020, Hoffmann+ 2022

scaling law

1.4 Data

我们希望模型具有什么样的能力：多语言、编程、数学

1.4.1 模型评估

困惑度 perplexity：大模型评估的基础方式
标准测试：M M L U，HellaSwag，GSM8K
指令跟随：ALpacaEval，IFEval，WildBench
LM-as-a-judge
全系统评估：RAG，agent

1.4.2 数据筛选

数据来源：互联网上爬虫得到的网页，数据，arXiv 论文，Github 代码等
可能需要数据授权
格式：HTML，PDF，文件夹

1.4.3 数据处理

格式转换：将 HTML/PDF 转换成文本
过滤：过滤掉有害内容
去重：降低计算量，防止模型 memorization

1.5 对齐

经过预训练后，模型已经非常擅长预测下一个 token，但是需要通过对齐让模型真正可用

让语言模型可以遵守对指令做出响应的范式
优化响应风格
防止产生有害内容

对齐可以分为两个阶段：监督学习微调和反馈强化学习

1.5.1 监督学习微调 SFT

使用指令数据 instruction data <prompt, response> 对进行监督学习训练

1
2
3
4
5
6
7
8
9


sft_data: list[ChatExample] = [
 ChatExample(
 turns=[
 Turn(role="system", content="You are a helpful assistant."),
 Turn(role="user", content="What is 1 + 1?"),
 Turn(role="assistant", content="The answer is 2."),
 ],
 ),
]

1.5.2 Preference Data

经过指令数据的有监督训练，我们已经有了一个初步的可以跟随指令的模型，我们可以通过使用 preference data 训练让模型更加强大，而不需要更多的指令标注数据

preference data：让多个模型对同一个 prompt 给出响应，用户给出偏好

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


preference_data: list[PreferenceExample] = [
 PreferenceExample(
 history=[
 Turn(role="system", content="You are a helpful assistant."),
 Turn(role="user", content="What is the best way to train a language model?"),
 ],
 response_a="You should use a large dataset and train for a long time.",
 response_b="You should use a small dataset and train for a short time.",
 chosen="a",
 )
]

Proximal Policy Optimization (PPO) from reinforcement learning Schulman+ 2017 Ouyang+ 2022
Direct Policy Optimization (DPO): for preference data, simpler Rafailov+ 2023
Group Relative Preference Optimization (GRPO): remove value function Shao+ 2024

2. Tokenization

一个 tokenizer 需要实现编码和解码的功能：

编码：将字符串编码成 token（int 类型的索引）
解码：将 token 转换回字符串

其中所有可能的 token 的个数称为 vocabulary size

2.1 Character-based tokenization

一个字符串由一序列字符组成，每个字符可以直接被转换成整数索引值作为 token。例如，直接使用 ord 可以得到字符的 ASCII 码，可以作为一个简单的tokenizer

缺点：

Unicode 字符的数量大概是 150K，导致 vocabulary size 非常大
很多字符在句子中很少使用，造成了稀疏性，vocabulary 非常低效

2.2 Byte-based tokenization

Unicode 字符串也可以直接用一系列的 byte 表示，这些 byte 可以使用 0 到 255 的整数索引

1
2


assert bytes("a", encoding="utf-8") == b"a"
assert bytes("🌍", encoding="utf-8") == b"\xf0\x9f\x8c\x8d"

只用 255 种 byte 就可以表示所有的字符窗，vocabulary size 是 255。但是一个 byte 对应了一个整数 token，压缩率为 1, tokenization 之后得到的 token 序列非常长，造成自注意力计算量非常大

2.3 Word-based tokenization

NLP 模型中常用的是基于单词的 tokenization，首先将一个句子划分成单词序列，然后将每个可以划分的单词指定一个整数索引，就可以建立一个tokenizer

1
2
3


string = "I'll say supercalifragilisticexpialidocious!"
segments = regex.findall(r"\w+|.", string)
# 输出 ["i", "ll", "say", "supercalifragilisticexpialidocious"]

缺点：

单词的总数非常大
很多单词使用率非常低，模型很难充分学习
很难确定一个固定大小的词典，存在没有见过的词

2.4 Byte Pair Encoding

首先将字符串表示成 byte，然后让 tokenizer 自动选择 byte 组成单词（不一定是完整的人类单词），那么一些常见的单词将可以使用一个 token 表示，对于少见或者没有见过的单词可以使用多个 token 组合表示

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


def train_bpe(string: str, num_merges: int) -> BPETokenizerParams:
 """
 Start with the list of bytes of string.
 indices = list(map(int, string.encode("utf-8")))
 merges: dict[tuple[int, int], int] = {}
 vocab: dict[int, bytes] = {x: bytes([x]) for x in range(256)}
 """
 for i in range(num_merges):
 """
 Count the number of occurrences of each pair of tokens
 """
 # 首先计算每两个byte的两两组合，以及出现的频率
 counts = defaultdict(int)
 for index1, index2 in zip(indices, indices[1:]):
 counts[(index1, index2)] += 1

 # Find the most common pair，确定可以合并的byte
 pair = max(counts, key=counts.get)
 index1, index2 = pair

 # Merge that pair
 new_index = 256 + i
 merges[pair] = new_index
 vocab[new_index] = vocab[index1] + vocab[index2]
 indices = merge(indices, pair, new_index)

 return BPETokenizerParams(vocab=vocab, merges=merges)