GPU服务器dmesg日志报错"Cannot map memory with base addr ..."的解决方案

GPU服务器dmesg日志报错"Cannot map memory with base addr ..."的解决方案
今天有同事反馈有请求报错502,我于是检查系统上的日志发现有如下错误信息

[Thu Jun 18 23:15:31 2026] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 46a19b2647fa00 >= 46a199fa96c900
[Thu Jun 18 23:15:31 2026] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[Thu Jun 18 23:15:47 2026] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 46a19ed0755100 >= 46a19d7daa5c00
[Thu Jun 18 23:15:47 2026] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[Thu Jun 18 23:16:02 2026] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 46a1a24ff55d00 >= 46a1a1275da100
[Thu Jun 18 23:16:02 2026] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[Thu Jun 18 23:16:18 2026] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 46a1a605946400 >= 46a1a4a97d1000
[Thu Jun 18 23:16:18 2026] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[Thu Jun 18 23:16:39 2026] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 46a1ab01988c00 >= 46a1a9f4e5ed00
[Thu Jun 18 23:16:39 2026] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[Thu Jun 18 23:16:54 2026] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 46a1ae78835400 >= 46a1ad5937f700
[Thu Jun 18 23:16:54 2026] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[Thu Jun 18 23:17:15 2026] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 46a1b33a5ae800 >= 46a1b0d05fc800
[Thu Jun 18 23:17:15 2026] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!

这个报错按起来和我之前排查的一个问题很相似,可以查看我过去的笔记:https://sulao.cn/post/1154
但是通过往下看出现了如下日志

[Mon Jun 22 17:06:22 2026] Cannot map memory with base addr 0x7ee2b3068000 and size of 0x40000 pages
[Mon Jun 22 17:06:22 2026] Cannot map memory with base addr 0x7f3ddf068000 and size of 0x40000 pages
[Mon Jun 22 17:19:09 2026] Cannot map memory with base addr 0x7ef15b58c000 and size of 0x40000 pages
[Mon Jun 22 17:19:09 2026] Cannot map memory with base addr 0x7fb2d758c000 and size of 0x40000 pages
[Mon Jun 22 17:19:09 2026] Cannot map memory with base addr 0x7ee31958c000 and size of 0x40000 pages
[Mon Jun 22 17:19:09 2026] Cannot map memory with base addr 0x7f19ad58c000 and size of 0x40000 pages
[Mon Jun 22 17:19:09 2026] Cannot map memory with base addr 0x7f631d58c000 and size of 0x40000 pages
[Mon Jun 22 17:19:09 2026] Cannot map memory with base addr 0x7ee7cb58c000 and size of 0x30a74 pages
[Mon Jun 22 17:19:09 2026] Cannot map memory with base addr 0x7f90ab58c000 and size of 0x40000 pages

这下我就基本能确定是页锁内存的问题
NVIDIA内核驱动NVRM,驱动试图把主机系统内存/统一内存(Unified Memory)映射到 GPU虚拟地址空间,内核虚拟地址空间不足、无法分配连续大片页,映射直接失败,两条日志是连锁故障,我们在往上翻应该能看到应该是先出现的分配失败,然后导致任务卡死 ,调度线程超时。
问题触发流程

开启pin‑memory → 驱动申请连续vmalloc地址 → 地址不足映射失败 → GPU任务无法入队 → 内部调度线程超时。

以下是导致这个问题的核心原因
1.NVIDIA 驱动每张 GPU 都会占用大量内核 vmalloc 区域做主机 - GPU 内存映射。
默认 Linux 内核 vmalloc 预留区间偏小,3 卡 / 4 卡以上、大模型启用统一内存(unified memory)、多进程并发加载大权重时,很快占满,无法申请连续百万级内存页,直接报Cannot map memory
2.PyTorch /llama.cpp/vLLM 默认用mmap加载几十GB模型文件,一次性申请巨量连续映射页;
开启 unified_memory / pin_memory=True 会大幅加剧内核地址空间占用;
多 Pod / 多进程共享 GPU,叠加多份大模型同时映射,瞬间打满内核虚拟地址。
3.系统内存碎片化 /min_free_kbytes 配置冲突
内核预留最小空闲内存水位过高,大块连续物理内存被零散占用,无法一次性分配大量的连续页;
同时系统可用内存偏低,DMA 缓冲区分配失败,连带 GPU 映射逻辑报错
在应用层面可以降低单卡并发任务、拆分大batch,避免多进程同时加载超大权重,关闭应用统一内存、mmap 超大文件逻辑

vLLM:--no-enable-unified-memory
llama.cpp:增加 --no-mmap,降低 -ngl 分层数量
PyTorch DataLoader:关闭 pin_memory=True

如果需要根治的话我们需要对启动引导进行配置,永久扩容内核vmalloc地址空间

cat /etc/default/grub
GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=0
GRUB_DISTRIBUTOR=`( . /etc/os-release; echo ${NAME:-Ubuntu} ) 2>/dev/null || echo Ubuntu`
GRUB_CMDLINE_LINUX=""
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash vmalloc=512G transparent_hugepage=never"

单卡设128G;
2卡256G;
4卡及以上直接512G
transparent_hugepage=never:关闭透明大页,减少内核内存碎片
然后更新grub引导

update-grub

然后优化内核参数

tee -a /etc/sysctl.conf <<EOF
# 降低最小预留内存,保留大块连续内存
vm.min_free_kbytes = 131072
# 增大mmap映射区域
vm.max_map_count = 2621440
# 提高内核内存分配效率
vm.overcommit_memory = 1
vm.overcommit_ratio = 90
# 增大文件句柄、内存锁
fs.file-max = 1048576
EOF

sysctl -p

然后重启操作系统

reboot

容器和k8s层面也有一些优化工作需要做
安全上下文放开内存锁定限制

securityContext:
  ulimits:
  - name: memlock
    soft: -1
    hard: -1
  runAsUser: 0 # 如需加载大权重,避免普通用户mmap权限不足

容器启动环境变量:禁用高占用统一内存

env:
  - name: VLLM_NO_UNIFIED_MEMORY
    value: "1"

启动参数追加 --no-enable-unified-memory
PyTorch训练/推理

env:
  - name: CUDA_VISIBLE_DEVICES
    valueFrom:
      fieldRef:
        fieldPath: metadata.annotations['nvidia.com/gpu.devices']
  # 环境变量减少 CUDA 内存映射开销
  - name: PYTORCH_CUDA_ALLOC_CONF
    value: "expandable_segments:True"
  # 关闭pin_memory减少主机-GPU映射
  - name: TORCH_USE_CUDA_DSA
    value: "0"

llama.cpp启动参数加 --no-mmap,不mmap超大权重文件。
vLLM添加以下环境变量

env:
  - name: VLLM_NO_UNIFIED_MEMORY
    value: "1"

另外可以使用以下命令验证vmalloc是否耗尽,查看系统连续内存碎片

cat /proc/vmallocinfo | awk '{sum += $2} END {print sum/1024/1024 " MB used"}'
cat /proc/buddyinfo

202606240111419887160992.png

尽量按需使用 pin‑memory,不要在 dataloader 全局开启。

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.sulao.cn/post/1181

评论列表

0%