nccl-tests英伟达GPU单机多卡一键测试脚本

由于经常需要做测试,所以撰写了一个测试单机多卡的bash脚本,前提需要环境中已经安装nvidia驱动和cuda库,且cuda库安装在默认目录/usr/local/下,然后nccl我是下载的zip包,名字是nccl-master.zip,nccl-tests包也是下载的zip的包,名字是nccl-tests-master.zip,这两个包名字写死了,将下面脚本内容存为脚本,前面提的包放在同一目录,然后使用bash命令进行执行。

脚本内容如下:

#!/bin/bash
#set -e

CURRENT_PATH=`readlink -f $(dirname $0)`
if [ -f ${CURRENT_PATH:=.}/common.sh ]; then
    . ${CURRENT_PATH:=.}/common.sh
else
    echo "无法找到公共配置文件!"
    exit 1
fi

function BUILD_NCCL_TESTS(){
    if [ $COMPUTE_SM -eq 120 ] || [ $COMPUTE_SM -eq 89 ]; then
        export TEST_TOTAL="2G"
    fi
    INFO "当前算力:${COMPUTE_SM}"
    cd ${CURRENT_PATH:=.}
    if [ ! -d ${CURRENT_PATH:=.}/nccl-master ]; then
        WARNING "${CURRENT_PATH:=.}/nccl-master 不目录存,开始解压!"
        unzip ${CURRENT_PATH:=.}/nccl-master.zip
    fi
    cd ${CURRENT_PATH:=.}/nccl-master
    mkdir -p ${CURRENT_PATH:=.}/nccl
    if [ -d ${CURRENT_PATH:=.}/nccl/lib ]; then
        INFO "检测到编译路径 ${CURRENT_PATH:=.}/nccl/lib 存在,开始清理编译文件!"
        make clean
    fi
    INFO "开始编译 nccl..."
    make -j$(nproc) src.build BUILDDIR=${CURRENT_PATH:=.}/nccl CUDA_HOME=${CUDA_PATH} NVCC_GENCODE="-gencode=arch=compute_${COMPUTE_SM},code=sm_${COMPUTE_SM}"
    if [ $? -eq 0 ]; then
        INFO "nccl 编译完成!"
    else
        ERROR "nccl 编译失败!"
        exit 1
    fi
    
    cd ${CURRENT_PATH:=.}
    if [ ! -d ${CURRENT_PATH:=.}/nccl-tests-master ]; then
        WARNING "${CURRENT_PATH:=.}/nccl-tests-master 不目录存,开始解压!"
        unzip ${CURRENT_PATH:=.}/nccl-tests-master.zip
    fi
    cd ${CURRENT_PATH:=.}/nccl-tests-master
    if [ -d ${CURRENT_PATH:=.}/nccl-tests-master/build ]; then
        INFO "检测到编译路径 ${CURRENT_PATH:=.}/nccl-tests-master/build 存在,开始清理编译文件!"
        make clean
    fi
    INFO "开始编译 nccl-tests..."
    make CUDA_HOME=${CUDA_PATH} NCCL_HOME=${CURRENT_PATH:=.}/nccl
    if [ $? -eq 0 ]; then
        INFO "nccl-tests 编译完成!"
    else
        ERROR "nccl-tests 编译失败!"
        exit 1
    fi
    export NCCL_TESTS_PATH=${CURRENT_PATH:=.}/nccl-tests-master
}
function NCCL_COMP_TESTS(){
    INFO "开始单机多卡通信测试,当前LD_LIBRARY_PATH环境变量:${LD_LIBRARY_PATH}"
    if [ ! -d ${CURRENT_PATH:=.}/result ]; then
        mkdir -p ${CURRENT_PATH:=.}/result
    fi
    cd ${CURRENT_PATH:=.}
    if [ ${GPU_TOTAL} -gt 1 ]; then
        export LD_LIBRARY_PATH=${CURRENT_PATH:=.}/nccl/lib:$LD_LIBRARY_PATH
        TEST_ITEM=('all_reduce_perf' 'all_gather_perf' 'alltoall_perf')
        for ITEM in "${TEST_ITEM[@]}"
        do
            sleep 10
            INFO "开始单机多卡${ITEM}通信测试..."
            ${NCCL_TESTS_PATH}/build/${ITEM} -b 8 -e ${TEST_TOTAL:-8G} -f 2 -g ${GPU_TOTAL} > ${CURRENT_PATH:=.}/result/${ITEM}_$(hostname).log
            if [ $? -eq 0 ];then
                cat ${CURRENT_PATH:=.}/result/${ITEM}_$(hostname).log
                INFO "nccl-tests ${ITEM}测试完成!" 
            else
                ERROR "nccl-tests ${ITEM}测试失败,可以尝试手工进行测试,导入环境变量:export LD_LIBRARY_PATH=${CURRENT_PATH:=.}/nccl/lib:\$LD_LIBRARY_PATH ,然后执行命令:${NCCL_TESTS_PATH}/build/${ITEM} -b 8 -e ${TEST_TOTAL:-4G} -f 2 -g ${GPU_TOTAL}"
            fi
        done
    else
        WARNING "当前卡数: ${GPU_TOTAL}, 不能进行nccl-tests测试!"
    fi
}

CHECK_COMPUTE_SM
CHECK_GPU_COMMAND
CHECK_CUDA_PATH
BUILD_NCCL_TESTS
NCCL_COMP_TESTS

然后存为single_nccl_test.sh文件,执行以下命令进行执行

bash single_nccl_test.sh

测试结果会写入当前目录下的result_$(hostname)目录下。

202508272103537538358890.png

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.sulao.cn/post/1125

评论列表

0%