Windows 下Mamba2 环境安装问题记录及解决方法(causal_conv1d=1.4.0,mamba_ssm=2.2.2)

导航

安装教程导航

安装教程及安装包索引

不知道如何入手请先阅读新手索引:Linux / Windows 下 Mamba / Vim / Vmamba 安装教程及安装包索引

本系列教程已接入ima知识库,欢迎在ima小程序里进行提问!如问题无法解决,安装问题 / 资源售后 / 论文合作想法请+文末vx

背景

由于 Mamba 官方(https://github.com/state-spaces/mamba)更新了 Mamba2 版本,同时 causal-conv1d 新老版本与Mamba 不同版本也不兼容,因此在原来系列博客 “Mamba 环境安装踩坑问题汇总及解决方法” 基础上进行版本更新。在Linux 下mamba2安装过程无区别,在Windows 下安装过程请参考本博客。本博客安装版本为:mamba_ssm-2.2.2causal_conv1d-1.4.0

安装步骤

1. Windows 下环境准备

前期环境准备,同原来博客 “Mamba 环境安装踩坑问题汇总及解决方法” ,具体为:

conda create -n mamba python=3.10
conda activate mamba
conda install cudatoolkit==11.8
pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu118
pip install setuptools==68.2.2
conda install nvidia/label/cuda-11.8.0::cuda-nvcc_win-64
conda install packaging
pip install triton-2.0.0-cp310-cp310-win_amd64.whl

其中 triton-2.0.0-cp310-cp310-win_amd64.whl 获取同原来博客(网盘)。

2. 从源码编译causal-conv1d 1.4.0 版本

首先是下载工程文件,即

git clone https://github.com/Dao-AILab/causal-conv1d.git
cd causal-conv1d
set CAUSAL_CONV1D_FORCE_BUILD=TRUE  # 也可修改setup.py第37行
# 先按照博客修改源码然后再执行这最后一步
pip install .

在执行最后一步编译之前,需要修改。

1)把csrc/causal_conv1d.cppcausal_conv1d.cpp),将第159行、277行、279行的 and 改为 &&,即:

TORCH_CHECK(x.stride(2) % 8 == 0 and x.stride(0) % 8 == 0, "causal_conv1d with channel last layout requires strides (x.stride(0) and x.stride(2)) to be multiples of 8");

TORCH_CHECK(x.stride(2) % 8 == 0 and x.stride(0) % 8 == 0, "causal_conv1d with channel last layout requires strides (x.stride(0) and x.stride(2)) to be multiples of 8");

TORCH_CHECK(dout.stride(2) % 8 == 0 and dout.stride(0) % 8 == 0, "causal_conv1d with channel last layout requires strides (dout.stride(0) and dout.stride(2)) to be multiples of 8");

改为

TORCH_CHECK(x.stride(2) % 8 == 0 && x.stride(0) % 8 == 0, "causal_conv1d with channel last layout requires strides (x.stride(0) and x.stride(2)) to be multiples of 8");

TORCH_CHECK(x.stride(2) % 8 == 0 && x.stride(0) % 8 == 0, "causal_conv1d with channel last layout requires strides (x.stride(0) and x.stride(2)) to be multiples of 8");

TORCH_CHECK(dout.stride(2) % 8 == 0 && dout.stride(0) % 8 == 0, "causal_conv1d with channel last layout requires strides (dout.stride(0) and dout.stride(2)) to be multiples of 8");

2) 把csrc/causal_conv1d_bwd.cucausal_conv1d_bwd.cu)和 csrc/causal_conv1d_fwd.cucausal_conv1d_fwd)与 USE_ROCM 有关的部分注释掉(ROCm是与AMD显卡有关的依赖,因此我们在使用NVIDIA显卡时可以将其注释,参考readme),即:
对于csrc/causal_conv1d_bwd.cu,开头部分的

#ifndef USE_ROCM
    #include <cub/block/block_load.cuh>
    #include <cub/block/block_store.cuh>
    #include <cub/block/block_reduce.cuh>
#else
    #include <hipcub/hipcub.hpp>
    namespace cub = hipcub;
#endif

改为:

#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_reduce.cuh>

原来第258-266行,函数 void causal_conv1d_bwd_launch 中的

                #ifndef USE_ROCM
                C10_CUDA_CHECK(cudaFuncSetAttribute(
                    kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
                #else
                // There is a slight signature discrepancy in HIP and CUDA "FuncSetAttribute" function.
                C10_CUDA_CHECK(cudaFuncSetAttribute(
                    (void *) kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
                std::cerr << "Warning (causal_conv1d bwd launch): attempting to set maxDynamicSharedMemorySize on an AMD GPU which is currently a non-op (in ROCm versions <= 6.1). This might lead to undefined behavior. \n" << std::endl;
                #endif

改为:

                C10_CUDA_CHECK(cudaFuncSetAttribute(
                    kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));

此外,对于csrc/causal_conv1d_fwd,开头部分的

#ifndef USE_ROCM
    #include <cub/block/block_load.cuh>
    #include <cub/block/block_store.cuh>
#else
    #include <hipcub/hipcub.hpp>
    namespace cub = hipcub;
#endif

改为:

#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>

原来第148-156行,函数 void causal_conv1d_fwd_launch 中的

            #ifndef USE_ROCM
            C10_CUDA_CHECK(cudaFuncSetAttribute(
                kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
            #else
            // There is a slight signature discrepancy in HIP and CUDA "FuncSetAttribute" function.
            C10_CUDA_CHECK(cudaFuncSetAttribute(
                (void *) kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
            std::cerr << "Warning (causal_conv1d fwd launch): attempting to set maxDynamicSharedMemorySize on an AMD GPU which is currently a non-op (in ROCm versions <= 6.1). This might lead to undefined behavior. \n" << std::endl;
            #endif

改为:

            C10_CUDA_CHECK(cudaFuncSetAttribute(
                kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));

官方没有编译好的适用于Windows版本的 whl,因此需要用上述步骤来手动编译。笔者编译好了 Windows 下的 causal_conv1d-1.4.0-cp310-cp310-win_amd64.whl,亦可直接下载安装(只适用于torch 2.1,算力6.0-9.0,不要急着下,先看完,后面还有全家桶)。

pip install causal_conv1d-1.4.0-cp310-cp310-win_amd64.whl

成功安装之后,会在相应虚拟环境中(xxx\conda\envs\xxx\Lib\site-packages\)产生 causal_conv1d_cuda.cp310-win_amd64.pyd 文件,此文件对应 causal_conv1d_cuda 包。

3. 从源码编译mamba-ssm 2.2.2 版本

前期准备以及部分文件的修改同原来的博文一样(Window 下Mamba 环境安装踩坑问题汇总及解决方法 (无需绕过selective_scan_cuda)),具体来说:
1)mamba-ssm 环境准备,下载工程文件,即

git clone https://github.com/state-spaces/mamba.git
cd mamba
set MAMBA_FORCE_BUILD=TRUE  # 也可修改setup.py第40行
# 先按照博客修改源码然后再执行这最后一步
pip install . --no-build-isolation --verbose

2)将 csrc/selective_scan/selective_scan_fwd_kernel.cuhvoid selective_scan_fwd_launch 函数改为

void selective_scan_fwd_launch(SSMParamsBase &params, cudaStream_t stream) {
    
    
    // Only kNRows == 1 is tested for now, which ofc doesn't differ from previously when we had each block
    // processing 1 row.
    static constexpr int kNRows = 1;
    BOOL_SWITCH(params.seqlen % (kNThreads * kNItems) == 0, kIsEvenLen, [&] {
    
    
        BOOL_SWITCH(params.is_variable_B, kIsVariableB, [&] {
    
    
            BOOL_SWITCH(params.is_variable_C, kIsVariableC, [&] {
    
    
                BOOL_SWITCH(params.z_ptr != nullptr , kHasZ, [&] {
    
    
                    using Ktraits = Selective_Scan_fwd_kernel_traits<kNThreads, kNItems, kNRows, kIsEvenLen, kIsVariableB, kIsVariableC, kHasZ, input_t, weight_t>;
                    // constexpr int kSmemSize = Ktraits::kSmemSize;
                    static constexpr int kSmemSize = Ktraits::kSmemSize + kNRows * MAX_DSTATE * sizeof(typename Ktraits::scan_t);
                    // printf("smem_size = %d\n", kSmemSize);
                    dim3 grid(params.batch, params.dim / kNRows);
                    auto kernel = &selective_scan_fwd_kernel<Ktraits>;
                    if (kSmemSize >= 48 * 1024) {
    
    
                        C10_CUDA_CHECK(cudaFuncSetAttribute(
                            kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
                    }
                    kernel<<<grid, Ktraits::kNThreads, kSmemSize, stream>>>(params);
                    C10_CUDA_KERNEL_LAUNCH_CHECK();
                });
            });
        });
    });
}
  • csrc/selective_scan/static_switch.hBOOL_SWITCH 函数改为
#define BOOL_SWITCH(COND, CONST_NAME, ...)                                           \
    [&] {
      
                                                                                  \
        if (COND) {
      
                                                                        \
            static constexpr bool CONST_NAME = true;                                        \
            return __VA_ARGS__();                                                    \
        } else {
      
                                                                           \
            static constexpr bool CONST_NAME = false;                                       \
            return __VA_ARGS__();                                                    \
        }                                                                            \
    }()

(这两步是将 constexpr 改为 static constexpr

  • csrc/selective_scan/cus/selective_scan_bwd_kernel.cuhcsrc/selective_scan/cus/selective_scan_fwd_kernel.cuh 文件开头加入:
#ifndef M_LOG2E
#define M_LOG2E 1.4426950408889634074
#endif

3)前述两步与之前的安装过程是一样的,但是mamba2 还需要进一步进行修改,此处的修改过程类似于causal-conv1d,把csrc/selective_scan/selective_scan_fwd_kernel.cuhselective_scan_fwd_kernel.cuh)和 csrc/selective_scan/selective_scan_bwd_kernel.cuhselective_scan_bwd_kernel.cuh)与 USE_ROCM 有关的部分注释掉,即:
对于 csrc/selective_scan/selective_scan_fwd_kernel.cuh ,把开头部分的

#ifndef USE_ROCM
    #include <cub/block/block_load.cuh>
    #include <cub/block/block_store.cuh>
    #include <cub/block/block_scan.cuh>
#else
    #include <hipcub/hipcub.hpp>
    namespace cub = hipcub;
#endif

改为:

#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_scan.cuh>

原来第332-339行,函数 void selective_scan_fwd_launch 中的

                        #ifndef USE_ROCM
                        C10_CUDA_CHECK(cudaFuncSetAttribute(
                            kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
                        #else
                        C10_CUDA_CHECK(cudaFuncSetAttribute(
                            (void *) kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
                            std::cerr << "Warning (selective_scan_fwd_kernel): attempting to set maxDynamicSharedMemorySize on an AMD GPU which is currently a non-op (in ROCm versions <= 6.1). This might lead to undefined behavior. \n" << std::endl;
                        #endif

改为:

                        C10_CUDA_CHECK(cudaFuncSetAttribute(
                            kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));

此外,对于 csrc/selective_scan/selective_scan_bwd_kernel.cuh ,开头部分的

#ifndef USE_ROCM
    #include <cub/block/block_load.cuh>
    #include <cub/block/block_store.cuh>
    #include <cub/block/block_scan.cuh>
    #include <cub/block/block_reduce.cuh>
#else
    #include <hipcub/hipcub.hpp>
    namespace cub = hipcub;
#endif

改为:

#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_scan.cuh>
#include <cub/block/block_reduce.cuh>

原来第515-522行,函数 void selective_scan_bwd_launch 中的

                            #ifndef USE_ROCM
                            C10_CUDA_CHECK(cudaFuncSetAttribute(
                                kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
                            #else
                            C10_CUDA_CHECK(cudaFuncSetAttribute(
                                (void *) kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
                            std::cerr << "Warning (selective_scan_bwd_kernel): attempting to set maxDynamicSharedMemorySize on an AMD GPU which is currently a non-op (in ROCm versions <= 6.1). This might lead to undefined behavior. \n" << std::endl;
                            #endif

改为:

C10_CUDA_CHECK(cudaFuncSetAttribute(
                                kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));

4)完成修改之后,执行:

pip install . --no-build-isolation --verbose

即可完成 mamba 2.2.2 的编译。

5)本人编译好的Windows 下的whl 也有:mamba-ssm-2.2.2 (只适用于torch 2.1,算力6.0-9.0),【Window下Mamba2环境安装包合集】可直接下载安装。利用 whl 安装命令为:

pip install mamba_ssm-2.2.2-cp310-cp310-win_amd64.whl

由于此时没有绕过selective_scan_cuda,在虚拟环境中(xxx\conda\envs\xxx\Lib\site-packages\)产生了 selective_scan_cuda.cp310-win-amd64.pyd 文件。

4. 环境运行验证

参考官方的 readme 文件,运行以下示例:

import torch
from mamba_ssm import Mamba
from mamba_ssm import Mamba2

batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")
model = Mamba(
    # This module uses roughly 3 * expand * d_model^2 parameters
    d_model=dim, # Model dimension d_model
    d_state=16,  # SSM state expansion factor
    d_conv=4,    # Local convolution width
    expand=2,    # Block expansion factor
).to("cuda")
y = model(x)
assert y.shape == x.shape
print('Mamba:', x.shape)

batch, length, dim = 2, 64, 256
x = torch.randn(batch, length, dim).to("cuda")
model = Mamba2(
    # This module uses roughly 3 * expand * d_model^2 parameters
    d_model=dim, # Model dimension d_model
    d_state=64,  # SSM state expansion factor, typically 64 or 128
    d_conv=4,    # Local convolution width
    expand=2,    # Block expansion factor
).to("cuda")
y = model(x)
assert y.shape == x.shape
print('Mamba2:', x.shape)

正常输出结果无报错。(遇到 KeyError: 'HOME' 报错请参考后文的解决方法)

Windows下编译causal-conv1d过程中遇到的问题

1. error C2146: 语法错误: 缺少“)”(在标识符“and”的前面)

在编译 causal-conv1d 时遇到以下报错:

	  cl: 命令行 warning D9002 :忽略未知选项“-O3”
      causal_conv1d.cpp
      csrc/causal_conv1d.cpp(159): error C2146: 语法错误: 缺少“)(在标识符“and”的前面)
      csrc/causal_conv1d.cpp(159): error C2143: 语法错误: 缺少“;(在“{
    
    ”的前面)
      csrc/causal_conv1d.cpp(277): error C2146: 语法错误: 缺少“)(在标识符“and”的前面)
      csrc/causal_conv1d.cpp(277): error C2143: 语法错误: 缺少“;(在“{
    
    ”的前面)
      csrc/causal_conv1d.cpp(278): error C2146: 语法错误: 缺少“)(在标识符“and”的前面)
      csrc/causal_conv1d.cpp(278): error C2143: 语法错误: 缺少“;(在“{
    
    ”的前面)
      error: command 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.34.31933\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for causal_conv1d
  Running setup.py clean for causal_conv1d
Failed to build causal_conv1d
ERROR: Could not build wheels for causal_conv1d, which is required to install pyproject.toml-based projects

如下图所示:
在这里插入图片描述
解决方法:
根据报错提示,把csrc/causal_conv1d.cppcausal_conv1d.cpp),将第159行、277行、279行的 and 改为 &&

2. causal_conv1d_bwd.cu(250): error: 及causal_corv1d fwd cu(140):error

在编译 causal-conv1d 时遇到以下报错:

E:\user\Projects\user\causal-conv1d-main\csrc\causal_conv1d_bwd.cu(250): error: expected an expression
                detected during:
                  instantiation of "void causal_conv1d_bwd_launch<kNThreads,kWidth,input_t,weight_t>(ConvParamsBwd &, cudaStream_t) [with kNThreads=128, kWidth=4, input_t=c10::Half, weight_t=float]"
      (283): here
                  instantiation of "void causal_conv1d_bwd_cuda<input_t,weight_t>(ConvParamsBwd &, cudaStream_t) [with input_t=c10::Half, weight_t=float]"
      (610): here

      E:\user\Projects\user\causal-conv1d-main\csrc\causal_conv1d_bwd.cu(250): error: expected a ";"
                detected during:
                  instantiation of "void causal_conv1d_bwd_launch<kNThreads,kWidth,input_t,weight_t>(ConvParamsBwd &, cudaStream_t) [with kNThreads=128, kWidth=4, input_t=c10::Half, weight_t=float]"
      (283): here
                  instantiation of "void causal_conv1d_bwd_cuda<input_t,weight_t>(ConvParamsBwd &, cudaStream_t) [with input_t=c10::Half, weight_t=float]"
      (610): here

      Error limit reached.
      100 errors detected in the compilation of "csrc/causal_conv1d_bwd.cu".
      Compilation terminated.
      error: command 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.6\\bin\\nvcc.exe' failed with exit code 4294967295
      [end of output]

以及


  E:\user\Projects\user\causal-conv1d-main\csrc\causal_conv1d_fwd.cu(140): error: expected an expression
					detected during:
                  instantiation of "void causal_conv1d_fwd_launch<kNThreads, kWidth, input_t, weight_t>(ConvParamsBase &, cudaStream_t) [with kNThreads=128, kWidth=4, input_t=f1oat, we1ght_t=c10::Half]"
(171):here
    
 E:\user\Projects\user\causal-conv1d-main\csrc\causal_conv1d_fwd.cu(60): warning #177-D: variable "smem_store_vec" was declared but never referenced
                detected during:
                  instantiation of "void causal_conv1d_fwd_kernel<Ktraits>(ConvParamsBase) [with Ktraits=Causal_conv1d_fwd_kernel_traits<128, 4, false, float, c10::Half>]"
      (140): here
                  instantiation of "void causal_conv1d_fwd_launch<kNThreads,kWidth,input_t,weight_t>(ConvParamsBase &, cudaStream_t) [with kNThreads=128, kWidth=4, input_t=float, weight_t=c10::Half]"
      (171): here
                  instantiation of "void causal_conv1d_fwd_cuda<input_t,weight_t>(ConvParamsBase &, cudaStream_t) [with input_t=float, weight_t=c10::Half]"
      (384): here
      

分别如下图所示:
在这里插入图片描述
在这里插入图片描述
解决方法:
csrc/causal_conv1d_bwd.cucausal_conv1d_bwd.cu)和 csrc/causal_conv1d_fwd.cucausal_conv1d_fwd)与 USE_ROCM 有关的部分注释掉(ROCm是与AMD显卡有关的依赖,因此我们在使用NVIDIA显卡时可以将其注释,参考readme)。

Windows下编译mamba2过程中遇到的问题

1. selective_scan_bwd_kernel.cuh(504): error

编译 mamba_ssm 时,出现以下报错:

E:\user\Projects\user\mamba-main\csrc\selective_scan\selective_scan_bwd_kernel.cuh(504): error: expected a ";"
                detected during:
                  instantiation of "void selective_scan_bwd_launch<kNThreads,kNItems,input_t,weight_t>(SSMParamsBwd &, cudaStream_t) [with kNThreads=32, kNItems=4, input_t=c10::BFloat16, weight_t=float]"
      (545): here
                  instantiation of "void selective_scan_bwd_cuda<input_t,weight_t>(SSMParamsBwd &, cudaStream_t) [with input_t=c10::BFloat16, weight_t=float]"
      E:\user\Projects\user\mamba-main\csrc\selective_scan\selective_scan_bwd_bf16_real.cu(9): here

如下图所示:
在这里插入图片描述
解决方法:
此处的解决过程类似于causal-conv1d,把csrc/selective_scan/selective_scan_fwd_kernel.cuhselective_scan_fwd_kernel.cuh)和 csrc/selective_scan/selective_scan_bwd_kernel.cuhselective_scan_bwd_kernel.cuh)与 USE_ROCM 有关的部分注释掉。

2. Building wheel for mamba_ssm 一直卡着不动

为了探究卡在了building的哪一步,加入 --verbose 进行显示

pip install . --no-build-isolation --verbose

其中 --no-build-isolation 可以跳过pytorch验证(参考readme)。

运行过程中遇到的问题

1. KeyError: ‘HOME’

运行前文的示例文件,发现在 mamba2 处出现报错:

Traceback (most recent call last):
  File "E:\user\Projects\user\mamba-main\temp.py", line 27, in <module>
    y = model(x)
  File "E:\user\conda\envs\mamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "E:\user\conda\envs\mamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\user\conda\envs\mamba\lib\site-packages\mamba_ssm\modules\mamba2.py", line 185, in forward
    out = mamba_split_conv1d_scan_combined(
  File "E:\user\conda\envs\mamba\lib\site-packages\mamba_ssm\ops\triton\ssd_combined.py", line 930, in mamba_split_conv1d_scan_combined
    return MambaSplitConv1dScanCombinedFn.apply(zxbcdt, conv1d_weight, conv1d_bias, dt_bias, A, D, chunk_size, initial_states, seq_idx, dt_limit, return_final_states, activation, rmsnorm_weight, rmsnorm_eps, outproj_weight, outproj_bias, headdim, ngroups, norm_before_gate)
  File "E:\user\conda\envs\mamba\lib\site-packages\torch\autograd\function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "E:\user\conda\envs\mamba\lib\site-packages\torch\cuda\amp\autocast_mode.py", line 113, in decorate_fwd
    return fwd(*args, **kwargs)
  File "E:\user\conda\envs\mamba\lib\site-packages\mamba_ssm\ops\triton\ssd_combined.py", line 795, in forward
    out_x, _, dt_out, dA_cumsum, states, final_states = _mamba_chunk_scan_combined_fwd(x, dt, A, B, C, chunk_size=chunk_size, D=D, z=None, dt_bias=dt_bias, initial_states=initial_states, seq_idx=seq_idx, dt_softplus=True, dt_limit=dt_limit)
  File "E:\user\conda\envs\mamba\lib\site-packages\mamba_ssm\ops\triton\ssd_combined.py", line 312, in _mamba_chunk_scan_combined_fwd
    dA_cumsum, dt = _chunk_cumsum_fwd(dt, A, chunk_size, dt_bias=dt_bias, dt_softplus=dt_softplus, dt_limit=dt_limit)
  File "E:\user\conda\envs\mamba\lib\site-packages\mamba_ssm\ops\triton\ssd_chunk_state.py", line 675, in _chunk_cumsum_fwd
    _chunk_cumsum_fwd_kernel[grid_chunk_cs](
  File "E:\user\conda\envs\mamba\lib\site-packages\triton\runtime\jit.py", line 106, in launcher
    return self.run(*args, grid=grid, **kwargs)
  File "E:\user\conda\envs\mamba\lib\site-packages\triton\runtime\autotuner.py", line 73, in run
    timings = {
    
    config: self._bench(*args, config=config, **kwargs)
  File "E:\user\conda\envs\mamba\lib\site-packages\triton\runtime\autotuner.py", line 73, in <dictcomp>
    timings = {
    
    config: self._bench(*args, config=config, **kwargs)
  File "E:\user\conda\envs\mamba\lib\site-packages\triton\runtime\autotuner.py", line 63, in _bench
    return do_bench(kernel_call)
  File "E:\user\conda\envs\mamba\lib\site-packages\triton\testing.py", line 136, in do_bench
    fn()
  File "E:\user\conda\envs\mamba\lib\site-packages\triton\runtime\autotuner.py", line 62, in kernel_call
    self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current)
  File "<string>", line 41, in _chunk_cumsum_fwd_kernel
  File "E:\user\conda\envs\mamba\lib\site-packages\triton\compiler.py", line 1230, in compile
    so_cache_manager = CacheManager(so_cache_key)
  File "E:\user\conda\envs\mamba\lib\site-packages\triton\compiler.py", line 1102, in __init__
    self.cache_dir = os.environ.get('TRITON_CACHE_DIR', default_cache_dir())
  File "E:\user\conda\envs\mamba\lib\site-packages\triton\compiler.py", line 1093, in default_cache_dir
    return os.path.join(os.environ["HOME"], ".triton", "cache")
  File "E:\user\conda\envs\mamba\lib\os.py", line 680, in __getitem__
    raise KeyError(key) from None
KeyError: 'HOME'

以及报错

Traceback (most recent call last):
  File "E:\user\Projects\user\mamba-main\temp.py", line 27, in <module>
    y = model(x)
  File "E:\user\conda\envs\mamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "E:\user\conda\envs\mamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\user\conda\envs\mamba\lib\site-packages\mamba_ssm\modules\mamba2.py", line 185, in forward
    out = mamba_split_conv1d_scan_combined(
  File "E:\user\conda\envs\mamba\lib\site-packages\mamba_ssm\ops\triton\ssd_combined.py", line 931, in mamba_split_conv1d_scan_combined
    return mamba_split_conv1d_scan_ref(zxbcdt, conv1d_weight, conv1d_bias, dt_bias, A, D, chunk_size, dt_limit, activation, rmsnorm_weight, rmsnorm_eps, outproj_weight, outproj_bias, headdim, ngroups, norm_before_gate)
  File "E:\user\conda\envs\mamba\lib\site-packages\mamba_ssm\ops\triton\ssd_combined.py", line 977, in mamba_split_conv1d_scan_ref
    out = rmsnorm_fn(out, rmsnorm_weight, None, z=rearrange(z, "b l h p -> b l (h p)"), eps=rmsnorm_eps,
  File "E:\user\conda\envs\mamba\lib\site-packages\mamba_ssm\ops\triton\layernorm_gated.py", line 385, in rmsnorm_fn
    return LayerNormFn.apply(x, weight, bias, z, eps, group_size, norm_before_gate, True)
  File "E:\user\conda\envs\mamba\lib\site-packages\torch\autograd\function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "E:\user\conda\envs\mamba\lib\site-packages\mamba_ssm\ops\triton\layernorm_gated.py", line 359, in forward
    y, mean, rstd = _layer_norm_fwd(x, weight, bias, eps, z=z, group_size=group_size, norm_before_gate=norm_before_gate, is_rms_norm=is_rms_norm)
  File "E:\user\conda\envs\mamba\lib\site-packages\mamba_ssm\ops\triton\layernorm_gated.py", line 140, in _layer_norm_fwd
    _layer_norm_fwd_1pass_kernel[grid](x, out, weight, bias, z, mean, rstd,
  File "E:\user\conda\envs\mamba\lib\site-packages\triton\runtime\jit.py", line 106, in launcher
    return self.run(*args, grid=grid, **kwargs)
  File "E:\user\conda\envs\mamba\lib\site-packages\triton\runtime\autotuner.py", line 200, in run
    return self.fn.run(*args, **kwargs)
  File "E:\user\conda\envs\mamba\lib\site-packages\triton\runtime\autotuner.py", line 200, in run
    return self.fn.run(*args, **kwargs)
  File "<string>", line 41, in _layer_norm_fwd_1pass_kernel
  File "E:\user\conda\envs\mamba\lib\site-packages\triton\compiler.py", line 1230, in compile
    so_cache_manager = CacheManager(so_cache_key)
  File "E:\user\conda\envs\mamba\lib\site-packages\triton\compiler.py", line 1102, in __init__
    self.cache_dir = os.environ.get('TRITON_CACHE_DIR', default_cache_dir())
  File "E:\user\conda\envs\mamba\lib\site-packages\triton\compiler.py", line 1093, in default_cache_dir
    return os.path.join(os.environ["HOME"], ".triton", "cache")
  File "E:\user\conda\envs\mamba\lib\os.py", line 680, in __getitem__
    raise KeyError(key) from None
KeyError: 'HOME'

这是因为Windows 下暂时不兼容 triton 模块,无法执行 @triton.jit

解决方案有两种:

1)使用 triton-windows:其版本为3.1.0,要求torch >= 2.4.0,CUDA12.x。其宣称可运行 triton.jittorch.compile。笔者未验证。
2)绕过此部分有关代码,具体来说在虚拟环境中(xxx\conda\envs\xxx\Lib\site-packages\mamba_ssm)修改有关代码:

  • 针对第一个报错修改 mamba_ssm\ops\triton\ssd_combined.py 第930行,将
return MambaSplitConv1dScanCombinedFn.apply(zxbcdt, conv1d_weight, conv1d_bias, dt_bias, A, D, chunk_size, initial_states, seq_idx, dt_limit, return_final_states, activation, rmsnorm_weight, rmsnorm_eps, outproj_weight, outproj_bias, headdim, ngroups, norm_before_gate)

改为:

return mamba_split_conv1d_scan_ref(zxbcdt, conv1d_weight, conv1d_bias, dt_bias, A, D, chunk_size, dt_limit, activation, rmsnorm_weight, rmsnorm_eps, outproj_weight, outproj_bias, headdim, ngroups, norm_before_gate)
  • 针对第二个报错修改 mamba_ssm\ops\triton\layernorm_gated.py 第385行,将
return LayerNormFn.apply(x, weight, bias, z, eps, group_size, norm_before_gate, True)

改为

return rms_norm_ref(x, weight, bias, z, eps, group_size, norm_before_gate, True)

即可。

此外,如果遇到triton报错锁定到layer_norm_fn函数,则还需要修改 mamba_ssm/ops/triton/layer_norm.py 第886行和第920行,将

def layer_norm_fn(
    x,
    weight,
    bias,
    residual=None,
    x1=None,
    weight1=None,
    bias1=None,
    eps=1e-6,
    dropout_p=0.0,
    rowscale=None,
    prenorm=False,
    residual_in_fp32=False,
    is_rms_norm=False,
    return_dropout_mask=False,
):
    return LayerNormFn.apply(
        x,
        weight,
        bias,
        residual,
        x1,
        weight1,
        bias1,
        eps,
        dropout_p,
        rowscale,
        prenorm,
        residual_in_fp32,
        is_rms_norm,
        return_dropout_mask,
    )


def rms_norm_fn(
    x,
    weight,
    bias,
    residual=None,
    x1=None,
    weight1=None,
    bias1=None,
    eps=1e-6,
    dropout_p=0.0,
    rowscale=None,
    prenorm=False,
    residual_in_fp32=False,
    return_dropout_mask=False,
):
    return LayerNormFn.apply(
        x,
        weight,
        bias,
        residual,
        x1,
        weight1,
        bias1,
        eps,
        dropout_p,
        rowscale,
        prenorm,
        residual_in_fp32,
        True,
        return_dropout_mask,
    )

改为:

def layer_norm_fn(
    x,
    weight,
    bias,
    residual=None,
    x1=None,
    weight1=None,
    bias1=None,
    eps=1e-6,
    dropout_p=0.0,
    rowscale=None,
    prenorm=False,
    residual_in_fp32=False,
    is_rms_norm=False,
    return_dropout_mask=False,
):
    return layer_norm_ref(
        x,
        weight,
        bias,
        residual,
        x1,
        weight1,
        bias1,
        eps,
        dropout_p,
        rowscale,
        prenorm,
        upcast=residual_in_fp32,
    )


def rms_norm_fn(
    x,
    weight,
    bias,
    residual=None,
    x1=None,
    weight1=None,
    bias1=None,
    eps=1e-6,
    dropout_p=0.0,
    rowscale=None,
    prenorm=False,
    residual_in_fp32=False,
    return_dropout_mask=False,
):
    return rms_norm_ref(
        x,
        weight,
        bias,
        residual,
        x1,
        weight1,
        bias1,
        eps,
        dropout_p,
        rowscale,
        prenorm,
        upcast=residual_in_fp32,
    )

2. 关于 triton 的问题

由于 triton 官方目前只支持Linux,因此在 Windows 系统运行时,函数中只要涉及到其调用都会出现报错,包括但不限于:

  • KeyError: 'HOME'
  • RuntimeError: failed to find C compiler, Please specify via cc environment variable.

终极解决方案参考Windows 下 Mamba / Vim / Vmamba 环境安装终极版Windows 下Mamba2 / Vim / Vmamba 环境安装问题记录及解决方法终极版(无需绕过triton)

关于whl付费的说明

  1. 无论是Linux还是Win,在这些平台下面的Mamba,Vim 以及Vmamba 编译过程以及所有可能遇到的问题已经在本系列博客中全程开源并写明,不少同学按照本博客自己编译成功。
  2. 资金紧张但学有余力的同学请自己按照本教程自己动手编译,出现问题请查阅本系列所有博客,不鼓励从任何渠道购买!!!
  3. 为时间紧张的同学提供优惠通道:【causal-conv1d-1.4.0-cp310-cp310-win-amd64.whl】;【mamba-ssm-2.2.2-cp310-cp310-win-amd64.whl】;【Window下Mamba2环境安装包】。
  4. 由于精力有限,只对【付费同学】全程售后,安装包本身没有价值,指导安装挤占了本人大量时间,所以付费其实是咨询费,其他同学随缘。
  5. 使用本人提供的whl请保证python、torch及cuda版本与博客里一致。否则会出现 DLL load failed 问题。有环境版本定制化需求请私信vx。
  6. 网上有大量人抄袭本系列博客的教程,连本人当时随手建的环境都变成了这些教程的基础配置,还是请关注本系列博客的权威解答,除前述渠道外的其他渠道均需理性看待,谨防诈骗。