Summary of pitfalls and solutions for Mamba environment installation under Windows (without bypassing selective_scan

News from this site

Rental advertising space, please contact the webmaster if you need cooperation

Ineverleft

article

21021

browse

+focus

classification

no classification

date

2024-11(6)

Summary of pitfalls and solutions for Mamba environment installation under Windows (without bypassing selective_scan_cuda)

posted on 2024-11-02 14:20 read(278) comment(0) like(27) collect(1)

navigation

For Mamba and Vim installation issues, please refer to my previous blog: Summary of Mamba environment installation pitfalls and solutions
For Vim installation problems under Windows, please refer to my previous blog: Summary of pitfalls and solutions for Vim environment installation under Windows
For the Vmamba installation tutorial under Linux, please refer to my previous blog: Vmamba installation tutorial (no need to change the cuda version in the base environment)
For the installation of VMamba under Windows, please refer to my previous blog: VMamba Installation Tutorial under Windows (no need to change the cuda version in the base environment and can be accelerated)

navigation
background
Environment preparation under Windows
Compiling `mamba-ssm` under Windows
Problems and solutions in compiling `mamba_ssm` under Windows (20240714)
20240724 Update
- 5. ImportError: DLL load failed

background

The official Mamba code link is: https://github.com/state-spaces/mamba . Based on the original blog " Summary of Mamba environment installation pitfalls and solutions ", the Mamba environment is installed without bypassing selective_scan_cuda, so that the same speed as Linux can be ^achieved1 .

（Installation issues/Self-resource/Paper collaboration ideas please +vx：931744281）

Environment preparation under Windows

The preliminary environment preparation is the same as the original blog " Summary of Mamba environment installation pitfalls and solutions ", specifically:

conda create -n mamba python=3.10
conda activate mamba
conda install cudatoolkit==11.8
pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu118
pip install setuptools==68.2.2
conda install nvidia/label/cuda-11.8.0::cuda-nvcc_win-64
conda install packaging
pip install triton-2.0.0-cp310-cp310-win_amd64.whl

For more triton-2.0.0-cp310-cp310-win_amd64.whlinformation, please refer to the original blog " Summary of Mamba environment installation pitfalls and solutions ".

causal-conv1dThe installation is the same as the original blog " Summary of Mamba environment installation pitfalls and solutions ", which is specifically detailed as follows:

git clone https://github.com/Dao-AILab/causal-conv1d.git
cd causal-conv1d
git checkout v1.1.3  # 安装最新版的话，此步可省略
set CAUSAL_CONV1D_FORCE_BUILD=TRUE
pip install .

There is no official compiled whl for Windows , so you need to compile it manually using the above steps. I have compiled causal_conv1d-1.1.1-cp310-cp310-win_amd64.whl for Windows , which can also be downloaded and installed directly (only for torch 2.1).

pip install causal_conv1d-1.1.1-cp310-cp310-win_amd64.whl

After successful installation, a file will be generated in the corresponding virtual environment ( xxx\conda\envs\xxx\Lib\site-packages\) , which corresponds to the causal_conv1d_cuda package.causal_conv1d_cuda.cp310-win_amd64.pyd

mamba-ssmPrepare the environment and download the project files.

git clone https://github.com/state-spaces/mamba.git
cd mamba
git checkout v1.1.3   # 安装最新版的话，此步可省略

Note that in the above process, the new version of mamba-ssm needs to be used with the new version of causal-conv1d, otherwise the functions are incompatible. After completing the preliminary work, proceed to the next step of formal compilation.

`mamba-ssm`Compilation under Windows

setup.pyModify line 41 of the mamba source code :

FORCE_BUILD = os.getenv("MAMBA_FORCE_BUILD", "TRUE") == "TRUE"

Change the csrc/selective_scan/selective_scan_fwd_kernel.cuhfunction ofvoid selective_scan_fwd_launch

void selective_scan_fwd_launch(SSMParamsBase &params, cudaStream_t stream) {
    // Only kNRows == 1 is tested for now, which ofc doesn't differ from previously when we had each block
    // processing 1 row.
    static constexpr int kNRows = 1;
    BOOL_SWITCH(params.seqlen % (kNThreads * kNItems) == 0, kIsEvenLen, [&] {
        BOOL_SWITCH(params.is_variable_B, kIsVariableB, [&] {
            BOOL_SWITCH(params.is_variable_C, kIsVariableC, [&] {
                BOOL_SWITCH(params.z_ptr != nullptr , kHasZ, [&] {
                    using Ktraits = Selective_Scan_fwd_kernel_traits<kNThreads, kNItems, kNRows, kIsEvenLen, kIsVariableB, kIsVariableC, kHasZ, input_t, weight_t>;
                    // constexpr int kSmemSize = Ktraits::kSmemSize;
                    static constexpr int kSmemSize = Ktraits::kSmemSize + kNRows * MAX_DSTATE * sizeof(typename Ktraits::scan_t);
                    // printf("smem_size = %d\n", kSmemSize);
                    dim3 grid(params.batch, params.dim / kNRows);
                    auto kernel = &selective_scan_fwd_kernel<Ktraits>;
                    if (kSmemSize >= 48 * 1024) {
                        C10_CUDA_CHECK(cudaFuncSetAttribute(
                            kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
                    }
                    kernel<<<grid, Ktraits::kNThreads, kSmemSize, stream>>>(params);
                    C10_CUDA_KERNEL_LAUNCH_CHECK();
                });
            });
        });
    });
}

Change the csrc/selective_scan/static_switch.hfunction ofBOOL_SWITCH

#define BOOL_SWITCH(COND, CONST_NAME, ...)                                           \
    [&] {                                                                            \
        if (COND) {                                                                  \
            static constexpr bool CONST_NAME = true;                                        \
            return __VA_ARGS__();                                                    \
        } else {                                                                     \
            static constexpr bool CONST_NAME = false;                                       \
            return __VA_ARGS__();                                                    \
        }                                                                            \
    }()

(These two steps are to constexprchange to static constexpr)

Add to the beginning of the csrc/selective_scan/cus/selective_scan_bwd_kernel.cuhand csrc/selective_scan/cus/selective_scan_fwd_kernel.cuhfiles:

#ifndef M_LOG2E
#define M_LOG2E 1.4426950408889634074
#endif

After completing the above modifications, execution pip install .can generally compile and install successfully.
I also have a compiled whl for Windows: mamba-ssm-1.1.3 (only for torch 2.1), you can download and install it directly or contact me to get it yourself. The command to install using whl is:

pip install mamba_ssm-1.1.3-cp310-cp310-win-amd64.whl

Since selective_scan_cuda is not bypassed at this time, the selective_scan_cuda.cp310-win-amd64.pydxxx\conda\envs\xxx\Lib\site-packages\ file is generated in the virtual environment ( ) , so the running speed is faster.

`mamba_ssm`Problems and solutions for compiling under Windows (20240714)

1. Basic error information

If you compile the source code directly using `pip install .`` without making any changes, the following error will appear:

subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
      
      The above exception was the direct cause of the following exception:
      
      Traceback (most recent call last):

        File "/data/xxx/anaconda3/envs/xxx/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
          raise RuntimeError(message) from e
      RuntimeError: Error compiling objects for extension
      [end of output]

This is the most basic error message. It will be output as long as there is a compilation error. If you don't see a specific error message on it, you can setup.pychange it in

cmdclass={"bdist_wheel": CachedWheelsCommand, "build_ext": BuildExtension}

Change to

cmdclass={"bdist_wheel": CachedWheelsCommand, "build_ext": BuildExtension.with_options(use_ninja=False)}

By default, pytorch uses ninjia as backend ^2. If you disable it, you can see the specific error message, but the compilation speed will actually be slower, so you can change it back after solving the bug.

Note: Some bloggers changed ['ninja','-v'] in the lib/python3.6/site-packages/torch/utils/cpp_extension.py file in the anaconda environment to ['ninja','–v'] or ['ninja','–version'], which is a wrong approach and only treats the symptoms but not the root cause.

2. “M_LOG2E” is undefined

Under Windows, a large number of errors will appear as follows:

 xxx\mamba-1.1.3\csrc\selective_scan\selective_scan_bwd_kernel.cuh(221): error: identifier "M_LOG2E" is undefined

For the reason why this happens, please refer to the issue :

Note for the owners: The reason for needing #define is stated here: https://stackoverflow.com/a/56974843:
“On windows it is using the Microsoft compiler for that. So the Microsoft compiler is correct to disallow VLA, and there is no way to avoid this AFAIK. Your code works on linux, because on linux nvcc uses the g++ host compiler, and it allows (in a non-standard-compliant way) the use of a VLA in C++ host code.”

Therefore, just add the following code to the csrc/selective_scan/cus/selective_scan_bwd_kernel.cuhand filescsrc/selective_scan/cus/selective_scan_fwd_kernel.cuh

#ifndef M_LOG2E
#define M_LOG2E 1.4426950408889634074
#endif

3. error C2975: “kIsVariableC_”

Under Windows, a large number of errors will appear as follows:

error C2975: “kIsVariableC_”:“Selective_Scan_bwd_kernel_traits”的模板参数无效，应为编译时常量表达式

Change csrc/selective_scan/static_switch.hthe function to constexpr, static constexprrefer to issue . For specific steps, refer to the previous section.

4. error C2975: “kNRows_”

Under Windows, a large number of errors will appear as follows:

xxx\mamba-1.1.3\csrc\selective_scan\selective_scan_fwd_kernel.cuh(314): error C2975: “kNRows_”:“Selective_Scan_fwd_kernel_traits”的模板参数无效，应为编译时常量表达式

Change csrc/selective_scan/selective_scan_fwd_kernel.cuhthe function void selective_scan_fwd_launchto constexpr, static constexprrefer to issue . For specific steps, refer to the previous section.

20240724 Update

5. ImportError: DLL load failed

Some students found the following error after installation:

ImportError: DLL load failed while importing causal_conv1d_cuda: 找不到指定的程序。

ImportError: DLL load failed while importing selective_scan_cuda: 找不到指定的程序。

Although and have been generated in the corresponding locations of the virtual environment, they still cannot be imported causal_conv1d_cuda.cp310-win-amd64.pydand called.selective_scan_cuda.cp310-win-amd64.pyd

After checking the dependencies of these two packages, I found that
insert image description here

they are both highly dependent on torch-related dlls, so I guess the error is due to a conflict in the torch version. Uninstalling torch and reinstalling it solved the problem. (A student installed two different versions of torch in his environment, so there was a conflict.)

pip uninstall torch
pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu118

Since the two whl environments I compiled earlier are both torch 2.1, the environment you install must also be 2.1, otherwise this error will be reported when calling related functions through whl installation.