Preface

Flash Attention is almost mandatory for LLM training and inference because of the speedup it brings. Usually pip install flash-attn is enough, but for specific hardware (e.g., A800) or bleeding-edge features you sometimes have to build from source.

After a few rounds of frustration, here’s the end-to-end process I used to build flash-attn from source on an A800 server, plus the traps to avoid.

Environment

Hardware/Software

  • Server: 64-core CPU, 512G RAM
  • GPU: NVIDIA A800 80G (PCIe)
  • CUDA: 12.2
  • PyTorch: 2.7.1+cu118

Official README

Requirements:

  • CUDA toolkit or ROCm toolkit
  • PyTorch 2.2 and above.
  • packaging Python package (pip install packaging)
  • ninja Python package (pip install ninja) *
  • Linux. Might work for Windows starting v2.3.2 (positive reports), but Windows builds need more testing.

* Make sure ninja is installed and works (ninja --version should exit 0). If not, reinstall (pip uninstall -y ninja && pip install ninja). Without ninja, builds can take ~2h single-threaded; with ninja, 3–5 minutes on a 64-core box.

To install:

1
pip install flash-attn --no-build-isolation

Or build from source:

1
python setup.py install

If your machine has less than 96GB RAM and lots of CPU cores, limit parallel jobs with MAX_JOBS.

A small rant: “3–5 minutes on a 64-core machine” must be on some dream hardware…

Core Build Steps

Upgrade core build tools

Keep pip, wheel, and setuptools current to dodge weird slowdowns.

1
pip install pip wheel setuptools -U

Old setuptools can be painfully slow—ask me how I know.

Install and verify ninja

The README stresses ninja. Without it, the build falls back to single-threaded and drags on forever.

  1. Install

    1
    
    pip install ninja
    
  2. Verify

    The official check is ninja --version followed by echo $? for exit code 0. I prefer a visual check: start the build, open another terminal, run htop, and see dozens of cicc/nvcc processes pegging the CPUs. If not, reinstall ninja with pip uninstall -y ninja && pip install ninja.

    Ninja enables multithreading

Set critical env vars

These variables decide whether the build succeeds and how fast.

  1. FLASH_ATTN_CUDA_ARCHS: limit architectures

    Tell the compiler which compute capability to target. A100 and A800 are compute capability 8.0, so set it to 80. Without it, the build will generate code for many architectures—wasting time.

    1
    
    export FLASH_ATTN_CUDA_ARCHS="80"
    
  2. MAX_JOBS: parallel build workers

    Controls how many CPU cores ninja uses. More cores = faster but also more RAM.

    • My run: On 64 cores / 512GB RAM, I set MAX_JOBS=64.
    • Result: 64 cores maxed, peak RAM >300GB, build took ~10–15 minutes. When RAM usage in htop drops, the build is almost done.
    • Tip: If RAM is tighter, start with half the cores, e.g., MAX_JOBS=32.

Install

Combine everything and build:

1
FLASH_ATTN_CUDA_ARCHS="80" MAX_JOBS=64 pip install flash-attn --no-build-isolation

--no-build-isolation is critical so the build process can see the env vars you exported in the current shell.

Troubleshooting

  • g++: fatal error: Killed signal terminated program cc1plus

    • Cause: OOM (too many parallel jobs). Lower MAX_JOBS (e.g., 64 → 32 → 16).
  • GCC too new

    • At publish time, the build requires gcc-11/g++-11.

    • Fix (Ubuntu 22.04 example):

      1
      2
      3
      4
      5
      
      apt install gcc-11 g++-11
      which gcc-11
      which g++-11
      export CC=/usr/bin/gcc-11
      export CXX=/usr/bin/g++-11
      
  • Build takes forever

    • Likely ninja isn’t actually used and the build is single-threaded. Verify/reinstall ninja.
  • CUDA-related build errors

    • Usually FLASH_ATTN_CUDA_ARCHS is wrong or PyTorch/CUDA toolkit versions don’t match. Double-check GPU model vs. compute capability, and compare nvidia-smi with nvcc --version.

Alternatives

If source builds are too painful, try:

  1. Prebuilt pip wheels: pip install flash-attn sometimes works out of the box, but versions may lag and hardware support isn’t guaranteed.
  2. conda: conda install flash-attn -c conda-forge might provide a prebuilt package.

Thanks

Flash Attention on GitHub

Issues in the official repo