vllm.model_executor.layers.quantization.utils.flashinfer_fp4_moe
Utility helpers for NVFP4 + FlashInfer fused-MoE path
__all__ module-attribute
¶
__all__ = [
"is_flashinfer_fp4_cutlass_moe_available",
"reorder_w1w3_to_w3w1",
"build_flashinfer_fp4_cutlass_moe_kernel",
"flashinfer_fp4_cutlass_moe_forward",
]
build_flashinfer_fp4_cutlass_moe_kernel ¶
build_flashinfer_fp4_cutlass_moe_kernel(
moe_parallel_config: FusedMoEParallelConfig,
) -> FusedMoEModularKernel
Create and return a FlashInfer CUTLASS fused-MoE modular kernel
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
flashinfer_fp4_cutlass_moe_forward ¶
flashinfer_fp4_cutlass_moe_forward(
fused_experts: FusedMoEModularKernel,
layer: Module,
x: Tensor,
topk_weights: Tensor,
topk_ids: Tensor,
activation: str,
global_num_experts: int,
expert_map: Optional[Tensor],
apply_router_weight_on_input: bool,
) -> Tensor
Common forward wrapper for FlashInfer NV-FP4 fused-MoE
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
is_flashinfer_fp4_cutlass_moe_available ¶
is_flashinfer_fp4_cutlass_moe_available() -> bool
Return True
when FlashInfer CUTLASS NV-FP4 kernels can be used.
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
reorder_w1w3_to_w3w1 ¶
Re-order the concatenated [w1, w3]
tensors to [w3, w1]
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
select_nvfp4_gemm_impl ¶
select_nvfp4_gemm_impl(allow_flashinfer: bool, moe, logger)
Return a GEMM experts implementation for NV-FP4 fused-MoE layers