vllm.config.parallel
DistributedExecutorBackend module-attribute
¶
DistributedExecutorBackend = Literal[
"ray", "mp", "uni", "external_launcher"
]
ParallelConfig ¶
Configuration for the distributed execution.
Source code in vllm/config/parallel.py
|
|
data_parallel_backend class-attribute
instance-attribute
¶
data_parallel_backend: str = 'mp'
Backend to use for data parallel, either "mp" or "ray".
data_parallel_external_lb class-attribute
instance-attribute
¶
data_parallel_external_lb: bool = False
Whether to use "external" DP LB mode. Applies only to online serving and when data_parallel_size > 0. This is useful for a "one-pod-per-rank" wide-EP setup in Kuberentes. Set implicitly when --data-parallel-rank is provided explicitly to vllm serve.
data_parallel_hybrid_lb class-attribute
instance-attribute
¶
data_parallel_hybrid_lb: bool = False
Whether to use "hybrid" DP LB mode. Applies only to online serving and when data_parallel_size > 0. Enables running an AsyncLLM and API server on a "per-node" basis where vLLM load balances between local data parallel ranks, but an external LB balances between vLLM nodes/replicas. Set explicitly in conjunction with --data-parallel-start-rank.
data_parallel_master_ip class-attribute
instance-attribute
¶
data_parallel_master_ip: str = '127.0.0.1'
IP of the data parallel master.
data_parallel_master_port class-attribute
instance-attribute
¶
data_parallel_master_port: int = 29500
Port of the data parallel master.
data_parallel_rank class-attribute
instance-attribute
¶
data_parallel_rank: int = 0
Rank of the data parallel group.
data_parallel_rank_local class-attribute
instance-attribute
¶
Local rank of the data parallel group, set only in SPMD mode.
data_parallel_rpc_port class-attribute
instance-attribute
¶
data_parallel_rpc_port: int = 29550
Port for data parallel messaging.
data_parallel_size class-attribute
instance-attribute
¶
data_parallel_size: int = 1
Number of data parallel groups. MoE layers will be sharded according to the product of the tensor parallel size and data parallel size.
data_parallel_size_local class-attribute
instance-attribute
¶
data_parallel_size_local: int = 1
Number of local data parallel groups.
disable_custom_all_reduce class-attribute
instance-attribute
¶
disable_custom_all_reduce: bool = False
Disable the custom all-reduce kernel and fall back to NCCL.
distributed_executor_backend class-attribute
instance-attribute
¶
distributed_executor_backend: Optional[
Union[DistributedExecutorBackend, type[ExecutorBase]]
] = None
Backend to use for distributed model workers, either "ray" or "mp" (multiprocessing). If the product of pipeline_parallel_size and tensor_parallel_size is less than or equal to the number of GPUs available, "mp" will be used to keep processing on a single host. Otherwise, this will default to "ray" if Ray is installed and fail otherwise. Note that tpu only support Ray for distributed inference.
enable_eplb class-attribute
instance-attribute
¶
enable_eplb: bool = False
Enable expert parallelism load balancing for MoE layers.
enable_expert_parallel class-attribute
instance-attribute
¶
enable_expert_parallel: bool = False
Use expert parallelism instead of tensor parallelism for MoE layers.
enable_multimodal_encoder_data_parallel class-attribute
instance-attribute
¶
enable_multimodal_encoder_data_parallel: bool = False
Use data parallelism instead of tensor parallelism for vision encoder. Only support LLama4 for now
eplb_log_balancedness class-attribute
instance-attribute
¶
eplb_log_balancedness: bool = False
Log the balancedness each step of expert parallelism. This is turned off by default since it will cause communication overhead.
eplb_step_interval class-attribute
instance-attribute
¶
eplb_step_interval: int = 3000
Interval for rearranging experts in expert parallelism.
Note that if this is greater than the EPLB window size, only the metrics of the last eplb_window_size
steps will be used for rearranging experts.
eplb_window_size class-attribute
instance-attribute
¶
eplb_window_size: int = 1000
Window size for expert load recording.
max_parallel_loading_workers class-attribute
instance-attribute
¶
Maximum number of parallel loading workers when loading model sequentially in multiple batches. To avoid RAM OOM when using tensor parallel and large models.
num_redundant_experts class-attribute
instance-attribute
¶
num_redundant_experts: int = 0
Number of redundant experts to use for expert parallelism.
pipeline_parallel_size class-attribute
instance-attribute
¶
pipeline_parallel_size: int = 1
Number of pipeline parallel groups.
placement_group class-attribute
instance-attribute
¶
placement_group: Optional[PlacementGroup] = None
ray distributed model workers placement group.
ray_runtime_env class-attribute
instance-attribute
¶
ray_runtime_env: Optional[RuntimeEnv] = None
Ray runtime environment to pass to distributed workers.
ray_workers_use_nsight class-attribute
instance-attribute
¶
ray_workers_use_nsight: bool = False
Whether to profile Ray workers with nsight, see https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html#profiling-nsight-profiler.
sd_worker_cls class-attribute
instance-attribute
¶
sd_worker_cls: str = 'auto'
The full name of the worker class to use for speculative decoding. If "auto", the worker class will be determined based on the platform.
tensor_parallel_size class-attribute
instance-attribute
¶
tensor_parallel_size: int = 1
Number of tensor parallel groups.
worker_cls class-attribute
instance-attribute
¶
worker_cls: str = 'auto'
The full name of the worker class to use. If "auto", the worker class will be determined based on the platform.
worker_extension_cls class-attribute
instance-attribute
¶
worker_extension_cls: str = ''
The full name of the worker extension class to use. The worker extension class is dynamically inherited by the worker class. This is used to inject new attributes and methods to the worker class for use in collective_rpc calls.
world_size class-attribute
instance-attribute
¶
world_size is TPxPP, it affects the number of workers we create.
world_size_across_dp property
¶
world_size_across_dp: int
world_size_across_dp is TPxPPxDP, it is the size of the world including data parallelism.
__post_init__ ¶
Source code in vllm/config/parallel.py
243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 |
|
_verify_args ¶
_verify_args() -> Self
Source code in vllm/config/parallel.py
compute_hash ¶
Provide a hash that uniquely identifies all the configs that affect the structure of the computation graph from input ids/embeddings to the final hidden states, excluding anything before input ids/embeddings and after the final hidden states.
Source code in vllm/config/parallel.py
get_next_dp_init_port ¶
get_next_dp_init_port() -> int
We might need to initialize process groups in multiple processes that is related to data parallelism, e.g. both in the worker and in the engine, which can live in different processes. To avoid port conflicts, we increment the port number each time we need to initialize a new process group related to data parallelism.