xorbitsai · qinxuye · Jan 29, 2026 · Jan 28, 2026 · Jan 29, 2026 · Jan 29, 2026
diff --git a/doc/source/development/contributing_environment.rst b/doc/source/development/contributing_environment.rst
@@ -61,7 +61,7 @@ Conda environment. Here are the commands:
 
 ::
 
-   conda install python=3.10
+   conda install python=3.12
    conda install nodejs
 
 Install from source code

diff --git a/doc/source/getting_started/environments.rst b/doc/source/getting_started/environments.rst
@@ -23,15 +23,20 @@ necessary files such as logs and models, where ``<HOME>`` is the home
 path of current user. You can change this directory by configuring this environment
 variable.
 
-XINFERENCE_HEALTH_CHECK_ATTEMPTS
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The number of attempts for the health check at Xinference startup, if exceeded,
-will result in an error. The default value is 3.
+XINFERENCE_HEALTH_CHECK_FAILURE_THRESHOLD
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The maximum number of failed health checks tolerated at Xinference startup.
+Default value is 5.
 
 XINFERENCE_HEALTH_CHECK_INTERVAL
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The timeout duration for the health check at Xinference startup, if exceeded,
-will result in an error. The default value is 3.
+Health check interval (seconds) at Xinference startup.
+Default value is 5.
+
+XINFERENCE_HEALTH_CHECK_TIMEOUT
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Health check timeout (seconds) at Xinference startup.
+Default value is 10.
 
 XINFERENCE_DISABLE_HEALTH_CHECK
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -43,3 +48,65 @@ XINFERENCE_DISABLE_METRICS
 Xinference will by default enable the metrics exporter on the supervisor and worker.
 Setting this environment to 1 will disable the /metrics endpoint on the supervisor
 and the HTTP service (only provide the /metrics endpoint) on the worker.
+
+XINFERENCE_DOWNLOAD_MAX_ATTEMPTS
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Maximum download retry attempts for model files.
+Default value is 3.
+
+XINFERENCE_TEXT_TO_IMAGE_BATCHING_SIZE
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Enable continuous batching for text-to-image models by specifying the target image size
+(e.g., ``1024*1024``). Default is unset.
+
+XINFERENCE_SSE_PING_ATTEMPTS_SECONDS
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Server-Sent Events keepalive ping interval (seconds).
+Default value is 600.
+
+XINFERENCE_MAX_TOKENS
+~~~~~~~~~~~~~~~~~~~~~
+Global max tokens limit override for requests. Default is unset.
+
+XINFERENCE_ALLOWED_IPS
+~~~~~~~~~~~~~~~~~~~~~~
+Restrict access to specified IPs or CIDR blocks. Default is unset (no restriction).
+
+XINFERENCE_BATCH_SIZE
+~~~~~~~~~~~~~~~~~~~~~
+Default batch size used by the server when batching is enabled.
+Default value is 32.
+
+XINFERENCE_BATCH_INTERVAL
+~~~~~~~~~~~~~~~~~~~~~~~~~
+Default batching interval (seconds).
+Default value is 0.003.
+
+XINFERENCE_ALLOW_MULTI_REPLICA_PER_GPU
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Whether to allow multiple replicas on a single GPU.
+Default value is 1 (enabled).
+
+XINFERENCE_LAUNCH_STRATEGY
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+GPU allocation strategy for replicas. Default is ``IDLE_FIRST_LAUNCH_STRATEGY``.
+
+XINFERENCE_ENABLE_VIRTUAL_ENV
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Enable model virtual environments globally.
+Default value is 1 (enabled, starting from v2.0).
+
+XINFERENCE_VIRTUAL_ENV_SKIP_INSTALLED
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Skip packages already present in system site-packages when creating virtual environments.
+Default value is 1.
+
+XINFERENCE_CSG_TOKEN
+~~~~~~~~~~~~~~~~~~~~
+Authentication token for CSGHub model source.
+Default is unset.
+
+XINFERENCE_CSG_ENDPOINT
+~~~~~~~~~~~~~~~~~~~~~~~
+CSGHub endpoint for model source.
+Default value is ``https://hub-stg.opencsg.com/``.
diff --git a/doc/source/getting_started/installation.rst b/doc/source/getting_started/installation.rst
@@ -43,12 +43,18 @@ PyTorch (transformers) supports the inference of most state-of-art models. It is
 
    pip install "xinference[transformers]"
 
+Notes:
+
+- The transformers engine supports ``pytorch`` / ``gptq`` / ``awq`` / ``bnb`` / ``fp4`` formats.
+- FP4 format requires ``transformers`` with ``FPQuantConfig`` support. If you see an import error,
+  please upgrade ``transformers`` to a newer version.
+
 
 vLLM Backend
 ~~~~~~~~~~~~
 vLLM is a fast and easy-to-use library for LLM inference and serving. Xinference will choose vLLM as the backend to achieve better throughput when the following conditions are met:
 
-- The model format is ``pytorch``, ``gptq`` or ``awq``.
+- The model format is ``pytorch``, ``gptq``, ``awq``, ``fp4``, ``fp8`` or ``bnb``.
 - When the model format is ``pytorch``, the quantization is ``none``.
 - When the model format is ``awq``, the quantization is ``Int4``.
 - When the model format is ``gptq``, the quantization is ``Int3``, ``Int4`` or ``Int8``.
@@ -142,4 +148,3 @@ Other Platforms
 ~~~~~~~~~~~~~~~
 
 * :ref:`Ascend NPU <installation_npu>`
-
diff --git a/doc/source/getting_started/using_docker_image.rst b/doc/source/getting_started/using_docker_image.rst
@@ -6,13 +6,14 @@ Xinference Docker Image
 
 Xinference provides official images for use on Dockerhub.
 
+.. versionchanged:: v2.0
+
+   Starting from **Xinference v2.0**, to use the CUDA version of the image, the minimum CUDA version must be **CUDA 12.9**.
 
 Prerequisites
 =============
 * The image can only run in an environment with GPUs and CUDA installed, because Xinference in the image relies on Nvidia GPUs for acceleration.
 * CUDA must be successfully installed on the host machine. This can be determined by whether you can successfully execute the ``nvidia-smi`` command.
-* For CUDA version < 12.8, CUDA version in the docker image is ``12.4``, and the CUDA version on the host machine should be ``12.4`` or above, and the NVIDIA driver version should be ``550`` or above.
-* For CUDA version >= 12.8 and <12.9, CUDA version in the docker image is ``12.8``, and the CUDA version on the host machine should be ``12.8`` or above, and the NVIDIA driver version should be ``570`` or above.
 * For CUDA version >= 12.9, CUDA version in the docker image is ``12.9``, and the CUDA version on the host machine should be ``12.9`` or above, and the NVIDIA driver version should be ``575`` or above.
 * Ensure `NVIDIA Container Toolkit <https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html>`_ installed.
 
@@ -26,8 +27,6 @@ Available tags include:
 * ``v<release version>``: This image is built each time a Xinference release version is published, and it is typically more stable.
 * ``latest``: This image is built with the latest Xinference release version.
 * For CPU version, add ``-cpu`` suffix, e.g. ``nightly-main-cpu``.
-* For CUDA 12.8, add ``-cu128`` suffix, e.g. ``nightly-main-cu128``. (Xinference version should be between v1.8.1 and v1.15.0)
-* For CUDA 12.9, add ``-cu129`` suffix, e.g. ``nightly-main-cu129``. (Xinference version should be v1.16.0 at least)
 
 
 Dockerfile for custom build
@@ -95,5 +94,3 @@ at <home_path>/.cache/huggingface and <home_path>/.cache/modelscope. The command
      --gpus all \
      xprobe/xinference:v<your_version> \
      xinference-local -H 0.0.0.0
-
-