Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/development/contributing_environment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ Conda environment. Here are the commands:

::

conda install python=3.10
conda install python=3.12
conda install nodejs

Install from source code
Expand Down
79 changes: 73 additions & 6 deletions doc/source/getting_started/environments.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,15 +23,20 @@ necessary files such as logs and models, where ``<HOME>`` is the home
path of current user. You can change this directory by configuring this environment
variable.

XINFERENCE_HEALTH_CHECK_ATTEMPTS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The number of attempts for the health check at Xinference startup, if exceeded,
will result in an error. The default value is 3.
XINFERENCE_HEALTH_CHECK_FAILURE_THRESHOLD
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The maximum number of failed health checks tolerated at Xinference startup.
Default value is 5.

XINFERENCE_HEALTH_CHECK_INTERVAL
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The timeout duration for the health check at Xinference startup, if exceeded,
will result in an error. The default value is 3.
Health check interval (seconds) at Xinference startup.
Default value is 5.

XINFERENCE_HEALTH_CHECK_TIMEOUT
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Health check timeout (seconds) at Xinference startup.
Default value is 10.

XINFERENCE_DISABLE_HEALTH_CHECK
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -43,3 +48,65 @@ XINFERENCE_DISABLE_METRICS
Xinference will by default enable the metrics exporter on the supervisor and worker.
Setting this environment to 1 will disable the /metrics endpoint on the supervisor
and the HTTP service (only provide the /metrics endpoint) on the worker.

XINFERENCE_DOWNLOAD_MAX_ATTEMPTS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Maximum download retry attempts for model files.
Default value is 3.

XINFERENCE_TEXT_TO_IMAGE_BATCHING_SIZE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Enable continuous batching for text-to-image models by specifying the target image size
(e.g., ``1024*1024``). Default is unset.

XINFERENCE_SSE_PING_ATTEMPTS_SECONDS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Server-Sent Events keepalive ping interval (seconds).
Default value is 600.

XINFERENCE_MAX_TOKENS
~~~~~~~~~~~~~~~~~~~~~
Global max tokens limit override for requests. Default is unset.

XINFERENCE_ALLOWED_IPS
~~~~~~~~~~~~~~~~~~~~~~
Restrict access to specified IPs or CIDR blocks. Default is unset (no restriction).

XINFERENCE_BATCH_SIZE
~~~~~~~~~~~~~~~~~~~~~
Default batch size used by the server when batching is enabled.
Default value is 32.

XINFERENCE_BATCH_INTERVAL
~~~~~~~~~~~~~~~~~~~~~~~~~
Default batching interval (seconds).
Default value is 0.003.

XINFERENCE_ALLOW_MULTI_REPLICA_PER_GPU
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Whether to allow multiple replicas on a single GPU.
Default value is 1 (enabled).

XINFERENCE_LAUNCH_STRATEGY
~~~~~~~~~~~~~~~~~~~~~~~~~~
GPU allocation strategy for replicas. Default is ``IDLE_FIRST_LAUNCH_STRATEGY``.

XINFERENCE_ENABLE_VIRTUAL_ENV
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Enable model virtual environments globally.
Default value is 1 (enabled, starting from v2.0).

XINFERENCE_VIRTUAL_ENV_SKIP_INSTALLED
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Skip packages already present in system site-packages when creating virtual environments.
Default value is 1.

XINFERENCE_CSG_TOKEN
~~~~~~~~~~~~~~~~~~~~
Authentication token for CSGHub model source.
Default is unset.

XINFERENCE_CSG_ENDPOINT
~~~~~~~~~~~~~~~~~~~~~~~
CSGHub endpoint for model source.
Default value is ``https://hub-stg.opencsg.com/``.
9 changes: 7 additions & 2 deletions doc/source/getting_started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,12 +43,18 @@ PyTorch (transformers) supports the inference of most state-of-art models. It is

pip install "xinference[transformers]"

Notes:

- The transformers engine supports ``pytorch`` / ``gptq`` / ``awq`` / ``bnb`` / ``fp4`` formats.
- FP4 format requires ``transformers`` with ``FPQuantConfig`` support. If you see an import error,
please upgrade ``transformers`` to a newer version.


vLLM Backend
~~~~~~~~~~~~
vLLM is a fast and easy-to-use library for LLM inference and serving. Xinference will choose vLLM as the backend to achieve better throughput when the following conditions are met:

- The model format is ``pytorch``, ``gptq`` or ``awq``.
- The model format is ``pytorch``, ``gptq``, ``awq``, ``fp4``, ``fp8`` or ``bnb``.
- When the model format is ``pytorch``, the quantization is ``none``.
- When the model format is ``awq``, the quantization is ``Int4``.
- When the model format is ``gptq``, the quantization is ``Int3``, ``Int4`` or ``Int8``.
Expand Down Expand Up @@ -142,4 +148,3 @@ Other Platforms
~~~~~~~~~~~~~~~

* :ref:`Ascend NPU <installation_npu>`

9 changes: 3 additions & 6 deletions doc/source/getting_started/using_docker_image.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,14 @@ Xinference Docker Image

Xinference provides official images for use on Dockerhub.

.. versionchanged:: v2.0

Starting from **Xinference v2.0**, to use the CUDA version of the image, the minimum CUDA version must be **CUDA 12.9**.

Prerequisites
=============
* The image can only run in an environment with GPUs and CUDA installed, because Xinference in the image relies on Nvidia GPUs for acceleration.
* CUDA must be successfully installed on the host machine. This can be determined by whether you can successfully execute the ``nvidia-smi`` command.
* For CUDA version < 12.8, CUDA version in the docker image is ``12.4``, and the CUDA version on the host machine should be ``12.4`` or above, and the NVIDIA driver version should be ``550`` or above.
* For CUDA version >= 12.8 and <12.9, CUDA version in the docker image is ``12.8``, and the CUDA version on the host machine should be ``12.8`` or above, and the NVIDIA driver version should be ``570`` or above.
* For CUDA version >= 12.9, CUDA version in the docker image is ``12.9``, and the CUDA version on the host machine should be ``12.9`` or above, and the NVIDIA driver version should be ``575`` or above.
* Ensure `NVIDIA Container Toolkit <https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html>`_ installed.

Expand All @@ -26,8 +27,6 @@ Available tags include:
* ``v<release version>``: This image is built each time a Xinference release version is published, and it is typically more stable.
* ``latest``: This image is built with the latest Xinference release version.
* For CPU version, add ``-cpu`` suffix, e.g. ``nightly-main-cpu``.
* For CUDA 12.8, add ``-cu128`` suffix, e.g. ``nightly-main-cu128``. (Xinference version should be between v1.8.1 and v1.15.0)
* For CUDA 12.9, add ``-cu129`` suffix, e.g. ``nightly-main-cu129``. (Xinference version should be v1.16.0 at least)


Dockerfile for custom build
Expand Down Expand Up @@ -95,5 +94,3 @@ at <home_path>/.cache/huggingface and <home_path>/.cache/modelscope. The command
--gpus all \
xprobe/xinference:v<your_version> \
xinference-local -H 0.0.0.0


Loading
Loading