PaddlePaddle · mattheliu · Nov 6, 2025 · Nov 6, 2025 · Nov 6, 2025 · Nov 6, 2025
diff --git a/docs/benchmark.md b/docs/benchmark.md
@@ -1,5 +1,3 @@
-[简体中文](zh/benchmark.md)
-
 # Benchmark
 
 FastDeploy extends the [vLLM benchmark](https://github.com/vllm-project/vllm/blob/main/benchmarks/) script with additional metrics, enabling more detailed performance benchmarking for FastDeploy.

diff --git a/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md b/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/best_practices/ERNIE-4.5-0.3B-Paddle.md)
-
 # ERNIE-4.5-0.3B
 ## Environmental Preparation
 ### 1.1 Hardware requirements
@@ -90,4 +88,5 @@ export FD_SAMPLING_CLASS=rejection
 ```
 
 ## FAQ
+
 If you encounter any problems during use, you can refer to [FAQ](./FAQ.md).
diff --git a/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md b/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md)
-
 # ERNIE-4.5-21B-A3B
 ## Environmental Preparation
 ### 1.1 Hardware requirements

diff --git a/docs/best_practices/ERNIE-4.5-21B-A3B-Thinking.md b/docs/best_practices/ERNIE-4.5-21B-A3B-Thinking.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/best_practices/ERNIE-4.5-21B-A3B-Thinking.md)
-
 # ERNIE-4.5-21B-A3B
 ## Environmental Preparation
 ### 1.1 Hardware requirements

diff --git a/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md b/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md)
-
 # ERNIE-4.5-300B-A47B
 ## Environmental Preparation
 ### 1.1 Hardware requirements

diff --git a/docs/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md)
-
 # ERNIE-4.5-VL-28B-A3B-Paddle
 
 ## 1. Environment Preparation

diff --git a/docs/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md b/docs/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md)
-
 # ERNIE-4.5-VL-424B-A47B-Paddle
 
 ## 1. Environment Preparation

diff --git a/docs/best_practices/FAQ.md b/docs/best_practices/FAQ.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/best_practices/FAQ.md)
-
 # FAQ
 ## 1.CUDA out of memory
 1. when starting the service：

diff --git a/docs/best_practices/PaddleOCR-VL-0.9B.md b/docs/best_practices/PaddleOCR-VL-0.9B.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/best_practices/PaddleOCR-VL-0.9B.md)
-
 # PaddleOCR-VL-0.9B
 
 ## 1. Environment Preparation

diff --git a/docs/best_practices/README.md b/docs/best_practices/README.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/best_practices/README.md)
-
 # Optimal Deployment
 
 - [ERNIE-4.5-0.3B-Paddle.md](ERNIE-4.5-0.3B-Paddle.md)

diff --git a/docs/cli/bench.md b/docs/cli/bench.md
@@ -1,5 +1,4 @@
 # bench: Benchmark Testing
-
 ## 1. bench latency: Offline Latency Test
 
 ### Parameters

diff --git a/docs/features/chunked_prefill.md b/docs/features/chunked_prefill.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/features/chunked_prefill.md)
-
 # Chunked Prefill
 
 Chunked Prefill employs a segmentation strategy that breaks down Prefill requests into smaller subtasks, which are then batched together with Decode requests. This approach better balances compute-intensive (Prefill) and memory-intensive (Decode) operations, optimizes GPU resource utilization, reduces computational overhead and memory footprint per Prefill, thereby lowering peak memory usage and avoiding out-of-memory issues.

diff --git a/docs/features/data_parallel_service.md b/docs/features/data_parallel_service.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/features/data_parallel_service.md)
-
 # Data Parallelism
 Under the MOE model, enabling Expert Parallelism (EP) combined with Data Parallelism (DP), where EP distributes expert workloads and DP enables parallel request processing.
 

diff --git a/docs/features/disaggregated.md b/docs/features/disaggregated.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/features/disaggregated.md)
-
 # Disaggregated Deployment
 
 Large model inference consists of two phases: Prefill and Decode, which are compute-intensive and memory access-intensive respectively. Deploying Prefill and Decode separately in certain scenarios can improve hardware utilization, effectively increase throughput, and reduce overall sentence latency.

diff --git a/docs/features/early_stop.md b/docs/features/early_stop.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/features/early_stop.md)
-
 # Early Stopping
 
 The early stopping is used to prematurely terminate the token generation of the model. Specifically, the early stopping uses different strategies to determine whether the currently generated token sequence meets the early stopping criteria. If so, token generation is terminated prematurely. FastDeploy currently supports the repetition strategy and stop sequence.

diff --git a/docs/features/graph_optimization.md b/docs/features/graph_optimization.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/features/graph_optimization.md)
-
 # Graph optimization technology in FastDeploy
 
 FastDeploy's `GraphOptimizationBackend` integrates a variety of graph optimization technologies:

diff --git a/docs/features/load_balance.md b/docs/features/load_balance.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/features/load_balance.md)
-
 # Global Scheduler: Multi-Instance Load Balancing
 
 ## Design Overview

diff --git a/docs/features/multi-node_deployment.md b/docs/features/multi-node_deployment.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/features/multi-node_deployment.md)
-
 # Multi-Node Deployment
 
 ## Overview

diff --git a/docs/features/plas_attention.md b/docs/features/plas_attention.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/features/plas_attention.md)
-
 # PLAS
 
 ## Introduction

diff --git a/docs/features/plugins.md b/docs/features/plugins.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/features/plugins.md)
-
 # FastDeploy Plugin Mechanism Documentation
 
 FastDeploy supports a plugin mechanism that allows users to extend functionality without modifying the core code. Plugins are automatically discovered and loaded through Python's `entry_points` mechanism.

diff --git a/docs/features/prefix_caching.md b/docs/features/prefix_caching.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/features/prefix_caching.md)
-
 # Prefix Caching
 
 Prefix Caching is a technique to optimize the inference efficiency of generative models. Its core idea is to cache intermediate computation results (KV Cache) of input sequences, avoiding redundant computations and thereby accelerating response times for multiple requests sharing the same prefix.

diff --git a/docs/features/reasoning_output.md b/docs/features/reasoning_output.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/features/reasoning_output.md)
-
 # Reasoning Outputs
 
 Reasoning models return an additional `reasoning_content` field in their output, which contains the reasoning steps that led to the final conclusion.

diff --git a/docs/features/sampling.md b/docs/features/sampling.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/features/sampling.md)
-
 # Sampling Strategies
 
 Sampling strategies are used to determine how to select the next token from the output probability distribution of a model. FastDeploy currently supports multiple sampling strategies including Top-p, Top-k_Top-p, and Min-p Sampling.

diff --git a/docs/features/speculative_decoding.md b/docs/features/speculative_decoding.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/features/speculative_decoding.md)
-
 # 🔮 Speculative Decoding
 
 This project implements an efficient **Speculative Decoding** inference framework based on PaddlePaddle. It supports **Multi-Token Proposing (MTP)** to accelerate large language model (LLM) generation, significantly reducing latency and improving throughput.

diff --git a/docs/features/structured_outputs.md b/docs/features/structured_outputs.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/features/structured_outputs.md)
-
 # Structured Outputs
 
 ## Overview

diff --git a/docs/get_started/README.md b/docs/get_started/README.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/get_started/README.md)
-
 # Get Started
 
 - [Deploy ERNIE-4.5-0.3B-Paddle in 10 Minutes](quick_start.md)

diff --git a/docs/get_started/ernie-4.5-vl.md b/docs/get_started/ernie-4.5-vl.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/get_started/ernie-4.5-vl.md)
-
 # Deploy ERNIE-4.5-VL-424B-A47B Multimodal Model
 
 This document explains how to deploy the ERNIE-4.5-VL multimodal model, which supports users to interact with the model using multimodal data (including reasoning capabilities). Before starting the deployment, please ensure that your hardware environment meets the following requirements:

diff --git a/docs/get_started/ernie-4.5.md b/docs/get_started/ernie-4.5.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/get_started/ernie-4.5.md)
-
 # Deploy ERNIE-4.5-300B-A47B Model
 
 This document explains how to deploy the ERNIE-4.5 model. Before starting the deployment, please ensure that your hardware environment meets the following requirements:

diff --git a/docs/get_started/installation/Enflame_gcu.md b/docs/get_started/installation/Enflame_gcu.md
@@ -1,5 +1,3 @@
-[简体中文](../../zh/get_started/installation/Enflame_gcu.md)
-
 # Running ERNIE 4.5 Series Models with FastDeploy
 
 The Enflame S60 ([Learn about Enflame](https://www.enflame-tech.com/)) is a next-generation AI inference accelerator card designed for large-scale deployment in data centers. It meets the demands of large language models (LLMs), search/advertising/recommendation systems, and traditional models. Characterized by broad model coverage, user-friendliness, and high portability, it is widely applicable to mainstream inference scenarios such as image and text generation applications, search and recommendation systems, and text/image/speech recognition.

diff --git a/docs/get_started/installation/README.md b/docs/get_started/installation/README.md
@@ -1,5 +1,3 @@
-[简体中文](../../zh/get_started/installation/README.md)
-
 # FastDeploy Installation
 
 FastDeploy currently supports installation on the following hardware platforms:

diff --git a/docs/get_started/installation/hygon_dcu.md b/docs/get_started/installation/hygon_dcu.md
@@ -1,5 +1,3 @@
-[简体中文](../../zh/get_started/installation/hygon_dcu.md)
-
 # Run ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B model on hygon machine
 The current version of the software merely serves as a demonstration demo for the hygon k100AI combined with the Fastdeploy inference framework for large models. There may be issues when running the latest ERNIE4.5 model, and we will conduct repairs and performance optimization in the future. Subsequent versions will provide customers with a more stable version.
 

diff --git a/docs/get_started/installation/iluvatar_gpu.md b/docs/get_started/installation/iluvatar_gpu.md
@@ -1,5 +1,3 @@
-[简体中文](../../zh/get_started/installation/iluvatar_gpu.md)
-
 # Run ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B model on iluvatar machine
 
 ## Machine Preparation

diff --git a/docs/get_started/installation/intel_gaudi.md b/docs/get_started/installation/intel_gaudi.md
@@ -1,5 +1,3 @@
-[简体中文](../../zh/get_started/installation/intel_gaudi.md)
-
 # Intel Gaudi Installation for running ERNIE 4.5 Series Models
 
 The following installation methods are available when your environment meets these requirements:

diff --git a/docs/get_started/installation/kunlunxin_xpu.md b/docs/get_started/installation/kunlunxin_xpu.md
@@ -1,5 +1,3 @@
-[简体中文](../../zh/get_started/installation/kunlunxin_xpu.md)
-
 # Kunlunxin XPU
 
 ## Requirements

diff --git a/docs/get_started/installation/metax_gpu.md b/docs/get_started/installation/metax_gpu.md
@@ -1,5 +1,3 @@
-[简体中文](../../zh/get_started/installation/metax_gpu.md)
-
 # Metax GPU Installation for running ERNIE 4.5 Series Models
 
 The following installation methods are available when your environment meets these requirements:

diff --git a/docs/get_started/installation/nvidia_gpu.md b/docs/get_started/installation/nvidia_gpu.md
@@ -1,5 +1,3 @@
-[简体中文](../../zh/get_started/installation/nvidia_gpu.md)
-
 # NVIDIA CUDA GPU Installation
 
 The following installation methods are available when your environment meets these requirements:

diff --git a/docs/get_started/quick_start.md b/docs/get_started/quick_start.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/get_started/quick_start.md)
-
 # Deploy ERNIE-4.5-0.3B-Paddle in 10 Minutes
 
 Before deployment, ensure your environment meets the following requirements:

diff --git a/docs/get_started/quick_start_qwen.md b/docs/get_started/quick_start_qwen.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/get_started/quick_start_qwen.md)
-
 # Deploy QWEN3-0.6b in 10 Minutes
 
 Before deployment, ensure your environment meets the following requirements:

diff --git a/docs/get_started/quick_start_vl.md b/docs/get_started/quick_start_vl.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/get_started/quick_start_vl.md)
-
 # Deploy ERNIE-4.5-VL-28B-A3B-Paddle Multimodal Model in 10 Minutes
 
 Before deployment, please ensure your environment meets the following requirements:

diff --git a/docs/index.md b/docs/index.md
@@ -1,5 +1,3 @@
-[简体中文](zh/index.md)
-
 # FastDeploy
 
 **FastDeploy** is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers **production-ready, out-of-the-box deployment solutions** with core acceleration technologies:

diff --git a/docs/offline_inference.md b/docs/offline_inference.md
@@ -1,5 +1,3 @@
-[简体中文](zh/offline_inference.md)
-
 # Offline Inference
 
 ## 1. Usage

diff --git a/docs/online_serving/README.md b/docs/online_serving/README.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/online_serving/README.md)
-
 # OpenAI Protocol-Compatible API Server
 
 FastDeploy provides a service-oriented deployment solution that is compatible with the OpenAI protocol. Users can quickly deploy it using the following command:

diff --git a/docs/online_serving/graceful_shutdown_service.md b/docs/online_serving/graceful_shutdown_service.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/online_serving/graceful_shutdown_service.md)
-
 # Graceful Service Node Shutdown Solution
 
 ## 1. Core Objective

diff --git a/docs/online_serving/metrics.md b/docs/online_serving/metrics.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/online_serving/metrics.md)
-
 # Monitoring Metrics
 
 After FastDeploy is launched, it supports continuous monitoring of the FastDeploy service status through Metrics. When starting FastDeploy, you can specify the port for the Metrics service by configuring the `metrics-port` parameter.

diff --git a/docs/online_serving/scheduler.md b/docs/online_serving/scheduler.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/online_serving/scheduler.md)
-
 # Scheduler
 
 FastDeploy currently supports two types of schedulers: **Local Scheduler** and **Global Scheduler**. The Global Scheduler is designed for large-scale clusters, enabling secondary load balancing across nodes based on real-time workload metrics.

diff --git a/docs/parameters.md b/docs/parameters.md
@@ -1,5 +1,3 @@
-[简体中文](zh/parameters.md)
-
 # FastDeploy Parameter Documentation
 
 ## Parameter Description

diff --git a/docs/quantization/README.md b/docs/quantization/README.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/quantization/README.md)
-
 # Quantization
 
 FastDeploy supports various quantization inference precisions including FP8, INT8, INT4, 2-bits, etc. It supports different precision inference for weights, activations, and KVCache tensors, which can meet the inference requirements of different scenarios such as low cost, low latency, and long context.

diff --git a/docs/quantization/online_quantization.md b/docs/quantization/online_quantization.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/quantization/online_quantization.md)
-
 # Online Quantization
 
 Online quantization refers to the inference engine quantizing weights after loading BF16 weights, rather than loading pre-quantized low-precision weights. FastDeploy supports online quantization of BF16 to various precisions, including: INT4, INT8, and FP8.

diff --git a/docs/quantization/wint2.md b/docs/quantization/wint2.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/quantization/wint2.md)
-
 # WINT2 Quantization
 
 Weights are compressed offline using the [CCQ (Convolutional Coding Quantization)](https://arxiv.org/pdf/2507.07145) method. The actual stored numerical type of weights is INT8, with 4 weights packed into each INT8 value, equivalent to 2 bits per weight. Activations are not quantized. During inference, weights are dequantized and decoded in real-time to BF16 numerical type, and calculations are performed using BF16 numerical type.

diff --git a/docs/supported_models.md b/docs/supported_models.md
@@ -1,5 +1,3 @@
-[简体中文](zh/supported_models.md)
-
 # Supported Models
 
 FastDeploy currently supports the following models, which can be downloaded automatically during FastDeploy deployment.Specify the ``model`` parameter as the model name in the table below to automatically download model weights (all supports resumable downloads). The following three download sources are supported:

diff --git a/docs/usage/code_overview.md b/docs/usage/code_overview.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/usage/code_overview.md)
-
 # Code Overview
 
 Below is an overview of the FastDeploy code structure and functionality organized by directory.

diff --git a/docs/usage/environment_variables.md b/docs/usage/environment_variables.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/usage/environment_variables.md)
-
 # FastDeploy Environment Variables
 
 FastDeploy's environment variables are defined in `fastdeploy/envs.py` at the root of the repository. Below is the documentation:

diff --git a/docs/usage/fastdeploy_unit_test_guide.md b/docs/usage/fastdeploy_unit_test_guide.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/usage/fastdeploy_unit_test_guide.md)
-
 # FastDeploy Unit Test Specification
 1. Test Naming Conventions
    - Test files must start with test_.

diff --git a/docs/usage/kunlunxin_xpu_deployment.md b/docs/usage/kunlunxin_xpu_deployment.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/usage/kunlunxin_xpu_deployment.md)
-
 ## Supported Models
 |Model Name|Context Length|Quantization|XPUs Required|Deployment Commands|Applicable Version|
 |-|-|-|-|-|-|

diff --git a/docs/usage/log.md b/docs/usage/log.md
@@ -1,5 +1,3 @@
-[简体中文](../zh/usage/log.md)
-
 # Log Description
 
 FastDeploy generates the following log files during deployment. Below is an explanation of each log's purpose.

diff --git a/docs/zh/benchmark.md b/docs/zh/benchmark.md
@@ -1,5 +1,3 @@
-[English](../benchmark.md)
-
 # Benchmark
 
 FastDeploy基于[vLLM benchmark](https://github.com/vllm-project/vllm/blob/main/benchmarks/)脚本，增加了部分统计信息，可用于benchmark FastDeploy更详细的性能指标。

diff --git a/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md
@@ -1,5 +1,3 @@
-[English](../../best_practices/ERNIE-4.5-0.3B-Paddle.md)
-
 # ERNIE-4.5-0.3B
 ## 一、环境准备
 ### 1.1 支持情况

diff --git a/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md
@@ -1,5 +1,3 @@
-[English](../../best_practices/ERNIE-4.5-21B-A3B-Paddle.md)
-
 # ERNIE-4.5-21B-A3B
 ## 一、环境准备
 ### 1.1 支持情况

diff --git a/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Thinking.md b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Thinking.md
@@ -1,5 +1,3 @@
-[English](../../best_practices/ERNIE-4.5-21B-A3B-Thinking.md)
-
 # ERNIE-4.5-21B-A3B-Thinking
 ## 一、环境准备
 ### 1.1 支持情况

diff --git a/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md
@@ -1,5 +1,3 @@
-[English](../../best_practices/ERNIE-4.5-300B-A47B-Paddle.md)
-
 # ERNIE-4.5-300B-A47B
 ## 一、环境准备
 ### 1.1 支持情况

diff --git a/docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -1,5 +1,3 @@
-[English](../../best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md)
-
 # ERNIE-4.5-VL-28B-A3B-Paddle
 
 ## 一、环境准备

diff --git a/docs/zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md
@@ -1,5 +1,3 @@
-[English](../../best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md)
-
 # ERNIE-4.5-VL-424B-A47B-Paddle
 
 ## 一、环境准备

diff --git a/docs/zh/best_practices/FAQ.md b/docs/zh/best_practices/FAQ.md
@@ -1,5 +1,3 @@
-[English](../../best_practices/FAQ.md)
-
 # 常见问题FAQ
 ## 1.显存不足
 1. 启动服务时显存不足：

diff --git a/docs/zh/best_practices/PaddleOCR-VL-0.9B.md b/docs/zh/best_practices/PaddleOCR-VL-0.9B.md
@@ -1,5 +1,3 @@
-[English](../../best_practices/PaddleOCR-VL-0.9B.md)
-
 # PaddleOCR-VL-0.9B
 
 ## 一、环境准备

diff --git a/docs/zh/best_practices/README.md b/docs/zh/best_practices/README.md
@@ -1,5 +1,3 @@
-[English](../../best_practices/README.md)
-
 # 最佳实践
 
 - [ERNIE-4.5-0.3B-Paddle.md](ERNIE-4.5-0.3B-Paddle.md)