run-llama · logan-markewich · Aug 9, 2023 · Aug 4, 2023 · Aug 4, 2023 · Aug 4, 2023
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,11 @@
 # ChangeLog
 
-## [0.7.22]
+## Unreleased
+
+### New Features
+- Added Xorbits inference for local deployments (#7151)
+
+## [0.7.22] - 2023-08-08
 
 ### New Features
 - add ensemble retriever notebook (#7190)

diff --git a/docs/core_modules/model_modules/llms/modules.md b/docs/core_modules/model_modules/llms/modules.md
@@ -81,3 +81,11 @@ maxdepth: 1
 ---
 /examples/llm/llama_api.ipynb
 ```
+
+## Xorbits Inference
+```{toctree}
+---
+maxdepth: 1
+---
+/examples/llm/XinferenceLocalDeployment.ipynb
+```
diff --git a/docs/examples/llm/XinferenceLocalDeployment.ipynb b/docs/examples/llm/XinferenceLocalDeployment.ipynb
@@ -0,0 +1,209 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7096589b-daaf-440a-b89d-b4956f2db4b2",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Using Xorbits Inference to Deploy Local LLMs - in 3 steps!\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "d8cfbe6f-4c50-4c4f-90f9-03bb91201ef5",
+   "metadata": {},
+   "source": [
+    "## <span style=\"font-size: xx-large;;\">🤖  </span> Installing and Running Xorbits Inference (1/3)\n",
+    "\n",
+    "#### i. Run `pip install \"xinference[all]\"` in a terminal window\n",
+    "\n",
+    "#### ii. After installation is complete, restart this jupyter notebook\n",
+    "\n",
+    "#### iii. Run `xinference` in a new terminal window\n",
+    "\n",
+    "#### iv. You should see something similar to the following output:\n",
+    "\n",
+    "```\n",
+    "INFO:xinference:Xinference successfully started. Endpoint: http://127.0.0.1:9997\n",
+    "INFO:xinference.core.service:Worker 127.0.0.1:21561 has been added successfully\n",
+    "INFO:xinference.deploy.worker:Xinference worker successfully started.\n",
+    "```\n",
+    "\n",
+    "#### v. In the endpoint description, locate the endpoint port number after the colon. In the above case it is `9997`\n",
+    "\n",
+    "#### vi. Paste the endpoint port number in the following cell"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5d520d56",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "port = 9997  # replace with your endpoint port number"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "93139076",
+   "metadata": {},
+   "source": [
+    "## <span style=\"font-size: xx-large;;\">🚀  </span> Downloading and Launching Local Models (2/3)\n",
+    "\n",
+    "#### In this step, simply run the following code blocks\n",
+    "\n",
+    "#### Also, feel free to change the model configuration for different experiences!\n",
+    "\n",
+    "#### The latest list of supported models can be found in Xorbits Inference's [official GitHub page](https://github.com/xorbitsai/inference/blob/main/README.md)\n",
+    "\n",
+    "##### Here are the parameter options for vicuna-v1.3, ranked from the least space-consuming to the most resource-intensive but high-performing:\n",
+    "\n",
+    "model_size_in_billions: `7`, `13`, `33`\n",
+    "\n",
+    "quantization: `q2_K`, `q3_K_L`, `q3_K_M`, `q3_K_S`, `q4_0`, `q4_1`, `q4_K_M`, `q4_K_S`, `q5_0`, `q5_1`, `q5_K_M`, `q5_K_S`, `q6_K`, `q8_0`\n",
+    "\n",
+    "##### Here are a few of the supported models:\n",
+    "\n",
+    "| Name          | Type             | Language | Format  | Size (in billions) | Quantization                            |\n",
+    "|---------------|------------------|----------|---------|--------------------|-----------------------------------------|\n",
+    "| baichuan      | Foundation Model | en, zh   | ggmlv3  | 7                  | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'  |\n",
+    "| llama-2-chat  | RLHF Model       | en       | ggmlv3  | 7, 13, 70          | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'  |\n",
+    "| chatglm       | SFT Model        | en, zh   | ggmlv3  | 6                  | 'q4_0', 'q4_1', 'q5_0', 'q5_1', 'q8_0'  |\n",
+    "| chatglm2      | SFT Model        | en, zh   | ggmlv3  | 6                  | 'q4_0', 'q4_1', 'q5_0', 'q5_1', 'q8_0'  |\n",
+    "| wizardlm-v1.0 | SFT Model        | en       | ggmlv3  | 7, 13, 33          | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'  |\n",
+    "| wizardlm-v1.1 | SFT Model        | en       | ggmlv3  | 13                 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'  |\n",
+    "| vicuna-v1.3   | SFT Model        | en       | ggmlv3  | 7, 13              | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'  |\n",
+    "\n",
+    "\n",
+    "In order to achieve satisfactory results, it is recommended to use models above 13 billion in size."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fd1d259c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# If Xinference can not be imported, you may need to restart jupyter notebook\n",
+    "from llama_index import (\n",
+    "    ListIndex,\n",
+    "    TreeIndex,\n",
+    "    VectorStoreIndex,\n",
+    "    KeywordTableIndex,\n",
+    "    KnowledgeGraphIndex,\n",
+    "    SimpleDirectoryReader,\n",
+    "    ServiceContext,\n",
+    ")\n",
+    "from llama_index.llms import Xinference\n",
+    "from xinference.client import RESTfulClient\n",
+    "from IPython.display import Markdown, display"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b48c6d7a-7a38-440b-8ecb-f43f9050ee54",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define a client to send commands to xinference\n",
+    "client = RESTfulClient(f\"http://localhost:{port}\")\n",
+    "\n",
+    "# Download and Launch a model, this may take a while the first time\n",
+    "model_uid = client.launch_model(\n",
+    "    model_name=\"llama-2-chat\",\n",
+    "    model_size_in_billions=7,\n",
+    "    model_format=\"ggmlv3\",\n",
+    "    quantization=\"q2_K\",\n",
+    "    n_ctx=4096,\n",
+    ")\n",
+    "\n",
+    "llm = Xinference(endpoint=f\"http://localhost:{port}\", model_uid=model_uid)\n",
+    "service_context = ServiceContext.from_defaults(llm=llm)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "094a02b7",
+   "metadata": {},
+   "source": [
+    "## <span style=\"font-size: xx-large;;\">🕺  </span> Index the Data and Start Chatting! (3/3)\n",
+    "\n",
+    "#### In this step, simply run the following code blocks\n",
+    "\n",
+    "#### Also, feel free to change the index that is used for different experiences\n",
+    "\n",
+    "#### A list of all available indexes can be found in Llama Index's [official Docs](https://gpt-index.readthedocs.io/en/latest/core_modules/data_modules/index/modules.html)\n",
+    "\n",
+    "Here are some available indexes that are imported:\n",
+    "\n",
+    "`ListIndex`, `TreeIndex`, `VetorStoreIndex`, `KeywordTableIndex`, `KnowledgeGraphIndex`\n",
+    "\n",
+    "The following code uses `VetorStoreIndex`. To change index, simply replace its name with another index"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "708b323e-d314-4b83-864b-22a1ead60de9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# create index from the data\n",
+    "documents = SimpleDirectoryReader(\"../data/paul_graham\").load_data()\n",
+    "\n",
+    "# change index name in the following line\n",
+    "index = VectorStoreIndex.from_documents(\n",
+    "    documents=documents, service_context=service_context\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2c2de13b-133f-404e-9661-2acafcdf2573",
+   "metadata": {
+    "scrolled": false
+   },
+   "outputs": [],
+   "source": [
+    "# ask a question and display the answer\n",
+    "query_engine = index.as_query_engine()\n",
+    "\n",
+    "question = \"What did the author do after his time at Y Combinator?\"\n",
+    "\n",
+    "response = query_engine.query(question)\n",
+    "display(Markdown(f\"<b>{response}</b>\"))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/llama_index/llms/__init__.py b/llama_index/llms/__init__.py
@@ -19,6 +19,7 @@
 from llama_index.llms.palm import PaLM
 from llama_index.llms.predibase import PredibaseLLM
 from llama_index.llms.replicate import Replicate
+from llama_index.llms.xinference import Xinference
 
 __all__ = [
     "OpenAI",
@@ -40,4 +41,5 @@
     "CompletionResponseGen",
     "CompletionResponseAsyncGen",
     "LLMMetadata",
+    "Xinference",
 ]