Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# ChangeLog

## [0.7.22]
## Unreleased

### New Features
- Added Xorbits inference for local deployments (#7151)

## [0.7.22] - 2023-08-08

### New Features
- add ensemble retriever notebook (#7190)
Expand Down
8 changes: 8 additions & 0 deletions docs/core_modules/model_modules/llms/modules.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,3 +81,11 @@ maxdepth: 1
---
/examples/llm/llama_api.ipynb
```

## Xorbits Inference
```{toctree}
---
maxdepth: 1
---
/examples/llm/XinferenceLocalDeployment.ipynb
```
209 changes: 209 additions & 0 deletions docs/examples/llm/XinferenceLocalDeployment.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "7096589b-daaf-440a-b89d-b4956f2db4b2",
"metadata": {
"tags": []
},
"source": [
"# Using Xorbits Inference to Deploy Local LLMs - in 3 steps!\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "d8cfbe6f-4c50-4c4f-90f9-03bb91201ef5",
"metadata": {},
"source": [
"## <span style=\"font-size: xx-large;;\">🤖 </span> Installing and Running Xorbits Inference (1/3)\n",
"\n",
"#### i. Run `pip install \"xinference[all]\"` in a terminal window\n",
"\n",
"#### ii. After installation is complete, restart this jupyter notebook\n",
"\n",
"#### iii. Run `xinference` in a new terminal window\n",
"\n",
"#### iv. You should see something similar to the following output:\n",
"\n",
"```\n",
"INFO:xinference:Xinference successfully started. Endpoint: http://127.0.0.1:9997\n",
"INFO:xinference.core.service:Worker 127.0.0.1:21561 has been added successfully\n",
"INFO:xinference.deploy.worker:Xinference worker successfully started.\n",
"```\n",
"\n",
"#### v. In the endpoint description, locate the endpoint port number after the colon. In the above case it is `9997`\n",
"\n",
"#### vi. Paste the endpoint port number in the following cell"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5d520d56",
"metadata": {},
"outputs": [],
"source": [
"port = 9997 # replace with your endpoint port number"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "93139076",
"metadata": {},
"source": [
"## <span style=\"font-size: xx-large;;\">🚀 </span> Downloading and Launching Local Models (2/3)\n",
"\n",
"#### In this step, simply run the following code blocks\n",
"\n",
"#### Also, feel free to change the model configuration for different experiences!\n",
"\n",
"#### The latest list of supported models can be found in Xorbits Inference's [official GitHub page](https://github.com/xorbitsai/inference/blob/main/README.md)\n",
"\n",
"##### Here are the parameter options for vicuna-v1.3, ranked from the least space-consuming to the most resource-intensive but high-performing:\n",
"\n",
"model_size_in_billions: `7`, `13`, `33`\n",
"\n",
"quantization: `q2_K`, `q3_K_L`, `q3_K_M`, `q3_K_S`, `q4_0`, `q4_1`, `q4_K_M`, `q4_K_S`, `q5_0`, `q5_1`, `q5_K_M`, `q5_K_S`, `q6_K`, `q8_0`\n",
"\n",
"##### Here are a few of the supported models:\n",
"\n",
"| Name | Type | Language | Format | Size (in billions) | Quantization |\n",
"|---------------|------------------|----------|---------|--------------------|-----------------------------------------|\n",
"| baichuan | Foundation Model | en, zh | ggmlv3 | 7 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |\n",
"| llama-2-chat | RLHF Model | en | ggmlv3 | 7, 13, 70 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |\n",
"| chatglm | SFT Model | en, zh | ggmlv3 | 6 | 'q4_0', 'q4_1', 'q5_0', 'q5_1', 'q8_0' |\n",
"| chatglm2 | SFT Model | en, zh | ggmlv3 | 6 | 'q4_0', 'q4_1', 'q5_0', 'q5_1', 'q8_0' |\n",
"| wizardlm-v1.0 | SFT Model | en | ggmlv3 | 7, 13, 33 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |\n",
"| wizardlm-v1.1 | SFT Model | en | ggmlv3 | 13 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |\n",
"| vicuna-v1.3 | SFT Model | en | ggmlv3 | 7, 13 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0' |\n",
"\n",
"\n",
"In order to achieve satisfactory results, it is recommended to use models above 13 billion in size."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fd1d259c",
"metadata": {},
"outputs": [],
"source": [
"# If Xinference can not be imported, you may need to restart jupyter notebook\n",
"from llama_index import (\n",
" ListIndex,\n",
" TreeIndex,\n",
" VectorStoreIndex,\n",
" KeywordTableIndex,\n",
" KnowledgeGraphIndex,\n",
" SimpleDirectoryReader,\n",
" ServiceContext,\n",
")\n",
"from llama_index.llms import Xinference\n",
"from xinference.client import RESTfulClient\n",
"from IPython.display import Markdown, display"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b48c6d7a-7a38-440b-8ecb-f43f9050ee54",
"metadata": {},
"outputs": [],
"source": [
"# Define a client to send commands to xinference\n",
"client = RESTfulClient(f\"http://localhost:{port}\")\n",
"\n",
"# Download and Launch a model, this may take a while the first time\n",
"model_uid = client.launch_model(\n",
" model_name=\"llama-2-chat\",\n",
" model_size_in_billions=7,\n",
" model_format=\"ggmlv3\",\n",
" quantization=\"q2_K\",\n",
" n_ctx=4096,\n",
")\n",
"\n",
"llm = Xinference(endpoint=f\"http://localhost:{port}\", model_uid=model_uid)\n",
"service_context = ServiceContext.from_defaults(llm=llm)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "094a02b7",
"metadata": {},
"source": [
"## <span style=\"font-size: xx-large;;\">🕺 </span> Index the Data and Start Chatting! (3/3)\n",
"\n",
"#### In this step, simply run the following code blocks\n",
"\n",
"#### Also, feel free to change the index that is used for different experiences\n",
"\n",
"#### A list of all available indexes can be found in Llama Index's [official Docs](https://gpt-index.readthedocs.io/en/latest/core_modules/data_modules/index/modules.html)\n",
"\n",
"Here are some available indexes that are imported:\n",
"\n",
"`ListIndex`, `TreeIndex`, `VetorStoreIndex`, `KeywordTableIndex`, `KnowledgeGraphIndex`\n",
"\n",
"The following code uses `VetorStoreIndex`. To change index, simply replace its name with another index"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "708b323e-d314-4b83-864b-22a1ead60de9",
"metadata": {},
"outputs": [],
"source": [
"# create index from the data\n",
"documents = SimpleDirectoryReader(\"../data/paul_graham\").load_data()\n",
"\n",
"# change index name in the following line\n",
"index = VectorStoreIndex.from_documents(\n",
" documents=documents, service_context=service_context\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2c2de13b-133f-404e-9661-2acafcdf2573",
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"# ask a question and display the answer\n",
"query_engine = index.as_query_engine()\n",
"\n",
"question = \"What did the author do after his time at Y Combinator?\"\n",
"\n",
"response = query_engine.query(question)\n",
"display(Markdown(f\"<b>{response}</b>\"))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
2 changes: 2 additions & 0 deletions llama_index/llms/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from llama_index.llms.palm import PaLM
from llama_index.llms.predibase import PredibaseLLM
from llama_index.llms.replicate import Replicate
from llama_index.llms.xinference import Xinference

__all__ = [
"OpenAI",
Expand All @@ -40,4 +41,5 @@
"CompletionResponseGen",
"CompletionResponseAsyncGen",
"LLMMetadata",
"Xinference",
]
Loading