Xinference

王江南2025年7月2日大约 2 分钟

模型推理框架

注意

为避免依赖冲突，请将 Langchain-Chatchat 和模型部署框架如 Xinference 等放在不同的 Python 虚拟环境中, 比如 conda, venv, virtualenv 等。

注意

xinference 和 ollama 在使用gpu时会有冲突

echo $env:CUDA_VISIBLE_DEVICES

ollama 使用uuid
xinference 使用 id

注意

无法科学上网需要配置modelscope为下载地址

配置

windows

$env:CUDA_VISIBLE_DEVICES = "0"
$env:XINFERENCE_MODEL_SRC= "modelscope"

安装

conda create -n xinference python==3.10
conda activate xinference
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

cpu

pip3 install "xinference" -i https://pypi.tuna.tsinghua.edu.cn/simple

gpu

pip3 install "xinference[all]" -i https://pypi.tuna.tsinghua.edu.cn/simple

xinference-local --host 0.0.0.0 --port 9997

注意

libgomp.so.1, needed by vendor/llama.cpp/ggml/src/libggml.so, not found

通过查找命令 find /usr -name libgomp.so.1
找到内容
/usr/lib/x86_64-linux-gnu/libgomp.so.1

然后在执行安装命令前, 输入如下命令并回车, 指定 LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

然后再执行如下命令成功了
pip install "xinference[all]"

xinference[all] 安装时，默认会把所有【需要GPU加速】的模块都一起安装，所以安装失败了。
只装 CPU 版
pip3 install "xinference" -i https://pypi.tuna.tsinghua.edu.cn/simple

设置

XINFERENCE_HOME=/tmp/xinference # 存储位置
XINFERENCE_MODEL_SRC=modelscope # 指定模型下载网站

使用

列出所有在运行的模型

xinference list

当你不需要某个正在运行的模型，可以通过以下的方式来停止它并释放资源

xinference terminate --model-uid "qwen2.5-instruct"

查询与 qwen-chat 模型相关的参数组合

xinference engine -e http://localhost:9997 --model-name qwen-chat

使用其他模型托管平台

$env:CUDA_VISIBLE_DEVICES="0"

XINFERENCE_MODEL_SRC=modelscope

xinference-local --host 0.0.0.0 --port 9997

注意

windows 不能使用 0.0.0.0 启动报错后 --host 指定ip

访问 http://ip:9997/ui

xinference launch --model-engine Transformers -u my-ai -n qwen-chat -s 7 -f pytorch

echo $env:CUDA_VISIBLE_DEVICES

netstat -ano | findstr :9997
netstat -ano | findstr :24031

set CUDA_VISIBLE_DEVICES=0

模型

embedding

jina-embeddings-v2-base-zh

xinference launch --model-name jina-embeddings-v2-base-zh --model-type embedding

curl -X 'POST' \
  'http://192.168.3.89:9997/v1/models' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model_engine": "http://192.168.3.89:9997",
  "model_name": "jina-embeddings-v2-base-zh",
  "model_type": "embedding"
}'

Q&A

只能运行cpu

python -c "import torch; print(torch.cuda.is_available())"

torch.cuda.is_available() 返回 False 说明你安装的是 CPU 版本的 PyTorch，需要重新安装 GPU 版本。

1. 先检查你的 CUDA 环境
powershell# 检查 NVIDIA 驱动
nvidia-smi


2. 卸载当前的 PyTorch
pip uninstall torch torchvision torchaudio
3. 安装 GPU 版本的 PyTorch
根据你的 CUDA 版本选择对应的安装命令：
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
4. 验证安装
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
python -c "import torch; print('CUDA version:', torch.version.cuda)"
python -c "import torch; print('GPU count:', torch.cuda.device_count())"
5. 重启 Xinference
powershellxinference-local --host 192.168.3.89 --port 9997