Deploying the ChatGLM3–6B Model on a CPU Server
在 CPU 服务器上部署 ChatGLM3–6B 模型

Quantization techniques for Large Language Models (LLMs) can significantly reduce the computational resources required for deployment. After quantization, the memory usage of LLMs can be reduced several times and even transformed into models that don’t require memory usage. This makes the quantized LLMs highly attractive for widespread use. This article will introduce how to quantify the GGML version of the ChatGLM3–6B model and how to deploy the quantized model on Colab’s CPU servers, helping readers understand model quantization and familiarize themselves with the operation of Colab.
大型语言模型（LLM）的量化技术可以显著减少部署所需的计算资源。量化后，LLM的内存使用量可以减少数倍，甚至可以转换为不需要内存使用的模型。这使得量化 LLM 对广泛使用极具吸引力。本文将介绍如何量化 ChatGLM3–6B 模型的 GGML 版本，以及如何在 Colab 的 CPU 服务器上部署量化模型，帮助读者了解模型量化并熟悉 Colab 的操作。

Terminology Introduction 术语介绍

Before starting the actual operation, we need to understand the meaning of the tools and terminology involved in this operation for a better grasp of the content that follows.
在开始实际操作之前，我们需要了解此操作中涉及的工具和术语的含义，以便更好地掌握接下来的内容。

Colab

Colab is a free cloud-based Jupyter Notebook service provided by Google. It allows users to run and share Python code in the cloud without any setup or configuration. One of its greatest advantages is the free trial of Google’s servers. Free users can use CPU servers and T4 GPU servers, while paid users can use TPU servers and A100, V100, and other GPU servers.
Colab 是 Google 提供的基于云的免费 Jupyter Notebook 服务。它允许用户在云中运行和共享 Python 代码，而无需任何设置或配置。它最大的优势之一是免费试用 Google 的服务器。免费用户可以使用 CPU 服务器和 T4 GPU 服务器，而付费用户可以使用 TPU 服务器和 A100、V100 等 GPU 服务器。

ChatGLM3–6B ChatGLM3–6B型

ChatGLM3–6B is the latest generation of conversational pre-training models developed jointly by the Zhishu AI Research Institute and Tsinghua University’s Knowledge Engineering Laboratory. This model retains the smooth dialogue and low deployment threshold of the previous two generations and introduces new functions like tool calling and code interpreters. More details can be found in my previous articles: ChatGLM3–6B Deployment Guide, ChatGLM3–6B Functional Principle Analysis.
ChatGLM3–6B是由智数人工智能研究院和清华大学知识工程实验室联合开发的最新一代对话式预训练模型。该模型保留了前两代的流畅对话和低部署门槛，并引入了工具调用和代码解释器等新功能。更多细节可以在我之前的文章中找到： ChatGLM3–6B 部署指南， ChatGLM3–6B 功能原理分析。

GGML

GGML is a tool library for LLM quantization and a format for quantized files. The quantized LLM not only greatly reduces in capacity (ChatGLM3–6B from 12G to 3.5G) but can also run directly on pure CPU servers. More LLM quantization formats can be referred to in my previous article Introduction to AI Model Quantization Formats.
GGML 是用于 LLM 量化的工具库和量化文件的格式。量化的 LLM 不仅大大降低了容量（ChatGLM3-6B 从 12G 到 3.5G），而且可以直接在纯 CPU 服务器上运行。更多的LLM量化格式可以在我之前的文章《AI模型量化格式简介》中参考。

Introduction to chatglm.cpp chatglm.cpp简介

We will use the chatglm.cpp tool for model quantization. It is a quantization tool based on the GGML library. Besides quantizing the ChatGLM series of LLMs, it also supports the quantization of other LLMs like BaiChuan, CodeGeeX, InternLM.
我们将使用 chatglm.cpp 工具进行模型量化。它是一种基于GGML库的量化工具。除了量化 ChatGLM 系列的 LLM 外，它还支持其他 LLM 的量化，如 BaiChuan、CodeGeeX、InternLM。

In addition to quantization, chatglm.cpp also provides various ways to run quantized models, including source code compilation, Python code execution, web service, and API service. These methods allow us to use the quantized model in different scenarios.
除了量化之外，chatglm.cpp还提供了多种运行量化模型的方式，包括源代码编译、Python 代码执行、Web 服务和 API 服务。这些方法允许我们在不同的场景中使用量化模型。

Quantifying the ChatGLM3–6B Model
量化 ChatGLM3–6B 模型

First, we create a new Jupyter notebook on Colab, then connect the notebook to a runtime server. For quantization, we need a larger memory (about 15G), and the server memory provided by Colab for free users is only 12G. Therefore, we need to use the server for paid users. Fortunately, Colab’s paid pricing is not high, and you can choose either the $9.99 for 100 compute units billing mode or the Pro mode for $9.99 per month. After upgrading to the paid mode, we can select the large memory server. Here we choose the large memory CPU server, as all operations in this article only need to be run on the CPU server.
首先，我们在 Colab 上创建一个新的 Jupyter 笔记本，然后将该笔记本连接到运行时服务器。对于量化，我们需要更大的内存（大约15G），而Colab为免费用户提供的服务器内存只有12G。因此，我们需要为付费用户使用服务器。幸运的是，Colab 的付费定价并不高，您可以选择 9.99 个计算单元的 100 美元计费模式或每月 9.99 美元的专业模式。升级到付费模式后，我们可以选择大内存服务器。这里我们选择大内存CPU服务器，因为本文中的所有操作只需要在CPU服务器上运行。

Then we can write code in the Jupyter Notebook. First, we download the ChatGLM3–6B model. Downloading Huggingface resources on Colab is fast and usually completed in a few minutes.
然后，我们可以在 Jupyter Notebook 中编写代码。首先，我们下载 ChatGLM3–6B 模型。在 Colab 上下载 Huggingface 资源速度很快，通常在几分钟内完成。

git clone https://huggingface.co/THUDM/chatglm3-6b

The downloaded model will be saved under the /content path. Then download the chatglm.cpp project code. When using the git clone command, you need to add the --recursive parameter to ensure that the downloaded code contains submodules.
下载的模型将保存在 /content 路径下。然后下载chatglm.cpp项目代码。使用 git clone 命令时，需要添加 --recursive 参数，以确保下载的代码包含子模块。

git clone --recursive https://github.com/li-plus/chatglm.cpp.git

The downloaded chatglm.cpp will also be saved under the /content path. Next, we need to install some dependencies required by the project.
下载的chatglm.cpp也将保存在 /content 路径下。接下来，我们需要安装项目所需的一些依赖。

python3 -m pip install -U pip
python3 -m pip install torch tabulate tqdm transformers accelerate sentencepiece

Then we can execute our quantization command. Here we use the covert.py script to perform quantization. The command is as follows:
然后我们可以执行我们的量化命令。这里我们使用 covert.py 脚本来执行量化。命令如下：

python3 chatglm.cpp/chatglm_cpp/convert.py -i /content/chatglm3-6b -t q4_0 -o chatglm-ggml.bin

Here we use q4_0 as the quantization type. Other types can be referred to in the documentation of chatglm.cpp. After quantization, a chatglm-ggml.bin file will be generated under the /content path, which is the quantized model file.
这里我们使用 q4_0 作为量化类型。其他类型可以在chatglm.cpp的文档中参考。量化后，会在 /content 路径下生成一个 chatglm-ggml.bin 文件，即量化后的模型文件。

The size of the quantized model file is 3.5G, while the original model file is 12G. Since we use the q4_0 quantization method, the size of the quantized model is about 1/4 of the original model, greatly reducing the model's capacity.
量化模型文件的大小为 3.5G，而原始模型文件的大小为 12G。由于我们使用的是 q4_0 量化方法，因此量化模型的大小约为原始模型的1/4，大大降低了模型的容量。

Saving the Quantized Model to Google Drive
将量化模型保存到 Google 云端硬盘

We can save the quantized model to Google Drive, so if we restart the server later, we don’t need to repeat the above steps and can directly read the quantized model from Google Drive.
我们可以将量化后的模型保存到 Google Drive，这样如果我们稍后重新启动服务器，就不需要重复上述步骤，可以直接从 Google Drive 读取量化后的模型。

First, we need to mount Google Drive in the notebook. The command is as follows:
首先，我们需要在笔记本中挂载Google Drive。命令如下：

from google.colab import drive
drive.mount('/content/gdrive')

After executing the command, a Google Drive authorization page will pop up. Choose ‘Allow’, and after mounting successfully, you can see the Google Drive mount directory /content/gdrive/MyDrive. Then save the quantized model file to Google Drive. The command is as follows:
执行命令后，会弹出一个Google Drive授权页面。选择“允许”，安装成功后，您可以看到Google云端硬盘挂载目录 /content/gdrive/MyDrive 。然后将量化的模型文件保存到 Google Drive。命令如下：

cp chatglm-ggml.bin /content/gdrive/MyDrive/chatglm-ggml.bin

If we restart the server later, we need to mount Google Drive, and then we can directly reference the model file inside.
如果我们稍后重启服务器，我们需要挂载 Google Drive，然后我们可以直接引用里面的模型文件。

Uploading the Quantized Model to Huggingface
将量化模型上传到 Huggingface

We can also upload the quantized model to Huggingface, which makes it convenient to deploy on other servers. Uploading files to Huggingface requires creating a new model repository, and then the following code can be used to upload files:
我们还可以将量化后的模型上传到 Huggingface，这样可以方便地部署到其他服务器上。上传文件到 Huggingface 需要创建一个新的模型仓库，然后可以使用以下代码上传文件：

from huggingface_hub import login, HfApi
login()

api = HfApi()
api.upload_file(
    path_or_fileobj="/content/chatglm-ggml.bin",
    path_in_repo="chatglm-ggml.bin",
    repo_id="username/chatglm3-6b-ggml",
    repo_type="model",
)

The path_or_fileobj parameter is the local path of the quantized model.
path_or_fileobj 参数是量化模型的局部路径。
The path_in_repo parameter is the path in the Huggingface repository.
path_in_repo 参数是 Huggingface 存储库中的路径。
The repo_id parameter is the ID of the Huggingface repository, formatted as username/repo-name.
repo_id 参数是 Huggingface 存储库的 ID，格式为 username/repo-name 。
The repo_type parameter is the type of the Huggingface repository, here it is model.
repo_type 参数是 Huggingface 存储库的类型，此处为 model 。

Note that during the code execution, we need to enter the Access Token of the Huggingface account. This Token needs to be a write permission Token, not a read Token.
需要注意的是，在代码执行过程中，我们需要输入 Huggingface 账号的 Access Token。此令牌必须是 write 权限令牌，而不是 read 令牌。

GGUF

Friends who know about GGML quantization might ask, does chatglm.cpp support GGUF format? According to the official introduction of GGML, the quantization format will gradually transition from GGML to GGUF in the future, as the GGUF format can preserve more additional information about the model. However, due to the architecture of the ChatGLM model, chatglm.cpp currently does not support the GGUF format.
了解GGML量化的朋友可能会问，chatglm.cpp支持GGUF格式吗？根据GGML的官方介绍，量化格式将在未来逐渐从GGML过渡到GGUF，因为GGUF格式可以保留更多关于模型的附加信息。但是，由于ChatGLM模型的架构，chatglm.cpp目前不支持GGUF格式。

Source Code Compilation to Run the Quantized Model
运行量化模型的源代码编译

After obtaining the quantized model, we can run the model to verify whether it is normal. chatglm.cpp provides multiple ways to run the model. Here we first introduce the method of source code compilation.
得到量化后的模型后，我们可以运行模型来验证它是否正常。chatglm.cpp提供了多种运行模型的方法。这里我们先介绍一下源码编译的方法。

First, compile the chatglm.cpp run command:
首先，编译 chatglm.cpp run 命令：

cd chatglm.cpp && cmake -B build && cmake --build build -j --config Release

After compilation, thebuild directory will be generated under the chatglm.cpp directory. The compiled commands are located in this directory. Then we can run the model. The command is as follows:
编译完成后，会在 chatglm.cpp 目录下生成 build 目录。编译的命令位于此目录中。然后我们就可以运行模型了。命令如下：

chatglm.cpp/build/bin/main -m chatglm-ggml.bin -p "hello"

# Output
Hello! How can I help you today?

In the above command, the -m parameter specifies the path of the quantized model, and the -p parameter is the input prompt. We can see the LLM's output, which is identical to the original model's result. We can also initiate an interactive conversation using the -i parameter, as follows:
在上面的命令中， -m 参数指定量化模型的路径， -p 参数是输入提示。我们可以看到 LLM 的输出，它与原始模型的结果相同。我们还可以使用 -i 参数发起交互式对话，如下所示：

./build/bin/main -m chatglm-ggml.bin -i

# Output
Welcome to ChatGLM.cpp! Ask whatever you want. Type 'clear' to clear context. Type 'stop' to exit.
Prompt   > hello
ChatGLM3 > Hello! How can I help you today?

Using the Python Package 使用 Python 包

chatglm.cpp also provides a Python package, allowing us to run the quantized model using this toolkit. First, we install the Python dependencies as follows:
chatglm.cpp 还提供了一个 Python 包，允许我们使用这个工具包运行量化模型。首先，我们按如下方式安装 Python 依赖项：

pip install -U chatglm-cpp

Code Execution

After installing the Python package, we can run the quantized model using Python code:
安装 Python 包后，我们可以使用 Python 代码运行量化模型：

import chatglm_cpp

pipeline = chatglm_cpp.Pipeline("../chatglm-ggml.bin")
pipeline.chat([chatglm_cpp.ChatMessage(role="user", content="hello")])
# Output
ChatMessage(role="assistant", content="Hello! How can I help you today?", tool_calls=[])

The result is the same as when running the model from the source code.
结果与从源代码运行模型时的结果相同。

Command-Line Execution 命令行执行

We can also use a Python script to run the quantized model. The script file is located in the examples directory of the chatglm.cpp project. The command is as follows:
我们还可以使用 Python 脚本来运行量化模型。脚本文件位于chatglm.cpp项目的 examples 目录中。命令如下：

python examples/cli_demo.py -m chatglm-ggml.bin -p hello

# Output
Hello! How can I help you today?

Deploying Web Services 部署 Web 服务

We can deploy the quantized model as a web service, allowing us to call the model from a browser. Here we use the web service script provided by chatglm.cpp to deploy the model.
我们可以将量化模型部署为 Web 服务，从而允许我们从浏览器调用模型。这里我们使用 chatglm.cpp 提供的 Web 服务脚本来部署模型。

Gradio

First, install the Gradio dependencies:
首先，安装 Gradio 依赖项：

python3 -m pip install gradio

Then modify the web service script examples/web_demo.py, setting the Share attribute to True to allow external server access. Then start the web service as follows:
然后修改 Web 服务脚本 examples/web_demo.py ，将 Share 属性设置为 True 以允许外部服务器访问。然后启动 Web 服务，如下所示：

python3 chatglm.cpp/examples/web_demo.py -m chatglm-ggml.bin

# Output
Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://41db812a8754cd8ab3.gradio.live

Browser view of public URL:
public URL 的浏览器视图：

Streamlit

Besides Gradio web services, ChatGLM3–6B also provides a Streamlit web service that integrates various tools. Let’s deploy this service next. First, install the Streamlit dependencies:
除了 Gradio Web 服务，ChatGLM3-6B 还提供集成了各种工具的 Streamlit Web 服务。接下来，我们将部署此服务。首先，安装 Streamlit 依赖项：

python3 -m pip install streamlit jupyter_client ipython ipykernel
ipython kernel install --name chatglm3-demo --user

Modify the integrated service script examples/chatglm3_demo.py to change the model path to the quantized model, as follows:
修改集成服务脚本 examples/chatglm3_demo.py 以将模型路径更改为量化模型，如下所示：

-MODEL_PATH = Path(__file__).resolve().parent.parent / "chatglm3-ggml.bin"
+MODEL_PATH = "/content/chatglm-ggml.bin"

We also need to proxy the web service in the Colab server to the public internet by installing Cloudflare’s reverse proxy:
我们还需要通过安装 Cloudflare 的反向代理将 Colab 服务器中的 Web 服务代理到公共互联网：

wget https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64
chmod +x cloudflared-linux-amd64

Finally, start the Streamlit web service and the proxy service:
最后，启动 Streamlit Web 服务和代理服务：

# Start the Streamlit web service
streamlit run chatglm.cpp/examples/chatglm3_demo.py &>/content/logs.txt &

# Start the proxy service
grep -o 'https://.*\.trycloudflare.com' nohup.out | head -n 1 | xargs -I {} echo "Your tunnel url {}"
nohup /
content/cloudflared-linux-amd64 tunnel --url http://localhost:8501 &
# Output
Your tunnel url https://incorporated-attend-totally-humidity.trycloudflare.com
nohup: appending output to 'nohup.out'

Browser view of tunnel url:
tunnel url 的浏览器视图：

Deploying API Services 部署 API 服务

We can also deploy the quantized model as an API service. The Python package of chatglm.cpp provides the functionality to start an API service compatible with the OpenAI API.
我们还可以将量化模型部署为 API 服务。chatglm.cpp 的 Python 包提供了启动与 OpenAI API 兼容的 API 服务的功能。

First, install the chatglm.cpp API package:
首先，安装 chatglm.cpp API 包：

pip install -U 'chatglm-cpp[api]'

Then start the API service: 然后启动 API 服务：

MODEL=./chatglm-ggml.bin uvicorn chatglm_cpp.openai_api:app --host 127.0.0.1 --port 8000

The MODEL environment variable is the address of the quantized model. Then we use the curl command to verify the API service:
MODEL 环境变量是量化模型的地址。然后我们使用 curl 命令来验证 API 服务：

curl http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{"messages": [{"role": "user", "content": "hello"}]}'

# Response
{
  "id":"chatcmpl",
  "model":"default-model",
  "object":"chat.completion",
  "created":1703052225,
  "choices": [
    {
      "index":0,
      "message": {
        "role":"assistant",
        "content":"Hello! How can I help you today?"
      },
      "finish_reason":"stop"
    }
  ],
  "usage":{
    "prompt_tokens":8,
    "completion_tokens":29,
    "total_tokens":37
  }
}

The API returns a result structure similar to that of the OpenAI API.
API 返回的结果结构类似于 OpenAI API。

Conclusion

This article introduces how to quantify the ChatGLM3–6B model in the GGML version on Colab using chatglm.cpp and discusses various deployment methods for running the quantized model. These methods enable us to use the quantized model in different scenarios, all on CPU servers without any GPU graphics card. I hope this article helps those deploying LLMs or building private LLM applications. The Colab script mentioned in the article can be found here. If there are any errors or deficiencies in the article, please point them out in the comments section.
本文介绍了如何使用 chatglm.cpp 在 Colab 上的 GGML 版本中量化 ChatGLM3–6B 模型，并讨论了运行量化模型的各种部署方法。这些方法使我们能够在不同的场景中使用量化模型，所有这些都在没有任何 GPU 显卡的 CPU 服务器上使用。我希望这篇文章能帮助那些部署 LLM 或构建私有 LLM 应用程序的人。文章中提到的 Colab 脚本可以在这里找到。如果文章中有任何错误或不足，请在评论部分指出。

Follow me to learn various AI and AIGC new technologies. Feel free to exchange ideas, and if you have any questions or comments, please leave a message in the comments section.
跟着我一起学习各种AI和AIGC新技术。欢迎随时交流想法，如有任何问题或意见，欢迎在评论区留言。

在 CPU 服务器上部署 ChatGLM3–6B 模型

Deploying the ChatGLM3–6B Model on a CPU Server
在 CPU 服务器上部署 ChatGLM3–6B 模型

Terminology Introduction 术语介绍

Colab

ChatGLM3–6B ChatGLM3–6B型

GGML

Introduction to chatglm.cpp chatglm.cpp简介

Quantifying the ChatGLM3–6B Model
量化 ChatGLM3–6B 模型

Saving the Quantized Model to Google Drive
将量化模型保存到 Google 云端硬盘

Uploading the Quantized Model to Huggingface
将量化模型上传到 Huggingface

GGUF

Source Code Compilation to Run the Quantized Model
运行量化模型的源代码编译

Using the Python Package 使用 Python 包

Code Execution

Command-Line Execution 命令行执行

Deploying Web Services 部署 Web 服务

Gradio

Streamlit

Deploying API Services 部署 API 服务

Conclusion

相关推荐

在 CPU 服务器上部署 ChatGLM3–6B 模型

多人、多用户同时使用一台电脑

阿里云oss挂载到普通linux服务器上

防盗链的反代获取

共有 0 条评论

归档

分类目录

在 CPU 服务器上部署 ChatGLM3–6B 模型

Deploying the ChatGLM3–6B Model on a CPU Server 在 CPU 服务器上部署 ChatGLM3–6B 模型

Terminology Introduction 术语介绍

Colab

ChatGLM3–6B ChatGLM3–6B型

GGML

Introduction to chatglm.cpp chatglm.cpp简介

Quantifying the ChatGLM3–6B Model 量化 ChatGLM3–6B 模型

Saving the Quantized Model to Google Drive 将量化模型保存到 Google 云端硬盘

Uploading the Quantized Model to Huggingface 将量化模型上传到 Huggingface

GGUF

Source Code Compilation to Run the Quantized Model 运行量化模型的源代码编译

Using the Python Package 使用 Python 包

Code Execution

Command-Line Execution 命令行执行

Deploying Web Services 部署 Web 服务

Gradio

Streamlit

Deploying API Services 部署 API 服务

Conclusion

相关推荐

在 CPU 服务器上部署 ChatGLM3–6B 模型

多人、多用户同时使用一台电脑

阿里云oss挂载到普通linux服务器上

防盗链的反代获取

共有 0 条评论

归档

分类目录

Deploying the ChatGLM3–6B Model on a CPU Server
在 CPU 服务器上部署 ChatGLM3–6B 模型

Quantifying the ChatGLM3–6B Model
量化 ChatGLM3–6B 模型

Saving the Quantized Model to Google Drive
将量化模型保存到 Google 云端硬盘

Uploading the Quantized Model to Huggingface
将量化模型上传到 Huggingface

Source Code Compilation to Run the Quantized Model
运行量化模型的源代码编译