通义千问开源模型部署使用

首先可以参考modelScope社区给出的使用文档，已经足够全面

https://modelscope.cn/models/qwen/Qwen-7B-Chat/quickstart

但在按照文档中步骤部署时，还是有些错误问题发生，可以搜索参考的解决方式不多，所以记录下来

个人电脑部署

这里不太建议使用自己的笔记本部署通义千问模型，因为实在是太耗资源，我使用的M2芯片的MacBook Pro即使运行起来了，但模型回答一个问题都需要四五分钟的时间，内存全部占满，其他应用程序也都强制退出了。所以还是使用社区提供的免费资源，或者有更高配置的服务器来部署模型。而且期间还有各种问题，搜了很多github上的问答才解决，耗时耗力，这里就不记录了，很不推荐这种方式。

免费算力服务器

打开modelScope社区后，点击登录注册可以看到免费赠送算力的活动

注册完成后在对应模型里可以看到，随时都能启用的服务器

这里CPU环境的服务器勉强可以跑起来模型，但运行效果感人，而且配置过程中有各种问题需要修改，而GPU环境启动模型可以说是非常流畅，体验效果也很好

CPU环境启动

社区提供的服务器配置已经很高了，8核32G，但因为是纯CPU环境，启动过程中还是有些问题

安装依赖包

第一行命令不需要运行，服务器已经自带了modelscope包

只需要新建一个Terminal窗口来执行第二条命令

启动代码

直接运行文档提供的代码会报错，这里是因为纯CPU环境导致的

错误 1

RuntimeError: "addmm_implcpu" not implemented for 'Half'Hide Error Details

RuntimeError: &quot;addmm_impl_cpu_&quot; not implemented for &#039;Half&#039;
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 8
      5 model = AutoModelForCausalLM.from_pretrained(&quot;qwen/Qwen-7B-Chat&quot;, revision = &#039;v1.0.5&#039;,device_map=&quot;auto&quot;, trust_remote_code=True,fp16 = True).eval()
      6 model.generation_config = GenerationConfig.from_pretrained(&quot;Qwen/Qwen-7B-Chat&quot;,revision = &#039;v1.0.5&#039;, trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
----&gt; 8 response, history = model.chat(tokenizer, &quot;你好&quot;, history=None)
      9 print(response)
     10 response, history = model.chat(tokenizer, &quot;浙江的省会在哪里？&quot;, history=history) 

File ~/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py:1010, in QWenLMHeadModel.chat(self, tokenizer, query, history, system, append_history, stream, stop_words_ids, **kwargs)
   1006 stop_words_ids.extend(get_stop_words_ids(
   1007     self.generation_config.chat_format, tokenizer
   1008 ))
   1009 input_ids = torch.tensor([context_tokens]).to(self.device)
-&gt; 1010 outputs = self.generate(
   1011             input_ids,
   1012             stop_words_ids = stop_words_ids,
   1013             return_dict_in_generate = False,
   1014             **kwargs,
   1015         )
   1017 response = decode_tokens(
   1018     outputs[0],
   1019     tokenizer,
   (...)
   1024     errors=&#039;replace&#039;
   1025 )
   1027 if append_history:

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

---------------------------------------------------------------------------

RuntimeError Traceback (most recent call last)

Cell In[1], line 8

5 model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision = 'v1.0.5',device_map="auto", trust_remote_code=True,fp16 = True).eval()

6 model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat",revision = 'v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参

----> 8 response, history = model.chat(tokenizer, "你好", history=None)

9 print(response)

10 response, history = model.chat(tokenizer, "浙江的省会在哪里？", history=history)

File ~/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py:1010, in QWenLMHeadModel.chat(self, tokenizer, query, history, system, append_history, stream, stop_words_ids, **kwargs)

1006 stop_words_ids.extend(get_stop_words_ids(

1007 self.generation_config.chat_format, tokenizer

1008 ))

1009 input_ids = torch.tensor([context_tokens]).to(self.device)

-> 1010 outputs = self.generate(

1011 input_ids,

1012 stop_words_ids = stop_words_ids,

1013 return_dict_in_generate = False,

1014 **kwargs,

1015 )

1017 response = decode_tokens(

1018 outputs[0],

1019 tokenizer,

(...)

1024 errors='replace'

1025 )

1027 if append_history:

错误 2

ValueError: The current device_map had weights offloaded to the disk. Please provide an offload_folder for them. Alternatively, make sure you have safetensors installed if the model you are using offers the weights in this format.Hide Error Details

ValueError: The current `device_map` had weights offloaded to the disk. Please provide an `offload_folder` for them. Alternatively, make sure you have `safetensors` installed if the model you are using offers the weights in this format.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 5
      2 from modelscope import GenerationConfig
      4 tokenizer = AutoTokenizer.from_pretrained(&quot;qwen/Qwen-7B-Chat&quot;, revision = &#039;v1.0.5&#039;,trust_remote_code=True)
----&gt; 5 model = AutoModelForCausalLM.from_pretrained(&quot;qwen/Qwen-7B-Chat&quot;, revision = &#039;v1.0.5&#039;,device_map=&quot;auto&quot;, trust_remote_code=True,fp16 = True).eval()
      6 model.generation_config = GenerationConfig.from_pretrained(&quot;Qwen/Qwen-7B-Chat&quot;,revision = &#039;v1.0.5&#039;, trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
      7 model.float()

File /opt/conda/lib/python3.8/site-packages/modelscope/utils/hf_util.py:98, in get_wrapped_class.&lt;locals&gt;.ClassWrapper.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
     95 else:
     96     model_dir = pretrained_model_name_or_path
---&gt; 98 model = module_class.from_pretrained(model_dir, *model_args,
     99                                      **kwargs)
    100 model.model_dir = model_dir
    101 return model

ValueError: The current `device_map` had weights offloaded to the disk. Please provide an `offload_folder` for them. Alternatively, make sure you have `safetensors` installed if the model you are using offers the weights in this format.

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

Cell In[2], line 5

2 from modelscope import GenerationConfig

4 tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision = 'v1.0.5',trust_remote_code=True)

----> 5 model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision = 'v1.0.5',device_map="auto", trust_remote_code=True,fp16 = True).eval()

6 model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat",revision = 'v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参

7 model.float()

File /opt/conda/lib/python3.8/site-packages/modelscope/utils/hf_util.py:98, in get_wrapped_class.<locals>.ClassWrapper.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)

95 else:

96 model_dir = pretrained_model_name_or_path

---> 98 model = module_class.from_pretrained(model_dir, *model_args,

99 **kwargs)

100 model.model_dir = model_dir

101 return model

解决方式

首先确保torch 2.0.1版本，然后在代码中添加这两行，即可运行

model.float()

offload_folder="offload_folder",

from modelscope import AutoModelForCausalLM, AutoTokenizer
from modelscope import GenerationConfig
import datetime
print(&quot;启动时间：&quot; + str(datetime.datetime.now()))
tokenizer = AutoTokenizer.from_pretrained(&quot;qwen/Qwen-7B-Chat&quot;, revision = &#039;v1.0.5&#039;,trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(&quot;qwen/Qwen-7B-Chat&quot;, revision = &#039;v1.0.5&#039;,device_map=&quot;auto&quot;,offload_folder=&quot;offload_folder&quot;, trust_remote_code=True,fp16 = True).eval()
model.generation_config = GenerationConfig.from_pretrained(&quot;Qwen/Qwen-7B-Chat&quot;,revision = &#039;v1.0.5&#039;, trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
model.float()

print(&quot;开始执行：&quot; + str(datetime.datetime.now()))
response, history = model.chat(tokenizer, &quot;你好&quot;, history=None)
print(response)
print(&quot;第一个问题处理完毕：&quot; + str(datetime.datetime.now()))
response, history = model.chat(tokenizer, &quot;浙江的省会在哪里？&quot;, history=history) 
print(response)
print(&quot;第二个问题处理完毕：&quot; + str(datetime.datetime.now()))
response, history = model.chat(tokenizer, &quot;它有什么好玩的景点&quot;, history=history)
print(response)
print(&quot;第三个问题处理完毕：&quot; + str(datetime.datetime.now()))

from modelscope import AutoModelForCausalLM, AutoTokenizer

from modelscope import GenerationConfig

import datetime

print("启动时间：" + str(datetime.datetime.now()))

tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision = 'v1.0.5',trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision = 'v1.0.5',device_map="auto",offload_folder="offload_folder", trust_remote_code=True,fp16 = True).eval()

model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat",revision = 'v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参

model.float()

print("开始执行：" + str(datetime.datetime.now()))

response, history = model.chat(tokenizer, "你好", history=None)

print(response)

print("第一个问题处理完毕：" + str(datetime.datetime.now()))

response, history = model.chat(tokenizer, "浙江的省会在哪里？", history=history)

print(response)

print("第二个问题处理完毕：" + str(datetime.datetime.now()))

response, history = model.chat(tokenizer, "它有什么好玩的景点", history=history)

print(response)

print("第三个问题处理完毕：" + str(datetime.datetime.now()))

运行起来之后速度实在感人，没回答一个问题都需要 5 分钟左右，还有一定概率直接启动失败

启动模型过程中会出现这种报错，点击OK重新执行就好了，可能是服务器负载太高

个人电脑部署

免费算力服务器

CPU环境启动

安装依赖包

启动代码

错误 1

错误 2

解决方式

发表回复 取消回复

发表回复取消回复