夢想照進現實,微軟果然不愧是微軟,開源了賈維斯(J.A.R.V.I.S.)人工智慧助理系統,賈維斯(jarvis)全稱為Just A Rather Very Intelligent System(只是一個相當聰明的人工智慧系統),它可以幫助鋼鐵俠託尼斯塔克完成各種任務和挑戰,包括控制和管理託尼的機甲裝備,提供實時情報和資料分析,幫助託尼做出決策等等。
如今,我們也可以擁有自己的賈維斯人工智慧助理,成本僅僅是一塊RTX3090顯示卡。
賈維斯(Jarvis)的環境配置
一般情況下,深度學習領域相對主流的入門級別顯示卡是2070或者3070,而3090可以算是消費級深度學習顯示卡的天花板了:
再往上走就是工業級別的A系列和V系列顯示卡,視訊記憶體是一個硬指標,因為需要載入本地的大模型,雖然可以改程式碼對模型載入進行“閹割”,但功能上肯定也會有一定的損失。如果沒有3090,也可以組兩塊3060 12G的並行,視訊記憶體雖然可以達標,但算力和綜合效能抵不過3090。
確保本地具備足以支撐賈維斯(Jarvis)的硬體環境之後,老規矩,克隆專案:
git clone https://github.com/microsoft/JARVIS.git
隨後進入專案目錄:
cd JARVIS
修改專案的配置檔案 server/config.yaml:
openai:
key: your_personal_key # gradio, your_personal_key
huggingface:
cookie: # required for huggingface inference
local: # ignore: just for development
endpoint: http://localhost:8003
dev: false
debug: false
log_file: logs/debug.log
model: text-davinci-003 # text-davinci-003
use_completion: true
inference_mode: hybrid # local, huggingface or hybrid
local_deployment: minimal # no, minimal, standard or full
num_candidate_models: 5
max_description_length: 100
proxy:
httpserver:
host: localhost
port: 8004
modelserver:
host: localhost
port: 8005
logit_bias:
parse_task: 0.1
choose_model: 5
這裡主要修改三個配置即可,分別是openaikey,huggingface官網的cookie令牌,以及OpenAI的model,預設使用的模型是text-davinci-003。
修改完成後,官方推薦使用虛擬環境conda,Python版本3.8,私以為這裡完全沒有任何必要使用虛擬環境,直接上Python3.10即可,接著安裝依賴:
pip3 install -r requirements.txt
專案依賴庫如下:
git+https://github.com/huggingface/diffusers.git@8c530fc2f6a76a2aefb6b285dce6df1675092ac6#egg=diffusers
git+https://github.com/huggingface/transformers@c612628045822f909020f7eb6784c79700813eda#egg=transformers
git+https://github.com/patrickvonplaten/controlnet_aux@78efc716868a7f5669c288233d65b471f542ce40#egg=controlnet_aux
tiktoken==0.3.3
pydub==0.25.1
espnet==202301
espnet_model_zoo==0.1.7
flask==2.2.3
flask_cors==3.0.10
waitress==2.1.2
datasets==2.11.0
asteroid==0.6.0
speechbrain==0.5.14
timm==0.6.13
typeguard==2.13.3
accelerate==0.18.0
pytesseract==0.3.10
gradio==3.24.1
這裡web端介面是用Flask2.2高版本搭建的,但奇怪的是微軟並未使用Flask新版本的非同步特性。
安裝完成之後,進入模型目錄:
cd models
下載模型和資料集:
sh download.sh
這裡一定要做好心理準備,因為模型就已經佔用海量的硬碟空間了,資料集更是不必多說,所有檔案均來自huggingface:
models="
nlpconnect/vit-gpt2-image-captioning
lllyasviel/ControlNet
runwayml/stable-diffusion-v1-5
CompVis/stable-diffusion-v1-4
stabilityai/stable-diffusion-2-1
Salesforce/blip-image-captioning-large
damo-vilab/text-to-video-ms-1.7b
microsoft/speecht5_asr
facebook/maskformer-swin-large-ade
microsoft/biogpt
facebook/esm2_t12_35M_UR50D
microsoft/trocr-base-printed
microsoft/trocr-base-handwritten
JorisCos/DCCRNet_Libri1Mix_enhsingle_16k
espnet/kan-bayashi_ljspeech_vits
facebook/detr-resnet-101
microsoft/speecht5_tts
microsoft/speecht5_hifigan
microsoft/speecht5_vc
facebook/timesformer-base-finetuned-k400
runwayml/stable-diffusion-v1-5
superb/wav2vec2-base-superb-ks
openai/whisper-base
Intel/dpt-large
microsoft/beit-base-patch16-224-pt22k-ft22k
facebook/detr-resnet-50-panoptic
facebook/detr-resnet-50
openai/clip-vit-large-patch14
google/owlvit-base-patch32
microsoft/DialoGPT-medium
bert-base-uncased
Jean-Baptiste/camembert-ner
deepset/roberta-base-squad2
facebook/bart-large-cnn
google/tapas-base-finetuned-wtq
distilbert-base-uncased-finetuned-sst-2-english
gpt2
mrm8488/t5-base-finetuned-question-generation-ap
Jean-Baptiste/camembert-ner
t5-base
impira/layoutlm-document-qa
ydshieh/vit-gpt2-coco-en
dandelin/vilt-b32-finetuned-vqa
lambdalabs/sd-image-variations-diffusers
facebook/timesformer-base-finetuned-k400
facebook/maskformer-swin-base-coco
Intel/dpt-hybrid-midas
lllyasviel/sd-controlnet-canny
lllyasviel/sd-controlnet-depth
lllyasviel/sd-controlnet-hed
lllyasviel/sd-controlnet-mlsd
lllyasviel/sd-controlnet-openpose
lllyasviel/sd-controlnet-scribble
lllyasviel/sd-controlnet-seg
"
# CURRENT_DIR=$(cd `dirname $0`; pwd)
CURRENT_DIR=$(pwd)
for model in $models;
do
echo "----- Downloading from https://huggingface.co/"$model" -----"
if [ -d "$model" ]; then
# cd $model && git reset --hard && git pull && git lfs pull
cd $model && git pull && git lfs pull
cd $CURRENT_DIR
else
# git clone 包含了lfs
git clone https://huggingface.co/$model $model
fi
done
datasets="Matthijs/cmu-arctic-xvectors"
for dataset in $datasets;
do
echo "----- Downloading from https://huggingface.co/datasets/"$dataset" -----"
if [ -d "$dataset" ]; then
cd $dataset && git pull && git lfs pull
cd $CURRENT_DIR
else
git clone https://huggingface.co/datasets/$dataset $dataset
fi
done
也可以考慮拆成兩個shell,開多程式下載,速度會快很多。
但事實上,真的,別下了,檔案屬實過於巨大,這玩意兒真的不是普通人能耍起來的,當然選擇不下載本地模型和資料集也能執行,請看下文。
漫長的下載流程結束之後,賈維斯(Jarvis)就配置好了。
執行賈維斯(Jarvis)
如果您選擇下載了所有的模型和資料集(佩服您是條漢子),終端內啟動服務:
python models_server.py --config config.yaml
隨後會在系統的8004埠啟動一個Flask服務程式,然後發起Http請求即可執行賈維斯(Jarvis):
curl --location 'http://localhost:8004/hugginggpt' \
--header 'Content-Type: application/json' \
--data '{
"messages": [
{
"role": "user",
"content": "please generate a video based on \"Spiderman is surfing\""
}
]
}'
這個的意思是讓賈維斯(Jarvis)生成一段“蜘蛛俠在衝浪”的影片。
當然了,以筆者的硬體環境,是不可能跑起來的,所以可以對載入的模型適當“閹割”,在models_server.py檔案的81行左右:
other_pipes = {
"nlpconnect/vit-gpt2-image-captioning":{
"model": VisionEncoderDecoderModel.from_pretrained(f"{local_fold}/nlpconnect/vit-gpt2-image-captioning"),
"feature_extractor": ViTImageProcessor.from_pretrained(f"{local_fold}/nlpconnect/vit-gpt2-image-captioning"),
"tokenizer": AutoTokenizer.from_pretrained(f"{local_fold}/nlpconnect/vit-gpt2-image-captioning"),
"device": "cuda:0"
},
"Salesforce/blip-image-captioning-large": {
"model": BlipForConditionalGeneration.from_pretrained(f"{local_fold}/Salesforce/blip-image-captioning-large"),
"processor": BlipProcessor.from_pretrained(f"{local_fold}/Salesforce/blip-image-captioning-large"),
"device": "cuda:0"
},
"damo-vilab/text-to-video-ms-1.7b": {
"model": DiffusionPipeline.from_pretrained(f"{local_fold}/damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16"),
"device": "cuda:0"
},
"facebook/maskformer-swin-large-ade": {
"model": MaskFormerForInstanceSegmentation.from_pretrained(f"{local_fold}/facebook/maskformer-swin-large-ade"),
"feature_extractor" : AutoFeatureExtractor.from_pretrained("facebook/maskformer-swin-large-ade"),
"device": "cuda:0"
},
"microsoft/trocr-base-printed": {
"processor": TrOCRProcessor.from_pretrained(f"{local_fold}/microsoft/trocr-base-printed"),
"model": VisionEncoderDecoderModel.from_pretrained(f"{local_fold}/microsoft/trocr-base-printed"),
"device": "cuda:0"
},
"microsoft/trocr-base-handwritten": {
"processor": TrOCRProcessor.from_pretrained(f"{local_fold}/microsoft/trocr-base-handwritten"),
"model": VisionEncoderDecoderModel.from_pretrained(f"{local_fold}/microsoft/trocr-base-handwritten"),
"device": "cuda:0"
},
"JorisCos/DCCRNet_Libri1Mix_enhsingle_16k": {
"model": BaseModel.from_pretrained("JorisCos/DCCRNet_Libri1Mix_enhsingle_16k"),
"device": "cuda:0"
},
"espnet/kan-bayashi_ljspeech_vits": {
"model": Text2Speech.from_pretrained(f"espnet/kan-bayashi_ljspeech_vits"),
"device": "cuda:0"
},
"lambdalabs/sd-image-variations-diffusers": {
"model": DiffusionPipeline.from_pretrained(f"{local_fold}/lambdalabs/sd-image-variations-diffusers"), #torch_dtype=torch.float16
"device": "cuda:0"
},
"CompVis/stable-diffusion-v1-4": {
"model": DiffusionPipeline.from_pretrained(f"{local_fold}/CompVis/stable-diffusion-v1-4"),
"device": "cuda:0"
},
"stabilityai/stable-diffusion-2-1": {
"model": DiffusionPipeline.from_pretrained(f"{local_fold}/stabilityai/stable-diffusion-2-1"),
"device": "cuda:0"
},
"runwayml/stable-diffusion-v1-5": {
"model": DiffusionPipeline.from_pretrained(f"{local_fold}/runwayml/stable-diffusion-v1-5"),
"device": "cuda:0"
},
"microsoft/speecht5_tts":{
"processor": SpeechT5Processor.from_pretrained(f"{local_fold}/microsoft/speecht5_tts"),
"model": SpeechT5ForTextToSpeech.from_pretrained(f"{local_fold}/microsoft/speecht5_tts"),
"vocoder": SpeechT5HifiGan.from_pretrained(f"{local_fold}/microsoft/speecht5_hifigan"),
"embeddings_dataset": load_dataset(f"{local_fold}/Matthijs/cmu-arctic-xvectors", split="validation"),
"device": "cuda:0"
},
"speechbrain/mtl-mimic-voicebank": {
"model": WaveformEnhancement.from_hparams(source="speechbrain/mtl-mimic-voicebank", savedir="models/mtl-mimic-voicebank"),
"device": "cuda:0"
},
"microsoft/speecht5_vc":{
"processor": SpeechT5Processor.from_pretrained(f"{local_fold}/microsoft/speecht5_vc"),
"model": SpeechT5ForSpeechToSpeech.from_pretrained(f"{local_fold}/microsoft/speecht5_vc"),
"vocoder": SpeechT5HifiGan.from_pretrained(f"{local_fold}/microsoft/speecht5_hifigan"),
"embeddings_dataset": load_dataset(f"{local_fold}/Matthijs/cmu-arctic-xvectors", split="validation"),
"device": "cuda:0"
},
"julien-c/wine-quality": {
"model": joblib.load(cached_download(hf_hub_url("julien-c/wine-quality", "sklearn_model.joblib")))
},
"facebook/timesformer-base-finetuned-k400": {
"processor": AutoImageProcessor.from_pretrained(f"{local_fold}/facebook/timesformer-base-finetuned-k400"),
"model": TimesformerForVideoClassification.from_pretrained(f"{local_fold}/facebook/timesformer-base-finetuned-k400"),
"device": "cuda:0"
},
"facebook/maskformer-swin-base-coco": {
"feature_extractor": MaskFormerFeatureExtractor.from_pretrained(f"{local_fold}/facebook/maskformer-swin-base-coco"),
"model": MaskFormerForInstanceSegmentation.from_pretrained(f"{local_fold}/facebook/maskformer-swin-base-coco"),
"device": "cuda:0"
},
"Intel/dpt-hybrid-midas": {
"model": DPTForDepthEstimation.from_pretrained(f"{local_fold}/Intel/dpt-hybrid-midas", low_cpu_mem_usage=True),
"feature_extractor": DPTFeatureExtractor.from_pretrained(f"{local_fold}/Intel/dpt-hybrid-midas"),
"device": "cuda:0"
}
}
把不需要的模型遮蔽即可。
與此同時,還可以不透過本地服務介面,修改config.yaml的inference_mode: huggingface
隨後直接在終端執行ChatGPT版本的賈維斯:
cd server
python3 awesome_chat.py --config config.yaml --mode cli
執行效果:
Welcome to Jarvis! A collaborative system that consists of an LLM as the controller and numerous expert models as collaborative executors. Jarvis can plan tasks, schedule Hugging Face models, generate friendly responses based on your requests, and help you with many things. Please enter your request (`exit` to exit).
[ User ]: Given a collection of image A: /examples/cat.jpg, B: /examples/z3.jpg, C: /examples/z1.jpg, please tell me how many zebras in these picture?
[ Jarvis ]: Based on your request, I have used several models to process the images you provided. For the image A: /examples/cat.jpg, I used the image-to-text model nlpconnect/vit-gpt2-image-captioning to generate the text "a cat sitting on a window sill looking out". Then I used the object-detection model facebook/detr-resnet-50 to generate the image with predicted box '/images/a3dc.jpg', which contains two objects with labels 'potted plant' and 'cat'. For the image B: /examples/z3.jpg, I used the image-to-text model nlpconnect/vit-gpt2-image-captioning to generate the text "two zebras standing in a field of dry grass". Then I used the object-detection model facebook/detr-resnet-50 to generate the image with predicted box '/images/f5df.jpg', which contains three objects with labels 'zebra'. For the image C: /examples/z1.jpg, I used the image-to-text model nlpconnect/vit-gpt2-image-captioning to generate the text "a zebra standing in a field of tall grass". Then I used the object-detection model facebook/detr-resnet-50 to generate the image with predicted box '/images/69c7.jpg', which contains one object with label 'zebra'. Finally, I used the canny-control model canny-control to generate the image '/images/7e18.png' based on the image /examples/savanna.jpg. Therefore, there are four zebras in these pictures. Is there anything else I can help you with?
[ User ]: Please answer all the named entities in the sentence: Iron Man is a superhero appearing in American comic books published by Marvel Comics. The character was co-created by writer and editor Stan Lee, developed by scripter Larry Lieber, and designed by artists Don Heck and Jack Kirby.
[ Jarvis ]: Sure, I understand your request. Based on the inference results of the models, I will answer your request step by step. Firstly, I used the token-classification model dslim/bert-base-NER to recognize the named entities in the sentence. The inference result is that the named entities are Iron Man, Stan Lee, Larry Lieber, Don Heck and Jack Kirby. Then, I used the question-answering model deepset/roberta-base-squad2 to answer your request. The inference result is that there is no output since the context cannot be empty. Therefore, I cannot make it. I hope this answer your request. Is there anything else I can help you with?
[ User ]: Please generate a canny image based on /examples/savanna.jpg
[ Jarvis ]: Sure. I have generated a canny image based on /examples/savanna.jpg. To do this, I first used the image-to-text model nlpconnect/vit-gpt2-image-captioning to generate the text description of the image, which is "a herd of giraffes and zebras grazing in a field". Then I used the canny-control model to generate the canny image of the original image, which is stored in the path "/images/1ea2.png". Finally, I used the canny-text-to-image model lllyasviel/sd-controlnet-canny to generate the canny image based on the text description and the canny image, which is stored in the path "/images/ba66.png". Do you need anything else?
可以理解為ChatGPT版本的文字生成圖片,外面套了個賈維斯(Jarvis)的殼兒,演出效果滿分。
結語
總的來說,和微軟之前開源的“視覺化圖形ChatGPT”一樣,此類專案的象徵意義要遠遠大於現實意義。賈維斯(Jarvis)代表的是大多數技術同仁的共同願景,對於這類人工智慧技術的發展,可以肯定,但由於硬體門檻過高的原因,短期內還不能過於期待。