EasyNLP中文文圖生成模型帶你秒變藝術家

作者：汪誠愚、劉婷婷

導讀

宣物莫大於言，存形莫善於畫。--【晉】陸機

多模態資料（文字、影像、聲音）是人類認識、理解和表達世間萬物的重要載體。近年來，多模態資料的爆炸性增長促進了內容網際網路的繁榮，也帶來了大量多模態內容理解和生成的需求。與常見的跨模態理解任務不同，文到圖的生成任務是流行的跨模態生成任務，旨在生成與給定文字對應的影像。這一文圖生成的任務，極大地釋放了AI的想象力，也激發了人類的創意。典型的模型例如OpenAI開發的DALL-E和DALL-E2。近期，業界也訓練出了更大、更新的文圖生成模型，例如Google提出的Parti和Imagen。

然而，上述模型一般不能用於處理中文的需求，而且上述模型的引數量龐大，很難被開源社群的廣大使用者直接用來Fine-tune和推理。本次，EasyNLP開源框架再次迎來大升級，整合了先進的文圖生成架構Transformer+VQGAN，同時，向開源社群免費開放不同引數量的中文文圖生成模型的Checkpoint，以及相應Fine-tune和推理介面。使用者可以在我們開放的Checkpoint基礎上進行少量領域相關的微調，在不消耗大量計算資源的情況下，就能一鍵進行各種藝術創作。

EasyNLP是阿里雲機器學習PAI 團隊基於 PyTorch 開發的易用且豐富的中文NLP演算法框架，並且提供了從訓練到部署的一站式 NLP 開發體驗。EasyNLP 提供了簡潔的介面供使用者開發 NLP 模型，包括NLP應用 AppZoo 、預訓練模型 ModelZoo、資料倉儲DataHub等特性。由於跨模態理解和生成需求的不斷增加，EasyNLP也支援各種跨模態模型，特別是中文領域的跨模態模型，推向開源社群。例如，在先前的工作中，EasyNLP已經對中文圖文檢索CLIP模型進行了支援（看這裡）。我們希望能夠服務更多的 NLP 和多模態演算法開發者和研究者，也希望和社群一起推動 NLP /多模態技術的發展和模型落地。本文簡要介紹文圖生成的技術，以及如何在EasyNLP框架中如何輕鬆實現文圖生成，帶你秒變藝術家。本文開頭的展示圖片即為我們模型創作的作品。

文圖生成模型簡述

下面以幾個經典的基於Transformer的工作為例，簡單介紹文圖生成模型的技術。DALL-E由OpenAI提出，採取兩階段的方法生成影像。在第一階段，訓練一個dVAE（discrete variational autoencoder）的模型將256×256的RGB圖片轉化為32×32的image token，這一步驟將圖片進行資訊壓縮和離散化，方便進行文字到影像的生成。第二階段，DALL-E訓練一個自迴歸的Transformer模型，將文字輸入轉化為上述1024個image token。

由清華大學等單位提出的CogView模型對上述兩階段文圖生成的過程進行了進一步的最佳化。在下圖中，CogView採用了sentence piece作為text tokenizer使得輸入文字的空間表達更加豐富，並且在模型的Fine-tune過程中採用了多種技術，例如影像的超分、風格遷移等。

ERNIE-ViLG模型考慮進一步考慮了Transformer模型學習知識的可遷移性，同時學習了從文字生成影像和從影像生成文字這兩種任務。其架構圖如下所示：

隨著文圖生成技術的不斷髮展，新的模型和技術不斷湧現。舉例來說，OFA將多種跨模態的生成任務統一在同一個模型架構中。DALL-E 2同樣由OpenAI提出，是DALL-E模型的升級版，考慮了層次化的影像生成技術，模型利用CLIP encoder作為編碼器，更好地融入了CLIP預訓練的跨模態表徵。Google進一步提出了Diffusion Model的架構，能有效生成高畫質大圖，如下所示：

在本文中，我們不再對這些細節進行贅述。感興趣的讀者可以進一步查閱參考文獻。

EasyNLP文圖生成模型

由於前述模型的規模往往在數十億、百億引數級別，龐大的模型雖然能生成質量較大的圖片，然後對計算資源和預訓練資料的要求使得這些模型很難在開源社群廣泛應用，尤其在需要面向垂直領域的情況下。在本節中，我們詳細介紹EasyNLP提供的中文文圖生成模型，它在較小引數量的情況下，依然具有良好的文圖生成效果。

模型架構

模型框架圖如下圖所示：

考慮到Transformer模型複雜度隨序列長度呈二次方增長，文圖生成模型的訓練一般以影像向量量化和自迴歸訓練兩階段結合的方式進行。

影像向量量化是指將影像進行離散化編碼，如將256×256的RGB影像進行16倍降取樣，得到16×16的離散化序列，序列中的每個image token對應於codebook中的表示。常見的影像向量量化方法包括：VQVAE、VQVAE-2和VQGAN等。我們採用VQGAN在ImageNet上訓練的f16_16384（16倍降取樣，詞表大小為16384）的模型權重來生成影像的離散化序列。

自迴歸訓練是指將文字序列和影像序列作為輸入，在影像部分，每個image token僅與文字序列的tokens和其之前的image tokens進行attention計算。我們採用GPT作為backbone，能夠適應不同模型規模的生成任務。在模型預測階段，輸入文字序列，模型以自迴歸的方式逐步生成定長的影像序列，再透過VQGAN decoder重構為影像。

開源模型引數設定

模型配置	pai-painter-base-zh	pai-painter-large-zh
引數量（Parameters）	202M	433M
層數（Number of Layers）	12	24
注意力頭數（Attention Heads）	12	16
隱向量維度（Hidden Size）	768	1024
文字長度（Text Length）	32	32
影像序列長度（Image Length）	16 x 16	16 x 16
影像尺寸（Image Size）	256 x 256	256 x 256
VQGAN詞表大小（Codebook Size）	16384	16384

模型實現

在EasyNLP框架中，我們在模型層構建基於minGPT的backbone構建模型，核心部分如下所示：

self.first_stage_model = VQModel(ckpt_path=vqgan_ckpt_path).eval()
self.transformer = GPT(self.config)

VQModel的Encoding階段過程為：

# in easynlp/appzoo/text2image_generation/model.py

@torch.no_grad()
def encode_to_z(self, x):
    quant_z, _, info = self.first_stage_model.encode(x)
    indices = info[2].view(quant_z.shape[0], -1)
    return quant_z, indices

x = inputs['image']
x = x.permute(0, 3, 1, 2).to(memory_format=torch.contiguous_format)
# one step to produce the logits
_, z_indices = self.encode_to_z(x)  # z_indice: torch.Size([batch_size, 256])

VQModel的Decoding階段過程為：

# in easynlp/appzoo/text2image_generation/model.py

@torch.no_grad()
def decode_to_img(self, index, zshape):
    bhwc = (zshape[0],zshape[2],zshape[3],zshape[1])
    quant_z = self.first_stage_model.quantize.get_codebook_entry(
        index.reshape(-1), shape=bhwc)
    x = self.first_stage_model.decode(quant_z)
    return x

# sample為訓練階段的結果生成，與預測階段的generate類似，詳解見下文generate
index_sample = self.sample(z_start_indices, c_indices,
                           steps=z_indices.shape[1],
                           ...)
x_sample = self.decode_to_img(index_sample, quant_z.shape)

Transformer採用minGPT進行構建，輸入影像的離散編碼，輸出文字token。前向傳播過程為：

# in easynlp/appzoo/text2image_generation/model.py

def forward(self, inputs):
    x = inputs['image']
    c = inputs['text']
    x = x.permute(0, 3, 1, 2).to(memory_format=torch.contiguous_format)
    # one step to produce the logits
    _, z_indices = self.encode_to_z(x)  # z_indice: torch.Size([batch_size, 256]) 
    c_indices = c
    
    if self.training and self.pkeep < 1.0:
        mask = torch.bernoulli(self.pkeep*torch.ones(z_indices.shape,
                                                     device=z_indices.device))
        mask = mask.round().to(dtype=torch.int64)
        r_indices = torch.randint_like(z_indices, self.transformer.config.vocab_size)
        a_indices = mask*z_indices+(1-mask)*r_indices
    
    else:
        a_indices = z_indices
        cz_indices = torch.cat((c_indices, a_indices), dim=1)
        # target includes all sequence elements (no need to handle first one
        # differently because we are conditioning)
        target = z_indices
        # make the prediction
        logits, _ = self.transformer(cz_indices[:, :-1])
        # cut off conditioning outputs - output i corresponds to p(z_i | z_{<i}, c)
        logits = logits[:, c_indices.shape[1]-1:]
    return logits, target

在預測階段，輸入為文字token，輸出為256*256的影像。首先，將輸入文字預處理為token序列：

# in easynlp/appzoo/text2image_generation/predictor.py

def preprocess(self, in_data):
    if not in_data:
        raise RuntimeError("Input data should not be None.")

    if not isinstance(in_data, list):
        in_data = [in_data]
    rst = {"idx": [], "input_ids": []}
    max_seq_length = -1
    for record in in_data:
        if "sequence_length" not in record:
            break
        max_seq_length = max(max_seq_length, record["sequence_length"])
    max_seq_length = self.sequence_length if (max_seq_length == -1) else max_seq_length

    for record in in_data:
        text= record[self.first_sequence]
        try:
            self.MUTEX.acquire()
            text_ids = self.tokenizer.convert_tokens_to_ids(self.tokenizer.tokenize(text))
            text_ids = text_ids[: self.text_len]
            n_pad = self.text_len - len(text_ids)
            text_ids += [self.pad_id] * n_pad
            text_ids = np.array(text_ids) + self.img_vocab_size

        finally:
            self.MUTEX.release()

        rst["idx"].append(record["idx"]) 
        rst["input_ids"].append(text_ids)
    return rst

逐步生成長度為16*16的影像離散token序列：

# in easynlp/appzoo/text2image_generation/model.py

def generate(self, inputs, top_k=100, temperature=1.0):
    cidx = inputs
    sample = True
    steps = 256
    for k in range(steps):
        x_cond = cidx
        logits, _ = self.transformer(x_cond)
        # pluck the logits at the final step and scale by temperature
        logits = logits[:, -1, :] / temperature
        # optionally crop probabilities to only the top k options
        if top_k is not None:
            logits = self.top_k_logits(logits, top_k)
        # apply softmax to convert to probabilities
        probs = torch.nn.functional.softmax(logits, dim=-1)
        # sample from the distribution or take the most likely
        if sample:
            ix = torch.multinomial(probs, num_samples=1)
        else:
            _, ix = torch.topk(probs, k=1, dim=-1)
        # append to the sequence and continue
        cidx = torch.cat((cidx, ix), dim=1)
    img_idx = cidx[:, 32:]
    return img_idx

最後，我們呼叫VQModel的Decoding過程將這些影像離散token序列轉換為影像。

模型效果

我們在四個中文的公開資料集COCO-CN、MUGE、Flickr8k-CN、Flickr30k-CN上驗證了EasyNLP框架中文圖生成模型的效果。同時，我們對比了這個模型和CogView、DALL-E的效果，如下所示：

其中，
1）MUGE是天池平臺公佈的電商場景的中文大規模多模態評測基準（http://tianchi.aliyun.com/muge）。為了方便計算指標，MUGE我們採用valid資料集的結果，其他資料集採用test資料集的結果。

2）CogView源自https://github.com/THUDM/CogView

3）DALL-E模型沒有公開的官方程式碼。已經公開的部分只包含VQVAE的程式碼，不包括Transformer部分。我們基於廣受關注的https://github.com/lucidrains...版本的程式碼和該版本推薦的checkpoits進行復現，checkpoints為2.09億引數，為OpenAI的DALL-E模型引數量的1/100。（OpenAI版本DALL-E為120億引數，其中CLIP為4億引數）。

經典案例

我們分別在自然風景資料集COCO-CN上Fine-tune了base和large級別的模型，如下展示了模型的效果：

示例1：一隻俏皮的狗正跑過草地

示例2：一片水域的景色以日落為背景

我們也積累了阿里集團的海量電商商品資料，微調得到了面向電商商品的文圖生成模型。效果如下：

示例3：女童套頭毛衣打底衫秋冬針織衫童裝兒童內搭上衣

示例4：春夏真皮工作鞋女深色軟皮久站舒適上班面試職業皮鞋

除了支援特定領域的應用，文圖生成也極大地輔助了人類的藝術創作。使用訓練得到的模型，我們可以秒變“中國國畫藝術大師”，示例如下所示：

更多的示例請欣賞：

使用教程

欣賞了模型生成的作品之後，如果我們想DIY，訓練自己的文圖生成模型，應該如何進行呢？以下我們簡要介紹在EasyNLP框架對預訓練的文圖生成模型進行Fine-tune和推理。

安裝EasyNLP

使用者可以直接參考連結的說明安裝EasyNLP演算法框架。

資料準備

首先準備訓練資料與驗證資料，為tsv檔案。這一檔案包含以製表符\t分隔的兩列，第一列為索引號，第二列為文字，第三列為圖片的base64編碼。用於測試的輸入檔案為兩列，僅包含索引號和文字。

為了方便開發者，我們也提供了轉換圖片到base64編碼的示例程式碼：

import base64
from io import BytesIO
from PIL import Image

img = Image.open(fn)
img_buffer = BytesIO()
img.save(img_buffer, format=img.format)
byte_data = img_buffer.getvalue()
base64_str = base64.b64encode(byte_data) # bytes

下列檔案已經完成預處理，可用於測試：

# train
https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/painter_text2image/MUGE_train_text_imgbase64.tsv

# valid
https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/painter_text2image/MUGE_val_text_imgbase64.tsv

# test
https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/painter_text2image/MUGE_test.text.tsv

模型訓練

我們採用以下命令對模型進行

easynlp \
    --mode=train \
    --worker_gpu=1 \
    --tables=MUGE_val_text_imgbase64.tsv,MUGE_val_text_imgbase64.tsv \
    --input_schema=idx:str:1,text:str:1,imgbase64:str:1 \
    --first_sequence=text \
    --second_sequence=imgbase64 \
    --checkpoint_dir=./finetuned_model/ \
    --learning_rate=4e-5 \
    --epoch_num=1 \
    --random_seed=42 \
    --logging_steps=100 \
    --save_checkpoint_steps=1000 \
    --sequence_length=288 \
    --micro_batch_size=16 \
    --app_name=text2image_generation \
    --user_defined_parameters='
        pretrain_model_name_or_path=alibaba-pai/pai-painter-large-zh
        size=256
        text_len=32
        img_len=256
        img_vocab_size=16384
    '

我們提供base和large兩個版本的預訓練模型，pretrain_model_name_or_path分別為alibaba-pai/pai-painter-base-zh和alibaba-pai/pai-painter-large-zh。

訓練完成後模型被儲存到./finetuned_model/。

模型批次推理

模型訓練完畢後，我們可以將其用於影像生成，其示例如下：

easynlp \
    --mode=predict \
    --worker_gpu=1 \
    --tables=MUGE_test.text.tsv \
    --input_schema=idx:str:1,text:str:1 \
    --first_sequence=text \
    --outputs=./T2I_outputs.tsv \
    --output_schema=idx,text,gen_imgbase64 \
    --checkpoint_dir=./finetuned_model/ \
    --sequence_length=288 \
    --micro_batch_size=8 \
    --app_name=text2image_generation \
    --user_defined_parameters='
        size=256
        text_len=32
        img_len=256
        img_vocab_size=16384
    '

結果儲存在一個tsv檔案中，每行對應輸入中的一個文字，輸出的影像以base64編碼。

使用Pipeline介面快速體驗文圖生成效果

為了進一步方便開發者使用，我們在EasyNLP框架內也實現了Inference Pipeline功能。使用者可以使用如下命令呼叫Fine-tune過的電商場景下的文圖生成模型：

# 直接構建pipeline
default_ecommercial_pipeline = pipeline("pai-painter-commercial-base-zh")

# 模型預測
data = ["寬鬆T恤"]
results = default_ecommercial_pipeline(data)  # results的每一條是生成影像的base64編碼

# base64轉換為影像
def base64_to_image(imgbase64_str):
    image = Image.open(BytesIO(base64.urlsafe_b64decode(imgbase64_str)))
    return image

# 儲存以文字命名的影像
for text, result in zip(data, results):
    imgpath = '{}.png'.format(text)
    imgbase64_str = result['gen_imgbase64']
    image = base64_to_image(imgbase64_str)
    image.save(imgpath)
    print('text: {}, save generated image: {}'.format(text, imgpath))

除了電商場景，我們還提供了以下場景的模型：

自然風光場景：“pai-painter-scenery-base-zh”
中國山水畫場景：“pai-painter-painting-base-zh”
在上面的程式碼當中替換“pai-painter-commercial-base-zh”，就可以直接體驗，歡迎試用。對於使用者Fine-tune的文圖生成模型，我們也開放了自定義模型載入的Pipeline介面：

# 載入模型，構建pipeline
local_model_path = ...
text_to_image_pipeline = pipeline("text2image_generation", local_model_path)

# 模型預測
data = ["xxxx"]
results = text_to_image_pipeline(data)  # results的每一條是生成影像的base64編碼

未來展望

在這一期的工作中，我們在EasyNLP框架中整合了中文文圖生成功能，同時開放了模型的Checkpoint，方便開源社群使用者在資源有限情況下進行少量領域相關的微調，進行各種藝術創作。在未來，我們計劃在EasyNLP框架中推出更多相關模型，敬請期待。我們也將在EasyNLP框架中整合更多SOTA模型（特別是中文模型），來支援各種NLP和多模態任務。此外，阿里雲機器學習PAI團隊也在持續推進中文多模態模型的自研工作，歡迎使用者持續關注我們，也歡迎加入我們的開源社群，共建中文NLP和多模態演算法庫！

Github地址：https://github.com/alibaba/EasyNLP

Reference

Chengyu Wang, Minghui Qiu, Taolin Zhang, Tingting Liu, Lei Li, Jianing Wang, Ming Wang, Jun Huang, Wei Lin. EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing. arXiv
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever. Zero-Shot Text-to-Image Generation. ICML 2021: 8821-8831
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, Jie Tang. CogView: Mastering Text-to-Image Generation via Transformers. NeurIPS 2021: 19822-19835
Han Zhang, Weichong Yin, Yewei Fang, Lanxin Li, Boqiang Duan, Zhihua Wu, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation. arXiv
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang. Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. ICML 2022
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv
Van Den Oord A, Vinyals O. Neural discrete representation learning. NIPS 2017
Esser P, Rombach R, Ommer B. Taming transformers for high-resolution image synthesis. CVPR 2021: 12873-12883.
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, Mohammad Norouzi: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. arXiv

阿里靈傑回顧
● 阿里靈傑：阿里雲機器學習PAI開源中文NLP演算法框架EasyNLP，助力NLP大模型落地
● 阿里靈傑：預訓練知識度量比賽奪冠！阿里雲PAI釋出知識預訓練工具
● 阿里靈傑：EasyNLP帶你玩轉CLIP圖文檢索