计算机与现代化 ›› 2024, Vol. 0 ›› Issue (12): 15-23.doi: 10.3969/j.issm.1006-2475.2024.12.003

• 人工智能 • 上一篇    下一篇

基于改进Stable Diffusion的时尚服饰图案生成



  

  1. (西安工程大学计算机科学学院,陕西 西安 710048)
  • 出版日期:2024-12-31 发布日期:2024-12-31
  • 基金资助:
    国家自然科学基金青年科学基金资助项目(62202366); 陕西省技术创新引导专项计划资助项目(2020CGXNG-012)

Fashion Clothing Pattern Generation Based on Improved Stable Diffusion

  1. (School of Computer Science, Xi’an Polytechnic University, Xi’an 710048, China)
  • Online:2024-12-31 Published:2024-12-31

摘要: 服饰图案是人们展示个性与时尚的窗口。近年来,随着多模态技术的不断发展,基于文本的服饰图案生成得到了充分研究。但现有方法由于结合语义性较差和分辨率不高等问题并未得到很好的应用。大规模语言-图像预训练模型CLIP提出后,各种预训练扩散模型结合CLIP做文本图像生成任务已成为该领域的主流方法。但原始预训练模型对下游任务泛化能力较差,单纯依靠预训练模型并不能灵活准确控制服饰图案的颜色和结构,且其庞大的参数量很难从头重新训练。为解决上述问题,本文设计一个基于Stable Diffusion改进的网络FT-SDM-L(Fine Tuning-Stable Diffusion Model-Lion),该网络使用服饰图像文本数据集,对原模型中的交叉注意力模块进行权重更新。实验结果表明,微调后模型的ClipScore及HPS v2分数平均提高了0.08和1.22,验证了该模块在结合文本信息中的重要能力。随后为进一步增强模型在服饰领域的特征提取和数据映射能力,在该模块输出位置设计添加一个轻量级适配器Stable-Adapter,最大限度地感知输入提示的变化。该适配器仅额外增加0.75%的参数就可使模型的ClipScore及HPS v2分数进一步提高0.05、0.38。模型在服饰图案生成的保真度和语义一致性上均取得良好效果。

关键词: 文本图像生成; 扩散模型; 交叉注意力机制; 图像生成; 计算机视觉 ,

Abstract:  Dress pattern is a window for people to show their personality and fashion. In recent years, with the continuous development of multimodal technology, text-based dress pattern generation has been well studied. However, the existing methods have not been well applied due to the problems of combining poor semanticity and low resolution. After the large-scale language-image pre-training model CLIP was proposed, various pre-training diffusion models combined with CLIP for text-image generation tasks have become the mainstream methods in this field. However, the original pre-training models have poor generalization ability to the downstream task, relying solely on the pre-training model does not allow flexible and accurate control of the color and structure of the dress pattern, and its large number of parameters is difficult to re-train from scratch. To solve the above problems, this study designs a Stable Diffusion-improved network FT-SDM-L (Fine Tuning-Stable Diffusion Model-Lion), which uses the dress image text dataset to update the weights of the cross-attention module in the original model. The experimental results show that the ClipScore and HPS v2 scores of the fine-tuned model are improved by 0.08 and 1.22 on average, which validates the important ability of this module in combining textual information. Subsequently, to further enhance the model’s feature extraction and data mapping capabilities in the apparel domain, a lightweight adapter, Stable-Adapter, was designed to be added to the module’s output location to maximize the sensing of changes in the input cues. By adding only 0.75% extra parameters to the adapter, the ClipScore and HPS v2 scores of the model can be further improved by 0.05, 0.38. Good results are achieved in terms of fidelity and semantic consistency of clothing pattern generation 

Key words:  , text image generation; diffusion model; cross-attention mechanism; image generation; computer vision

中图分类号: