Fashion Clothing Pattern Generation Based on Improved Stable Diffusion

doi:10.3969/j.issm.1006-2475.2024.12.003

Abstract

Abstract: Dress pattern is a window for people to show their personality and fashion. In recent years， with the continuous development of multimodal technology， text-based dress pattern generation has been well studied. However， the existing methods have not been well applied due to the problems of combining poor semanticity and low resolution. After the large-scale language-image pre-training model CLIP was proposed， various pre-training diffusion models combined with CLIP for text-image generation tasks have become the mainstream methods in this field. However， the original pre-training models have poor generalization ability to the downstream task， relying solely on the pre-training model does not allow flexible and accurate control of the color and structure of the dress pattern， and its large number of parameters is difficult to re-train from scratch. To solve the above problems， this study designs a Stable Diffusion-improved network FT-SDM-L （Fine Tuning-Stable Diffusion Model-Lion）， which uses the dress image text dataset to update the weights of the cross-attention module in the original model. The experimental results show that the ClipScore and HPS v2 scores of the fine-tuned model are improved by 0.08 and 1.22 on average， which validates the important ability of this module in combining textual information. Subsequently， to further enhance the model’s feature extraction and data mapping capabilities in the apparel domain， a lightweight adapter， Stable-Adapter， was designed to be added to the module’s output location to maximize the sensing of changes in the input cues. By adding only 0.75% extra parameters to the adapter， the ClipScore and HPS v2 scores of the model can be further improved by 0.05， 0.38. Good results are achieved in terms of fidelity and semantic consistency of clothing pattern generation

Key words: , text image generation； diffusion model； cross-attention mechanism； image generation； computer vision

CLC Number:

TP391

ZHAO Chenyang, XUE Tao, LIU Junhua. Fashion Clothing Pattern Generation Based on Improved Stable Diffusion[J]. Computer and Modernization, 2024, 0(12): 15-23.

References

［1］ BUCKNER W. Disguises and the origins of clothing［J］. Human Nature， 2021，32（4）：706-728.
［2］ AK K E， LIM J H， THAM J Y， et al. Semantically consistent text to fashion image synthesis with an enhanced attentional generative adversarial network［J］. Pattern Recognition Letters， 2020，135：22-29.
［3］ LIU L L， ZHANG H J， ZHOU D L. Clothing generation by multi-modal embedding： A compatibility matrix-regularized
GAN model［J］. Image and Vision Computing， 2021，107. DOI： 10.1016/j.imavis.2021.104097.
［4］ NIU X， GONG M G， ZHAN T， et al. A conditional adversarial network for change detection in heterogeneous images［J］. IEEE Geoscience and Remote Sensing Letters， 2019，16（1）：45-49.
［5］ CRESWELL A， WHITE T， DUMOULIN V， et al. Generative adversarial networks： An overview［J］. IEEE signal Processing Magazine， 2018，35（1）：53-65.
［6］ RADFORD A， KIM J W， HALLACY C， et al. Learning transferable visual models from natural language supervision［C］// 38th International Conference on Machine Learning. PMLR， 2021：8748-8763.
［7］ SOHL-DICKSTEIN J， WEISS E， MAHESWARANATHAN
N， et al. Deep unsupervised learning using nonequilibrium thermodynamics［C］// 32nd International Conference on Machine Learning. PMLR， 2015：2256-2265.
［8］ NICHOL A， DHARIWAL P， RAMESH A， et al. GLIDE： Towards photorealistic image generation and editing with text-guided diffusion models［J］. arXiv preprint arXiv：2112.10741， 2021.
［9］ RAMESH A， DHARIWAL P， NICHOL A， et al. Hierarchical text-conditional image generation with CHIP latents［J］. arXiv preprint arXiv：2204.06125， 2022.
［10］ ESSER P， ROMBACH R， OMMER B. Taming transformers for high-resolution image synthesis［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2021：12868-12878.
［11］ ROMBACH R， BLATTMANN A， LORENZ D， et al. High-resolution image synthesis with latent diffusion models［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2022：10674-10685.
［12］ RONNEBERGER O， FISCHER P， BROX T. U-Net： Convolutional networks for biomedical image segmentation［C］// Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015： 18th International Conference. Springer， 2015：234-241.
［13］ LOSHCHILOV I， HUTTER F. Decoupled weight decay regularization［J］. arXiv preprint arXiv：1711.05101， 2017.
［14］ CHEN X N， LIANG C， HUANG D， et al. Symbolic discovery of optimization algorithms［J］. arXiv preprint arXiv：2302.06675， 2023.
［15］ ZHANG H， XU T， LI H S， et al. StackGAN： Text to photo-realistic image synthesis with stacked generative adversarial networks［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. IEEE， 2017： 5908-5916.
［16］ YU Z Y， LUO T J. Research on clothing patterns generation based on multi-scales self-attention improved generative adversarial network［J］. International Journal of Intelligent Computing and Cybernetics， 2021，14（4）：647-663.
［17］ VINYALS O， TOSHEV A， BENGIO S， et al. Show and tell： A neural image caption generator［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2015：3156-3164.
［18］ KARPATHY A， LI F F. Deep visual-semantic alignments for generating image descriptions［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2015：3128-3137.
［19］ ANTOL S， AGRAWAL A， LU J S， et al. VQA： Visual question answering［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. IEEE， 2015：2425-2433.
［20］ JOHNSON J， HARIHARAN B， VAN DER MAATEN L， et al. CLEVR： A diagnostic dataset for compositional language and elementary visual reasoning［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2017：1988-1997.
［21］ REDDY D M， BASHA S M， HARI M C， et al. Dall-e： Creating images from text［J］. UGC Care Group I Journal， 2021，8（14）：71-75.
［22］ VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］.// Proceedings of the 31st International Conference on Neural Information Processing Systems. ACM， 2017：6000-6010.
［23］ SCHUHMANN C， BEAUMONT R， VENCU R， et al. Laion-5b： An open large-scale dataset for training next generation image-text models［C］.// Proceedings of the 35th International Conference on Neural Information Processing Systems. ACM， 2022，35：25278-25294.
［24］ RUIZ N， LI Y Z， JAMPANI V， et al. DreamBooth： Fine tuning text-to-image diffusion models for subject-driven generation［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2023：22500-22510.
［25］ AGHAJANYAN A， ZETTLEMOYER L， GUPTA S. Intrinsic dimensionality explains the effectiveness of language model fine-tuning［J］. arXiv preprint arXiv：2012.13255， 2020.
［26］ LI C Y， FARKHOOR H， LIU R， et al. Measuring the intrinsic dimension of objective landscapes［J］. arXiv preprint arXiv：1804.08838， 2018.
［27］ HU E J， SHEN Y L， WALLIS P， et al. LoRA： Low-rank adaptation of large language models［J］. arXiv preprint arXiv：2106.09685， 2021.
［28］ LI X L， LIANG P. Prefix-tuning： Optimizing continuous prompts for generation［J］. arXiv preprint arXiv：2101.00190，
2021.
［29］ JIA M L， TANG L M， CHEN B C， et al. Visual prompt tuning［C］// 2022 European Conference on Computer Vision. Springer， 2022：709-727.
［30］ REBUFFI S A， VEDALDI A， BILEN H. Efficient parametrization of multi-domain deep neural networks［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2018：8119-8127.
［31］ KLINKER F. Exponential moving average versus moving exponential average［J］. Mathematische Semesterberichte， 2011，58（1）：97-107.
［32］ GE Y Y， ZHANG R M， WANG X G， et al. DeepFashion2： A versatile benchmark for detection， pose estimation， segmentation and re-identification of clothing images［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2019：5332-5340.
［33］ LIU Z W， LUO P， QIU S， et al. DeepFashion： Powering robust clothes recognition and retrieval with rich annotations［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2016：1096-1104.
［34］ ZOU X X， KONG X H， WONG W K， et al. FashionAI： A hierarchical dataset for fashion understanding［C］// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. IEEE， 2019：296-304.
［35］ LI J N， LI D X， SAVARESE S， et al. BLIP-2： Bootstrapping language-image pre-training with frozen image encoders and large language models［J］. arXiv preprint arXiv：2301.12597， 2023.
［36］ HEUSEL M， RAMSAUER H， UNTERTHINER T， et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium［J］. arXiv preprint arXiv：1706.08500， 2017.
［37］ KORHONEN J， YOU J Y. Peak signal-to-noise ratio revisited： Is simple beautiful?［C］// 2012 4th International Workshop on Quality of Multimedia Experience. IEEE， 2012：37-38.
［38］ HORE A， ZIOU D. Image quality metrics： PSNR vs. SSIM［C］// 2010 20th International Conference on Pattern Recognition. IEEE， 2010：2366-2369.
［39］ WANG J J， LIU Y H， WEI P， et al. Fractal image coding using SSIM［C］// 2011 18th IEEE International Conference on Image Processing. IEEE， 2011：241-244.
［40］ HESSEL J， HOLTZMAN A， FORBES M， et al. CLIPscore： A reference-free evaluation metric for image captioning［J］. arXiv preprint arXiv：2104.08718， 2021.
［41］ CROWSON K， BIDERMAN S， KORNIS D， et al. VQGAN-CLIP： Open domain image generation and editing with natural language guidance［C］// 2022 European Conference on Computer Vision. Springer， 2022：88-105.

[1]	ZHENG Jiuchao, ZHAO Xinyuan. Entity Linking Method Based on Topics and Description Information [J]. Computer and Modernization, 2024, 0(12): 10-14.
[2]	WANG Xiaohang1, LI Yongjie1, YU Lei1, FAN Xiao2. A Method of Using Compound Event Probability Operation to Solve Problem of Negative Information Blocking Maximization [J]. Computer and Modernization, 2024, 0(12): 24-33.
[3]	ZHANG Xiaodong1, BAI Guangzhi1, LI Min1, LI Haoyang2. Oil and Gas Well Production Prediction Model Based on Empirical Wavelet Transform [J]. Computer and Modernization, 2024, 0(12): 53-58.
[4]	LIU Yunhai1, Feng Guang1, WU Xiaoting2, YANG Qun2 . Safety Helmet Wearing Detection Algorithm for Complex Construction Scenes [J]. Computer and Modernization, 2024, 0(12): 66-71.
[5]	LIU Baobao, YANG Jingjing, TAO Lu, WANG Heying . DSMSC Based on Attention Mechanism for Remote Sensing Image Scene Classification [J]. Computer and Modernization, 2024, 0(12): 72-77.
[6]	GU Yue, DENG Songfeng, SHEN Ji, MU Wentao, ZHAO Enqi. SAR Ship Detection Algorithm Based on Improved YOLOv8 [J]. Computer and Modernization, 2024, 0(12): 78-83.
[7]	WU Xiuling1, ZHOU Sheng1, WANG Chunjuan1, YU Cuizhuo2, LIU Hao3. Research Progress in Ultra Short-term Power Load Forecasting Technology [J]. Computer and Modernization, 2024, 0(12): 108-115.
[8]	LI Deyou1, 2, YU Jinsongdi1, 2, WEI Dandan1, 2, LUO Yuan1, 2, TONG Ruiju3. Abstract Tree Model for Gridded Cube Metadata [J]. Computer and Modernization, 2024, 0(11): 1-6.
[9]	GONG Yicheng1, 2, LIU Qing1, 2. Beijing Opera Binary Classification Based on RF-LCE-BiLSTM-Attention-AMSSA Model [J]. Computer and Modernization, 2024, 0(11): 7-12.
[10]	LI Taoying, LI Meng, WU Mengqiao. Taxi Passenger Flow Prediction Based on Heterogeneous Spatiotemporal Graph#br# Convolutional Networks [J]. Computer and Modernization, 2024, 0(11): 13-18.
[11]	ZHANG Tai1, YAN Zihao2, DUAN Jie2, ZHANG Zhihong2. Information Forwarding Strategy of Internet of Vehicles in Named Data Network [J]. Computer and Modernization, 2024, 0(11): 19-27.
[12]	YUAN Qingle, MU Li. Inventory Forecasting Method Based on Improved Elman Neural Network [J]. Computer and Modernization, 2024, 0(11): 28-33.
[13]	ZHANG Kun1, ZHANG Yongwei1, WU Yongcheng1, ZHANG Xiaowen2, ZHAI Shichen2. An LLM-based Method for Automatic Construction of Equipment Failure Knowledge Graphs [J]. Computer and Modernization, 2024, 0(11): 46-53.
[14]	YE Xue, YANG Sheng, CHENG Kai, ZHU Feng. A Financial Knowledge Q&A Model for Power Enterprise Based on ChatGLM2-6B [J]. Computer and Modernization, 2024, 0(11): 54-63.
[15]	WAN Hongwei, CHEN Pinghua. Polyp Segmentation Based on Involution and Coordinate Reverse Attention [J]. Computer and Modernization, 2024, 0(11): 84-90.