Posted 2023-03-15Updated 2023-03-157 minutes read (About 1052 words)

Image and Text Multimodal

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs.

模型底层逻辑还是image+text input（融入多模态元素，更唬人一些？），但还是文本outputs（不过听说chatgpt plus版本已经可以有image output了，怀疑是一些命令的组合？就类似于上一篇微软刚提出的Vision Chatgpt的方式一样，将视觉模型作为tool模型，large-scale语言预训练作为agent模型）。
支持输入更多的tokens（更个性化，更方便定制了，更task-specific了）
加了一些VQA的性能对比。

Professional and Academic Benchmarks

While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.

professional benchmarks

乱杀应试教育界，秒杀多少普通人。

AI for science提上日程吧，早日研究，然后自我替代（开玩笑，不过很期待这一天）。

这GRE、leetcode水平，感觉我自己都要花点时间才能达到呢。

academic benchmark

已经叫做benchmark-specific tuning了，面向任务的DL调参侠瑟瑟发抖。

Thinking

这个part让我觉得，训练一个大模型需要好多方面的协调，包括
- Pretraining
  - Compute cluster scaling
  - Data
  - Distributed training infrastructure
  - Hardware correctness
  - Optimization & architecture
  - Training run babysitting
- Long context
  - Long context research
  - Long context kernels
- Vision
  - Architecture research
  - Compute cluster scaling
  - Distributed training infrastructure
  - Hardware correctness
  - Data
  - Alignment data
  - Training run babysitting
  - Deployment & post-training
- Reinforcement Learning & Alignment
  - Dataset contributions
  - Data infrastructure
  - ChatML format
  - Model safety
  - Refusals
  - Foundational RLHF and InstructGPT work
  - Flagship training runs
  - Code capability
- Evaluation & analysis
  - OpenAI Evals library
  - Model-graded evaluation infrastructure
  - Acceleration forecasting
  - ChatGPT evaluations
  - Capability evaluations
  - Coding evaluations
  - Real-world use case evaluations
  - Contamination investigations
  - Instruction following and API evals
  - Novel capability discovery
  - Vision evaluations
  - Economic impact evaluation
  - Non-proliferation, international humanitarian law & national security red teaming
  - Overreliance analysis
  - Privacy and PII evaluations
  - Safety and policy evaluations
  - OpenAI adversarial testers
  - System card & broader impacts analysis
- Deployment
  - Inference research
  - GPT-4 API & ChatML deployment
  - GPT-4 web experience
  - Inference infrastructure
  - Reliability engineering
  - Trust & safety engineering
  - Trust & safety monitoring and response
  - Trust & safety policy
  - Deployment compute
  - Product management
- Additional contributions
  - Blog post & paper content
  - Communications
  - Compute allocation support
  - Contracting, revenue, pricing, & finance support
  - Launch partners & product operations
  - Legal
  - Security & privacy engineering
  - System administration & on-call support
比较费人的小部门就是data和training部分（标粗显示的部分），然后就是领域专家给反馈（adversarial testers）。
算法部分Pretraining+long context+Vision+RL，测试部署Evaluation+deployment，以及后期各种市场、产品，都缺一不可，都很关键啊。不过能看到AI产品能够有今天，也是十分欣慰了，以前的AI都停留在弱弱弱弱AI的层面吧，好处是觉得自己学的东西真的能改变世界，学科真的有技术爆炸式的飞跃进展，坏处是自己好像没什么用处了（美滋滋，不过发展的尽头，不都是要被替代的吗？语言、教育、设计、律师、计算机、金融各行各业，不论是专业性的，还是需要想象力的艺术生成，好像AI在某种程度上已经击败了90%的人类了吧）。
3年前的自己还很有信念的All in AI，坚信Deep Learning，距离通用AI的出现或许真的不远咯。
目前的AI变强了，但还是辅助人类办公，提升效率的帮手，距离完全代替人类还有很长的路要走（甚至真正的商业化都比较麻烦？）。愈发认为，人类的情感、情绪价值，在当下变得更为宝贵、更难以替代一些。
未来究竟是理性的胜利、还是感性的胜利，是机器的胜利、还是人类的胜利呢。如果有生之年能够见证的话，还挺让人期待的。
不过当下，打不过就加入嘛！

References

Posted 2023-03-14Updated 2023-03-155 minutes read (About 690 words)

Paper | Visual ChatGPT Talking, Drawing and Editing with Visual Foundation Models | arXiv2023

Info

Title： Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Keyword：Large Language Model（LLM），Visual Foundation Model（VFM）
Idea：Prompt Engineering
Source
- Paper，2023年3月8日ArXiv Submitted，微软亚洲研究院的一项新工作。2303.04671] Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models (arxiv.org)
- Code，刚发布几天，目前已经有1万多标星了。microsoft/visual-chatgpt: VisualChatGPT (github.com)

Abstract

存在的问题：

大型语言模型如ChatGPT利用单一语言模态训练，因此处理视觉信息的能力非常有限。
相比较而言，视觉基础模型（VFM，Visual Foundation Models）在计算机视觉方面潜力巨大，因而能够理解和生成复杂的图像（如ViT、BLIP、Stable Diffusion等等）。VFM模型对输入-输出格式的苛求和固定限制，使得其在人机交互方面不如会话语言模型灵活。

贡献：

Prompt Engineering：将ChatGPT和多个SOTA视觉基础模型连接。

Method

没有任何的训练，系统构成：

Part 1 ChatGPT（直接利用大语言集成工具LangChain，调用OpenAI text-davinci-003 version）
Part 2 PromptManager

构造了一个巨大的Prompt，把系统规则、视觉基础模型调用、历史对话、用户query、历史推理、中间结果都包含，简单来说就是指导ChatGPT怎么调用模型，什么时候调用，怎么处理结果。ChatGPT和VFMs之间沟通提到图片的时候使用的是随机生成的uuid（universally unique identifier），两者之间是没有向量或者图片数据交互的。
Part 3 VFMs（22个训练好的SOTA视觉基础模型，直接调用，利用4张V100就能全部部署）

Result

不是真正的多模态大模型，不过是普通玩家（小公司）可以尝试的Prompt Engineering。
训练一个多任务的large-scale视觉-语言模型应该非常消耗算力吧，23年3月15日发布的gpt4虽然没有公开详细的技术细节，但我觉得底层加了Vision QA，也就是Image-to-Text的能力，还是很难将I2I，T2I，I2T完全结合再一起的。
不过大力出奇迹，stack more layers，feed more data。
猜测GPT4背后的一些图像能力是靠这样的简单逻辑实现的。

References

Posted 2023-03-07Updated 2023-04-082 minutes read (About 332 words)

Frontend | Icarus主题美化

为博客添加nest动态线条特效

在themes\icarus\layout\layout.jsx的body中添加如下代码，CDN可根据自己使用的修改。

1	<script type="text/javascript" color="30,144,255" opacity='0.5' zIndex="-1" count="150" src="//cdn.bootcss.com/canvas-nest.js/1.0.0/canvas-nest.min.js"></script>

除了通过CDN加载，也可以下载到本地使用，详见官方文档。

1	color="255,255,255" opacity='0.7' # 改成了自己喜欢的颜色

Code Highlight

可以从这些code highlight中找自己喜欢的styles。highlight.js/src/styles at 9.18.1 · highlightjs/highlight.js (github.com)

article:
    # Code highlight settings
    highlight:
        # Code highlight themes
        # https://github.com/highlightjs/highlight.js/tree/master/src/styles
        theme: xt256 # 赛博朋克主题的暗黑code highlight
        # Show copy code button
        clipboard: true
        # Default folding status of the code blocks. Can be "", "folded", "unfolded"
        fold: unfolded

我的最爱：monokai

支持Latex

安装mathjax插件

1 2	npm uninstall hexo-rendered-marked npm install hexo-filter-mathjax

修改主博客_config.yaml

mathjax:
  tags: none # or 'ams' or 'all'
  single_dollars: true # enable single dollar signs as in-line math delimiters
  cjk_width: 0.9 # relative CJK char width
  normal_width: 0.6 # relative normal (monospace) width
  append_css: true # add CSS to pages rendered by MathJax
  every_page: false # if true, every page will be rendered by MathJax regardless the `mathjax` setting in Front-matter

设置md页面头部

1	mathjax: true

References

Posted 2023-03-05Updated 2023-03-054 minutes read (About 558 words)

Paper | ZITS++ Image Inpainting by Improving the Incremental Transformer on Structural Priors | arXiv2023

Info

Title： ZITS++: Image Inpainting by Improving the Incremental Transformer on Structural Priors
Keyword：Transformer, High resolution Image Inpainting
Idea：之前CVPR2022会议文章的期刊版本，做了一些小改进和其他的尝试。
Source
- Paper，2022年10月第一版，2023年2月23日第二版（新鲜出炉的）。2210.05950] ZITS++: Image Inpainting by Improving the Incremental Transformer on Structural Priors (arxiv.org)
- Code，DQiaole/ZITS_inpainting: Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding (CVPR2022) (github.com)，Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding (dqiaole.github.io)

Abstract

ZITS存在的问题：

ZITS中使用的canny边缘不能区分有意义的结构。在复杂环境中Canny边缘产生confusing textures而不是具有丰富信息的底层结构。

深入研究不同的图像先验信息引导的高分辨率图像修复是必要的。
提升LaMa的纹理修复性能。

贡献点：

在原始的ZITS上（transformer-based的边缘和线框补充），又加入了许多不同先验的实验分析和讨论，最终发现L-Edges、线框和梯度先验结合效果最好。
将补全好的先验信息融合到修复网络中需要上采样，提出了一种Edge Non-Maximum Suppression（E-NMS），将冗余的边缘信息过滤掉（消除边界附近的模糊边缘）。
对于LaMa进行修改，加入了Large Kernel Attention以及修改模型设计。（增益：large receptive fields and scale invariance尺度不变性。we promote the maxpool as the mask resizing strategy of PatchGAN instead of the nearest in LaMa）

提供了一个高分辨率图像数据集，HR-Flickr。

Method

提出了learning-based边缘CATS取代原来用的canny边缘。并利用E-NMS（现有的算法）过滤不确定的边缘。最终使用的先验是CAT+线框（wireframe）+梯度。

利用扩张卷积分解large Kernel，实验中取K=21。
mask resize策略：maxpool取代nearest resize（稳定训练过程）

Evalutaion

定量性能提升明显。

定量效果也很好。

人脸修复效果也很好。

Posted 2023-03-05Updated 2023-03-0513 minutes read (About 2019 words)

Paper | Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding | CVPR2022

Info

Title： Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding
Keyword：Transformer, High resolution Image Inpainting
Idea：Extract edges and contours with Transformer, Masking Positional Encoding
Source
- Paper，2022年3月submitted的，到现在已经一年过去了，accepted in CVPR2022。[2203.00867] Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding (arxiv.org)
- Code，基于LaMa做的一些小改进。DQiaole/ZITS_inpainting: Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding (CVPR2022) (github.com)，Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding (dqiaole.github.io)
- PaperReading，CVPR2022|基于Transformer结构增强的增量式图像修复|ZITS - 知乎 (zhihu.com)非常好的阅读笔记。

Abstract v1

本文是基于WACV’22的高分辨率图像修复工作LaMa进一步改进的，更偏向于自然场景的修复（更注重结构、轮廓的先验信息）。

现存的问题：

1）现有的方法受限于CNN有限的感受野，只能处理常规的纹理，仍存在恢复生动纹理与合理的整体结构的问题（Vivid textures and Reasonable structures）。
2）Attention-based模型（Transformer）虽然能更好的学习长距离依赖（Long-range dependency），但是受限于高分辨率图像推理时的Heavy Computation。

解决的方法（贡献）：

1）【主要贡献】An additional structure restorer，增加一个额外的结构修复器，增量式的辅助图像修复。
- 在固定的低分辨率Sketch space（Gray-scale space）修复整体的结构，并可以通过上采样融入到修复过程中。
- Can be integrated with other pretrained inpainting models efficiently with the zero-initialized residual addition（无需额外训练，直接融入到其他Inpainting预训练模型中）。
2）Masking positional encoding strategy用于提升使用Large irregular mask训练的性能。

Abstract v2

现存的问题：

现有的Inpainting方法只能处理regular textures，由于CNN感受野有限的问题，失去了对于图像整体结构（Holistic Structure）的把控。
基于attention的方法可以一定程度上解决该问题，但受限于高分辨率图像推理时的Heavy Computation。

贡献：

Motivation：对于高分辨率自然图像修复来说，边缘信息十分重要，如果没有对于大图像的整体理解，很难恢复场景的边缘和线条，尤其是纹理较弱的场景。Method：使用一个额外的结构恢复网络，增量式的辅助图像修复过程。具体而言：transformer-based网络，在固定的低分辨率草图空间中，修复图像的边缘和轮廓线条，而后上采样到高分辨率，融合到后续图像修复网络中。
Zero-initialized Residual Addition（零初始化残差融合）增量训练策略：提出的方法可以和其他的pretrained inpainting model轻易的整合在一起（许多其他利用先验信息的方法通常是多阶段多模型，训练成本高，而这个策略可以在较少的step数中快速收敛）。
提出了一个Masking Positional Encoding Strategy，提升在大mask配置下的模型性能。（高分辨率、较大缺失区域的修复，模型前期会在mask区域重复产生没有语义的伪影，浪费计算量）

Introduction

Image Inpainting Goal：The inpainted images should remain both semantically coherent textures and visually reasonable structures. 这里也给了我们一点点启发，对于人脸修复而言，语义一致性至关重要，所以利用语义分割信息来引导人脸修复是一个好的想法；而后者，整体结构的连贯性，则对于自然场景图像修复至关重要。
Image Inpainting任务现存的问题
- 1）Limited receptive fields。面对large corrupted region和高分辨率图像时问题更加凸显。
- 2）Missing holistic structures。缺乏整体结构，Recovering key edges and lines for scenes。
- 3） Heavy computations。训练高分辨率图像的GAN非常tricky and costly。
- 4） No positional information in masked regions。在大mask配置下，模型会生成没有意义伪影，浪费计算量。

很好，我的另一个Idea别人也已经实现了，好好看好好学吧(●’◡’●)

作者分析了LaMa的不足之处（其实非常明显），LaMa的本质是在频域内做了1×1卷积保证了相同周期性信号的关联，也就是LaMa作者想要解决的重复性纹理的修复。但是这样的方法无法确保整体结构，并且在纹理较弱的图像上性能很差。

最先使用transformer-based做low-resolution图像修复，然后再CNN上采样超分一下的工作。

Ziyu Wan, Jingbo Zhang, Dongdong Chen, and Jing Liao. High-fidelity pluralistic image completion with transformers. arXiv preprint arXiv:2103.14031, 2021.

Yingchen Yu, Fangneng Zhan, Rongliang Wu, Jianxiong Pan, Kaiwen Cui, Shijian Lu, Feiying Ma, Xuansong Xie, and Chunyan Miao. Diverse image inpainting with bidirectional and autoregressive transformers. arXiv preprint arXiv:2104.12335, 2021.

还有许多使用先验信息的网络，但通常都是多阶段图像修复，训练成本较高（trained from scratch）。

Method

首先将mask、masked image（valid pixel为1，待填充区域为0，mask可视化时反转一下，待填充变为1，都是为了方便后续计算）、canny边缘提取器获得的masked edge（边缘）以及利用作者之前提出的模型获取的masked lines（线框，主要是建模两点之间的连线，所以上采样下采样时不存在歧义，但是canny边缘提取出来的信息在不同feature size提取出的边缘可能不同）。
送入TSR，首先将256×256的图片下采样三次到32×32大小，然后利用基于轴向注意力和常规注意力的transformer，减少计算量提升计算效率，最后获得256×256的修复后的边缘和线框。后续利用一个简单的四层CNN网络来对于修复好的先验信息进行上采样，只用线框数据进行训练而不用线框加边缘数据，这样做能够更好的消除歧义，获得不同分辨率更加一致的先验信息。
因为边缘和线框信息是稀疏的，所以利用基于门控卷积的网络来提取更关键的信息，并采用多尺度信息，也就是中间block的最后一层和上采样的三层，通过零初始化残差融合（就是做了一个简单的残差运算），和baseline LaMa的前四层融合在一起，然后训练50k进行一个增量学习微调就能显著的提升原模型的效果。
至于MPE（Masking Positional Encoding），其实就是取一个3×3的all-one卷积核来和mask区域做计算，能够获得距离大mask中心的距离信息以及mask方向信息，送入到baseline网络中作为辅助信息。（黑色区域为1白色为0，很简单的卷积运算）。

Evaluation

主要针对自然场景图像修复，定性上的性能增益不是很明显。

MPE这个方法更是鸡肋，出发点很好但是做的太简单了，所以也没有多高的性能增益。

但是定性效果很好，主要是整体结构信息（边缘和线框）对于高分辨率的自然场景图像来说是非常关键的信息。作者之前提出的提取线框的模型，我觉得底层逻辑就像是透视图，对于空间布局来说，透视图很重要，所以修复出来的图片效果会更好。

Posted 2023-02-28Updated 2023-02-285 minutes read (About 732 words)

Issues | Baseline MISF-CVPR2022 Reprod & GIQA improve

Official code：tsingqguo/misf (github.com)

关于loss参数

1	[('epoch', 1), ('iter', 1), ('l_d2', 0.707538366317749), ('l_g2', 0.07427514344453812), ('l_l1', 0.7772688865661621), ('l_per', 0.20167401432991028), ('l_sty', 0.393798291683197)]

logs = [
            ("l_d2", dis_loss.item()),
            ("l_g2", gen_gan_loss.item()),
            ("l_l1", gen_l1_loss.item()),
            ("l_per", gen_content_loss.item()),
            ("l_sty", gen_style_loss.item()),
        ]

其中，l_d2是Inpainting Model的Discriminator loss，l_g2是Inpainting Model的Generator loss，l_l1是L1 loss，l_per是Perceptual loss，l_sty是Style loss。

这篇文章作者的code是基于Edge Connect的代码Repo的，原模型Edge Connect分为了Edge model、Inpainting Model、Inpaint with Edge Model以及Joint Model四个训练阶段，这里MISF的作者应该是只用了Inainting Model的部分并进行了修改。

wandb使用

import wandb  # 使wandb库在pytorch库之后引用

default_config = dict(
    batch_size=128,
    dropout=0.5
)

wandb.init(project="pj-name", config=default_config, mode="online/offline/disabled")

batch_size = wandb.config.batch_size # 保证代码可读性和一致性

wandb.log({'epoch': epoch, 'loss': loss, 'accuracy': accuracy})

Package import

sys.path指定模块搜索路径的列表。默认情况下，python导入文件或者模块，会在sys.path里找模块的路径。如果路径下搜索不到模块的话，就会报错。

1
2
3

import sys
print(sys.path)
sys.path.append('/home/nsy/nlp') # package路径为/home/nsy/nlp/new_package

['/home/user5/code/misf-main', '/home/user5/.pycharm_helpers/pydev', '/home/user5/code/misf-main', '/home/user5/.pycharm_helpers/pycharm_display', '/home/user5/.pycharm_helpers/third_party/thriftpy', '/home/user5/.pycharm_helpers/pydev', '/home/user5/code/misf-main/C', '/Users/75796/AppData/Local/JetBrains/PyCharm2021.3/cythonExtensions', '/home/user5/anaconda3/envs/testenv/lib/python38.zip', '/home/user5/anaconda3/envs/testenv/lib/python3.8', '/home/user5/anaconda3/envs/testenv/lib/python3.8/lib-dynload', '/home/user5/.local/lib/python3.8/site-packages', '/home/user5/code/PUT-main', '/home/user5/anaconda3/envs/testenv/lib/python3.8/site-packages', '/home/user5/.pycharm_helpers/pycharm_matplotlib_backend']

后台训练

1	nohup python -u main.py >02272115_loss.log 2>&1 &

GIQA升级版FIQA

Best model？

1 2	- checkpoints/acc01090300/model_best.pth.tar - /home/user5/code/QA/GIQA-master/MBC-GIQA/checkpoints/acc01090300/model_best.pth.tar

Freeze pretrained layer

(29 封私信 / 7 条消息) Pytorch 如何精确的冻结我想冻结的预训练模型的某一层，有什么命令吗？ - 知乎 (zhihu.com)

算法工程师升级打怪

成为一个算法工程师首先你得有工程能力，就是说你得先能干活，熟练的掌握一门编程语言必不可少；
然后是相关领域的专业知识，比如推荐算法，你需要了解常见推荐算法的原理、优缺点、应用场景等；
然后是机器学习的基础知识，李航的《统计机器学习》，周志华的《机器学习》，Benjio的《深度学习》，这三本书至少得过个那么一两遍吧，把基础知识掌握牢了再学习其它的就容易多了，基础不牢地动山摇；
然后是掌握一些数据结构和算法知识，这个还是比较重要的，对你写出高效的代码很有帮助。

Posted 2023-02-26Updated 2023-03-028 minutes read (About 1234 words)

Paper | Resolution-robust Large Mask Inpainting with Fourier Convolutions | WACV2022

Info

Title： Resolution-robust Large Mask Inpainting with Fourier Convolutions
Keyword：Large Mask Inapinting
Idea：Fourier Convolutions
Source
- Paper，2021年9月15日submitted的。最后发表在WACV2022上，确实是Applications of CV，非常实用。后续有很多CVPR2022的高分辨率图像修复任务都和这篇工作做了对比。[2109.07161] Resolution-robust Large Mask Inpainting with Fourier Convolutions (arxiv.org)
- Code，大分辨率图像修复效果非常好的一项工作，面向落地的。https://github.com/saic-mdal/lama，[Resolution-robust Large Mask Inpainting with Fourier Convolutions (advimman.github.io)](https://advimman.github.io/lama-project/)
- Vedio，超棒的一个paper讲解，非作者本人，但是邀请了一作来interview。Resolution-robust Large Mask Inpainting with Fourier Convolutions (w/ Author Interview) - YouTube

Abstract

现存的问题：

Modern image inpainting systems, often struggle with large missing areas, complex geometric structures, and high-resolution images. 目前图像修复存在的问题有：大缺失区域（但个人认为ill-posed problem不是傅里叶卷积能够解决的）、复杂几何结构以及高分辨率图像修复。

猜想：

如何解决这个问题？作者认为最主要的原因是lack of an effective receptive field in both the inpainting network and the loss function.

本文LaMa（Large mask inpainting）贡献点：

在网络结构上，使用fast Fourier convolutions的inpainting network architecture，image-wide的感受野（快速傅里叶卷积的贡献）。
在损失函数上，A high receptive field perceptual loss。
在训练策略上，使用Large training mask。

Introduction

A large effective receptive field is essential for understanding the global structure of an image.

第一， high receptive field architecture。文章提出了基于快速傅里叶卷积（FFCs）的网络架构，能够使得网络前几层感受野都能cover整个图像。可以提升perceptual quality并使网络轻量化，而且泛化能力很强（即使训练集不包含的高分辨率图像，也能很好的推理）。
第二， high receptive field loss function。文章提出基于语义分割网络、大感受野的perceptual loss。能够提升全局结构和形状的一致性。
第三，aggressive algorithm of training masks generation。training mask generation，生成更大的mask。

Method

大mask配置下，如果依旧利用传统的3×3ResNet卷积核，在网络前期感受野可能位于掩膜内部，所以网络中的许多层都缺乏全局上下文，浪费了计算量和参数。

Add FFC

而Fast Fourier convolution (FFC) 能够让网络前几层应用全局的上下文信息。包含两个并行分支，1）局部分支使用常规的卷积操作；2）全局分支使用real FFT，作用在实数信号上。FFT会转换到复数空间（频域）。而inverse real FFT能够保证输出是实数。
这里简单的real FFT得到的复数实部和虚部concat到了一起，然后在频域上做了一个1×1卷积，也就是同频分量的卷积，这样能保证周期性信号的修复（也就是重复性的pattern，作者最初的motivation就是认为现有的方法对于重复性pattern修复的结果不佳，想到重复pattern就想到了周期性信号，也就使用了FFT来解决这个问题）

提出了一个Fast Fourier Conv Residual Block，也就是res block改成快速傅里叶卷积。FFC还有局部分支和全局分支的交互，作用在每一个层之间。

Perceptual loss pro

在鉴别器部分，使用segmentation model作为backbone来专注于high-level information，而不是classification model backbone，更专注于纹理等低级特征。使用傅里叶或扩张卷积来实现均可。

这里做了消融实验验证了对于perceptual loss升级后的效果。因为生成器更关注于全局信息，所以也要使判别器的性能提升，这样在GAN的训练过程中才能保持平衡。

Generation of large mask

输入的数据对于模型的性能提升很重要。与deepfillv2和narrow mask相比，文章生成large wide mask（多边形宽笔划）和large box mask的组合，作为训练输入。

Evaluation

红色代表本方法比其他方法性能提升的百分比。可以看出在narrow mask配置下，本文方法超过绝大多数method，但是在wide mask配置下，性能吊打其他方法。

使用傅里叶卷积的消融实验，在narrow mask下傅里叶卷积模型的性能提升效果不是很明显，但是大mask配置下优势就很突出。

还可以泛化到高分辨率图像上。

Posted 2023-02-24Updated 2023-02-246 minutes read (About 919 words)

Backend | 关于Node.js、NPM和Node_modules

r/ProgrammerHumor - Sun Neutron star Black hole node_modules HEAVIEST OBJECTS IN THE UNIVERSE

是什么

首先看了一下Node.js官网的介绍。

As an asynchronous event-driven JavaScript runtime, Node.js is designed to build scalable network applications.

说的都是什么鬼话，一句没看明白。让我们来看看人话是什么样的。

Node.js, which is a run-time environment that includes everything required to execute a program written in JavaScript.

Node.js is neither a programming language nor a framework; it’s an environment for them.

Node.js是用JavaScript写程序时的一个运行时环境。

NPM is Node.js’s package ecosystem. It is the largest ecosystem of all open-source libraries in the world, with over 1 million packages and growing. NPM is free to use, and thousands of open source developers contribute to it daily.

NPM是Node.js的包生态系统，也就是管理package的。这是世界上最大的一个开源库生态，据说每天都会有200多个新的package被注册。
根据项目中的package.json或package-lock.json文件，利用npm install就可以安装项目所有的依赖库，并存储在node_modules下。

与Python库管理的区别

npm vs. pip
- npm使用的是局部依赖，所以相同的module会被反复安装到每个项目以及每个可传递的依赖项上（ The same module is installed over and over again for every project and every transitive dependency）。一个package可以是一个tar包，也可以是本地file协议，甚至git仓库地址。所以，node_module——HEAVIEST OBJECTS IN THE UNIVERSE。
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  node_modules
  mod-a
  node_modules
  mod-b@1.0
  mod-c
  node_modules
  mod-b@2.0
  mod-d
  node-modules
  mod-b@2.0
  虽然mod-c和mod-b依赖同一个mod-b版本，但是该版本却安装了两遍。如果应用了很多第三方库，同时第三方库依赖了一些很基础的第三方库（如lodash），node_modules里就会充满各种重复版本的lodash。
- 而pip使用的全局依赖（至少对于虚拟环境而言是全局依赖的），所以就避免了上述问题。
standard library python vs. JS
- Python的标准库比较大，与JS的标准库相比。
- 所以JS会依赖更多的packages。
我觉得可以把node.js类比于anaconda，都像环境和容器一样。

ICARUS的npm版本和git版本

首先我的blog是基于hexo的。

因为有两个版本可以安装，npm install下来的就是直接到node_modules里，其实就是github repo的一个注册包版本（node_modules/hexo-theme-icarus包含package.json，所以它是一个package而不是一个module？不过我的觉得package和module的区分不重要）。
而git clone安装方法是存在themes文件夹下面。

如果我想修改一些主页的设置，就需要改主题的源代码，但是他是以npm的方式安装的，虽然直接修改也能生效（因为是本地路径查询包，所以直接修改node_modules中的库也是没问题的吧？），但是这种方法十分的不优雅（比如某天重新装了一下node_modules就全G了）。

推荐的方法是利用git clone安装到theme文件下，也就是自己的项目里，然后修改好了闲的没事的话可以注册到npm上，这样别人也能使用你修改后的icarus plus版本了，而且npm直接安装一下，十分的方便。

Posted 2023-02-23Updated 2023-03-027 minutes read (About 1011 words)

Paper | SFI-Swin Symmetric Face Inpainting with Swin Transformer by Distinctly Learning Face Components Distributions | arXiv2023

Info

Title：SFI-Swin: Symmetric Face Inpainting with Swin Transformer by Distinctly Learning Face Components Distributions
Keyword：Face Inpainting、Swin Transformer
Idea：Symmetric（对称的，人脸对称性）、Distinctly Learning Face Components Distributions（显式学习面部组件分布）
Source
- Paper，2023年1月9号Submitted到arxiv上的。[2301.03130] SFI-Swin: Symmetric Face Inpainting with Swin Transformer by Distinctly Learning Face Components Distributions (arxiv.org)
- Code，Repo给出了但是代码还没有push上来。mohammadrezanaderi4/SFI-Swin: SFI-Swin: Symmetric Face Inpainting with Swin Transformer by Distinctly Learning Face Components Distributions https://arxiv.org/abs/2301.03130 (github.com)

日常感叹，为什么我能想到的Idea别人总能如此之快的抢发。当我还在拖拖拉拉实现Idea，别人已经验证完了。要多读paper，更重要的是多写code，实现Idea并验证哇。世界上最遥远的距离就是知道和做到。

Abstract

现存的问题（问题陈述）：

None of the existing inpainting methods consider the symmetry and homogeneity of the picture.

现有的方法在人脸修复的过程中没有考虑图像的对称性和同质性。
The metrics that assess a repaired face image quality cannot measure the preservation of symmetry between the rebuilt and existing parts of a face.

现有的评估指标无法衡量修复人脸的对称性。

提出的方法（贡献点）：

利用多discriminators分别验证五官的生成质量（提升对人脸高级语义五官的理解），构建一个transformer-based的网络（大感受野能够保证面部对称性）。
提出symmetry concentration score指标，来评估修复人脸的对称性。
在reality, symmetry, and homogeneity三个维度上，比最近提出的sota算法效果好。

Introduction

在图像处理中，同质性指的是测量图像的局部均匀性。
文章中的同质性指的是修复的缺失区域需要和面部的其他区域保持协调（global features of each part of the face）。The inpainted regions must be homogeneous with the other parts of the face and highly correlated to the available surrounding areas of the input image.
对称性指的是面部的左右对称。facial symmetry must be preserved between the left and right sides.

作者认为现存方法的问题出在了损失函数无法向生成器传达面部特征的整体理解。This shortcoming is because the network losses do not convey a general understanding of the facial features to the generator.

于是作者分析了主流Inpainting方法常用的几种loss对于模型训练的影响，包括pixel-wise, adversarial, feature-matching, and perceptual loss。

pixel-wise loss。L1、L2范数，只能让网络理解到底层特征（low-level features）。👉focus on 底层特征（颜色、纹理）
adversarial loss。能够让gt和生成图像的分布（distribution）接近，使用discriminator和generator构成博弈；feature-matching loss。gt和pred作为输入，提取discriminator中间层特征。这两个loss只能让生成的图片看起来真实，但不能保证missing regions exactly similar to ground truth（inpainting任务的不适定性，ill-posed problem），大多数鉴别器是patch-based的，所以只能保证局部真实感。👉focus on 生成patches内容的真实感
perceptual loss。先利用一个seg network的预训练提取高级语义特征，然后计算L1、L2范数。主要考虑了high-level features，比如边缘。👉focus on 边缘轮廓的平滑性

一般是过一个类似VGG的backbone预训练提取特征，high-level features就默认为语义及以上层次的特征。

有时上述的loss会牺牲面部对称性而达到局部真实感的最优，所以我们现在需要💡homogeneity-aware loss均匀感知损失，来约束模型。同时，transformer的大感受野也能保证面部对称性。

Method

Evaluation

方法效果一般，更多的是Swin transformer带来的加成。

Posted 2023-02-15Updated 2023-02-15Util4 minutes read (About 525 words)

Util | Github教育认证以及Copilot使用

下午浪费了一点时间踩坑，希望能够帮助到其他需要教育认证的朋友们~

Github教育认证

不要挂梯子，直接用Microsoft Edge打开GitHub Education，进行后续验证。
学信网下载学籍报告，用DeepL文档翻译把报告翻译成英文版，截图保存为jpg格式。
直接上传图像，proof选择Other，填Ministry of Education Online Verification Report of Student Status

刚开始上传会提示profile问题、没有valid date问题等。我还把Github profile重新改了一遍，但这不是问题的关键。主要问题还是上传的学籍报告不是英文版or不清晰，之前用的是Google的文档翻译，翻译出来的字很小，再转成JPEG压缩了一下很模糊。

Copilot使用

有了教育认证，就可以免费用GitHub Copilot · Your AI pair programmer啦。每个月省了10刀~

在pycharm中添加插件即可。

1
2
3

写注释，copilot会自动补全相应的注释和代码。
- tab键应用suggestion（将自动补全的代码，或者根据注释补全的代码应用）
- alt+[或alt+]可以查看其他的suggestion

再安利两个我超级喜欢的插件

1 2	一个是Indent Rainbow，彩虹缩进🌈，写python超级好用啊！另外一个是Monokai Pro Theme，我最喜欢的代码配色就是Monokai了！

使用体验：

和ChatGPT相比，Copilot可能更方便辅助日常中的代码构建（尤其是常写的代码，Copilot可以直接内嵌在IDE中，补全代码），可以提高程序员的编码效率~

但是如果要解决实际场景下的编程问题，而不是一些基础的Leetcode算法题或者教学Case，ChatGPT和Copilot都只是一种辅助工具

距离取代程序员还有很远的距离呢~

Image and Text Multimodal

Professional and Academic Benchmarks

Thinking

References

Info

Abstract

Method

Result

References

References

Info

Abstract

Method

Evalutaion

Info

Abstract v1

Abstract v2

Introduction

Method

Evaluation

关于loss参数

wandb使用

Package import

后台训练

GIQA升级版FIQA

算法工程师升级打怪

Info

Abstract

Introduction

Method

Add FFC

Perceptual loss pro

Generation of large mask

Evaluation

是什么

与Python库管理的区别

ICARUS的npm版本和git版本

Info

Abstract

Introduction

Method

Evaluation

Github教育认证

Copilot使用

Links

Categories

Recents

Archives

Tags

Subscribe for updates

follow.it