Has anyone explored weighting non-overlapping patches in images using ViTs? The weights would be part of learnable parameters. For instance, the background patches are sometimes useless for an image classification task. I am hypothesising that including this as a part of image embedding might be adding noise.
It would be great if someone could point me to some relevant works.
[link] [comments]