I've been experimenting on instruction-tuning LLMs and VLMs either with adding new specialized tokens to their corresponding tokenizer/processor, or not. The setup is typical: mask the instructions/prompts (only attend to responses/answer) and apply CE loss. Nothing special, standard SFT.
However, I've observed better validation losses and output quality with models trained using their base tokenizer/processor versus models trained with modified tokenizer... Any thoughts on this? Feel free to shed light on this.
(my hunch: it's difficult to increase the likelihood of these new added tokens and the model simply just can't learn it properly).
[link] [comments]