User is experiencing unexpected results when finetuning the Qwen 3 VL 2B model for multimodal tasks. They seek a reliable finetuning script for better output control.
I had used Qwen 3 VL 2B model for multimodal task wherein it takes multiple images and text and produces textual output. For finetuning it I used HF PEFT library but the results are unexpected and a bit off for eg not giving the output within bounds mentioned in prompt and only stopping when max token limit reached . It might be due to some issue in finetuning script (this is my first time doing it). Unsloth has some finetuning notebook for Qwen 3 VL 8B on their website. Should I trust it? If anyone has tried multimodal LLM fine-tuning and has a script for it, I would really appreciate it if you could share it. Thank you