Improved Baselines with Visual Instruction Tuning Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient admin_sagi2024-01-08T05:10:11+00:00January 8, 2024|