Large multimodal models (LMM) have recently shown
encouraging progress with visual instruction tuning. In
this note, we show that the fully-connected vision-language
cross-modal connector in LLaVA is surprisingly powerful
and data-efficient