Bars (left axis) show total multimodal AI preprint volume per year. The dashed line (right axis) shows multimodal AI as a proportion of all AI preprints identified in the dataset.
Vision and Language consistently dominate. Audio, Sensor, Graph, and Tabular modalities show emerging — and accelerating — growth from 2022 onwards.
Pairwise combinations remain the most common. Triple-modality papers are growing fastest proportionally, reflecting a trend towards richer and more complex multimodal systems.
Across all combination types, Vision & Language pairings dominate. The "Others" category captures novel pairings that do not involve Vision or Language as a primary modality — an area of growing research interest.
Vision & Language is by far the most common pairing. Filtering by a specific modality reveals which pairings are most or least explored relative to it.
These underexplored combinations often involve non-standard modalities such as Sensor, Graph, Tabular, or Spatial data. Some show rapid recent growth, indicating emerging research directions.