Multimodal AI Landscape Explorer

Fig. 1 · Yearly growth of multimodal AI preprints ?

Bars (left axis) show total multimodal AI preprint volume per year. The dashed line (right axis) shows multimodal AI as a proportion of all AI preprints identified in the dataset.

Fig. 2 · Multimodal AI preprints by modality ?

Vision and Language consistently dominate. Audio, Sensor, Graph, and Tabular modalities show emerging — and accelerating — growth from 2022 onwards.

Fig. 3 · Preprints by number of combined modalities ?

Pairwise combinations remain the most common. Triple-modality papers are growing fastest proportionally, reflecting a trend towards richer and more complex multimodal systems.

Fig. 4 · Pairwise, triple, quadruple, and quintuple modality combinations ?

Across all combination types, Vision & Language pairings dominate. The "Others" category captures novel pairings that do not involve Vision or Language as a primary modality — an area of growing research interest.

Fig. 5 · Modality pairs ?

Top

Vision & Language is by far the most common pairing. Filtering by a specific modality reveals which pairings are most or least explored relative to it.

Fig. 6 · Underexplored modality combinations ?

These underexplored combinations often involve non-standard modalities such as Sensor, Graph, Tabular, or Spatial data. Some show rapid recent growth, indicating emerging research directions.