Round 1 Funded Projects
CRITICAL-MM: Cross-modal Reasoning with Integrated Clinical Assessment for Large Multimodal Models
Munib Mesinovic, University of Oxford
Summary: Clinical AI deployment failures rarely stem from technical limitations, and they arise when models cannot be trusted by clinicians, integrated by healthcare systems, or understood by patients. Current multimodal benchmarks evaluate what foundation models can do (accuracy, speed), not what they should do (trustworthiness, safety, fairness). CRITICAL-MM addresses this gap by developing the first benchmark co-designed with the complete ICU stakeholder ecosystem across multiple international sites. We will create a two-tier evaluation framework for multimodal foundation models on real-world ICU tasks (asynchronous data fusion, explainable deterioration prediction, clinical handover generation, resource-constrained triage). Tier 1 (Technical) measures predictive performance using publicly accessible ICU datasets: MIMIC-IV, MIMIC-III, eICU, HiRID, and AmsterdamUMCdb, enabling multinational cross-institutional generalisability testing. Tier 2 (Socio-Technical) evaluates outputs through participatory workshops with ICU clinicians, patient representatives, ethicists, and industry partners, assessing trust, interpretability, and ethical acceptability using co-designed rubrics. Through bidirectional exchange with NYU Abu Dhabi (Cleveland Clinic) and clinical validation at Oxford NHS Trust, we will test benchmark generalisability across diverse healthcare contexts. CRITICAL-MM will establish global standards for AI evaluation, directly informing NHS decisions and cross-cultural AI safety practices.
Towards Carbon-Neutral Living: An Open Dataset for Smart Social Housing
Lu Gan, Brunel University of London
Summary: This project will develop an open multimodal dataset for AI-driven interventions in carbon-neutral social housing. The dataset will integrate smart meter data, IoT and acoustic sensors, resident surveys, and community social media, capturing energy use, behavioural patterns, indoor air quality, thermal comfort, and local engagement to enable rigorous evaluation of AI models for energy forecasting, digital twin simulation, circular economy strategies, climate adaptation, and policy analysis. This matters as over 4 million residents (17% of England’s population) live in social housing, many facing fuel poverty, health risks, and exclusion from smart city innovations. Despite growing interest in AI for sustainability, there is a critical lack of standardised, open datasets combining technical performance with real-world social, environmental, and equity indicators. Our interdisciplinary team, spanning engineering, social science, and mental health, will co-design community-driven data collection in partnership with EYYA (IoT industry partner), Lancaster West Neighbourhood Team (LWNT) at the Royal Borough of Kensington and Chelsea (RBKC), and local housing associations. This collaborative approach will produce a foundational resource for responsible AI development, evidence-based housing policy, and inclusive, human-centred innovation in the sustainable housing sector.
Neural Architecture Search Benchmark for Multimodal self-driving Car’s test data
Stephen McGough, Newcaste University
Summary: Self-driving cars are almost here. However, manufacturers wish to ‘test’ their cars in real-world (more realistic) scenarios where as well as the car collecting data on what is going on, a secondary set of sensors are collecting data on how the system is performing and how it is reacting to the environment. This is a highly multimodal problem due to the vast range of sensors and cameras along with collecting data from engineers in the car and the car. For the last five years we have been running a Neural Architecture Search (NAS) competition at CVPR, and more recently AutoML based on novel (unseen) datasets created by us. We will, with ORI.AI (UK company), create a new set of self-driving car NAS benchmarks along with running competitions at BMVC and NeurIPS. ORI.AI will provide raw labelled datasets, however, these will need anonymisation, cleaning and converting into appropriate benchmarks. We will develop two benchmarks, (a) a classification task (how the car acted sub-optimally) and (b) generation of a full report on the car’s activity. These can be used for standard or NAS benchmarking. We anticipate wide impact for this work from three main communities: self-driving car developers, multimodal data and NAS.
SONAIR - Sim2real Operational beNchmark for AI Robotics
Tianyi Zeng, University of Nottingham
Summary: SONAIR is a new platform that helps researchers and industry test artificial intelligence systems in a fair and realistic way. It allows teams to run exactly the same tasks in both computer simulation and on real robots or industrial machines, using standardised inputs and outputs. Each team has an online “room” linked to their institution. They upload their results through a simple one-click workflow. Before anything is evaluated, the system automatically checks that the data format and timing are correct. A secure evaluator then generates easy-to-understand scorecards and a browser replay, so users can directly compare performance.
SONAIR also includes an “AI Lab” to choose how different types of data—such as video, depth, lidar, or sensor readings—are combined. The platform supports long-horizon reasoning, streaming data, uncertainty estimation, and efficient models that run on edge devices. All code, starter kits, and tools will be openly released on GitHub so the wider community can reuse and expand them.
Our UK and international industrial partners will contribute real equipment, safety requirements, and testing facilities. They will help review tasks, run pilot trials, and provide practical guidance. This ensures SONAIR is grounded in real operational needs and supports safe, trustworthy AI deployment in robotics, aerospace, marine, and industrial settings.
Contact us: omaib-ukomain-group@sheffield.ac.uk