Approaches to performance evaluation are shifting, with increasing attention to robustness, explainability, and fairness; better integration of these approaches can connect performance to real-world needs.

While accuracy benchmarks have been helpful in driving progress in some areas of AI technology, these benchmarks are not typically representative of the demands AI systems face when deployed in real-world contexts. The risk that follows is that achieving a benchmark creates a misleading impression of performance or progress in the field. In response, AI research has been generating new approaches to benchmarking, for example measuring performance under changing conditions. Further research in this area can help develop benchmarks that better reflect the characteristics needed to deliver trustworthy AI in practice, while also helping raise awareness amongst developers and users on the system side about the limits of methods for robustness, explainability, and fairness. In the long term, systems should be able to assess their own limits, indicating when they are not able to give a reliable answer; such systems will require significant progress to deliver.