Transitions Newsletter Header

Director's Corner: Testing and Evaluation in the Machine-Learning Weather-Prediction Era

Spring 2024

Can data-driven weather prediction models, in which artificial-intelligence methods are used to train algorithms for predicting specific weather elements to a desired lead time, be competitive with state-of-the-art dynamical-core and parameterized-physics numerical weather prediction models? Thanks to recent rapid advances in the field of data-driven weather prediction, what was a pipe dream just a few years ago now seems inevitable, particularly on the synoptic scale.

Currently available data-driven weather-prediction models include Google’s GraphCast, NVIDIA’s FourCastNet, Huawei’s Pangu-Weather, Microsoft’s Aurora, Fudan University’s FuXi, and ECMWF’s AIFS deterministic models, as well as Google’s GenCast probabilistic model. Google’s NeuralGCM is a hybrid model that fuses a traditional dynamical core with an artificial-intelligence module for representing physical-process tendencies, and the testing and evaluation of artificial-intelligence emulators for existing physical parameterizations (such as the Rapid Radiative Transfer Model, through CIRA and NOAA GSL) is ongoing. Though the extent to which these models have been evaluated remains extremely limited, they have demonstrated synoptic-scale forecast skill that is comparable to, or exceeds, that of the current standard bearers, ECMWF’s IFS deterministic and EPS probabilistic forecast systems. Data-driven models also appear to be capable of faithfully mimicking fundamental atmospheric physics and dynamics, learning such characteristics solely from training data.

While data-driven models are currently at low readiness levels and thus are not yet ready for operational use, their viability is actively being evaluated by operational forecast agencies. For example, ECMWF has developed its own data-driven model, the Artificial Intelligence/Integrated Forecast System (AIFS), which it is evaluating alongside four other data-driven models (ForeCastNet, FuXi, GraphCast, and Pangu-Weather). In addition, NCEP is evaluating a version of GraphCast initialized using GDAS data, and NOAA recently hosted a workshop to establish a vision for data-driven numerical weather prediction.

Given their robust performance, data-driven models will play an integral role in the present and future of numerical weather prediction.

Given their robust performance, demonstrated ability to mimic synoptic-scale kinematic and thermodynamic properties, and ongoing rapid development, data-driven models will play an integral role in the present and future of numerical weather prediction. Given that, it is fair to ask: how will and how should data-driven models be tested and evaluated? The DTC’s Science Advisory Board began to broach these questions at its Fall 2023 meeting, and given continued advances in data-driven modeling since then, these questions are even more timely now. While I don’t claim to have the answers to these important questions, I would like to share a little about what I see are the current similarities and differences in testing and evaluating these models.

Testing innovations for traditional numerical weather prediction models typically involves making one or more code changes to attempt to address a known deficiency, then running a series of increasingly complex tests (e.g., physics simulator to single-column model to two- and then fully three-dimensional models) to assess whether the changes are having the desired effect. Doing so requires a modest computational expense that is tractable for many individual researchers and research groups. This process also holds for hybrid dynamical/data-driven models, in which the specific artificial-intelligence emulators are trained separately from the rest of the model, and represents a potential new capacity for the DTC’s Common Community Physics Package. 

Testing fully data-driven models is quite different. Even as these models are informed by many architectural and hyperparameter choices, testing changes to these attributes currently requires retraining the entire model, which carries a significant computational expense (many dedicated graphics and/or tensor processing units for several days or weeks) that is not presently tractable for most researchers. Thus, until these computational resources – ideally complemented by community frameworks for developing, testing, and evaluating data-driven models at scale – become more widely available, testing innovations to data-driven models may primarily be the domain of well-resourced teams. In the interim, testing efforts may emphasize tuning pretrained models for specific prediction tasks, which typically carries less computational expense.

There may be fewer changes in how models are evaluated, however. Evaluating traditional numerical weather prediction models typically involves process-oriented diagnoses and forecast-verification activities to assess whether their solutions are physically realistic and quantify whether their solutions are improved relative to a suitable baseline. Yet, although data-driven models are presently limited in their outputs (e.g., these models do not currently attempt to mimic physical-process tendencies and the temporal frequency of their outputs is limited by the specific prediction interval for which the models are trained), their evaluation looks much the same as for traditional models. This is exemplified by idealized evaluations of Pangu-Weather by Greg Hakim and Sanjit Masanam and a case-study evaluation for Storm Ciarán by Andrew Charlton-Perez and collaborators. 

There are two noteworthy practical differences in evaluating traditional versus data-driven weather prediction models, however. First, as data-driven models bring trivial computational expense compared to traditional models, it is far easier to generate large datasets over a wide range of cases or conditions to robustly evaluate data-driven models. It also makes case-study analyses accessible to a wider range of researchers, particularly at under-resourced institutions and in developing countries. Second, although similar metrics are currently being used to verify both types of models, parallel verification infrastructures have become established for traditional versus data-driven models. An example is given by the DTC’s METplus verification framework, which underpins several modeling centers’ verification systems, and Google’s Python-based WeatherBench 2 suite for verifying data-driven models. Whereas the range of applications is currently far greater for METplus, WeatherBench 2 facilitates fully Python-based verification workflows against cloud-based gridded analysis datasets. Consequently, I believe that the broader weather-prediction community would benefit from closer collaborations between the traditional and data-driven weather modeling communities, such as hosting workshops, visitor projects, and informal discussions, to ensure that both communities can best learn from the other’s expertise and experiences.

Clark Evans NOAA GSL (formerly, Professor of Atmospheric Sciences and the Chair of the Atmospheric Sciences Program in the UWM School of Freshwater Sciences)
Contributed by Clark Evans NOAA GSL (formerly, Professor of Atmospheric Sciences and the Chair of the Atmospheric Sciences Program in the UWM School of Freshwater Sciences)