November 5-7, 2025
2nd Workshop on Correctness and Reproducibility
for Earth System Software
Tutorial: Rigor and Reasoning in Research Software
We are excited to announce the second edition of the Workshop on Correctness and Reproducibility for Earth System Software, to be held on November 5-7, 2025 at the Mesa Laboratory of the NSF National Center for Atmospheric Research (NCAR) in Boulder, Colorado. We aim to provide a dedicated forum for earth system modelers, software engineers, and the broader scientific software community to discuss challenges, opportunities, and recent advances in ensuring software correctness and reproducibility. This workshop is a follow-up to the inaugural workshop held in November 2023, which brought together participants from academia, research labs, and industry to share their experiences and insights on software correctness and reproducibility.
Sponsored by the 2025 Better Scientific Software (BSSw) Fellowship program, this year’s workshop will feature a Tutorial on Rigor and Reasoning in Research Software, which will include sessions on practical techniques for improving software quality and reliability in scientific computing. The tutorial will cover core topics such as unit testing, continuous integration (CI), property-based testing, correctness in AI, and reasoning in research software. The workshop will also include invited talks, panel discussions, and contributed presentations on a wide range of topics related to software correctness and reproducibility.
Call for Abstracts: We invite contributions from researchers, software engineers, and practitioners in the Earth System Modeling (ESM) community, as well as the broader scientific computing community. Topics include:
Relevant applications include simulation codes, external libraries, AI techniques, diagnostics, packaging, and development practices.
Join us for a hands-on tutorial on bringing rigor and reasoning to research software (R3Sw). The R3Sw Tutorial runs primarily on Day 1 (November 5, 2025), with optional sessions on Days 2–3. See the tentative program for more details.
Motivation. Scientific software enables critical research, yet it’s often built in a “code-and-fix” style: quick to prototype, hard to test, and difficult to reason about. The R3Sw Tutorial draws inspiration from the scientific method, introducing practical techniques for designing and verifying code with the same rigor and systematic reasoning that underpin trustworthy scientific discovery.
What you’ll do. Working through a running example (1-D heat equation), you’ll incrementally transform an unstructured, monolithic code into modular, testable, and trustworthy code. Key topics:
pytest
Hypothesis
CrossHair
Who should attend? This tutorial is intended for scientists, engineers, and students involved in scientific computing, regardless of domain or career stage. No prior experience with testing or verification is required, but participants should have some familiarity with Python.
We will offer travel support for a limited number of students and early-career researchers. Indicate your interest on the registration form. This tutorial is led by Alper Altuntas, features guest lecturers from academia and industry, and is supported by the 2025 Better Scientific Software (BSSw) Fellowship program.
Special Session: AI/ML Reasoning and Explainability in Scientific Software (Thursday Morning Session)
The convergence of artificial intelligence and formal verification is creating a transformative virtuous cycle that will reshape how we create and validate knowledge. On one side, AI systems increasingly need verification to ensure correct outputs and build trust. On the other, formal verification methods require AI to overcome fundamental challenges in scalability, usability, and automation. This bidirectional relationship is already revolutionizing mathematics through systems like Lean4, where AI assists in proof discovery while formal foundations ensure mathematical rigor, demonstrating how this synergy accelerates reliable knowledge creation. The success of Lean4 in transforming mathematical research offers a compelling blueprint for achieving verifiable intelligence across domains, fundamentally changing our approach to creating systems that are both powerful and provably correct.
Neural networks deliver strong predictions across the sciences, but their opacity limits adoption, especially in geoscience, where understanding why a forecast is made matters. Explainable AI (XAI) methods aim to bridge this gap, yet their outputs are method-dependent and can disagree, and even faithful attributions may be physically misleading if the underlying model has learned noise rather than signal. We conduct a controlled study using a synthetic benchmark in which the true drivers of the target are known. By training neural networks across different data sizes and target noise levels (conditions that mirror many observational Earth system datasets), we test when XAI methods recover the true explanatory structure. Two main results emerge. First, explanatory fidelity increases as models capture a larger fraction of the learnable, signal-driven variance. Second, inter-method agreement tracks this fidelity and can serve as a practical proxy when ground truth is unavailable. Conversely, in low signal-to-noise or data-scarce regimes, explanations degrade and methods diverge. These findings offer concrete guidance for deploying XAI in geosciences and beyond: prioritize models that demonstrably learn signal, and use cross-method consensus as an operational check on explanation reliability.
AI weather prediction models have demonstrated remarkable increases in accuracy over traditional NWP models with far less latency and computational requirements for prediction. However, there are a growing number of examples where the improvements in performance come at the expense of physical consistency guarantees that are necessary assumptions for downstream applications like data assimilation. AI weather prediction models also tend to experience error growth in ways that differ noticeably from physics-based models. This presentation will examine different error scenarios for AI weather prediction models and show how some of these errors are being mitigated through architecture and physics constraints in the NCAR CREDIT platform.
Both the workshop and tutorial will be held in person (with a virtual option) and at the Mesa Laboratory of the NSF National Center for Atmospheric Research. (Helpful things to know for your visit.)
Virtual Meeting details will be announced later.