Data processing

Our data processing

Data on the climate data factory are derived from the original model data available on the official IPCC data portal (ESGF). These data are processed with advanced peer-reviewed methods to make them Ready to use. A summary of the processing is presented in this article ; for more details, download the full technical report available on researchgate.

Short description of the processing chain

Data sourcing
Original data: searching and downloading official model data from the ESGF archive and the Copernicus Climate Change Service.

Spatial interpolation
Input quality control: checking integrity of original data
Remapping : interpolating data on a common grid

Adjustment
Downscaling: adjusting and increasing data resolution variables to observations
Standardization: rewriting data files according to climate community’ standards
Output quality control: checking downscaled data values and metadata

Operational
File merging: linking together the multiple 10 years data files into a unique 150 years set
Spatial extraction: producing country and city level data sets instead of global ones

Figure 1. Summary of the data processing chain

Detailed description of the processing chain

Data sourcing

Original data
Original data come from research organisations around the world that run Earth System Models to simulate future world climate under different greenhouse gas scenarios. These data are accessible to everyone through a peta-scale distributed database called the Earth System Grid System Federation, or ESGF or the Copernicus Climate Change Service data store for a more user oriented subset of this data. We use the Synda software developed by IPSL to search and download data files from ESFG.

Spatial interpolation

Input quality control
At this step, we check the original input model data to make sure there are neither technical nor numerical bugs and to validate metadata integrity.

Remapping
Because climate models have different spatial resolution (or “grids”), we remap raw models data to a finer regular reference-grid resolution (i.e., here at 0.25°x0.25° or 0.10°x0.10°). This allows to conduct models inter-comparison or output models comparison with gridded observations.

From one variable to another, different remapping methods are used, depending on the trend (linear, non-linear) or distribution (gaussian, non gaussian, etc.) of the variable:
- tas, tasmin, tasmax and sfcWind are interpolated with a bicubic method.
- pr and rsds are interpolated with a conservative method (first and second order).

Adjustment/Downscaling

Removing biaises and increasing resolution with statistical downscaling
A climate model is an approximate representation of the real world climate drivers. This simplification is due to incomplete understanding of climate physics and is required for computational purpose. This inevitably introduces random models errors in models simulations when their statistical properties (e.g., mean, variance) are compared to climatological observations, thus limiting the use of raw models data in impact studies.

In contrast with raw model data, we remove model biases and increase spatial resolution with statistical methods and observations to produce data that are better suited for climate change impact studies. We use the Cumulative Distribution Function transform (CDF-t) method, one of many adjustment methods found in the literature that we co-developed with academics to address climate models limitations. The R code is freely available as an R package and is extensively used and referenced in more than 100 peer-reviewed publications.

The statistical approach needs observational data as reference. Their spatial and temporal characteristics constraint the ones of the "calibrated model data". We use the ERA5 reanalyses as a proxy of observational data sets. It is the latest climate reanalyses being produced by ECMWF and accessible through the Copernicus Climate Change Service data store, providing hourly data on atmospheric, land-surface and sea-state parameters together with estimates of uncertainty. ERA5 data are available on regular latitude-longitude grids at 0.25° x 0.25°.

Standardization
Standardization consists in rewriting output data files and related metadata to comply with the climate community’s standards (e.g., the Climate and Forecast metadata convention and the CORDEX Data Reference Syntax for adjusted simulations that we adapted for CMIP5). We use the Climate Model Output Rewriter 2 (CMOR 2) library.

Output quality control
For each bias-adjusted variable, we check data compliance with climate community's standards, data consistency and metadata. Doing quality control is crucial in the data publication process and data re-use. We use the QA-DKRZ combined with an additional in-house quality control that checks values of adjusted and standardized variables data. In-house quality control is built upon CDO and NCO tools and consists twofold in:
- Analyzing the difference between adjusted model and observation values on the reference period,
- Analyzing the time evolution difference between adjusted and non-adjusted model.

Operational

File merging
Raw ESGF files are stored by 10 years period, so we merge them into a single 150 years-long file to make data handling easier for users.

Spatial extraction
Raw ESGF files are stored as global data files so we extract city or country level information to help users focus on their area of interest. The City level data is available on our Data Shop, other regions are available on demand.

Going further

Updated on: 06/01/2021

Was this article helpful?

Thank you!