Data on the climate data factory are derived from the raw data available on the official IPCC data portal (ESGF). The raw data are processed to make them actionable for climate change impact studies. A summary of the processing is presented in this article, for more details you can download the full technical report available on researchgate.com.
Short description of the processing chain
- Raw data: searching and downloading raw data from the ESGF archive
- Input quality control: checking integrity of raw data
- Remapping : interpolating raw data on a common grid
- Bias-adjustment: adjusting variables to observations for synoptic biases
- Standardization: rewriting files according to climate community’ standards
- Output quality control: checking bias-adjusted variables and metadata
- Temporal concatenation: linking together the multiple 10 years data files into a unique 150 years set
- Geographical extraction: producing country level data sets instead of global ones
Detailed description of the processing chain
Raw data come from research organisations around the world that run Earth System Models to simulate future world climate under different greenhouse gas scenarios. These data are accessible to everyone through a peta-scale distributed database called the Earth System Grid System Federation, or ESGF. We use the Synda software developed by IPSL to search and download data files from ESFG.
Input quality control
At this step, we check the raw input data to make sure there are neither technical nor numerical bugs and to validate metadata integrity.
Because climate models have different spatial resolution (or “grids”), we remap raw data on a reference grid. This allows to conduct models inter-comparison or outputs models comparison with gridded observations.
A climate model is an approximate representation of the real world climate drivers.
This simplification is due to incomplete understanding of climate physics and is
required for computational purpose. This inevitably introduces random models errors
in models simulations when their statistical properties (e.g., mean, variance) are
compared to climatological observations, thus limiting the use of raw models data in
In contrast with raw model data, we remove model biases with statistical methods and observations to produce adjusted data that are better suited for climate change impact studies. We use the Cumulative Distribution Function transform (CDF-t) method, one of many adjustment methods found in the literature that we co-developed with academics to address climate models limitations. The R code is freely available as an R package and is extensively used and referenced in more than 100 peer-reviewed publications.
Standardization consists in rewriting output data files and related metadata to comply with the climate community’s standards (e.g., the Climate and Forecast metadata convention and the Data Reference Syntax). We use the Climate Model Output Rewriter 2 (CMOR 2) library.
Output quality control
For each bias-adjusted variable, we check data compliance with climate community's standards, data consistency and metadata. Doing quality control is crucial in the data publication process and data re-use. We use the QA-DKRZ combined with an additional in-house quality control that checks values of adjusted and standardized variables data. In-house quality control is built upon CDO and NCO tools and consists twofold in:
- Analyzing the difference between adjusted model and observation values on the reference period,
- Analyzing the time evolution difference between adjusted and non-adjusted model.
Raw ESGF files are stored by 10 years period, so we concatenate them into a single 150 years-long file to make data handling easier for users.
Raw ESGF files are stored as global (CMIP5) or continental (CORDEX) domains (Asia, Europe, etc.), so we extract country-level and city-level information to help users focusing on their area of interest.
Our country-level extraction method consists in identifying the “border” grid points for a country and drawing a rectangle around them. The drawback is that neighboring country points can be included in this rectangle, so our next move is to create a mask per country to only consider country points.
Model grid points are spaced approximately every 100 km to 50 km for CMIP5 models and 15 to 10 km for CORDEX models. To extract city-level information, we consider the nearest grid point for a city (nota: we only consider continental points). You should keep in mind that our city level data correspond to a single grid point. They give the trend but do not account for local phenomena like the “urban heat island” effect that modulates small scale changes and requires higher resolution (typically 100m) and specific modeling to be resolved (e.g., check out the climate urban modeling done by our friends at Vito).
We provide city level graphics and data in more than 4,300 cities worldwide from 100,000 inhabitants (yellow) (>500,000 (orange) and >2,000,000 (red))
Our online service enables users to search, find and download ready to use IPCC climate projection data and graphics. Original models data are extracted from ESGF, remapped on a reference grid and biases on variables are adjusted to make them better suited for impact studies. Our processing chain is transparent, methods are referenced in rank A journals [1, 2] and the workflow software is available on GitHub.
The table below compares what the climate data factory provides compared with ESGF.
ESGF official site https://esgf.llnl.gov/
Read more in our technical report "Bias adjusting climate model projections" (2018) available on ResearchGate.