Data on the climate data factory are derived from the original model data available on the official IPCC data portal (ESGF). These data went through a light (Raw data) or advanced processing (Ready to use data). A summary of the processing is presented in this article ; for more details, download the full technical report available on researchgate.
Short description of the processing chain
- Original data: searching and downloading official model data from the ESGF archive
- Input quality control: checking integrity of original data
- Remapping : interpolating data on a common grid
- Bias-adjustment: adjusting variables to observations for synoptic biases
- Standardization: rewriting files according to climate community’ standards
- Output quality control: checking bias-adjusted variables and metadata
- File merging: linking together the multiple 10 years data files into a unique 150 years set
- Spatial extraction: producing country and city level data sets instead of global ones
Figure 1. Summary of the data processing chain
Detailed description of the processing chain
- Original data
Original data come from research organisations around the world that run Earth System Models to simulate future world climate under different greenhouse gas scenarios. These data are accessible to everyone through a peta-scale distributed database called the Earth System Grid System Federation, or ESGF. We use the Synda software developed by IPSL to search and download data files from ESFG.
- Input quality control
At this step, we check the original input model data to make sure there are neither technical nor numerical bugs and to validate metadata integrity.
Because climate models have different spatial resolution (or “grids”), we remap raw data on a reference grid. This allows to conduct models inter-comparison or outputs models comparison with gridded observations.
- Bias adjustment
A climate model is an approximate representation of the real world climate drivers.
This simplification is due to incomplete understanding of climate physics and is
required for computational purpose. This inevitably introduces random models errors
in models simulations when their statistical properties (e.g., mean, variance) are
compared to climatological observations, thus limiting the use of raw models data in
In contrast with raw model data, we remove model biases with statistical methods and observations to produce adjusted data that are better suited for climate change impact studies. We use the Cumulative Distribution Function transform (CDF-t) method, one of many adjustment methods found in the literature that we co-developed with academics to address climate models limitations. The R code is freely available as an R package and is extensively used and referenced in more than 100 peer-reviewed publications.
Standardization consists in rewriting output data files and related metadata to comply with the climate community’s standards (e.g., the Climate and Forecast metadata convention and the CORDEX Data Reference Syntax for adjusted simulations that we adapted for CMIP5). We use the Climate Model Output Rewriter 2 (CMOR 2) library.
- Output quality control
For each bias-adjusted variable, we check data compliance with climate community's standards, data consistency and metadata. Doing quality control is crucial in the data publication process and data re-use. We use the QA-DKRZ combined with an additional in-house quality control that checks values of adjusted and standardized variables data. In-house quality control is built upon CDO and NCO tools and consists twofold in:
- Analyzing the difference between adjusted model and observation values on the reference period,
- Analyzing the time evolution difference between adjusted and non-adjusted model.
- File merging
Raw ESGF files are stored by 10 years period, so we merge them into a single 150 years-long file to make data handling easier for users.
- Spatial extraction
Raw ESGF files are stored as global (CMIP5) or continental (CORDEX) domains (Asia, Europe, etc.), so we extract country-level and city-level information to help users focusing on their area of interest.
Our country-level extraction method consists in identifying the “border” grid points for a country and drawing a rectangle around them. The drawback is that neighboring country points can be included in this rectangle, so our next move is to create a mask per country to only consider country points.
Figure 2. We provide country-level data worldwide.
Model grid points are spaced approximately every 100 km to 50 km for CMIP5 models and 15 to 10 km for CORDEX models. To extract city-level information, we consider the nearest grid point for a city (nota: we only consider continental points). You should keep in mind that our city level data correspond to a single grid point. They give the trend but do not account for local phenomena like the “urban heat island” effect that modulates small scale changes and requires higher resolution (typically 100m) and specific modeling to be resolved (e.g., check out the climate urban modeling done by our friends at Vito).
Figure 3. We provide city level data in more than 4,300 cities, starting from 100,000 inhabitants (yellow) (i.e., >500,000 (orange) and >2,000,000 (red))
Our online service enables users to search, find and download Raw and Ready to use climate information. Original models data are extracted from ESGF, remapped on a reference grid (the Raw data) and biases on variables are adjusted (the Ready to use data) to make them better suited for impact studies. Our processing chain is transparent, methods are referenced in rank A journals [1, 2] and the workflow software is available on GitHub.
The table below compares what the climate data factory provides compared with ESGF.
ESGF official site https://esgf.llnl.gov/
Read more on our data processing in our technical report "Bias adjusting climate model projections" (2018) available on ResearchGate.