Writing a new reader class

This section describes how to write a reader class that will allow to read data that are not correctly decoded by cerbere (because the standard metadata and coordinates are expressed in a way cerbere does not manage to infer. Writing a new reader class for a given dataset may have several advantages:

  • fixing known formats but with non-conventional ways of storing or naming standard information (dates and times, lat/lon, dimension or coordinate names, …) or for which you want to transform, reshape the stored data in some way (even adding virtual variables computed on the fly).

  • fixing issues in the format and content that cause failures in xarray or cerbere

  • automatically recognizing the correct reader class to be used to read some dataset, guessing from their data filenames pattern.

Handling a completely new type of format not yet supported by Xarray should be better approached by writing a xarray.backends.BackendEntrypoint class, following the Xarray recommendations (we had to write had to write backend engines for formats such as BUFR, EPS,…).

The following sections explore different cases to help you write your own reader class. Basically it consists in writing a couple of attributes and methods that help cerbere to understand and access a file content:

  • engine: an attribute returning the name of the engine to be used by Xarray open_dataset method to read the data files

  • guess_can_open: a method returning True if a given filename matches the expected pattern of the files of the dataset it is supposed to read

  • postprocess: a method postprocessing the Xarray Dataset object read with open_dataset to fix the different format and content issues and returning a new Dataset object complying to the CF/Cerbere requirements

Example 1 : OSI SAF Level 2 Scatterometer datasets

Let’s take an OSI SAF scatterometer L2 product, an example of which can be downloaded at:

It is a swath product NetCDF format, however it uses conventions not understood by the existing cerbere accessors:

  • lat, lon and time coordinates are respectively named wvc_lat, wvc_lon and row_time

  • the row and cell dimensions expected for a Swath feature are named respectively NUMROWS (sometimes numrows in other products) and NUMCELLS (or numcells)

This is quite easy to fix because the cerberize method in cerbere Dataset accessor allows to pass some dictionary to translate the coordinate and dimension names to the expected naming.

The OSISAFSCATL2 reader class for this dataset will be written in the following way:

"""
Reader class for KNMI / OSI SAF scatterometer Level 2 NETCDF files

Example of file pattern:
ascat_20230210_165700_metopc_22116_eps_o_coa_3301_ovw.l2.nc
"""
from pathlib import Path
import re
import typing as T


class OSISAFSCATL2:

    pattern = re.compile(r"[a-z]*_[0-9]{8}_[0-9]{6}.*_ovw.l2.nc$")
    engine: str = "netcdf4"
    description: str = "Use OSI SAF NetCDF scatterometer files in Xarray and " \
                   "cerbere"
    url: str = "https://link_to/your_backend/documentation"

    @staticmethod
    def guess_can_open(filename_or_obj: T.Union[str, Path]) -> bool:
        return re.match(
            OSISAFSCATL2.pattern,
            Path(filename_or_obj).name) is not None

    @staticmethod
    def postprocess(ds, **kwargs):
        ds.cb.cerberize(
            dim_matching={
                'numrows': 'row',
                'NUMROWS': 'row',
                'numcells': 'cell',
                'NUMCELLS': 'cell',
                'NUMAMBIGS': 'solutions',
                'NUMVIEWS': 'views'
            },
            var_matching={
                'row_time': 'time',
                'wvc_lat': 'lat',
                'wvc_lon': 'lon'
            }
        )

        # normalize longitude to -180/180
        ds.cb.cfdataset['lon'] = ds.cb.cfdataset['lon'].where(
            ds.cb.cfdataset['lon'] < 180,
            ds.cb.cfdataset['lon'] - 360
        )

        return ds.cb.cfdataset

Note

Note also that in such (simple) case, you don’t even need to create a new class. A file could be properly read and standardize by providing dim_matching and var_matching dictionaries to the cerberize method as it is done in the postprocess of the example above.

Example 2: GHRSST datasets

GHRSST format specification for NetCDF, applying to L2, L3 and L4 datasets, is based on CF convention but has a peculiarity that makes it non compliant to CF/Cerbere requirements: the time is splitted into two different variables, time and sst_dtime that need to be sum up to reconstruct a full time variable. This is done again in the postprocess method, as shown below:

from pathlib import Path
import re
import typing as T

import numpy as np


class GHRSST:

    pattern: str = re.compile(r"^[0-9]{8,14}-.*-L[2P|3\w|4]P*_GHRSST-.*nc$")
    engine: str = "netcdf4"
    description: str = "Use GHRSST files in Xarray"
    url: str = "https://link_to/your_backend/documentation"

    @staticmethod
    def guess_can_open(filename_or_obj: T.Union[str, Path]) -> bool:
        return re.match(
            GHRSST.pattern,
            Path(filename_or_obj).name) is not None

    @staticmethod
    def postprocess(ds, decode_times: bool = True):
        if "sst_dtime" in ds.variables and decode_times:
            # replace with new time variable decoding GHRSST time
            dunit = ds.sst_dtime.attrs.get(
                "units", ds.sst_dtime.encoding.get("units"))
            if dunit is None:
                raise ValueError("missing unit for sst_dtime")
            if dunit not in ["second", "seconds"]:
                raise TypeError("Can not process sst_dtime in {}".format(dunit))
            reference = np.datetime64(ds["time"].values[0])

            time = ds["sst_dtime"].astype("timedelta64[s]") + reference
            time.attrs = ds["time"].attrs
            time.encoding = ds["time"].encoding

            # remove dummy time dimension
            ds = ds.squeeze(dim="time")

            # replace with single time field
            ds = ds.drop_vars(["time", "sst_dtime"])
            ds['time'] = time.squeeze(dim="time")
            ds = ds.set_coords(["time"])
            if "comment" in ds["time"].attrs:
                ds["time"].attrs.pop("comment")
            ds["time"].attrs["long_name"] = "measurement time"

        # normalize collection id (GDS 1.x => 2.x)
        if "DSD_entry_id" in ds.attrs:
            ds.attrs["id"] = ds.attrs.pop("DSD_entry_id")

        return ds