.. _Xarray: http://xarray.pydata.org .. |BackendEntrypoint| replace:: :mod:`xarray.backends.BackendEntrypoint` ============================== Writing a new ``reader`` class ============================== This section describes how to write a ``reader`` class that will allow to read data that are not correctly decoded by ``cerbere`` (because the standard metadata and coordinates are expressed in a way ``cerbere`` does not manage to infer. Writing a new ``reader`` class for a given dataset may have several advantages: * fixing known formats but with non-conventional ways of storing or naming standard information (dates and times, lat/lon, dimension or coordinate names, ...) or for which you want to transform, reshape the stored data in some way (even adding virtual variables computed on the fly). * fixing issues in the format and content that cause failures in xarray or ``cerbere`` * automatically recognizing the correct ``reader`` class to be used to read some dataset, guessing from their data filenames pattern. Handling a completely new type of format not yet supported by Xarray_ should be better approached by writing a |BackendEntrypoint| class, following the Xarray_ recommendations (we had to write had to write backend engines for formats such as BUFR, EPS,...). The following sections explore different cases to help you write your own ``reader`` class. Basically it consists in writing a couple of attributes and methods that help ``cerbere`` to understand and access a file content: * ``engine``: an attribute returning the name of the engine to be used by Xarray_ ``open_dataset`` method to read the data files * ``guess_can_open``: a method returning True if a given filename matches the expected pattern of the files of the dataset it is supposed to read * ``postprocess``: a method postprocessing the Xarray_ ``Dataset`` object read with ``open_dataset`` to fix the different format and content issues and returning a new ``Dataset`` object complying to the CF/Cerbere requirements Example 1 : OSI SAF Level 2 Scatterometer datasets ================================================== Let's take an OSI SAF scatterometer L2 product, an example of which can be downloaded at: It is a swath product NetCDF format, however it uses conventions not understood by the existing ``cerbere`` accessors: * ``lat``, ``lon`` and ``time`` coordinates are respectively named ``wvc_lat``, ``wvc_lon`` and ``row_time`` * the ``row`` and ``cell`` dimensions expected for a :class:`~cerbere.feature.cswath.Swath` feature are named respectively ``NUMROWS`` (sometimes ``numrows`` in other products) and ``NUMCELLS`` (or ``numcells``) This is quite easy to fix because the ``cerberize`` method in ``cerbere`` Dataset accessor allows to pass some dictionary to translate the coordinate and dimension names to the expected naming. The ``OSISAFSCATL2`` reader class for this dataset will be written in the following way: .. code-block:: python """ Reader class for KNMI / OSI SAF scatterometer Level 2 NETCDF files Example of file pattern: ascat_20230210_165700_metopc_22116_eps_o_coa_3301_ovw.l2.nc """ from pathlib import Path import re import typing as T class OSISAFSCATL2: pattern = re.compile(r"[a-z]*_[0-9]{8}_[0-9]{6}.*_ovw.l2.nc$") engine: str = "netcdf4" description: str = "Use OSI SAF NetCDF scatterometer files in Xarray and " \ "cerbere" url: str = "https://link_to/your_backend/documentation" @staticmethod def guess_can_open(filename_or_obj: T.Union[str, Path]) -> bool: return re.match( OSISAFSCATL2.pattern, Path(filename_or_obj).name) is not None @staticmethod def postprocess(ds, **kwargs): ds.cb.cerberize( dim_matching={ 'numrows': 'row', 'NUMROWS': 'row', 'numcells': 'cell', 'NUMCELLS': 'cell', 'NUMAMBIGS': 'solutions', 'NUMVIEWS': 'views' }, var_matching={ 'row_time': 'time', 'wvc_lat': 'lat', 'wvc_lon': 'lon' } ) # normalize longitude to -180/180 ds.cb.cfdataset['lon'] = ds.cb.cfdataset['lon'].where( ds.cb.cfdataset['lon'] < 180, ds.cb.cfdataset['lon'] - 360 ) return ds.cb.cfdataset .. note:: Note also that in such (simple) case, you don't even need to create a new class. A file could be properly read and standardize by providing ``dim_matching`` and ``var_matching`` dictionaries to the ``cerberize`` method as it is done in the ``postprocess`` of the example above. Example 2: GHRSST datasets ========================== GHRSST format specification for NetCDF, applying to L2, L3 and L4 datasets, is based on CF convention but has a peculiarity that makes it non compliant to CF/Cerbere requirements: the time is splitted into two different variables, ``time`` and ``sst_dtime`` that need to be sum up to reconstruct a full ``time`` variable. This is done again in the ``postprocess`` method, as shown below: .. code-block:: python from pathlib import Path import re import typing as T import numpy as np class GHRSST: pattern: str = re.compile(r"^[0-9]{8,14}-.*-L[2P|3\w|4]P*_GHRSST-.*nc$") engine: str = "netcdf4" description: str = "Use GHRSST files in Xarray" url: str = "https://link_to/your_backend/documentation" @staticmethod def guess_can_open(filename_or_obj: T.Union[str, Path]) -> bool: return re.match( GHRSST.pattern, Path(filename_or_obj).name) is not None @staticmethod def postprocess(ds, decode_times: bool = True): if "sst_dtime" in ds.variables and decode_times: # replace with new time variable decoding GHRSST time dunit = ds.sst_dtime.attrs.get( "units", ds.sst_dtime.encoding.get("units")) if dunit is None: raise ValueError("missing unit for sst_dtime") if dunit not in ["second", "seconds"]: raise TypeError("Can not process sst_dtime in {}".format(dunit)) reference = np.datetime64(ds["time"].values[0]) time = ds["sst_dtime"].astype("timedelta64[s]") + reference time.attrs = ds["time"].attrs time.encoding = ds["time"].encoding # remove dummy time dimension ds = ds.squeeze(dim="time") # replace with single time field ds = ds.drop_vars(["time", "sst_dtime"]) ds['time'] = time.squeeze(dim="time") ds = ds.set_coords(["time"]) if "comment" in ds["time"].attrs: ds["time"].attrs.pop("comment") ds["time"].attrs["long_name"] = "measurement time" # normalize collection id (GDS 1.x => 2.x) if "DSD_entry_id" in ds.attrs: ds.attrs["id"] = ds.attrs.pop("DSD_entry_id") return ds