.. _Xarray: http://xarray.pydata.org

.. |BackendEntrypoint| replace:: :mod:`xarray.backends.BackendEntrypoint`

==============================
Writing a new ``reader`` class
==============================

This section describes how to write a ``reader`` class that will allow
to read data that are not correctly decoded by ``cerbere`` (because the
standard metadata and coordinates are expressed in a way ``cerbere`` does not
manage to infer. Writing a new ``reader`` class for a given dataset may have
several advantages:

* fixing known formats but with non-conventional ways of storing or naming
  standard information (dates and times, lat/lon, dimension or coordinate names,
  ...) or for which you want to transform, reshape the stored data in some way
  (even adding virtual variables computed on the fly).
* fixing issues in the format and content that cause failures in xarray or
  ``cerbere``
* automatically recognizing the correct ``reader`` class to be used to read some
  dataset, guessing from their data filenames pattern.

Handling a completely new type of format not yet supported by Xarray_ should
be better approached by writing a |BackendEntrypoint| class, following the
Xarray_ recommendations (we had to write had to write backend engines for
formats such as BUFR, EPS,...).


The following sections explore different cases to help you write your own
``reader`` class. Basically it consists in writing a couple of
attributes and methods that help ``cerbere`` to understand and access a file
content:

* ``engine``: an attribute returning the name of the engine to be used by
  Xarray_ ``open_dataset`` method to read the data files
* ``guess_can_open``: a method returning True if a given filename matches the
  expected pattern of the files of the dataset it is supposed to read
* ``postprocess``: a method postprocessing the Xarray_ ``Dataset`` object
  read with ``open_dataset`` to fix the different format and content issues
  and returning a new ``Dataset`` object complying to the CF/Cerbere
  requirements


Example 1 : OSI SAF Level 2 Scatterometer datasets
==================================================

Let's take an OSI SAF scatterometer L2 product, an example of which can be
downloaded at:

It is a swath product NetCDF format, however it uses conventions not understood
by the existing ``cerbere`` accessors:

  * ``lat``, ``lon`` and ``time`` coordinates are respectively named
    ``wvc_lat``, ``wvc_lon`` and ``row_time``
  * the ``row`` and ``cell`` dimensions expected for
    a :class:`~cerbere.feature.cswath.Swath` feature are named respectively
    ``NUMROWS`` (sometimes ``numrows`` in other products) and ``NUMCELLS`` (or
    ``numcells``)

This is quite easy to fix because the ``cerberize`` method in ``cerbere``
Dataset accessor allows to pass some dictionary to translate the coordinate and
dimension names to the expected naming.

The ``OSISAFSCATL2`` reader class for this dataset will be written in the
following way:


.. code-block:: python

    """
    Reader class for KNMI / OSI SAF scatterometer Level 2 NETCDF files

    Example of file pattern:
    ascat_20230210_165700_metopc_22116_eps_o_coa_3301_ovw.l2.nc
    """
    from pathlib import Path
    import re
    import typing as T


    class OSISAFSCATL2:

        pattern = re.compile(r"[a-z]*_[0-9]{8}_[0-9]{6}.*_ovw.l2.nc$")
        engine: str = "netcdf4"
        description: str = "Use OSI SAF NetCDF scatterometer files in Xarray and " \
                       "cerbere"
        url: str = "https://link_to/your_backend/documentation"

        @staticmethod
        def guess_can_open(filename_or_obj: T.Union[str, Path]) -> bool:
            return re.match(
                OSISAFSCATL2.pattern,
                Path(filename_or_obj).name) is not None

        @staticmethod
        def postprocess(ds, **kwargs):
            ds.cb.cerberize(
                dim_matching={
                    'numrows': 'row',
                    'NUMROWS': 'row',
                    'numcells': 'cell',
                    'NUMCELLS': 'cell',
                    'NUMAMBIGS': 'solutions',
                    'NUMVIEWS': 'views'
                },
                var_matching={
                    'row_time': 'time',
                    'wvc_lat': 'lat',
                    'wvc_lon': 'lon'
                }
            )

            # normalize longitude to -180/180
            ds.cb.cfdataset['lon'] = ds.cb.cfdataset['lon'].where(
                ds.cb.cfdataset['lon'] < 180,
                ds.cb.cfdataset['lon'] - 360
            )

            return ds.cb.cfdataset


.. note::

    Note also that in such (simple) case, you don't even need to create a new
    class. A file could be properly read and standardize by providing
    ``dim_matching`` and ``var_matching`` dictionaries to the ``cerberize``
    method as it is done in the ``postprocess`` of the example above.


Example 2: GHRSST datasets
==========================

GHRSST format specification for NetCDF, applying to L2, L3 and L4 datasets, is
based on CF convention but has a peculiarity that makes it non compliant to
CF/Cerbere requirements: the time is splitted into two different variables,
``time`` and ``sst_dtime`` that need to be sum up to reconstruct a full
``time`` variable. This is done again in the ``postprocess`` method, as shown
below:

.. code-block:: python

    from pathlib import Path
    import re
    import typing as T

    import numpy as np


    class GHRSST:

        pattern: str = re.compile(r"^[0-9]{8,14}-.*-L[2P|3\w|4]P*_GHRSST-.*nc$")
        engine: str = "netcdf4"
        description: str = "Use GHRSST files in Xarray"
        url: str = "https://link_to/your_backend/documentation"

        @staticmethod
        def guess_can_open(filename_or_obj: T.Union[str, Path]) -> bool:
            return re.match(
                GHRSST.pattern,
                Path(filename_or_obj).name) is not None

        @staticmethod
        def postprocess(ds, decode_times: bool = True):
            if "sst_dtime" in ds.variables and decode_times:
                # replace with new time variable decoding GHRSST time
                dunit = ds.sst_dtime.attrs.get(
                    "units", ds.sst_dtime.encoding.get("units"))
                if dunit is None:
                    raise ValueError("missing unit for sst_dtime")
                if dunit not in ["second", "seconds"]:
                    raise TypeError("Can not process sst_dtime in {}".format(dunit))
                reference = np.datetime64(ds["time"].values[0])

                time = ds["sst_dtime"].astype("timedelta64[s]") + reference
                time.attrs = ds["time"].attrs
                time.encoding = ds["time"].encoding

                # remove dummy time dimension
                ds = ds.squeeze(dim="time")

                # replace with single time field
                ds = ds.drop_vars(["time", "sst_dtime"])
                ds['time'] = time.squeeze(dim="time")
                ds = ds.set_coords(["time"])
                if "comment" in ds["time"].attrs:
                    ds["time"].attrs.pop("comment")
                ds["time"].attrs["long_name"] = "measurement time"

            # normalize collection id (GDS 1.x => 2.x)
            if "DSD_entry_id" in ds.attrs:
                ds.attrs["id"] = ds.attrs.pop("DSD_entry_id")

            return ds