Writing a new reader
class
This section describes how to write a reader
class that will allow
to read data that are not correctly decoded by cerbere
(because the
standard metadata and coordinates are expressed in a way cerbere
does not
manage to infer. Writing a new reader
class for a given dataset may have
several advantages:
fixing known formats but with non-conventional ways of storing or naming standard information (dates and times, lat/lon, dimension or coordinate names, …) or for which you want to transform, reshape the stored data in some way (even adding virtual variables computed on the fly).
fixing issues in the format and content that cause failures in xarray or
cerbere
automatically recognizing the correct
reader
class to be used to read some dataset, guessing from their data filenames pattern.
Handling a completely new type of format not yet supported by Xarray should
be better approached by writing a xarray.backends.BackendEntrypoint
class, following the
Xarray recommendations (we had to write had to write backend engines for
formats such as BUFR, EPS,…).
The following sections explore different cases to help you write your own
reader
class. Basically it consists in writing a couple of
attributes and methods that help cerbere
to understand and access a file
content:
engine
: an attribute returning the name of the engine to be used by Xarrayopen_dataset
method to read the data filesguess_can_open
: a method returning True if a given filename matches the expected pattern of the files of the dataset it is supposed to readpostprocess
: a method postprocessing the XarrayDataset
object read withopen_dataset
to fix the different format and content issues and returning a newDataset
object complying to the CF/Cerbere requirements
Example 1 : OSI SAF Level 2 Scatterometer datasets
Let’s take an OSI SAF scatterometer L2 product, an example of which can be downloaded at:
It is a swath product NetCDF format, however it uses conventions not understood
by the existing cerbere
accessors:
lat
,lon
andtime
coordinates are respectively namedwvc_lat
,wvc_lon
androw_time
the
row
andcell
dimensions expected for aSwath
feature are named respectivelyNUMROWS
(sometimesnumrows
in other products) andNUMCELLS
(ornumcells
)
This is quite easy to fix because the cerberize
method in cerbere
Dataset accessor allows to pass some dictionary to translate the coordinate and
dimension names to the expected naming.
The OSISAFSCATL2
reader class for this dataset will be written in the
following way:
"""
Reader class for KNMI / OSI SAF scatterometer Level 2 NETCDF files
Example of file pattern:
ascat_20230210_165700_metopc_22116_eps_o_coa_3301_ovw.l2.nc
"""
from pathlib import Path
import re
import typing as T
class OSISAFSCATL2:
pattern = re.compile(r"[a-z]*_[0-9]{8}_[0-9]{6}.*_ovw.l2.nc$")
engine: str = "netcdf4"
description: str = "Use OSI SAF NetCDF scatterometer files in Xarray and " \
"cerbere"
url: str = "https://link_to/your_backend/documentation"
@staticmethod
def guess_can_open(filename_or_obj: T.Union[str, Path]) -> bool:
return re.match(
OSISAFSCATL2.pattern,
Path(filename_or_obj).name) is not None
@staticmethod
def postprocess(ds, **kwargs):
ds.cb.cerberize(
dim_matching={
'numrows': 'row',
'NUMROWS': 'row',
'numcells': 'cell',
'NUMCELLS': 'cell',
'NUMAMBIGS': 'solutions',
'NUMVIEWS': 'views'
},
var_matching={
'row_time': 'time',
'wvc_lat': 'lat',
'wvc_lon': 'lon'
}
)
# normalize longitude to -180/180
ds.cb.cfdataset['lon'] = ds.cb.cfdataset['lon'].where(
ds.cb.cfdataset['lon'] < 180,
ds.cb.cfdataset['lon'] - 360
)
return ds.cb.cfdataset
Note
Note also that in such (simple) case, you don’t even need to create a new
class. A file could be properly read and standardize by providing
dim_matching
and var_matching
dictionaries to the cerberize
method as it is done in the postprocess
of the example above.
Example 2: GHRSST datasets
GHRSST format specification for NetCDF, applying to L2, L3 and L4 datasets, is
based on CF convention but has a peculiarity that makes it non compliant to
CF/Cerbere requirements: the time is splitted into two different variables,
time
and sst_dtime
that need to be sum up to reconstruct a full
time
variable. This is done again in the postprocess
method, as shown
below:
from pathlib import Path
import re
import typing as T
import numpy as np
class GHRSST:
pattern: str = re.compile(r"^[0-9]{8,14}-.*-L[2P|3\w|4]P*_GHRSST-.*nc$")
engine: str = "netcdf4"
description: str = "Use GHRSST files in Xarray"
url: str = "https://link_to/your_backend/documentation"
@staticmethod
def guess_can_open(filename_or_obj: T.Union[str, Path]) -> bool:
return re.match(
GHRSST.pattern,
Path(filename_or_obj).name) is not None
@staticmethod
def postprocess(ds, decode_times: bool = True):
if "sst_dtime" in ds.variables and decode_times:
# replace with new time variable decoding GHRSST time
dunit = ds.sst_dtime.attrs.get(
"units", ds.sst_dtime.encoding.get("units"))
if dunit is None:
raise ValueError("missing unit for sst_dtime")
if dunit not in ["second", "seconds"]:
raise TypeError("Can not process sst_dtime in {}".format(dunit))
reference = np.datetime64(ds["time"].values[0])
time = ds["sst_dtime"].astype("timedelta64[s]") + reference
time.attrs = ds["time"].attrs
time.encoding = ds["time"].encoding
# remove dummy time dimension
ds = ds.squeeze(dim="time")
# replace with single time field
ds = ds.drop_vars(["time", "sst_dtime"])
ds['time'] = time.squeeze(dim="time")
ds = ds.set_coords(["time"])
if "comment" in ds["time"].attrs:
ds["time"].attrs.pop("comment")
ds["time"].attrs["long_name"] = "measurement time"
# normalize collection id (GDS 1.x => 2.x)
if "DSD_entry_id" in ds.attrs:
ds.attrs["id"] = ds.attrs.pop("DSD_entry_id")
return ds