Writing a new reader class
This section describes how to write a reader class that will allow
to read data that are not correctly decoded by cerbere (because the
standard metadata and coordinates are expressed in a way cerbere does not
manage to infer. Writing a new reader class for a given dataset may have
several advantages:
fixing known formats but with non-conventional ways of storing or naming standard information (dates and times, lat/lon, dimension or coordinate names, …) or for which you want to transform, reshape the stored data in some way (even adding virtual variables computed on the fly).
fixing issues in the format and content that cause failures in xarray or
cerbereautomatically recognizing the correct
readerclass to be used to read some dataset, guessing from their data filenames pattern.
Handling a completely new type of format not yet supported by Xarray should
be better approached by writing a xarray.backends.BackendEntrypoint class, following the
Xarray recommendations (we had to write had to write backend engines for
formats such as BUFR, EPS,…).
The following sections explore different cases to help you write your own
reader class. Basically it consists in writing a couple of
attributes and methods that help cerbere to understand and access a file
content:
engine: an attribute returning the name of the engine to be used by Xarrayopen_datasetmethod to read the data filesguess_can_open: a method returning True if a given filename matches the expected pattern of the files of the dataset it is supposed to readpostprocess: a method postprocessing the XarrayDatasetobject read withopen_datasetto fix the different format and content issues and returning a newDatasetobject complying to the CF/Cerbere requirements
Example 1 : OSI SAF Level 2 Scatterometer datasets
Let’s take an OSI SAF scatterometer L2 product, an example of which can be downloaded at:
It is a swath product NetCDF format, however it uses conventions not understood
by the existing cerbere accessors:
lat,lonandtimecoordinates are respectively namedwvc_lat,wvc_lonandrow_timethe
rowandcelldimensions expected for aSwathfeature are named respectivelyNUMROWS(sometimesnumrowsin other products) andNUMCELLS(ornumcells)
This is quite easy to fix because the cerberize method in cerbere
Dataset accessor allows to pass some dictionary to translate the coordinate and
dimension names to the expected naming.
The OSISAFSCATL2 reader class for this dataset will be written in the
following way:
"""
Reader class for KNMI / OSI SAF scatterometer Level 2 NETCDF files
Example of file pattern:
ascat_20230210_165700_metopc_22116_eps_o_coa_3301_ovw.l2.nc
"""
from pathlib import Path
import re
import typing as T
class OSISAFSCATL2:
pattern = re.compile(r"[a-z]*_[0-9]{8}_[0-9]{6}.*_ovw.l2.nc$")
engine: str = "netcdf4"
description: str = "Use OSI SAF NetCDF scatterometer files in Xarray and " \
"cerbere"
url: str = "https://link_to/your_backend/documentation"
@staticmethod
def guess_can_open(filename_or_obj: T.Union[str, Path]) -> bool:
return re.match(
OSISAFSCATL2.pattern,
Path(filename_or_obj).name) is not None
@staticmethod
def postprocess(ds, **kwargs):
ds.cb.cerberize(
dim_matching={
'numrows': 'row',
'NUMROWS': 'row',
'numcells': 'cell',
'NUMCELLS': 'cell',
'NUMAMBIGS': 'solutions',
'NUMVIEWS': 'views'
},
var_matching={
'row_time': 'time',
'wvc_lat': 'lat',
'wvc_lon': 'lon'
}
)
# normalize longitude to -180/180
ds.cb.cfdataset['lon'] = ds.cb.cfdataset['lon'].where(
ds.cb.cfdataset['lon'] < 180,
ds.cb.cfdataset['lon'] - 360
)
return ds.cb.cfdataset
Note
Note also that in such (simple) case, you don’t even need to create a new
class. A file could be properly read and standardize by providing
dim_matching and var_matching dictionaries to the cerberize
method as it is done in the postprocess of the example above.
Example 2: GHRSST datasets
GHRSST format specification for NetCDF, applying to L2, L3 and L4 datasets, is
based on CF convention but has a peculiarity that makes it non compliant to
CF/Cerbere requirements: the time is splitted into two different variables,
time and sst_dtime that need to be sum up to reconstruct a full
time variable. This is done again in the postprocess method, as shown
below:
from pathlib import Path
import re
import typing as T
import numpy as np
class GHRSST:
pattern: str = re.compile(r"^[0-9]{8,14}-.*-L[2P|3\w|4]P*_GHRSST-.*nc$")
engine: str = "netcdf4"
description: str = "Use GHRSST files in Xarray"
url: str = "https://link_to/your_backend/documentation"
@staticmethod
def guess_can_open(filename_or_obj: T.Union[str, Path]) -> bool:
return re.match(
GHRSST.pattern,
Path(filename_or_obj).name) is not None
@staticmethod
def postprocess(ds, decode_times: bool = True):
if "sst_dtime" in ds.variables and decode_times:
# replace with new time variable decoding GHRSST time
dunit = ds.sst_dtime.attrs.get(
"units", ds.sst_dtime.encoding.get("units"))
if dunit is None:
raise ValueError("missing unit for sst_dtime")
if dunit not in ["second", "seconds"]:
raise TypeError("Can not process sst_dtime in {}".format(dunit))
reference = np.datetime64(ds["time"].values[0])
time = ds["sst_dtime"].astype("timedelta64[s]") + reference
time.attrs = ds["time"].attrs
time.encoding = ds["time"].encoding
# remove dummy time dimension
ds = ds.squeeze(dim="time")
# replace with single time field
ds = ds.drop_vars(["time", "sst_dtime"])
ds['time'] = time.squeeze(dim="time")
ds = ds.set_coords(["time"])
if "comment" in ds["time"].attrs:
ds["time"].attrs.pop("comment")
ds["time"].attrs["long_name"] = "measurement time"
# normalize collection id (GDS 1.x => 2.x)
if "DSD_entry_id" in ds.attrs:
ds.attrs["id"] = ds.attrs.pop("DSD_entry_id")
return ds