Writing a new dataset class

This section describes how to write a dataset class that will allow access to a new format in read mode (see at the end of the section complementary information in case you also want to use this dataset class to save data in a specific format).

Writing a new dataset class may be needed in different cases:

  • to handle a completely new type of format not yet supported by cerbere (we had to write contrib datasets for formats such as GRIB, BUFR, EPS,…). This requires the most code writing.

  • to handle known formats but with non-conventional ways of storing or naming standard information (dates and times, lat/lon, dimension or coordinate names, …) or for which you want to transform, reshape the stored data in some way (even adding virtual variables computed on the fly). Usually such a dataset can be derived from a base class handling the same format and requires then less code writing.

The following sections explore the two cases to help you write your own dataset class. Basically it consists in writing or overriding a set of function that helps cerbere to understand and access a file content. It must implements the interface defined by Dataset class, and therefore inherit this class or one of its derived classes.

Case 1 : customize an existing format

Let’s take an OSI SAF scatterometer product, an example of which can be downloaded at:

It is a swath product NetCDF format, however it uses conventions not understood by the existing NCDataset class:

  • lat, lon and time coordinates are respectively named wvc_lat, wvc_lon and row_time

  • the row and cell dimensions expected for a Swath feature are named respectively NUMROWS (sometimes numrows in other products) and NUMCELLS (or numcells)

This is quite easy to fix because the NCDataset constructor allows to pass some dictionary to translate the coordinate and dimension names to the expect naming. The derived class for this dataset will be written in the following way:

# -*- coding: utf-8 -*-
"""
:mod:`~cerbere.dataset` class for KNMI OSI SAF scatterometer netcdf files
"""
from cerbere.dataset.ncdataset import NCDataset


class KNMIL2NCDataset(NCDataset):
    """Mapper class for KNMI netcdf files
    """

    def __init__(self, *args,  **kwargs):
        super(KNMIL2NCDataset, self).__init__(
            *args,
            dim_matching={
                'time': ['time'],
                'row': ['NUMROWS','numrows'],
                'cell': ['NUMCELLS', 'numcells'],
                'solutions': 'NUMAMBIGS',
                'views': 'NUMVIEWS'
            },
            field_matching={
                'time': 'row_time',
                'lat': ['lat', 'wvc_lat'],
                'lon': ['lon', 'wvc_lon']
            },
            **kwargs
        )

Note

In above example, other cases are addressed as well, this is why lists can be provided as key values to match different possible conventions. Other dimensions or fields can also be renamed through dim-matching and field_matching dictionaries.

Note also that in such (simple) case, you don’t even need to create a new class. A file could be properly read and standardize by providing dim-matching and field_matching dictionaries to NCDataset constructor:

>>> from cerbere.dataset.ncdataset import NCDataset
>>> prod = NCDataset(

… ‘toto.nc’, … dim_matching={ .. ‘row’: ‘NUMROWS’, … ‘cell’: ‘NUMCELLS’}, … field_matching={ … ‘time’: ‘row_time’, … ‘lat’: ‘wvc_lat, … ‘lon’: ‘lon’} … )

Case 2: Using a xarray contrib

If now we are dealing with a new format for which we don’t have any dataset class yet in cerbere, say GRIB or GeoTIFF, first check if xarray provides already a contrib to open this type of file as a Dataset object. For above formats, in fact there is:

Case 3: Creating a new dataset class

Lets create a new dataset class called MyDataset:

Create first a new module mydataset.py file with a basic structure as follow:

"""
    Dataset class for <the format and/or product type handled by this class>

    :license: Released under GPL v3 license, see :ref:`license`.

    .. sectionauthor:: <your name>
    .. codeauthor:: <your name>
    """
# import parent class
from cerbere.dataset.dataset import Dataset

class MyDataset(Dataset):

    """Dataset class to read <the format and/or product type handled by this
    class> files
    """
    def __init__(self, dataset, **kwargs):
        """Initialize a <the format and/or product type handled by this
        class> file dataset
        """
        return super(MyDataset, self).__init__(
            dataset, **kwargs
            )

The class methods to override depend on the type of format.

Let’s take as a first example BUFR format from ECMWF : it is a binary format that is not self described and does not permit subsetting easily. To make it simple, we will load the full content in memory. In such a case we just need to override the cerbere.dataset.dataset.Dataset._open_dataset() private method that returns a xarray xarray.Dataset object from the file content: we will build in this function this object with all the properties and naming expected by cerbere.

The following functions of the Dataset have to be overriden, if necessary:

  • _open()

  • close()

Saving data with a mapper

If the file can be also opened in write mode:

create_dim create_field write_field write_global_attributes

If you don’t plan to write in this mapper format, just copy the following block in your class source code:

def create_field(self, field, dim_translation=None):
    """Creates a new field in the mapper.

    Creates the field structure but don't write yet its values array.

    Args:
        field (Field): the field to be created.

    See also:
        :func:`write_field` for writing the values array.
    """
    raise NotImplementedError

def create_dim(self, dimname, size=None):
    """Add a new dimension.

    Args:
        dimname (str): name of the dimension.
        size (int): size of the dimension (unlimited if None)
    """
    raise NotImplementedError

def write_field(self, fieldname):
    """Writes the field data on disk.

    Args:
        fieldname (str): name of the field to write.
    """
    raise NotImplementedError

def write_global_attributes(self, attrs):
    """Write the global attributes of the file.

    Args:
        attrs (dict<string, string or number or datetime>): a dictionary
            containing the attributes names and values to be written.
    """
    raise NotImplementedError