Reading or creating datasets

Reading from a file

The dataset package and other contribution packages provide various classes to read and standardize the format and content of EO data products. To each EO data product type should correspond a specific class in dataset to read its content. Some of these classes, such as the NCDataset for CF compliant NetCDF files, can read a wide range of EO products sharing similar format conventions. Each class derives from the main Dataset base class and inherits all its methods.

To read data from a file, first instantiate a dataset object of the corresponding class, specifying the path to this file. For instance, let’s create a dataset object from a Mercator Ocean Model file (test file available at ftp://ftp.ifremer.fr/ifremer/cersat/projects/cerbere/test_data/NCDataset/mercatorpsy4v3r1_gl12_hrly_20200219_R20200210.nc ). It is a CF compliant NetCDF file and we can then use the |NCDataset] class:

>>> import cerbere
>>> # instantiate the dataset object with the file path as argument
>>> dst = cerbere.open_dataset(
>>> ... 'NCDataset', "mercatorpsy4v3r1_gl12_hrly_20200219_R20200210.nc")

or, directly importing the NCDataset class:

>>> from cerbere.dataset.ncdataset import NCDataset
>>> # instantiate the dataset object with the file path as argument
>>> dst = NCDataset("mercatorpsy4v3r1_gl12_hrly_20200219_R20200210.nc")

Print the dataset description:

>>> print(dst)

A Dataset can also be created from a list of files.

Creating a new dataset

A Dataset class object (or from an inherited class in dataset package) can be created in memory without pre-existing file. A Dataset object can be created in different ways:

  • from a xarray Dataset object

  • using xarray data_vars, coords, attrs arguments

  • from a dict, using xarray syntax (as in xarray from_dict())

  • from another cerbere dataset object

Creating a Dataset from an xarray Dataset object

The xarray Dataset object must have latitude, longitude and time coordinates with valid cerbere names (lat, lon, time):

>>> import xarray as xr
>>> import numpy as np
>>> xrobj = xr.Dataset(
    coords={
        'lat': np.arange(0,10, 0.1),
        'lon': np.arange(5,15, 0.1),
        'time': np.full((100,), np.datetime64(('2010-02-03'), dtype='D'))
        },
    data_vars={'myvar': (('time',), np.ones(100))}
    )
>>> dst = Dataset(xrobj)

Creating a dataset from a dictionary

Using the same syntax as xarray (see: http://xarray.pydata.org/en/stable/generated/xarray.Dataset.from_dict.html#xarray.Dataset.from_dict ) by providing these arguments as a dictionary.

The provided dict must have latitude, longitude and time coordinates with valid cerbere names (lat, lon, time, optionally z):

>>> from cerbere.dataset.dataset import Dataset
>>> import numpy as np
>>> from datetime import datetime
>>> dst = Dataset(
...        {'time': {'dims': ('time'), 'data': [datetime(2018, 1, 1)]},
...         'lat': {'dims': ('lat'), 'data': np.arange(-80, 80, 1)},
...         'lon': {'dims': ('lon',), 'data': np.arange(-180, 180, 1)},
...         'myvar': {'dims': ('lat', 'lon',),
...                  'data': np.ones(shape=(160, 360))}
...         }
...    )
>>> print(dst)
Dataset: Dataset
Feature Dims :
   .        lat : 160
   .        lon : 360
   .        time : 1
Other Dims :
Feature Coordinates :
   .        time  (time: 1)
   .        lat  (lat: 160)
   .        lon  (lon: 360)
Other Coordinates :
Fields :
   .        myvar  (lat: 160, lon: 360)
Global Attributes :
   .        time_coverage_start     2018-01-01 00:00:00
   .        time_coverage_end       2018-01-01 00:00:00

Another syntax accepted by xarray provides explicit coordinates (coords), fields (data_vars), dimensions (dims) and global attributes (attrs), which, again, have to be passed as a dictionary to the Dataset creator:

>>> dst = Dataset({
...     'coords': {
...        'time': {'dims': ('time'), 'data': [datetime(2018, 1, 1)],
...                 'attrs': {'units': 'seconds since 2001-01-01 00:00:00'}},
...        'lat': {'dims': ('lat'), 'data': np.arange(-80, 80, 1)},
...        'lon': {'dims': ('lon',), 'data': np.arange(-180, 180, 1)}},
...     'attrs': {'gattr1': 'gattr_val'},
...     'dims': ('time', 'lon', 'lat'),
...     'data_vars': {'myvar': {'dims': ('lat', 'lon',),
...                             'data': np.ones(shape=(160, 360))}}}
...     )
>>> print(dst)
Dataset: Dataset
Feature Dims :
   .        lat : 160
   .        lon : 360
   .        time : 1
Other Dims :
Feature Coordinates :
   .        time  (time: 1)
   .        lat  (lat: 160)
   .        lon  (lon: 360)
Other Coordinates :
Fields :
   .        myvar  (lat: 160, lon: 360)
Global Attributes :
   .        gattr1  gattr_val
   .        time_coverage_start     2018-01-01 00:00:00
   .        time_coverage_end       2018-01-01 00:00:00

cerbere Field objects can also be mixed in:

>>> from cerbere.dataset.field import Field
>>> field = Field(
...        np.ones(shape=(160, 360)),
...        'myvar',
...        dims=('lat', 'lon',),
...        attrs={'myattr': 'attr_val'}
...    )
>>> dst = Dataset(
...        {'time': {'dims': ('time'), 'data': [datetime(2018, 1, 1)]},
...         'lat': {'dims': ('lat'), 'data': np.arange(-80, 80, 1)},
...         'lon': {'dims': ('lon',), 'data': np.arange(-180, 180, 1)},
...         'myvar': field
...         }
...    )
>>> print(dst)
Dataset: Dataset
Feature Dims :
   .        lat : 160
   .        lon : 360
   .        time : 1
Other Dims :
Feature Coordinates :
   .        time  (time: 1)
   .        lat  (lat: 160)
   .        lon  (lon: 360)
Other Coordinates :
Fields :
   .        myvar  (lat: 160, lon: 360)
Global Attributes :
   .        time_coverage_start     2018-01-01 00:00:00
   .        time_coverage_end       2018-01-01 00:00:00

Creating a dataset from xarray arguments

Using the same syntax as xarray (see: http://xarray.pydata.org/en/stable/data-structures.html#dataset ).

The provided coords must have latitude, longitude and time coordinates with valid cerbere names (lat, lon, time, optionally z) and the same goes for dimensions:

>>> dst = Dataset(
...     {'myvar': (['lat', 'lon'], np.ones(shape=(160, 360)))},
...     coords={
...         'time': (['time'], [datetime(2018, 1, 1)], {'units': 'seconds since 2001-01-01 00:00:00'}),
...         'lat': (['lat'], np.arange(-80, 80, 1)),
...         'lon': (['lon'], np.arange(-180, 180, 1))
...     },
...     attrs={'gattr1': 'gattr_val'}
... )