Dataset accessor

cerbere provides an accessor to the xarray Dataset class, called cb. This accessor offers a set of attributes and methods that enrich those provided natively by xarray.

Harmonization

cerbere harmonizes Dataset object related to Earth observation data by enforcing CF and others conventions, providing consistent naming for coordinate variables and dimensions but also variable and global attributes, similar types and conventions for coordinate data, etc… It also tries to fill in some generic standard attributes defined by the aforementioned conventions. Harmonization rules include:

  • consistent naming of latitude, longitude and time as lat, lon, time

  • detection of X, Y, Z, T axis coordinates and dimensions

  • detection of instance dimensions for collections of discrete features

  • longitudes changed to [-180, 180] range unless setting the longitude180 global option to False with cerbere.set_options(longitude180=False)

  • consistent naming (inferring from data) of global attributes such as time_coverage_start, time_coverage_end, …

Given an xarray Dataset, cerbere will perform some guesses to provide a more harmonized version of this dataset in cfdataset property of the cerbere cb accessor. For instance:

# create a Dataset
In [1]: dst = xr.Dataset(
   ...:     {'myvar': (['LATITUDE', 'LONGITUDE', 'Z'],
   ...:                 np.ones(shape=(160, 360,3)))},
   ...:     coords={
   ...:         'TIME': (['TIME'], [datetime(2018, 1, 1)],
   ...:             {'units': 'seconds since 2001-01-01 00:00:00'}),
   ...:         'LATITUDE': (['LATITUDE'], np.arange(-80, 80, 1)),
   ...:         'LONGITUDE': (['LONGITUDE'], np.arange(-180, 180, 1)),
   ...:         'DEPTH':  (['Z'], np.arange(5, 20, 5))},
   ...:     attrs={
   ...:         'start_date': '2018-01-01 00:00:00',
   ...:         'stop_date': '2018-02-01 00:00:00'}
   ...: )
   ...: 

In [2]: import cerbere

# get a cerberized version with renaming of variables, dimensions and
# attributes, added complementary CF attributes
In [3]: cfdst = dst.cb.cfdataset

In [4]: cfdst
Out[4]: 
<xarray.Dataset>
Dimensions:  (lat: 160, lon: 360, z: 3, time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01
  * lat      (lat) int64 -80 -79 -78 -77 -76 -75 -74 ... 73 74 75 76 77 78 79
  * lon      (lon) int64 -180 -179 -178 -177 -176 -175 ... 175 176 177 178 179
    DEPTH    (z) int64 5 10 15
Dimensions without coordinates: z
Data variables:
    myvar    (lat, lon, z) float64 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
Attributes:
    time_coverage_start:  2018-01-01 00:00:00
    time_coverage_end:    2018-02-01 00:00:00

Now let’s see what happened to above dataset when retrieving its cerbere compliant version through cb.cfdataset property.

Spatiotemporal coordinates

TIME, LATITUDE and LONGITUDE were renamed to time, lat, lon used consistently within cerbere. Users can safely use these harmonized variable names whenever reading and manipulating Earth Observation data with cerbere. Other variants of coordinate naming are recognized; in case the time, lat, lon variables are not found in the file, an error will be thrown and this case should be managed through a specific reader when encountering such dataset.

The spatiotemporal coordinates can be listed with cf_coords method, which provides the mapping between the CF coordinate reference and the internal naming in the dataset after cerbere harmonization.

# get the spatiotemporal coordinates
In [5]: cfdst.cb.cf_coords
Out[5]: {'latitude': 'lat', 'longitude': 'lon', 'time': 'time', 'vertical': 'DEPTH'}

Note the time, lat, lon naming convention in cerbere, matching the time, latitude, longitude CF standard names. cerbere does not rename the vertical coordinate as it can be misleading due to the many ways of expressing this quantity (depth, altitude, pressure, sigma level,…) but it can be accessed in a unified way through vertical property:

# get the spatiotemporal vertical coordinate
In [6]: cfdst.cb.vertical
Out[6]: 
<xarray.DataArray 'DEPTH' (z: 3)>
array([ 5, 10, 15])
Coordinates:
    DEPTH    (z) int64 5 10 15
Dimensions without coordinates: z
Attributes:
    axis:     Z

Similarly, time, latitude, longitude can be used to access the other spatiotemporal coordinates.

Spatiotemporal axes

The CF axis dimensions corresponding to spatiotemporal information (X, Y, Z, T) have been detected and can be listed with cf_axis_dims method:

# get the coordinates for each CF axis
In [7]: cfdst.cb.cf_axis_dims
Out[7]: {'Y': 'LATITUDE', 'X': 'LONGITUDE', 'T': 'TIME', 'Z': 'Z'}

Coordinates that depend only on a single spatiotemporal axis are refered to as as axis coordinates. The axis attribute is set by cerbere for the corresponding coordinate variables. They can be listed with cf_axis_coords method:

# get the spatiotemporal vertical coordinate
In [8]: cfdst.cb.cf_axis_coords
Out[8]: {'Y': 'lat', 'X': 'lon', 'T': 'time', 'Z': 'DEPTH'}

# axis attribute value for time, added by cerbere
In [9]: cfdst['time'].attrs['axis']
Out[9]: 'T'

Note that axis coordinate variables can be individually retrieved using their CF standard axis name, X, Y, Z or T (None is returned if it does not exist or was not identified as such by cerbere):

In [10]: cfdst.cb.X
Out[10]: 
<xarray.DataArray 'lon' (lon: 360)>
array([-180, -179, -178, ...,  177,  178,  179])
Coordinates:
  * lon      (lon) int64 -180 -179 -178 -177 -176 -175 ... 175 176 177 178 179
Attributes:
    standard_name:  longitude
    units:          degree_east
    long_name:      longitude
    authority:      CF-1.11, ACDD-1.3
    axis:           X

Spatiotemporal attributes

cerbere also harmonizes some attributes. In above example, start_date and stop_date were detected as aliases of the more standard time_coverage_start and time_coverage_end (from ACDD convention) that can be accessed with the properties time_coverage_start and time_coverage_end:

In [11]: cfdst.cb.time_coverage_start
Out[11]: datetime.datetime(2018, 1, 1, 0, 0)