Dataset accessor
cerbere
provides an accessor to the xarray Dataset
class, called
cb
. This accessor offers a set of attributes and methods that enrich
those provided natively by xarray.
Harmonization
cerbere
harmonizes Dataset object related to Earth observation data by
enforcing CF and others conventions, providing consistent naming for
coordinate variables and dimensions but also variable and global attributes,
similar types and conventions for coordinate data, etc… It also tries to
fill in some generic standard attributes defined by the aforementioned
conventions. Harmonization rules include:
consistent naming of latitude, longitude and time as
lat
,lon
,time
detection of X, Y, Z, T axis coordinates and dimensions
detection of instance dimensions for collections of discrete features
longitudes changed to [-180, 180] range unless setting the
longitude180
global option to False withcerbere.set_options(longitude180=False)
consistent naming (inferring from data) of global attributes such as
time_coverage_start
,time_coverage_end
, …
Given an xarray Dataset, cerbere
will perform some guesses to provide a
more harmonized version of this dataset in cfdataset
property of the
cerbere cb
accessor. For instance:
# create a Dataset
In [1]: dst = xr.Dataset(
...: {'myvar': (['LATITUDE', 'LONGITUDE', 'Z'],
...: np.ones(shape=(160, 360,3)))},
...: coords={
...: 'TIME': (['TIME'], [datetime(2018, 1, 1)],
...: {'units': 'seconds since 2001-01-01 00:00:00'}),
...: 'LATITUDE': (['LATITUDE'], np.arange(-80, 80, 1)),
...: 'LONGITUDE': (['LONGITUDE'], np.arange(-180, 180, 1)),
...: 'DEPTH': (['Z'], np.arange(5, 20, 5))},
...: attrs={
...: 'start_date': '2018-01-01 00:00:00',
...: 'stop_date': '2018-02-01 00:00:00'}
...: )
...:
In [2]: import cerbere
# get a cerberized version with renaming of variables, dimensions and
# attributes, added complementary CF attributes
In [3]: cfdst = dst.cb.cfdataset
In [4]: cfdst
Out[4]:
<xarray.Dataset>
Dimensions: (lat: 160, lon: 360, z: 3, time: 1)
Coordinates:
* time (time) datetime64[ns] 2018-01-01
* lat (lat) int64 -80 -79 -78 -77 -76 -75 -74 ... 73 74 75 76 77 78 79
* lon (lon) int64 -180 -179 -178 -177 -176 -175 ... 175 176 177 178 179
DEPTH (z) int64 5 10 15
Dimensions without coordinates: z
Data variables:
myvar (lat, lon, z) float64 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
Attributes:
time_coverage_start: 2018-01-01 00:00:00
time_coverage_end: 2018-02-01 00:00:00
Now let’s see what happened to above dataset when retrieving its cerbere
compliant version through cb.cfdataset
property.
Spatiotemporal coordinates
TIME, LATITUDE and LONGITUDE were renamed to time, lat, lon used
consistently within cerbere
. Users can safely use these harmonized variable
names whenever reading and manipulating Earth Observation data with
cerbere
. Other variants of coordinate naming are recognized; in case the
time, lat, lon variables are not found in the file, an error will be
thrown and this case should be managed through a specific reader when
encountering such dataset.
The spatiotemporal coordinates can be listed with cf_coords
method, which
provides the mapping between the CF coordinate reference and the internal
naming in the dataset after cerbere
harmonization.
# get the spatiotemporal coordinates
In [5]: cfdst.cb.cf_coords
Out[5]: {'latitude': 'lat', 'longitude': 'lon', 'time': 'time', 'vertical': 'DEPTH'}
Note the time, lat, lon naming convention in cerbere
, matching the
time, latitude, longitude CF standard names. cerbere
does not
rename the vertical coordinate as it can be misleading due to the many ways
of expressing this quantity (depth, altitude, pressure, sigma level,…) but
it can be accessed in a unified way through vertical
property:
# get the spatiotemporal vertical coordinate
In [6]: cfdst.cb.vertical
Out[6]:
<xarray.DataArray 'DEPTH' (z: 3)>
array([ 5, 10, 15])
Coordinates:
DEPTH (z) int64 5 10 15
Dimensions without coordinates: z
Attributes:
axis: Z
Similarly, time
, latitude
, longitude
can be used to access the
other spatiotemporal coordinates.
Spatiotemporal axes
The CF axis dimensions corresponding to spatiotemporal information (X, Y,
Z, T) have been detected and can be listed with cf_axis_dims
method:
# get the coordinates for each CF axis
In [7]: cfdst.cb.cf_axis_dims
Out[7]: {'Y': 'LATITUDE', 'X': 'LONGITUDE', 'T': 'TIME', 'Z': 'Z'}
Coordinates that depend only on a single spatiotemporal axis are refered to
as as axis coordinates. The axis attribute is set by cerbere
for the
corresponding coordinate variables. They can be listed with cf_axis_coords
method:
# get the spatiotemporal vertical coordinate
In [8]: cfdst.cb.cf_axis_coords
Out[8]: {'Y': 'lat', 'X': 'lon', 'T': 'time', 'Z': 'DEPTH'}
# axis attribute value for time, added by cerbere
In [9]: cfdst['time'].attrs['axis']
Out[9]: 'T'
Note that axis coordinate variables can be individually retrieved using their
CF standard axis name, X
, Y
, Z
or T
(None is returned if it
does not exist or was not identified as such by cerbere
):
In [10]: cfdst.cb.X
Out[10]:
<xarray.DataArray 'lon' (lon: 360)>
array([-180, -179, -178, ..., 177, 178, 179])
Coordinates:
* lon (lon) int64 -180 -179 -178 -177 -176 -175 ... 175 176 177 178 179
Attributes:
standard_name: longitude
units: degree_east
long_name: longitude
authority: CF-1.11, ACDD-1.3
axis: X
Spatiotemporal attributes
cerbere
also harmonizes some attributes. In above example, start_date and
stop_date were detected as aliases of the more standard time_coverage_start
and time_coverage_end (from ACDD convention) that can be accessed with the
properties time_coverage_start
and time_coverage_end
:
In [11]: cfdst.cb.time_coverage_start
Out[11]: datetime.datetime(2018, 1, 1, 0, 0)