Reading or creating datasets
Reading from a file
The dataset
package and other contribution packages provide various
classes to read and standardize the format and content of EO data products. To
each EO data product type should correspond a specific class in dataset
to
read its content. Some of these classes, such as the NCDataset
for CF
compliant NetCDF files, can read a wide range of EO products sharing similar
format conventions. Each class derives from the main Dataset
base class and
inherits all its methods.
To read data from a file, first instantiate a dataset
object of the
corresponding class, specifying the path to this file. For instance, let’s
create a dataset object from a Mercator Ocean Model file (test file available at
ftp://ftp.ifremer.fr/ifremer/cersat/projects/cerbere/test_data/NCDataset/mercatorpsy4v3r1_gl12_hrly_20200219_R20200210.nc
). It is a CF compliant NetCDF file and we can then use the |NCDataset] class:
>>> import cerbere
>>> # instantiate the dataset object with the file path as argument
>>> dst = cerbere.open_dataset(
>>> ... 'NCDataset', "mercatorpsy4v3r1_gl12_hrly_20200219_R20200210.nc")
or, directly importing the NCDataset
class:
>>> from cerbere.internals.ncdataset import NCDataset
>>> # instantiate the dataset object with the file path as argument
>>> dst = NCDataset("mercatorpsy4v3r1_gl12_hrly_20200219_R20200210.nc")
Print the dataset description:
>>> print(dst)
A Dataset can also be created from a list of files.
>>> from cerbere.internals.ncdataset import NCDataset
>>> # instantiate the dataset object with the file path as argument
>>> dst = NCDataset("mercatorpsy4v3r1_gl12_hrly_20200219_R20200210.nc")
Print the dataset description:
>>> print(dst)
A Dataset can also be created from a list of files.
>>> from cerbere.dataset.ncdataset import NCDataset
>>> # instantiate the dataset object with the file path as argument
>>> dst = NCDataset("mercatorpsy4v3r1_gl12_hrly_20200219_R20200210.nc")
Print the dataset description:
>>> print(dst)
A Dataset can also be created from a list of files.
Creating a new dataset
A Dataset
class object (or from an inherited class in dataset
package) can
be created in memory without pre-existing file. A Dataset
object can be
created in different ways:
Creating a Dataset from an xarray Dataset
object
The xarray Dataset
object must have latitude, longitude and
time coordinates with valid cerbere names (lat
, lon
, time
):
>>> import xarray as xr
>>> import numpy as np
>>> xrobj = xr.Dataset(
coords={
'lat': np.arange(0,10, 0.1),
'lon': np.arange(5,15, 0.1),
'time': np.full((100,), np.datetime64(('2010-02-03'), dtype='D'))
},
data_vars={'myvar': (('time',), np.ones(100))}
)
>>> dst = Dataset(xrobj)
Creating a dataset from a dictionary
Using the same syntax as xarray (see: http://xarray.pydata.org/en/stable/generated/xarray.Dataset.from_dict.html#xarray.Dataset.from_dict ) by providing these arguments as a dictionary.
The provided dict must have latitude, longitude and time coordinates with valid
cerbere names (lat
, lon
, time
, optionally z
):
>>> from cerbere.dataset.internals import Dataset
>>> import numpy as np
>>> from datetime import datetime
>>> dst = Dataset(
... {'time': {'dims': ('time'), 'data': [datetime(2018, 1, 1)]},
... 'lat': {'dims': ('lat'), 'data': np.arange(-80, 80, 1)},
... 'lon': {'dims': ('lon',), 'data': np.arange(-180, 180, 1)},
... 'myvar': {'dims': ('lat', 'lon',),
... 'data': np.ones(shape=(160, 360))}
... }
... )
>>> print(dst)
Dataset: Dataset
Feature Dims :
. lat : 160
. lon : 360
. time : 1
Other Dims :
Feature Coordinates :
. time (time: 1)
. lat (lat: 160)
. lon (lon: 360)
Other Coordinates :
Fields :
. myvar (lat: 160, lon: 360)
Global Attributes :
. time_coverage_start 2018-01-01 00:00:00
. time_coverage_end 2018-01-01 00:00:00
Another syntax accepted by
>>> from cerbere.dataset.internals import Dataset
>>> import numpy as np
>>> from datetime import datetime
>>> dst = Dataset(
... {'time': {'dims': ('time'), 'data': [datetime(2018, 1, 1)]},
... 'lat': {'dims': ('lat'), 'data': np.arange(-80, 80, 1)},
... 'lon': {'dims': ('lon',), 'data': np.arange(-180, 180, 1)},
... 'myvar': {'dims': ('lat', 'lon',),
... 'data': np.ones(shape=(160, 360))}
... }
... )
>>> print(dst)
Dataset: Dataset
Feature Dims :
. lat : 160
. lon : 360
. time : 1
Other Dims :
Feature Coordinates :
. time (time: 1)
. lat (lat: 160)
. lon (lon: 360)
Other Coordinates :
Fields :
. myvar (lat: 160, lon: 360)
Global Attributes :
. time_coverage_start 2018-01-01 00:00:00
. time_coverage_end 2018-01-01 00:00:00
Another syntax accepted by
>>> from cerbere.internals.internals import Dataset
>>> import numpy as np
>>> from datetime import datetime
>>> dst = Dataset(
... {'time': {'dims': ('time'), 'data': [datetime(2018, 1, 1)]},
... 'lat': {'dims': ('lat'), 'data': np.arange(-80, 80, 1)},
... 'lon': {'dims': ('lon',), 'data': np.arange(-180, 180, 1)},
... 'myvar': {'dims': ('lat', 'lon',),
... 'data': np.ones(shape=(160, 360))}
... }
... )
>>> print(dst)
Dataset: Dataset
Feature Dims :
. lat : 160
. lon : 360
. time : 1
Other Dims :
Feature Coordinates :
. time (time: 1)
. lat (lat: 160)
. lon (lon: 360)
Other Coordinates :
Fields :
. myvar (lat: 160, lon: 360)
Global Attributes :
. time_coverage_start 2018-01-01 00:00:00
. time_coverage_end 2018-01-01 00:00:00
Another syntax accepted by
>>> from cerbere.internals.internals import Dataset
>>> import numpy as np
>>> from datetime import datetime
>>> dst = Dataset(
... {'time': {'dims': ('time'), 'data': [datetime(2018, 1, 1)]},
... 'lat': {'dims': ('lat'), 'data': np.arange(-80, 80, 1)},
... 'lon': {'dims': ('lon',), 'data': np.arange(-180, 180, 1)},
... 'myvar': {'dims': ('lat', 'lon',),
... 'data': np.ones(shape=(160, 360))}
... }
... )
>>> print(dst)
Dataset: Dataset
Feature Dims :
. lat : 160
. lon : 360
. time : 1
Other Dims :
Feature Coordinates :
. time (time: 1)
. lat (lat: 160)
. lon (lon: 360)
Other Coordinates :
Fields :
. myvar (lat: 160, lon: 360)
Global Attributes :
. time_coverage_start 2018-01-01 00:00:00
. time_coverage_end 2018-01-01 00:00:00
Another syntax accepted by
>>> from cerbere.dataset.dataset import Dataset
>>> import numpy as np
>>> from datetime import datetime
>>> dst = Dataset(
... {'time': {'dims': ('time'), 'data': [datetime(2018, 1, 1)]},
... 'lat': {'dims': ('lat'), 'data': np.arange(-80, 80, 1)},
... 'lon': {'dims': ('lon',), 'data': np.arange(-180, 180, 1)},
... 'myvar': {'dims': ('lat', 'lon',),
... 'data': np.ones(shape=(160, 360))}
... }
... )
>>> print(dst)
Dataset: Dataset
Feature Dims :
. lat : 160
. lon : 360
. time : 1
Other Dims :
Feature Coordinates :
. time (time: 1)
. lat (lat: 160)
. lon (lon: 360)
Other Coordinates :
Fields :
. myvar (lat: 160, lon: 360)
Global Attributes :
. time_coverage_start 2018-01-01 00:00:00
. time_coverage_end 2018-01-01 00:00:00
Another syntax accepted by xarray provides explicit coordinates (coords
),
fields (data_vars
), dimensions (dims
) and global attributes (attrs
),
which, again, have to be passed as a dictionary to the Dataset
creator:
>>> dst = Dataset({
... 'coords': {
... 'time': {'dims': ('time'), 'data': [datetime(2018, 1, 1)],
... 'attrs': {'units': 'seconds since 2001-01-01 00:00:00'}},
... 'lat': {'dims': ('lat'), 'data': np.arange(-80, 80, 1)},
... 'lon': {'dims': ('lon',), 'data': np.arange(-180, 180, 1)}},
... 'attrs': {'gattr1': 'gattr_val'},
... 'dims': ('time', 'lon', 'lat'),
... 'data_vars': {'myvar': {'dims': ('lat', 'lon',),
... 'data': np.ones(shape=(160, 360))}}}
... )
>>> print(dst)
Dataset: Dataset
Feature Dims :
. lat : 160
. lon : 360
. time : 1
Other Dims :
Feature Coordinates :
. time (time: 1)
. lat (lat: 160)
. lon (lon: 360)
Other Coordinates :
Fields :
. myvar (lat: 160, lon: 360)
Global Attributes :
. gattr1 gattr_val
. time_coverage_start 2018-01-01 00:00:00
. time_coverage_end 2018-01-01 00:00:00
cerbere Field
objects can also be mixed in:
>>> from cerbere.internals.field import Field
>>> field = Field(
... np.ones(shape=(160, 360)),
... 'myvar',
... dims=('lat', 'lon',),
... attrs={'myattr': 'attr_val'}
... )
>>> dst = Dataset(
... {'time': {'dims': ('time'), 'data': [datetime(2018, 1, 1)]},
... 'lat': {'dims': ('lat'), 'data': np.arange(-80, 80, 1)},
... 'lon': {'dims': ('lon',), 'data': np.arange(-180, 180, 1)},
... 'myvar': field
... }
... )
>>> print(dst)
Dataset: Dataset
Feature Dims :
. lat : 160
. lon : 360
. time : 1
Other Dims :
Feature Coordinates :
. time (time: 1)
. lat (lat: 160)
. lon (lon: 360)
Other Coordinates :
Fields :
. myvar (lat: 160, lon: 360)
Global Attributes :
. time_coverage_start 2018-01-01 00:00:00
. time_coverage_end 2018-01-01 00:00:00
>>> from cerbere.internals.field import Field
>>> field = Field(
... np.ones(shape=(160, 360)),
... 'myvar',
... dims=('lat', 'lon',),
... attrs={'myattr': 'attr_val'}
... )
>>> dst = Dataset(
... {'time': {'dims': ('time'), 'data': [datetime(2018, 1, 1)]},
... 'lat': {'dims': ('lat'), 'data': np.arange(-80, 80, 1)},
... 'lon': {'dims': ('lon',), 'data': np.arange(-180, 180, 1)},
... 'myvar': field
... }
... )
>>> print(dst)
Dataset: Dataset
Feature Dims :
. lat : 160
. lon : 360
. time : 1
Other Dims :
Feature Coordinates :
. time (time: 1)
. lat (lat: 160)
. lon (lon: 360)
Other Coordinates :
Fields :
. myvar (lat: 160, lon: 360)
Global Attributes :
. time_coverage_start 2018-01-01 00:00:00
. time_coverage_end 2018-01-01 00:00:00
>>> from cerbere.dataset.field import Field
>>> field = Field(
... np.ones(shape=(160, 360)),
... 'myvar',
... dims=('lat', 'lon',),
... attrs={'myattr': 'attr_val'}
... )
>>> dst = Dataset(
... {'time': {'dims': ('time'), 'data': [datetime(2018, 1, 1)]},
... 'lat': {'dims': ('lat'), 'data': np.arange(-80, 80, 1)},
... 'lon': {'dims': ('lon',), 'data': np.arange(-180, 180, 1)},
... 'myvar': field
... }
... )
>>> print(dst)
Dataset: Dataset
Feature Dims :
. lat : 160
. lon : 360
. time : 1
Other Dims :
Feature Coordinates :
. time (time: 1)
. lat (lat: 160)
. lon (lon: 360)
Other Coordinates :
Fields :
. myvar (lat: 160, lon: 360)
Global Attributes :
. time_coverage_start 2018-01-01 00:00:00
. time_coverage_end 2018-01-01 00:00:00
Creating a dataset from xarray arguments
Using the same syntax as xarray (see: http://xarray.pydata.org/en/stable/data-structures.html#dataset ).
The provided coords must have latitude, longitude and time coordinates with
valid cerbere names (lat
, lon
, time
, optionally z
) and the
same goes for dimensions:
>>> dst = Dataset(
... {'myvar': (['lat', 'lon'], np.ones(shape=(160, 360)))},
... coords={
... 'time': (['time'], [datetime(2018, 1, 1)], {'units': 'seconds since 2001-01-01 00:00:00'}),
... 'lat': (['lat'], np.arange(-80, 80, 1)),
... 'lon': (['lon'], np.arange(-180, 180, 1))
... },
... attrs={'gattr1': 'gattr_val'}
... )