<!-- double header is needed to workaround doxygen bug -->
# The NetCDF NCZarr Implementation {#nczarr_head}
\tableofcontents
# NCZarr Introduction {#nczarr_introduction}
Beginning with netCDF version 4.8.0, the Unidata NetCDF group has extended the netcdf-c library to provide access to cloud storage (e.g. Amazon S3 <ahref="#ref_aws">[1]</a> ).
This extension provides a mapping from a subset of the full netCDF Enhanced (aka netCDF-4) data model to a variant of the Zarr <ahref="#ref_zarrv2">[4]</a> data model.
The NetCDF version of this storage format is called NCZarr <ahref="#ref_nczarr">[4]</a>.
A note on terminology in this document.
1. The term "dataset" is used to refer to all of the Zarr objects constituting
NCZarr uses a data model <ahref="#ref_nczarr">[4]</a> that, by design, extends the Zarr Version 2 Specification <ahref="#ref_zarrv2">[6]</a> to add support for the NetCDF-4 data model.
__Note Carefully__: a legal _NCZarr_ dataset is also a legal _Zarr_ dataset under a specific assumption. This assumption is that within Zarr meta-data objects, like ''.zarray'', unrecognized dictionary keys are ignored.
If this assumption is true of an implementation, then the _NCZarr_ dataset is a legal _Zarr_ dataset and should be readable by that _Zarr_ implementation.
2. The compressors (aka filters) used by the dataset can be encoded/decoded by the implementation. NCZarr uses HDF5-style filters, so ensuring access to such filters is somewhat complicated. See [the companion document on
filters](./md_filters.html "filters") for details.
With respect to full netCDF-4, the following concepts are
currently unsupported.
- User-defined types (enum, opaque, VLEN, and Compound)
- Unlimited dimensions
- Contiguous or compact storage
Note that contiguous and compact are not actually supported
because they are HDF5 specific.
When specified, they are treated as chunked where the file consists of only one chunk.
This means that testing for contiguous or compact is not possible; the _nc_inq_var_chunking_ function will always return NC_CHUNKED and the chunksizes will be the same as the dimension sizes of the variable's dimensions.
If the _--disable-nczarr_ option is used with './configure', then NCZarr (and Zarr) support is disabled.
If NCZarr support is enabled, then support for datasets stored as files in a directory tree is provided as the only guaranteed mechanism for storing datasets.
However, several addition storage mechanisms are available if additional libraries are installed.
1. Zip format -- if _libzip_ is installed, then it is possible to directly read and write datasets stored in zip files.
2. If the AWS C++ SDK is installed, and _libcurl_ is installed, then it is possible to directly read and write datasets stored in the Amazon S3 cloud storage.
# Accessing Data Using the NCZarr Prototocol {#nczarr_accessing_data}
In order to access a NCZarr data source through the netCDF API, the file name normally used is replaced with a URL with a specific format.
Note specifically that there is no NC_NCZARR flag for the mode argument of _nc_create_ or _nc_open_.
In this case, it is indicated by the URL path.
## URL Format
The URL is the usual format.
````
scheme:://host:port/path?query#fragment format
````
There are some details that are important.
- Scheme: this should be _https_ or _s3_,or _file_.
The _s3_ scheme is equivalent
to "https" plus setting "mode=nczarr,s3" (see below).
Specifying "file" is mostly used for testing, but is used to support
+ _Other_: It is possible to use other non-Amazon cloud storage, but
that is cloud library dependent.
- Query: currently not used.
- Fragment: the fragment is of the form _key=value&key=value&..._.
Depending on the key, the _value_ part may be left out and some
default value will be used.
## Client Parameters
The fragment part of a URL is used to specify information that is interpreted to specify what data format is to be used, as well as additional controls for that data format.
For NCZarr support, the following _key=value_ pairs are allowed.
- mode=nczarr|zarr|noxarray|file|zip|s3
Typically one will specify two mode flags: one to indicate what format
to use and one to specify the way the dataset is to be stored.
For example, a common one is "mode=zarr,file"
Using _mode=nczarr_ causes the URL to be interpreted as a
reference to a dataset that is stored in NCZarr format.
The _zarr_ mode tells the library to
use NCZarr, but to restrict its operation to operate on pure
Zarr Version 2 datasets.
The modes _s3_, _file_, and _zip_ tell the library what storage
driver to use.
* The _s3_ driver is the default and indicates using Amazon S3 or some equivalent.
* The _file_ format stores data in a directory tree.
* The _zip_ format stores data in a local zip file.
Note that It should be the case that zipping a _file_
format directory tree will produce a file readable by the
- log=<output-stream>: this control turns on logging output,
which is useful for debugging and testing.
If just _log_ is used
then it is equivalent to _log=stderr_.
-->
# NCZarr Map Implementation {#nczarr_mapimpl}
Internally, the nczarr implementation has a map abstraction that allows different storage formats to be used.
This is closely patterned on the same approach used in the Python Zarr implementation, which relies on the Python _MutableMap_<ahref="#ref_python">[5]</a> class.
In NCZarr, the corresponding type is called _zmap_.
The __zmap__ API essentially implements a simplified variant
of the Amazon S3 API.
As with Amazon S3, __keys__ are utf8 strings with a specific structure:
that of a path similar to those of a Unix path with '/' as the
2.__Object__ - equivalent of the S3 object; Each object has a unique key
and "contains" data in the form of an arbitrary sequence of 8-bit bytes.
The zmap API defined here isolates the key-value pair mapping
code from the Zarr-based implementation of NetCDF-4.
It wraps an internal C dispatch table manager for implementing an
abstract data structure implementing the zmap key/object model.
Of special note is the "search" function of the API.
__Search__: The search function has two purposes:
1. Support reading of pure zarr datasets (because they do not explicitly track their contents).
2. Debugging to allow raw examination of the storage. See zdump for example.
The search function takes a prefix path which has a key syntax (see above).
The set of legal keys is the set of keys such that the key references a content-bearing object -- e.g. /x/y/.zarray or /.zgroup.
Essentially this is the set of keys pointing to the leaf objects of the tree of keys constituting a dataset.
This set potentially limits the set of keys that need to be examined during search.
The search function returns a limited set of names, where the set of names are immediate suffixes of a given prefix path.
That is, if _\<prefix\>_ is the prefix path, then search returnsnall _\<name\>_ such that _\<prefix>/\<name\>_ is itself a prefix of a "legal" key.
This can be used to implement glob style searches such as "/x/y/*" or "/x/y/**"
This semantics was chosen because it appears to be the minimum required to implement all other kinds of search using recursion.
It was also chosen to limit the number of names returned from the search.
Specifically
1. Avoid returning keys that are not a prefix of some legal key.
2. Avoid returning all the legal keys in the dataset because that set may be very large; although the implementation may still have to examine all legal keys to get the desired subset.
3. Allow for use of partial read mechanisms such as iterators, if available.
This can support processing a limited set of keys for each iteration.
This is a straighforward tradeoff of space over time.
As a side note, S3 supports this kind of search using common prefixes with a delimiter of '/', although its use is a bit tricky.
For the file system zmap implementation, the legal search keys can be obtained one level at a time, which directly implements the search semantics.
For the zip file implementation, this semantics is not possible, so the whole
tree must be obtained and searched.
__Issues:__
1. S3 limits key lengths to 1024 bytes.
Some deeply nested netcdf files will almost certainly exceed this limit.
2. Besides content, S3 objects can have an associated small set
of what may be called tags, which are themselves of the form of
key-value pairs, but where the key and value are always text.
As far as it is possible to determine, Zarr never uses these tags,
so they are not included in the zmap data structure.
__A Note on Error Codes:__
The zmap API returns some distinguished error code:
1. NC_NOERR if a operation succeeded
2. NC_EEMPTY is returned when accessing a key that has no content.
3. NC_EOBJECT is returned when an object is found which should not exist
4. NC_ENOOBJECT is returned when an object is not found which should exist
This does not preclude other errors being returned such NC_EACCESS or NC_EPERM or NC_EINVAL if there are permission errors or illegal function arguments, for example.
It also does not preclude the use of other error codes internal to the zmap implementation.
So zmap_file, for example, uses NC_ENOTFOUND internally because it is possible to detect the existence of directories and files.
But this does not propagate outside the zmap_file implementation.
## Zmap Implementatons
The primary zmap implementation is _s3_ (i.e. _mode=nczarr,s3_) and indicates that the Amazon S3 cloud storage -- or some related applicance -- is to be used.
Another storage format uses a file system tree of directories and files (_mode=nczarr,file_).
A third storage format uses a zip file (_mode=nczarr,zip_).
The latter two are used mostly for debugging and testing.
However, the _file_ and _zip_ formats are important because they are intended to match corresponding storage formats used by the Python Zarr implementation.
3. The Zarr dataset may store data in column-major order instead of row-major order. The effect of encountering such a dataset is to output the data in the wrong order.
The NCZarr code will accept either form, although internally, it is standardized on path style.
The reason for this is that the bucket name forms the initial segment in the keys.
# Zarr vs NCZarr {#nczarr_vs_zarr}
## Data Model
The NCZarr storage format is almost identical to that of the the standard Zarr version 2 format.
The data model differs as follows.
1. Zarr only supports anonymous dimensions -- NCZarr supports only shared (named) dimensions.
2. Zarr attributes are untyped -- or perhaps more correctly characterized as of type string.
## Storage Format
Consider both NCZarr and Zarr, and assume S3 notions of bucket and object.
In both systems, Groups and Variables (Array in Zarr) map to S3 objects.
Containment is modeled using the fact that the dataset's key is a prefix of the variable's key.
So for example, if variable _v1_ is contained in top level group g1 -- _/g1 -- then the key for _v1_ is _/g1/v_.
Additional meta-data information is stored in special objects whose name start with ".z".
In Zarr, the following special objects exist.
1. Information about a group is kept in a special object named _.zgroup_;
so for example the object _/g1/.zgroup_.
2. Information about an array is kept as a special object named _.zarray_;
so for example the object _/g1/v1/.zarray_.
3. Group-level attributes and variable-level attributes are stored in a special object named _.zattr_;
so for example the objects _/g1/.zattr_ and _/g1/v1/.zattr_.
4. Chunk data is stored in objects named "\<n1\>.\<n2\>...,\<nr\>" where the ni are positive integers representing the chunk index for the ith dimension.
The first three contain meta-data objects in the form of a string representing a JSON-formatted dictionary.
The NCZarr format uses the same objects as Zarr, but inserts NCZarr
specific key-value pairs in them to hold NCZarr specific information
The value of each of these keys is a JSON dictionary containing a variety
This means that technically, it is attribute, but one for which access
is normally surpressed .
Specifically it contains the following keys:
* types -- the types of all of the other attributes in the _.zattr_ object.
## Translation {#nczarr_translation}
With some constraints, it is possible for an nczarr library to read Zarr and for a zarr library to read the nczarr format.
The latter case, zarr reading nczarr is possible if the zarr library is willing to ignore keys whose name it does not recognize; specifically anything beginning with _\_NCZARR\__.
The former case, nczarr reading zarr is also possible if the nczarr can simulate or infer the contents of the missing _\_NCZARR\_XXX_ objects.
1. _\_nczarr_group\__ -- The list of contained variables and sub-groups can be computed using the search API to list the keys "contained" in the key for a group.
In order to accomodate existing implementations, certain mode tags are provided to tell the NCZarr code to look for information used by specific implementations.
## XArray
The Xarray <ahref="#ref_xarray">[7]</a> Zarr implementation uses its own mechanism for specifying shared dimensions.
It uses a special attribute named ''_ARRAY_DIMENSIONS''.
The value of this attribute is a list of dimension names (strings).
An example might be ````["time", "lon", "lat"]````.
It is essentially equivalent to the ````_nczarr_array "dimrefs" list````, except that the latter uses fully qualified names so the referenced dimensions can be anywhere in the dataset.
<aname="ref_nczarr">[4]</a> [NetCDF ZARR Data Model Specification](https://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf-zarr-data-model-specification)<br>
2. -DENABLE_NCZARR_S3=off -- equivalent to the Automake _--enable-nczarr-s3_ option.
3. -DENABLE_NCZARR_S3_TESTS=off -- equivalent to the Automake _--enable-nczarr-s3-tests_ option.
Note that unlike Automake, CMake can properly locate C++ libraries, so it should not be necessary to specify _-laws-cpp-sdk-s3_ assuming that the aws s3 libraries are installed in the default location.
For CMake with Visual Studio, the default location is here:
````
C:/Program Files (x86)/aws-cpp-sdk-all
````
It is possible to install the sdk library in another location.
In this case, one must add the following flag to the cmake command.
````
cmake ... -DAWSSDK_DIR=\<awssdkdir\>
````
where "awssdkdir" is the path to the sdk installation.
For example, this might be as follows.
````
cmake ... -DAWSSDK_DIR="c:\tools\aws-cpp-sdk-all"
````
This can be useful if blanks in path names cause problems in your build environment.
## Testing S3 Support {#nczarr_testing_S3_support}
The relevant tests for S3 support are in the _nczarr_test_ directory.
Currently, by default, testing of S3 with NCZarr is supported only for Unidata members of the NetCDF Development Group.
The NetCDF-C library contains an alternate mechanism for accessing traditional netcdf-4 files stored in Amazon S3: The byte-range mechanism.
The idea is to treat the remote data as if it was a big file.
This remote "file" can be randomly accessed using the HTTP Byte-Range header.
In the Amazon S3 context, a copy of a dataset, a netcdf-3 or netdf-4 file, is uploaded into a single object in some bucket.
Then using the key to this object, it is possible to tell the netcdf-c library to treat the object as a remote file and to use the HTTP Byte-Range protocol to access the contents of the object.
The dataset object is referenced using a URL with the trailing fragment containing the string ````#mode=bytes````.
An examination of the test program _nc_test/test_byterange.sh_ shows simple examples using the _ncdump_ program.
Note that for S3 access, it is expected that the URL is in what is called "path" format where the bucket, _noaa-goes16_ in this case, is part of the URL path instead of the host.
In NCZarr Version 1, the NCZarr specific metadata was represented using new objects rather than as keys in existing Zarr objects.
Due to conflicts with the Zarr specification, that format is deprecated in favor of the one described above.
However the netcdf-c NCZarr support can still read the version 1 format.
The version 1 format defines three specific objects: _.nczgroup_, _.nczarray_,_.nczattr_.
These are stored in parallel with the corresponding Zarr objects. So if there is a key of the form "/x/y/.zarray", then there is also a key "/x/y/.nczarray".
The content of these objects is the same as the contents of the corresponding keys. So the value of the ''_NCZARR_ARRAY'' key is the same as the content of the ''.nczarray'' object. The list of connections is as follows: