netcdf-c/libnczarr/zmap.h
Dennis Heimbigner 59e04ae071 This PR adds EXPERIMENTAL support for accessing data in the
cloud using a variant of the Zarr protocol and storage
format. This enhancement is generically referred to as "NCZarr".

The data model supported by NCZarr is netcdf-4 minus the user-defined
types and the String type. In this sense it is similar to the CDF-5
data model.

More detailed information about enabling and using NCZarr is
described in the document NUG/nczarr.md and in a
[Unidata Developer's blog entry](https://www.unidata.ucar.edu/blogs/developer/en/entry/overview-of-zarr-support-in).

WARNING: this code has had limited testing, so do use this version
for production work. Also, performance improvements are ongoing.
Note especially the following platform matrix of successful tests:

Platform | Build System | S3 support
------------------------------------
Linux+gcc      | Automake     | yes
Linux+gcc      | CMake        | yes
Visual Studio  | CMake        | no

Additionally, and as a consequence of the addition of NCZarr,
major changes have been made to the Filter API. NOTE: NCZarr
does not yet support filters, but these changes are enablers for
that support in the future.  Note that it is possible
(probable?) that there will be some accidental reversions if the
changes here did not correctly mimic the existing filter testing.

In any case, previously filter ids and parameters were of type
unsigned int. In order to support the more general zarr filter
model, this was all converted to char*.  The old HDF5-specific,
unsigned int operations are still supported but they are
wrappers around the new, char* based nc_filterx_XXX functions.
This entailed at least the following changes:
1. Added the files libdispatch/dfilterx.c and include/ncfilter.h
2. Some filterx utilities have been moved to libdispatch/daux.c
3. A new entry, "filter_actions" was added to the NCDispatch table
   and the version bumped.
4. An overly complex set of structs was created to support funnelling
   all of the filterx operations thru a single dispatch
   "filter_actions" entry.
5. Move common code to from libhdf5 to libsrc4 so that it is accessible
   to nczarr.

Changes directly related to Zarr:
1. Modified CMakeList.txt and configure.ac to support both C and C++
   -- this is in support of S3 support via the awd-sdk libraries.
2. Define a size64_t type to support nczarr.
3. More reworking of libdispatch/dinfermodel.c to
   support zarr and to regularize the structure of the fragments
   section of a URL.

Changes not directly related to Zarr:
1. Make client-side filter registration be conditional, with default off.
2. Hack include/nc4internal.h to make some flags added by Ed be unique:
   e.g. NC_CREAT, NC_INDEF, etc.
3. cleanup include/nchttp.h and libdispatch/dhttp.c.
4. Misc. changes to support compiling under Visual Studio including:
   * Better testing under windows for dirent.h and opendir and closedir.
5. Misc. changes to the oc2 code to support various libcurl CURLOPT flags
   and to centralize error reporting.
6. By default, suppress the vlen tests that have unfixed memory leaks; add option to enable them.
7. Make part of the nc_test/test_byterange.sh test be contingent on remotetest.unidata.ucar.edu being accessible.

Changes Left TO-DO:
1. fix provenance code, it is too HDF5 specific.
2020-06-28 18:02:47 -06:00

286 lines
10 KiB
C

/* Copyright 2018-2018 University Corporation for Atmospheric
Research/Unidata. */
/**
* @file This header file contains types (and type-related macros)
* for the libzarr code.
*
*
* @author Dennis Heimbigner
*/
/*
This API essentially implements a simplified variant
of the Amazon S3 API. Specifically, we have the following
kinds of things.
As with Amazon S3, keys are utf8 strings with a specific structure:
that of a path similar to those of a Unix path with '/' as the
separator for the segments of the path.
As with Unix, all keys have this BNF syntax:
<pre>
key: '/' | key segment ;
segment: <sequence of UTF-8 characters except control characters and '/'>
</pre>
Obviously, one can infer a tree structure from this key structure.
A containment relationship is defined by key prefixes.
Thus one key is "contained" (possibly transitively)
by another if one key is a prefix (in the string sense) of the other.
So in this sense the key "/x/y/z" is contained by the key "/x/y".
As with S3, a key refers to an "object" that can contain content.
An important restriction is placed on the structure of the tree.
Namely, keys are only defined for content-bearing objects.
Further, all the leaves of the tree are these content-bearing objects.
This means that the key for one content-bearing object cannot
be a prefix of any other key.
There several other concepts of note.
1. Dataset - a dataset is the complete tree contained by the key defining
the root of the dataset.
2. Object - equivalent of the S3 object; Each object has a unique key
and "contains" data in the form of an arbitrary sequence of 8-bit bytes.
Notes:
1. The search function is optional. It has two purposes:
a. Support reading of pure zarr datasets (because they do not explicitly
track their contents).
b. Debugging to allow raw examination of the storage. See zdump
for example.
The zmap API defined here isolates the key-value pair mapping code
from the Zarr-based implementation of NetCDF-4.
It wraps an internal C dispatch table manager
for implementing an abstract data structure
implementing the key/object model.
Issues:
1. S3 limits key lengths to 1024 bytes. Some deeply nested netcdf files
will almost certainly exceed this limit.
2. Besides content, S3 objects can have an associated small set
of what may be called tags, which are themselves of the form of
key-value pairs, but where the key and value are always text. As
far as it is possible to determine, Zarr never uses these tags,
so they are not included in the zmap data structure.
A Note on Error Codes:
This model uses the S3 concepts of keys. All legal keys "exist"
in that it is possible to write to them, The concept of a key
not-existing has no meaning: all keys exist. Normally, in S3,
each key specifies an object, but unless that object has
content, it does not exist. Therefore we distinguish
content-bearing "objects" from non-content-bearing objects. Our
model only hold content-bearing objects. Note that the length of
that content may be zero. The important point is that in this
model, only content-bearing objects actually exist. Note that
this different than, say, a direvtory tree where a key will
always lead to something: a directory or a file.
In any case, the zmap API returns two distinguished error code:
1. NC_NOERR if a content bearing object is created or referenced.
2. NC_EEMPTY is returned when accessing a key that has no content.
This does not preclude other errors being returned such NC_EACCESS or NC_EPERM or NC_EINVAL
if there are permission errors or illegal function arguments, for example.
It also does not preclude the use of other error codes internal to the zmap
implementation. So zmap_nzf, for example, uses NC_ENOTFOUND internally
because it is possible to detect the existence of directories and files.
This does not propagate to the API.
Note that NC_EEMPTY is a new error code to signal to that the
caller asked for non-content-bearing key.
The current set of operations defined for zmaps are define with the
generic nczm_xxx functions below.
*/
#ifndef ZMAP_H
#define ZMAP_H
#define NCZM_SEP "/"
#define NCZM_DOT '.'
/* Forward */
typedef struct NCZMAP_API NCZMAP_API;
/* Define the space of implemented (eventually) map implementations */
typedef enum NCZM_IMPL {
NCZM_UNDEF=0, /* In-memory implementation */
NCZM_S3=1, /* Amazon S3 implementation */
NCZM_NC4=2, /* Netcdf-4 file based implementation */
NCZM_FILE=3, /* File system directory-based implementation */
} NCZM_IMPL;
/* Define the default map implementation */
#define NCZM_DEFAULT NCZM_NC4
/*
For each dataset, we create what amounts to a class
defining data and the API function implementations.
All datasets are subclasses of NCZMAP.
In the usual C approach, subclassing is performed by
casting.
So all Dataset structs have this as their first field
so we can cast to this form; avoids need for
a separate per-implementation malloc piece.
*/
typedef struct NCZMAP {
NCZM_IMPL format;
char* url;
int mode;
size64_t flags;
struct NCZMAP_API* api;
} NCZMAP;
/* Forward */
/* Define the object-level API */
struct NCZMAP_API {
int version;
/* Map Operations */
int (*close)(NCZMAP* map, int deleteit);
/* Object Operations */
int (*exists)(NCZMAP* map, const char* key);
int (*len)(NCZMAP* map, const char* key, size64_t* sizep);
int (*defineobj)(NCZMAP* map, const char* key);
int (*read)(NCZMAP* map, const char* key, size64_t start, size64_t count, void* content);
int (*write)(NCZMAP* map, const char* key, size64_t start, size64_t count, const void* content);
int (*search)(NCZMAP* map, const char* prefix, NClist* matches);
};
/* Define the Dataset level API */
typedef struct NCZMAP_DS_API {
int version;
int (*create)(const char *path, int mode, size64_t flags, void* parameters, NCZMAP** mapp);
int (*open)(const char *path, int mode, size64_t flags, void* parameters, NCZMAP** mapp);
} NCZMAP_DS_API;
#ifdef __cplusplus
extern "C" {
#endif
/* Object API Wrappers; note that there are no group operations
because group keys do not map to directories.
*/
/**
Check if a specified content-bearing object exists or not.
@param map -- the containing map
@param key -- the key specifying the content-bearing object
@return NC_NOERR if the object exists
@return NC_ENOTFOUND if the object does not exist
@return NC_EXXX if the operation failed for one of several possible reasons
*/
extern int nczmap_exists(NCZMAP* map, const char* key);
/**
Return the current size of a specified content-bearing object exists or not.
@param map -- the containing map
@param key -- the key specifying the content-bearing object
@param sizep -- the object's size is returned thru this pointer.
@return NC_NOERR if the object exists
@return NC_ENOTFOUND if the object does not exist
@return NC_EXXX if the operation failed for one of several possible reasons
*/
extern int nczmap_len(NCZMAP* map, const char* key, size64_t* sizep);
/**
Create a specified content-bearing object.
@param map -- the containing map
@param key -- the key specifying the content-bearing object
@return NC_NOERR if the object is created
@return NC_ENOTFOUND if the object does not exist
@return NC_EXXX if the operation failed for one of several possible reasons
*/
extern int nczmap_defineobj(NCZMAP* map, const char* key);
/**
Read the content of a specified content-bearing object.
@param map -- the containing map
@param key -- the key specifying the content-bearing object
@param start -- offset into the content to start reading
@param count -- number of bytes to read
@param content -- read the data into this memory
@return NC_NOERR if the operation succeeded
@return NC_ENOTFOUND if the object does not exist
@return NC_EXXX if the operation failed for one of several possible reasons
*/
extern int nczmap_read(NCZMAP* map, const char* key, size64_t start, size64_t count, void* content);
/**
Write the content of a specified content-bearing object.
@param map -- the containing map
@param key -- the key specifying the content-bearing object
@param start -- offset into the content to start writing
@param count -- number of bytes to write
@param content -- write the data from this memory
@return NC_NOERR if the operation succeeded
@return NC_ENOTFOUND if the object does not exist
@return NC_EXXX if the operation failed for one of several possible reasons
*/
extern int nczmap_write(NCZMAP* map, const char* key, size64_t start, size64_t count, const void* content);
/**
Return a vector of keys representing the content-bearing
objects that are immediately contained by the prefix key.
@param map -- the containing map
@param prefix -- the key into the tree where the search is to occur
@param matches -- return the set of keys in this list
@return NC_NOERR if the operation succeeded
@return NC_EXXX if the operation failed for one of several possible reasons
*/
extern int nczmap_search(NCZMAP* map, const char* prefix, NClist* matches);
/**
Close a map
@param map -- the map to close
@param deleteit-- if true, then delete the corresponding dataset
@return NC_NOERR if the operation succeeded
@return NC_ENOTFOUND if the object does not exist
@return NC_EXXX if the operation failed for one of several possible reasons
*/
extern int nczmap_close(NCZMAP* map, int deleteit);
/* Create/open and control a dataset using a specific implementation */
extern int nczmap_create(NCZM_IMPL impl, const char *path, int mode, size64_t flags, void* parameters, NCZMAP** mapp);
extern int nczmap_open(NCZM_IMPL impl, const char *path, int mode, size64_t flags, void* parameters, NCZMAP** mapp);
/* Utility functions */
/** Split a path into pieces along '/' character; elide any leading '/' */
extern int nczm_split(const char* path, NClist* segments);
/* Split a path into pieces along some character; elide any leading char */
extern int nczm_split_delim(const char* path, char delim, NClist* segments);
/* Convenience: Join all segments into a path using '/' character */
extern int nczm_join(NClist* segments, char** pathp);
/* Convenience: concat two strings; caller frees */
extern int nczm_concat(const char* prefix, const char* suffix, char** pathp);
/* Break a key into prefix and suffix, where prefix is the first nsegs segments;
nsegs can be negative to specify that suffix is |nsegs| long
*/
extern int nczm_divide_at(const char* key, int nsegs, char** prefixp, char** suffixp);
/* Reclaim the content of a map but not the map itself */
extern int nczm_clear(NCZMAP* map);
/* Return 1 if path is absolute; takes Windows drive letters into account */
extern int nczm_isabsolutepath(const char* path);
#ifdef __cplusplus
}
#endif
#endif /*ZMAP_H*/