netcdf-c/docs/quickstart_paths.md
Dennis Heimbigner 49737888ca Improve S3 Documentation and Support
## Improvements to S3 Documentation
* Create a new document *quickstart_paths.md* that give a summary of the legal path formats used by netcdf-c. This includes both file paths and URL paths.
* Modify *nczarr.md* to remove most of the S3 related text.
* Move the S3 text from *nczarr.md* to a new document *cloud.md*.
* Add some S3-related text to the *byterange.md* document.

Hopefully, this will make it easier for users to find the information they want.

## Rebuild NCZarr Testing
In order to avoid problems with running make check in parallel, two changes were made:
1. The *nczarr_test* test system was rebuilt. Now, for each test.
any generated files are kept in a test-specific directory, isolated
from all other test executions.
2. Similarly, since the S3 test bucket is shared, any generated S3 objects
are isolated using a test-specific key path.

## Other S3 Related Changes
* Add code to ensure that files created on S3 are reclaimed at end of testing.
* Used the bash "trap" command to ensure S3 cleanup even if the test fails.
* Cleanup the S3 related configure.ac flag set since S3 is used in several places. So now one should use the option *--enable-s3* instead of *--enable-nczarr-s3*, although the latter is still kept as a deprecated alias for the former.
* Get some of the github actions yml to work with S3; required fixing various test scripts adding a secret to access the Unidata S3 bucket.
* Cleanup S3 portion of libnetcdf.settings.in and netcdf_meta.h.in and test_common.in.
* Merge partial S3 support into dhttp.c.
* Create an experimental s3 access library especially for use with Windows. It is enabled by using the options *--enable-s3-internal* (automake) or *-DENABLE_S3_INTERNAL=ON* (CMake). Also add a unit-test for it.
* Move some definitions from ncrc.h to ncs3sdk.h

## Other Changes
* Provide a default implementation of strlcpy and move this and similar defaults into *dmissing.c*.
2023-04-25 17:15:06 -06:00

5.9 KiB

Appendix D.2. Specifying Paths for NetCDF-C

A key concept in netcdf-c is the notion of a "path". A path specifies some dataset that is of interest to a user. It is the primary argument to the nc_open and nc_create functions in the NetCDF-C API, as defined by the file netcdf.h. It is also the primary option for the NetCDF-C utilities: ncdump, nccopy, and ncgen. Hence understanding what kind of paths are acceptable is important for using the NetCDF-C library.

Classification of Paths

Basically, there are two kinds of paths:

  1. File system paths, and
  2. Uniform Resource Locator (URL) paths.

File System Paths

The most common form of path accepted by the NetCDF-C library is a file system path. Every user of some computer operating system is familiar with the idea of a file system path.

Each operating system has some special quirks when specifying file system paths. Here are some example paths for various version of Linux/Unix.

  • /
  • /a/b/c/d

As a rule, Linux/Unix has a single root path typically indicated by "/", and "/" is also used as the file separator.

For Windows, some example paths would be as follows:

  • c:\
  • d:\a\b\c

Windows has a notion of a drive ("d:") and each drive serves as the root of its own file system. Windows uses "\" as its file separator, although many programs also accept "/".

Uniform Resource Locator (URL) Paths

The NetCDF-C library can access datasets that reside on remote computers, Hence NetCDF-C now also accepts URLs to specify those remote datasets.

The general form of a URL is as follows:

<protocol>://<user>:<pwd>@<host>:<port>/<path>?<query>#<fragment>
  • <protocol> - specifies the format of the messages between the local and remote computers. Some examples used by NetCDF-C are http, https, s3, dap, dap4, or file.
  • <user:pwd> - Pass authorization credentials to the remote computer.
  • <host>:<port> - The specific remote computer to access, such as thredds.ucar.edu:8080.
  • <path> - A specification of file-path-like string that indicates some specific resource on the remote computer.
  • <query> - A sequence of (key=value) pairs, separated by "&", and providing special information to the remote computer.
  • <fragment> - A sequence of (key=value) pairs, separated by "&", and providing special information to the local computer.

Most parts of a URL are optional. As a rule, the protocol is always required, and either the host is required or, if the protocol is "file", then the path is required. The query and fragment are optional.

Examples of URL Paths for NetCDF-C

Addendum A. Amazon S3 Specific URLS

A URL path is required for accessing datasets on the Amazon S3 storage cloud. Unfortunately S3 URLs are complicated. It has the following features:

  • Protocol: https or s3. The s3 scheme is equivalent to "https" plus setting various tags in the query and/or fragment part of the URL.
  • Host: Amazon S3 defines three forms: Virtual, Path, and S3
    • Virtual: the host includes the bucket name as in bucket.s3.<region>.amazonaws.com or bucket.s3.amazonaws.com
    • Path: the host does not include the bucket name, but rather the bucket name is the first segment of the path. For example s3.<region>.amazonaws.com/bucket or s3.amazonaws.com/bucket
    • S3: the protocol is "s3:" and if the host is a single name, then it is interpreted as the bucket. The region is determined using an algorithm defined in the nczarr documentation.
    • Other: It is possible to use other non-Amazon cloud storage, but that is cloud library dependent.
  • Query: currently not used.
  • Fragment: the fragment is of the form key=value&key=value&.... Depending on the key, the value part may be left out and some default value will be used. The exact set of possible keys is defined in the nczarr documentation.

Addendum B. Known Fragment Keys

The fragment part of a URL is used to pass information deep into the netcdf-c library to control its actions. This appendix list known keys, although it may be somewhat out-of-date.

The current set of keys used in the netcdf-c library is as follows.

  • mode -- A special key that is used to provide single values for controlling the netcdf-c library. It consists of a comma separated sequence of values primarily used to control the file format. The mode key supports the following values
    • dap2 -- Specifies that the URL accesses a resource using the DAP2 protocol
    • dap4 -- Specifies that the URL accesses a resource using the DAP4 protocol
    • netcdf-3 -- Specifies that a file is a netcdf-classic file
    • classic -- Alias for netcdf-3
    • netcdf-4 -- Specifies that a file is a netcdf-enhanced file
    • enhanced -- Alias for netcdf-4
    • udf0 -- Specifies that the file format is defined by a User Defined format
    • udf1 -- Specifies that the file format is defined by a User Defined format
    • nczarr -- Specifies that the file is in NCZarr format
    • zarr -- Specifies that the file is in Zarr format
    • xarray --Specifies that the file is in Zarr format and uses the XARRAY convention
    • noxarray --Specifies that the file is in Zarr format and does not use the XARRAY convention
    • s3 --Specifies that the file is remote and is stored on the Amazon S3 cloud
    • file --Specifies that the file is an NCZarr/Zarr file stored as a file tree
    • zip --Specifies that the file is an NCZarr/Zarr file stored as a zip file
    • bytes -- Specifies that the file is remote and is to be read using byte-range support in NCZarr format
  • dap2 -- equivalent to "mode=dap2"
  • dap4 -- equivalent to "mode=dap4"
  • bytes -- equivalent to "mode=bytes"
  • log -- turn on logging for the duration of the data request
  • show=fetch -- log curl fetch commands