netcdf-c/docs/cloud.md
Dennis Heimbigner 49737888ca Improve S3 Documentation and Support
## Improvements to S3 Documentation
* Create a new document *quickstart_paths.md* that give a summary of the legal path formats used by netcdf-c. This includes both file paths and URL paths.
* Modify *nczarr.md* to remove most of the S3 related text.
* Move the S3 text from *nczarr.md* to a new document *cloud.md*.
* Add some S3-related text to the *byterange.md* document.

Hopefully, this will make it easier for users to find the information they want.

## Rebuild NCZarr Testing
In order to avoid problems with running make check in parallel, two changes were made:
1. The *nczarr_test* test system was rebuilt. Now, for each test.
any generated files are kept in a test-specific directory, isolated
from all other test executions.
2. Similarly, since the S3 test bucket is shared, any generated S3 objects
are isolated using a test-specific key path.

## Other S3 Related Changes
* Add code to ensure that files created on S3 are reclaimed at end of testing.
* Used the bash "trap" command to ensure S3 cleanup even if the test fails.
* Cleanup the S3 related configure.ac flag set since S3 is used in several places. So now one should use the option *--enable-s3* instead of *--enable-nczarr-s3*, although the latter is still kept as a deprecated alias for the former.
* Get some of the github actions yml to work with S3; required fixing various test scripts adding a secret to access the Unidata S3 bucket.
* Cleanup S3 portion of libnetcdf.settings.in and netcdf_meta.h.in and test_common.in.
* Merge partial S3 support into dhttp.c.
* Create an experimental s3 access library especially for use with Windows. It is enabled by using the options *--enable-s3-internal* (automake) or *-DENABLE_S3_INTERNAL=ON* (CMake). Also add a unit-test for it.
* Move some definitions from ncrc.h to ncs3sdk.h

## Other Changes
* Provide a default implementation of strlcpy and move this and similar defaults into *dmissing.c*.
2023-04-25 17:15:06 -06:00

14 KiB

Cloud Storage Access Using The NetCDF-C Library

Cloud Storage Access Using The NetCDF-C Library

\tableofcontents

Introduction

The NetCDF-C library supports limited access to cloud storage. Currently, that access is restricted to the Amazon S3 cloud storage, so this document is S3-centric. It is expected that over time, access to additional cloud stores will be added, and this document will be expanded to cover those additional cases.

S3 Use-Cases in NetCDF-C

At the moment, the NetCDF-C library provides access to S3 for the following purposes:

  • Byte-Range access to Netcdf-4/HDF5 datasets stored as single object in S3.
  • Zarr dataset access as represented as a "tree" of objects in S3.

Amazon S3 Storage Access Mechanisms

Three S3 storage drivers are available for accessing Amazon S3.

  1. H5FDros3 -- This is an HDF5 Virtual File Driver provided as part of the HDF5 library. It is specifically used by NetCDF-C to perform byte-range access to netcdf-4 files. In order for this functionality to be available to NetCDF-C, it is necessary to build the HDF5 library with the --enable-ros3-vfd option.
  2. aws-sdk-cpp -- This is the standard Amazon AWS S3 SDK. It is written in C++. It is used specifically for NCZarr access to Zarr formatted datasets on S3.
  3. nch5s3comms -- This is an experimental SDK, written in C, that provides the minimum functionality necessary to access S3. As with aws-sdk-cpp, it is used specifically for NCZarr access to Zarr formatted datasets on S3.

All three S3 drivers use the AWS profile mechanism to provide configuration information, and especially to provide authorization information. Specifically, the ''~/.aws/credentials'' file should contain something like this.

[default]
output = json
aws_access_key_id=XXXX...
aws_secret_access_key=YYYY...

Byte-Range Access

The NetCDF-C library contains a mechanism for accessing traditional netcdf-4 files stored on remote computers. The idea is to treat the remote data as if it was one big file, and to use the HTTP protocol to read a contiguous sequence of bytes from that remote "file". This is performed using the "byte-range" header in an HTTP request.

In the Amazon S3 context, a copy of a dataset, a netcdf-3 or netdf-4 file, is uploaded into a single object in some bucket. Then using the key to this object, it is possible to tell the netcdf-c library to treat the object as a remote file and to use the HTTP Byte-Range protocol to access the contents of the object. The dataset object is referenced using a URL with the trailing fragment containing the string #mode=bytes.

An examination of the test program nc_test/test_byterange.sh shows simple examples using the ncdump program. One such test is specified as follows:

https://s3.us-east-1.amazonaws.com/noaa-goes16/ABI-L1b-RadC/2017/059/03/OR_ABI-L1b-RadC-M3C13_G16_s20170590337505_e20170590340289_c20170590340316.nc#mode=bytes

Note that for S3 access, it is expected that the URL is in what is called "path" format where the bucket, noaa-goes16 in this case, is part of the URL path instead of the host.

The #mode=bytes mechanism generalizes to work with most servers that support byte-range access.

Specifically, Thredds servers support such access using the HttpServer access method as can be seen from this URL taken from the above test program.

https://thredds-test.unidata.ucar.edu/thredds/fileServer/irma/metar/files/METAR_20170910_0000.nc#bytes

References

[1] Amazon Simple Storage Service Documentation
[2] Amazon Simple Storage Service Library
[11] Conda-forge / packages / aws-sdk-cpp

Appendix A. S3 Build Support

Currently the following build cases are known to work.

Operating SystemBuild SystemSDKS3 Support
Linux Automake aws-s3-sdk yes
Linux Automake nch5s3comms yes
Linux CMake aws-s3-sdk yes
Linux CMake nch5s3comms yes
OSX Automake aws-s3-sdk unknown
OSX Automake nch5s3comms unknown
OSX CMake aws-s3-sdk unknown
OSX CMake nch5s3comms unknown
Visual Studio CMake aws-s3-sdk no (tests-fail)
Visual Studio CMake nch5s3comms yes
Cygwin Automake aws-s3-sdk unknown
Cygwin Automake nch5s3comms yes
Cygwin CMake aws-s3-sdk unknown
Cygwin CMake nch5s3comms unknown
Mingw Automake aws-s3-sdk unknown
Mingw Automake nch5s3comms unknown
Mingw CMake aws-s3-sdk unknown
Mingw CMake nch5s3comms unknown

Automake

There are several options relevant to NCZarr support and to Amazon S3 support. These are as follows.

  1. --enable-s3 -- Enable S3 support.
  2. --enable-s3-internal -- Force use of the nch5s3comms SDK instead of the aws-cpp-sdk (assuming it is available).

A note about using S3 with Automake. If S3 support is desired, and using the Amazon "aws-sdk-cpp" SDK, and using Automake, then LDFLAGS must be properly set, namely to this.

LDFLAGS="$LDFLAGS -L/usr/local/lib -laws-cpp-sdk-s3"

The above assumes that these libraries were installed in '/usr/local/lib', so the above requires modification if they were installed elsewhere.

Note also that if S3 support is enabled, then you need to have a C++ compiler installed because the "aws-sdk-cpp" S3 support code is written in C++.

CMake

The necessary CMake flags are as follows (with defaults)

  1. -DENABLE_S3 -- Enable S3 support, including NCZarr support if NCZarr is enabled
  2. -DENABLE_S3_INTERNAL -- Force use of the nch5s3comms SDK instead of the aws-cpp-sdk.

Note that unlike Automake, CMake can properly locate C++ libraries, so it should not be necessary to specify -laws-cpp-sdk-s3 assuming that the aws s3 libraries are installed in the default location. For CMake with Visual Studio, the default location is here:

C:/Program Files (x86)/aws-cpp-sdk-all

It is possible to install the sdk library in another location. In this case, one must add the following flag to the cmake command.

cmake ... -DAWSSDK_DIR=\<awssdkdir\>

where "awssdkdir" is the path to the sdk installation. For example, this might be as follows.

cmake ... -DAWSSDK_DIR="c:\tools\aws-cpp-sdk-all"

This can be useful if blanks in path names cause problems in your build environment.

Appendix B. Building the S3 SDKs

As mentioned, three S3 storage drivers are available for accessing Amazon S3.

  1. H5FDros3
  2. aws-sdk-cpp
  3. nch5s3comms

Building H5FDros3

This driver is part of the HDF5 library codebase. It must be enabled at the time that the HDF5 library is built by using the --enable-ros3=vfd option. If built, then the NetCDF-C build process should detect it and make use of it.

Building aws-sdk-cpp

Amazon provides (thru AWS-labs) an SDK for accessing the Amazon S3 cloud. This library, aws-sdk-cpp library, has a number of properties of interest:

  • It is written in C++
  • It is available on GitHub,
  • It uses CMake + ninja as its primary build system.

*nix* Build

For linux, the following context works. Of course your mileage may vary.

  • OS: ubuntu 21
  • aws-sdk-cpp version 1.9.96 (or later)
  • Dependencies: openssl, libcurl, cmake, ninja (ninja-build in apt)

AWS-SDK-CPP CMake Build Recipe

git clone --recurse-submodules https://www.github.com/aws/aws-sdk-cpp
pushd aws-sdk-cpp
mkdir build
cd build
PREFIX=/usr/local
FLAGS="-DCMAKE_INSTALL_PREFIX=${PREFIX} \
       -DCMAKE_INSTALL_LIBDIR=lib \
       -DCMAKE_MODULE_PATH=${PREFIX}/lib/cmake \
       -DCMAKE_POLICY_DEFAULT_CMP0075=NEW \
       -DBUILD_ONLY=s3 \
       -DENABLE_UNITY_BUILD=ON \
       -DENABLE_TESTING=OFF \
       -DCMAKE_BUILD_TYPE=$CFG \
       -DSIMPLE_INSTALL=ON"
cmake -GNinja $FLAGS ..
ninja all
ninja install
cd ..
popd

Windows build

It is possible to build and install aws-sdk-cpp on Windows using CMake. Unfortunately, testing currently fails.

For Windows, the following context work. Of course your mileage may vary.

  • OS: Windows 10 64-bit with Visual Studio community edition 2019.
  • aws-sdk-cpp version 1.9.96 (or later)
  • Dependencies: openssl, libcurl, cmake

AWS-SDK-CPP Build Recipe

This command-line build assumes one is using Cygwin or Mingw to provide tools such as bash.

git clone --recurse-submodules https://www.github.com/aws/aws-sdk-cpp
pushd aws-sdk-cpp
mkdir build
cd build
CFG="Release"
PREFIX="c:/tools/aws-sdk-cpp"

FLAGS="-DCMAKE_INSTALL_PREFIX=${PREFIX} \
       -DCMAKE_INSTALL_LIBDIR=lib" \
       -DCMAKE_MODULE_PATH=${PREFIX}/cmake \
       -DCMAKE_POLICY_DEFAULT_CMP0075=NEW \
       -DBUILD_ONLY=s3 \
       -DENABLE_UNITY_BUILD=ON \
       -DCMAKE_BUILD_TYPE=$CFG \
       -DSIMPLE_INSTALL=ON"

rm -fr build
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=${CFG} $FLAGS ..
cmake --build . --config ${CFG}
cmake --install . --config ${CFG}
cd ..
popd

Notice that the sdk is being installed in the directory "c:\tools\aws-sdk-cpp" rather than the default location "c:\Program Files (x86)/aws-sdk-cpp-all" This is because when using a command line, an install path that contains blanks may not work.

In order for CMake to find the aws sdk libraries, the following environment variables must be set:

AWSSDK_ROOT_DIR="c:/tools/aws-sdk-cpp"
AWSSDKBIN="/cygdrive/c/tools/aws-sdk-cpp/bin"
PATH="$PATH:${AWSSDKBIN}"

Then the following options must be specified for cmake.

-DAWSSDK_ROOT_DIR=${AWSSDK_ROOT_DIR}
-DAWSSDK_DIR=${AWSSDK_ROOT_DIR}/lib/cmake/AWSSDK"

Building ``nch5s3comms''

This is an experimental SDK provided internally in the netcdf-c library.

  • It is written in C
  • It provides the minimum functionality necessary to read/write/search an Amazon S3 bucket.
  • It was constructed by heavily modifying the HDF5 H5FDros3 Virtual File Driver and combining it with crypto code wrappers provided by libcurl.
  • The resulting code is rather ugly, but appears to work under at least Linux and under Windows (using Visual C++).

Dependencies

  • *nix*: the following packages need to be installed: openssl, libcurl, (optionally) libxml2.
  • Windows (Visual C++): the following packages need to be installed: libcurl, (optionally) libxml2.

Build Options

In order to enable this SDK, the Automake option --enable-s3-internal or the CMake option -DENABLE_S3_INTERNAL=ON must be specified.

Testing S3 Support

The pure S3 test(s) are in the unit_tests directory. Currently, by default, testing of S3 is supported only for Unidata members of the NetCDF Development Group. This is because it uses a Unidata-specific bucket is inaccessible to the general user.

Appendix C. AWS Selection Algorithms.

If byterange support is enabled, the netcdf-c library will parse the files

${HOME}/.aws/config
and
${HOME}/.aws/credentials

to extract profile names plus a list of key=value pairs. In case of duplicates, credentials takes precedence over config.

This example is typical of the contents of these files.

[default]
    aws_access_key_id=XXXXXXXXXXXXXXXXXXXX
    aws_secret_access_key=YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
    aws_region=ZZZZZZZZZ

The keys in the profile will be used to set various parameters in the library

Profile Selection

The algorithm for choosing the active profile to use is as follows:

  1. If the "aws.profile" fragment flag is defined in the URL, then it is used. For example, see this URL.
https://...#mode=nczarr,s3&aws.profile=xxx
  1. If the "AWS.PROFILE" entry in the .rc file (i.e. .netrc or .dodsrc) is set, then it is used.
  2. Otherwise the profile "default" is used.

The profile named "none" is a special profile that the netcdf-c library automatically defines. It should not be defined anywhere else. It signals to the library that no credentialas are to used. It is equivalent to the "--no-sign-request" option in the AWS CLI. Also, it must be explicitly specified by name. Otherwise "default" will be used.

Region Selection

If the specified URL is of the form

s3://<bucket>/key

Then this is rebuilt to this form:

s3://s2.&lt;region&gt.amazonaws.com>/key

However this requires figuring out the region to use. The algorithm for picking an region is as follows.

  1. If the "aws.region" fragment flag is defined in the URL, then it is used.
  2. The active profile is searched for the "aws_region" key.
  3. If the "AWS.REGION" entry in the .rc file (i.e. .netrc or .dodsrc) is set, then it is used.
  4. Otherwise use "us-east-1" region.

Authorization Selection

Picking an access-key/secret-key pair is always determined by the current active profile. To choose to not use keys requires that the active profile must be "none".

Change Log

[Note: minor text changes are not included.]

3/8/2023

  1. Add an internal, minimal Amazon S3 SDK to support S3 access especially for Windows.

Point of Contact

Author: Dennis Heimbigner
Email: dmh at ucar dot edu
Initial Version: 3/8/2023
Last Revised: 3/8/2023