2023-05-10 11:13:49 +08:00
2023-04-26 07:15:06 +08:00
Cloud Storage Access Using The NetCDF-C Library
============================
<!-- double header is needed to workaround doxygen bug -->
# Cloud Storage Access Using The NetCDF-C Library {#nccloud_head}
\tableofcontents
# Introduction {#nccloud_introduction}
The NetCDF-C library supports limited access to cloud storage.
Currently, that access is restricted to the Amazon S3 cloud storage,
so this document is S3-centric.
It is expected that over time, access to additional cloud stores will be added,
and this document will be expanded to cover those additional cases.
# S3 Use-Cases in NetCDF-C
At the moment, the NetCDF-C library provides access to S3 for the following purposes:
* Byte-Range access to Netcdf-4/HDF5 datasets stored as single object in S3.
* Zarr dataset access as represented as a "tree" of objects in S3.
# Amazon S3 Storage Access Mechanisms {#nccloud_s3_sdks}
Three S3 storage drivers are available for accessing Amazon S3.
1. *H5FDros3* -- This is an HDF5 Virtual File Driver provided as part of the HDF5 library. It is specifically used by NetCDF-C to perform byte-range access to netcdf-4 files. In order for this functionality to be available to NetCDF-C,
it is necessary to build the HDF5 library with the *--enable-ros3-vfd* option.
2. *aws-sdk-cpp* -- This is the standard Amazon AWS S3 SDK. It is written in C++. It is used specifically for NCZarr access to Zarr formatted datasets on S3.
3. *nch5s3comms* -- This is an experimental SDK, written in C, that provides the minimum functionality necessary to access S3. As with *aws-sdk-cpp* , it is used specifically for NCZarr access to Zarr formatted datasets on S3.
All three S3 drivers use the AWS profile mechanism to provide configuration information, and especially to provide authorization information.
Specifically, the ''~/.aws/credentials'' file should contain something like this.
```
[default]
output = json
aws_access_key_id=XXXX...
aws_secret_access_key=YYYY...
```
## Byte-Range Access {#nccloud_byterange}
The NetCDF-C library contains a mechanism for accessing traditional netcdf-4 files stored on remote computers.
The idea is to treat the remote data as if it was one big file,
and to use the HTTP protocol to read a contiguous sequence of bytes
from that remote "file".
This is performed using the "byte-range" header in an HTTP request.
In the Amazon S3 context, a copy of a dataset, a netcdf-3 or netdf-4 file, is uploaded into a single object in some bucket.
Then using the key to this object, it is possible to tell the netcdf-c library to treat the object as a remote file and to use the HTTP Byte-Range protocol to access the contents of the object.
The dataset object is referenced using a URL with the trailing fragment containing the string ````#mode=bytes````.
An examination of the test program _nc_test/test_byterange.sh_ shows simple examples using the _ncdump_ program.
One such test is specified as follows:
````
https://s3.us-east-1.amazonaws.com/noaa-goes16/ABI-L1b-RadC/2017/059/03/OR_ABI-L1b-RadC-M3C13_G16_s20170590337505_e20170590340289_c20170590340316.nc#mode=bytes
````
Note that for S3 access, it is expected that the URL is in what is called "path" format where the bucket, _noaa-goes16_ in this case, is part of the URL path instead of the host.
The _#mode=bytes_ mechanism generalizes to work with most servers that support byte-range access.
Specifically, Thredds servers support such access using the HttpServer access method as can be seen from this URL taken from the above test program.
````
https://thredds-test.unidata.ucar.edu/thredds/fileServer/irma/metar/files/METAR_20170910_0000.nc#bytes
````
# References {#nccloud_bib}
2023-12-21 07:06:08 +08:00
< a name = "cloud_ref_aws" > [1]</ a > [Amazon Simple Storage Service Documentation ](https://docs.aws.amazon.com/s3/index.html )< br >
< a name = "cloud_ref_awssdk" > [2]</ a > [Amazon Simple Storage Service Library ](https://github.com/aws/aws-sdk-cpp )< br >
< a name = "cloud_ref_awssdk_conda" > [11]</ a > [Conda-forge / packages / aws-sdk-cpp ](https://anaconda.org/conda-forge/aws-sdk-cpp )< br >
2023-04-26 07:15:06 +08:00
# Appendix A. S3 Build Support {#nccloud_s3build}
Currently the following build cases are known to work.
< table >
< tr > < td > < u > Operating System< / u > < td > < u > Build System< / u > < td > < u > SDK< / u > < td > < u > S3 Support< / u >
< tr > < td > Linux < td > Automake < td > aws-s3-sdk < td > yes
< tr > < td > Linux < td > Automake < td > nch5s3comms < td > yes
< tr > < td > Linux < td > CMake < td > aws-s3-sdk < td > yes
< tr > < td > Linux < td > CMake < td > nch5s3comms < td > yes
< tr > < td > OSX < td > Automake < td > aws-s3-sdk < td > unknown
Mitigate S3 test interference + Unlimited Dimensions in NCZarr
This PR started as an attempt to add unlimited dimensions to NCZarr.
It did that, but this exposed significant problems with test interference.
So this PR is mostly about fixing -- well mitigating anyway -- test
interference.
The problem of test interference is now documented in the document docs/internal.md.
The solutions implemented here are also describe in that document.
The solution is somewhat fragile but multiple cleanup mechanisms
are provided. Note that this feature requires that the
AWS command line utility must be installed.
## Unlimited Dimensions.
The existing NCZarr extensions to Zarr are modified to support unlimited dimensions.
NCzarr extends the Zarr meta-data for the ".zgroup" object to include netcdf-4 model extensions. This information is stored in ".zgroup" as dictionary named "_nczarr_group".
Inside "_nczarr_group", there is a key named "dims" that stores information about netcdf-4 named dimensions. The value of "dims" is a dictionary whose keys are the named dimensions. The value associated with each dimension name has one of two forms
Form 1 is a special case of form 2, and is kept for backward compatibility. Whenever a new file is written, it uses format 1 if possible, otherwise format 2.
* Form 1: An integer representing the size of the dimension, which is used for simple named dimensions.
* Form 2: A dictionary with the following keys and values"
- "size" with an integer value representing the (current) size of the dimension.
- "unlimited" with a value of either "1" or "0" to indicate if this dimension is an unlimited dimension.
For Unlimited dimensions, the size is initially zero, and as variables extend the length of that dimension, the size value for the dimension increases.
That dimension size is shared by all arrays referencing that dimension, so if one array extends an unlimited dimension, it is implicitly extended for all other arrays that reference that dimension.
This is the standard semantics for unlimited dimensions.
Adding unlimited dimensions required a number of other changes to the NCZarr code-base. These included the following.
* Did a partial refactor of the slice handling code in zwalk.c to clean it up.
* Added a number of tests for unlimited dimensions derived from the same test in nc_test4.
* Added several NCZarr specific unlimited tests; more are needed.
* Add test of endianness.
## Misc. Other Changes
* Modify libdispatch/ncs3sdk_aws.cpp to optionally support use of the
AWS Transfer Utility mechanism. This is controlled by the
```#define TRANSFER```` command in that file. It defaults to being disabled.
* Parameterize both the standard Unidata S3 bucket (S3TESTBUCKET) and the netcdf-c test data prefix (S3TESTSUBTREE).
* Fixed an obscure memory leak in ncdump.
* Removed some obsolete unit testing code and test cases.
* Uncovered a bug in the netcdf-c handling of big-endian floats and doubles. Have not fixed yet. See tst_h5_endians.c.
* Renamed some nczarr_tests testcases to avoid name conflicts with nc_test4.
* Modify the semantics of zmap\#ncsmap_write to only allow total rewrite of objects.
* Modify the semantics of zodom to properly handle stride > 1.
* Add a truncate operation to the libnczarr zmap code.
2023-09-27 06:56:48 +08:00
< tr > < td > OSX < td > Automake < td > nch5s3comms < td > yes
2023-04-26 07:15:06 +08:00
< tr > < td > OSX < td > CMake < td > aws-s3-sdk < td > unknown
Mitigate S3 test interference + Unlimited Dimensions in NCZarr
This PR started as an attempt to add unlimited dimensions to NCZarr.
It did that, but this exposed significant problems with test interference.
So this PR is mostly about fixing -- well mitigating anyway -- test
interference.
The problem of test interference is now documented in the document docs/internal.md.
The solutions implemented here are also describe in that document.
The solution is somewhat fragile but multiple cleanup mechanisms
are provided. Note that this feature requires that the
AWS command line utility must be installed.
## Unlimited Dimensions.
The existing NCZarr extensions to Zarr are modified to support unlimited dimensions.
NCzarr extends the Zarr meta-data for the ".zgroup" object to include netcdf-4 model extensions. This information is stored in ".zgroup" as dictionary named "_nczarr_group".
Inside "_nczarr_group", there is a key named "dims" that stores information about netcdf-4 named dimensions. The value of "dims" is a dictionary whose keys are the named dimensions. The value associated with each dimension name has one of two forms
Form 1 is a special case of form 2, and is kept for backward compatibility. Whenever a new file is written, it uses format 1 if possible, otherwise format 2.
* Form 1: An integer representing the size of the dimension, which is used for simple named dimensions.
* Form 2: A dictionary with the following keys and values"
- "size" with an integer value representing the (current) size of the dimension.
- "unlimited" with a value of either "1" or "0" to indicate if this dimension is an unlimited dimension.
For Unlimited dimensions, the size is initially zero, and as variables extend the length of that dimension, the size value for the dimension increases.
That dimension size is shared by all arrays referencing that dimension, so if one array extends an unlimited dimension, it is implicitly extended for all other arrays that reference that dimension.
This is the standard semantics for unlimited dimensions.
Adding unlimited dimensions required a number of other changes to the NCZarr code-base. These included the following.
* Did a partial refactor of the slice handling code in zwalk.c to clean it up.
* Added a number of tests for unlimited dimensions derived from the same test in nc_test4.
* Added several NCZarr specific unlimited tests; more are needed.
* Add test of endianness.
## Misc. Other Changes
* Modify libdispatch/ncs3sdk_aws.cpp to optionally support use of the
AWS Transfer Utility mechanism. This is controlled by the
```#define TRANSFER```` command in that file. It defaults to being disabled.
* Parameterize both the standard Unidata S3 bucket (S3TESTBUCKET) and the netcdf-c test data prefix (S3TESTSUBTREE).
* Fixed an obscure memory leak in ncdump.
* Removed some obsolete unit testing code and test cases.
* Uncovered a bug in the netcdf-c handling of big-endian floats and doubles. Have not fixed yet. See tst_h5_endians.c.
* Renamed some nczarr_tests testcases to avoid name conflicts with nc_test4.
* Modify the semantics of zmap\#ncsmap_write to only allow total rewrite of objects.
* Modify the semantics of zodom to properly handle stride > 1.
* Add a truncate operation to the libnczarr zmap code.
2023-09-27 06:56:48 +08:00
< tr > < td > OSX < td > CMake < td > nch5s3comms < td > yes
< tr > < td > Visual Studio < td > CMake < td > aws-s3-sdk < td > no (tests fail)
2023-04-26 07:15:06 +08:00
< tr > < td > Visual Studio < td > CMake < td > nch5s3comms < td > yes
Mitigate S3 test interference + Unlimited Dimensions in NCZarr
This PR started as an attempt to add unlimited dimensions to NCZarr.
It did that, but this exposed significant problems with test interference.
So this PR is mostly about fixing -- well mitigating anyway -- test
interference.
The problem of test interference is now documented in the document docs/internal.md.
The solutions implemented here are also describe in that document.
The solution is somewhat fragile but multiple cleanup mechanisms
are provided. Note that this feature requires that the
AWS command line utility must be installed.
## Unlimited Dimensions.
The existing NCZarr extensions to Zarr are modified to support unlimited dimensions.
NCzarr extends the Zarr meta-data for the ".zgroup" object to include netcdf-4 model extensions. This information is stored in ".zgroup" as dictionary named "_nczarr_group".
Inside "_nczarr_group", there is a key named "dims" that stores information about netcdf-4 named dimensions. The value of "dims" is a dictionary whose keys are the named dimensions. The value associated with each dimension name has one of two forms
Form 1 is a special case of form 2, and is kept for backward compatibility. Whenever a new file is written, it uses format 1 if possible, otherwise format 2.
* Form 1: An integer representing the size of the dimension, which is used for simple named dimensions.
* Form 2: A dictionary with the following keys and values"
- "size" with an integer value representing the (current) size of the dimension.
- "unlimited" with a value of either "1" or "0" to indicate if this dimension is an unlimited dimension.
For Unlimited dimensions, the size is initially zero, and as variables extend the length of that dimension, the size value for the dimension increases.
That dimension size is shared by all arrays referencing that dimension, so if one array extends an unlimited dimension, it is implicitly extended for all other arrays that reference that dimension.
This is the standard semantics for unlimited dimensions.
Adding unlimited dimensions required a number of other changes to the NCZarr code-base. These included the following.
* Did a partial refactor of the slice handling code in zwalk.c to clean it up.
* Added a number of tests for unlimited dimensions derived from the same test in nc_test4.
* Added several NCZarr specific unlimited tests; more are needed.
* Add test of endianness.
## Misc. Other Changes
* Modify libdispatch/ncs3sdk_aws.cpp to optionally support use of the
AWS Transfer Utility mechanism. This is controlled by the
```#define TRANSFER```` command in that file. It defaults to being disabled.
* Parameterize both the standard Unidata S3 bucket (S3TESTBUCKET) and the netcdf-c test data prefix (S3TESTSUBTREE).
* Fixed an obscure memory leak in ncdump.
* Removed some obsolete unit testing code and test cases.
* Uncovered a bug in the netcdf-c handling of big-endian floats and doubles. Have not fixed yet. See tst_h5_endians.c.
* Renamed some nczarr_tests testcases to avoid name conflicts with nc_test4.
* Modify the semantics of zmap\#ncsmap_write to only allow total rewrite of objects.
* Modify the semantics of zodom to properly handle stride > 1.
* Add a truncate operation to the libnczarr zmap code.
2023-09-27 06:56:48 +08:00
< tr > < td > Cygwin < td > Automake < td > aws-s3-sdk < td > no (tests fail)
2023-04-26 07:15:06 +08:00
< tr > < td > Cygwin < td > Automake < td > nch5s3comms < td > yes
Mitigate S3 test interference + Unlimited Dimensions in NCZarr
This PR started as an attempt to add unlimited dimensions to NCZarr.
It did that, but this exposed significant problems with test interference.
So this PR is mostly about fixing -- well mitigating anyway -- test
interference.
The problem of test interference is now documented in the document docs/internal.md.
The solutions implemented here are also describe in that document.
The solution is somewhat fragile but multiple cleanup mechanisms
are provided. Note that this feature requires that the
AWS command line utility must be installed.
## Unlimited Dimensions.
The existing NCZarr extensions to Zarr are modified to support unlimited dimensions.
NCzarr extends the Zarr meta-data for the ".zgroup" object to include netcdf-4 model extensions. This information is stored in ".zgroup" as dictionary named "_nczarr_group".
Inside "_nczarr_group", there is a key named "dims" that stores information about netcdf-4 named dimensions. The value of "dims" is a dictionary whose keys are the named dimensions. The value associated with each dimension name has one of two forms
Form 1 is a special case of form 2, and is kept for backward compatibility. Whenever a new file is written, it uses format 1 if possible, otherwise format 2.
* Form 1: An integer representing the size of the dimension, which is used for simple named dimensions.
* Form 2: A dictionary with the following keys and values"
- "size" with an integer value representing the (current) size of the dimension.
- "unlimited" with a value of either "1" or "0" to indicate if this dimension is an unlimited dimension.
For Unlimited dimensions, the size is initially zero, and as variables extend the length of that dimension, the size value for the dimension increases.
That dimension size is shared by all arrays referencing that dimension, so if one array extends an unlimited dimension, it is implicitly extended for all other arrays that reference that dimension.
This is the standard semantics for unlimited dimensions.
Adding unlimited dimensions required a number of other changes to the NCZarr code-base. These included the following.
* Did a partial refactor of the slice handling code in zwalk.c to clean it up.
* Added a number of tests for unlimited dimensions derived from the same test in nc_test4.
* Added several NCZarr specific unlimited tests; more are needed.
* Add test of endianness.
## Misc. Other Changes
* Modify libdispatch/ncs3sdk_aws.cpp to optionally support use of the
AWS Transfer Utility mechanism. This is controlled by the
```#define TRANSFER```` command in that file. It defaults to being disabled.
* Parameterize both the standard Unidata S3 bucket (S3TESTBUCKET) and the netcdf-c test data prefix (S3TESTSUBTREE).
* Fixed an obscure memory leak in ncdump.
* Removed some obsolete unit testing code and test cases.
* Uncovered a bug in the netcdf-c handling of big-endian floats and doubles. Have not fixed yet. See tst_h5_endians.c.
* Renamed some nczarr_tests testcases to avoid name conflicts with nc_test4.
* Modify the semantics of zmap\#ncsmap_write to only allow total rewrite of objects.
* Modify the semantics of zodom to properly handle stride > 1.
* Add a truncate operation to the libnczarr zmap code.
2023-09-27 06:56:48 +08:00
< tr > < td > Cygwin < td > CMake < td > aws-s3-sdk < td > no
< tr > < td > Cygwin < td > CMake < td > nch5s3comms < td > yes
2023-04-26 07:15:06 +08:00
< tr > < td > Mingw < td > Automake < td > aws-s3-sdk < td > unknown
Mitigate S3 test interference + Unlimited Dimensions in NCZarr
This PR started as an attempt to add unlimited dimensions to NCZarr.
It did that, but this exposed significant problems with test interference.
So this PR is mostly about fixing -- well mitigating anyway -- test
interference.
The problem of test interference is now documented in the document docs/internal.md.
The solutions implemented here are also describe in that document.
The solution is somewhat fragile but multiple cleanup mechanisms
are provided. Note that this feature requires that the
AWS command line utility must be installed.
## Unlimited Dimensions.
The existing NCZarr extensions to Zarr are modified to support unlimited dimensions.
NCzarr extends the Zarr meta-data for the ".zgroup" object to include netcdf-4 model extensions. This information is stored in ".zgroup" as dictionary named "_nczarr_group".
Inside "_nczarr_group", there is a key named "dims" that stores information about netcdf-4 named dimensions. The value of "dims" is a dictionary whose keys are the named dimensions. The value associated with each dimension name has one of two forms
Form 1 is a special case of form 2, and is kept for backward compatibility. Whenever a new file is written, it uses format 1 if possible, otherwise format 2.
* Form 1: An integer representing the size of the dimension, which is used for simple named dimensions.
* Form 2: A dictionary with the following keys and values"
- "size" with an integer value representing the (current) size of the dimension.
- "unlimited" with a value of either "1" or "0" to indicate if this dimension is an unlimited dimension.
For Unlimited dimensions, the size is initially zero, and as variables extend the length of that dimension, the size value for the dimension increases.
That dimension size is shared by all arrays referencing that dimension, so if one array extends an unlimited dimension, it is implicitly extended for all other arrays that reference that dimension.
This is the standard semantics for unlimited dimensions.
Adding unlimited dimensions required a number of other changes to the NCZarr code-base. These included the following.
* Did a partial refactor of the slice handling code in zwalk.c to clean it up.
* Added a number of tests for unlimited dimensions derived from the same test in nc_test4.
* Added several NCZarr specific unlimited tests; more are needed.
* Add test of endianness.
## Misc. Other Changes
* Modify libdispatch/ncs3sdk_aws.cpp to optionally support use of the
AWS Transfer Utility mechanism. This is controlled by the
```#define TRANSFER```` command in that file. It defaults to being disabled.
* Parameterize both the standard Unidata S3 bucket (S3TESTBUCKET) and the netcdf-c test data prefix (S3TESTSUBTREE).
* Fixed an obscure memory leak in ncdump.
* Removed some obsolete unit testing code and test cases.
* Uncovered a bug in the netcdf-c handling of big-endian floats and doubles. Have not fixed yet. See tst_h5_endians.c.
* Renamed some nczarr_tests testcases to avoid name conflicts with nc_test4.
* Modify the semantics of zmap\#ncsmap_write to only allow total rewrite of objects.
* Modify the semantics of zodom to properly handle stride > 1.
* Add a truncate operation to the libnczarr zmap code.
2023-09-27 06:56:48 +08:00
< tr > < td > Mingw < td > Automake < td > nch5s3comms < td > yes
2023-04-26 07:15:06 +08:00
< tr > < td > Mingw < td > CMake < td > aws-s3-sdk < td > unknown
Mitigate S3 test interference + Unlimited Dimensions in NCZarr
This PR started as an attempt to add unlimited dimensions to NCZarr.
It did that, but this exposed significant problems with test interference.
So this PR is mostly about fixing -- well mitigating anyway -- test
interference.
The problem of test interference is now documented in the document docs/internal.md.
The solutions implemented here are also describe in that document.
The solution is somewhat fragile but multiple cleanup mechanisms
are provided. Note that this feature requires that the
AWS command line utility must be installed.
## Unlimited Dimensions.
The existing NCZarr extensions to Zarr are modified to support unlimited dimensions.
NCzarr extends the Zarr meta-data for the ".zgroup" object to include netcdf-4 model extensions. This information is stored in ".zgroup" as dictionary named "_nczarr_group".
Inside "_nczarr_group", there is a key named "dims" that stores information about netcdf-4 named dimensions. The value of "dims" is a dictionary whose keys are the named dimensions. The value associated with each dimension name has one of two forms
Form 1 is a special case of form 2, and is kept for backward compatibility. Whenever a new file is written, it uses format 1 if possible, otherwise format 2.
* Form 1: An integer representing the size of the dimension, which is used for simple named dimensions.
* Form 2: A dictionary with the following keys and values"
- "size" with an integer value representing the (current) size of the dimension.
- "unlimited" with a value of either "1" or "0" to indicate if this dimension is an unlimited dimension.
For Unlimited dimensions, the size is initially zero, and as variables extend the length of that dimension, the size value for the dimension increases.
That dimension size is shared by all arrays referencing that dimension, so if one array extends an unlimited dimension, it is implicitly extended for all other arrays that reference that dimension.
This is the standard semantics for unlimited dimensions.
Adding unlimited dimensions required a number of other changes to the NCZarr code-base. These included the following.
* Did a partial refactor of the slice handling code in zwalk.c to clean it up.
* Added a number of tests for unlimited dimensions derived from the same test in nc_test4.
* Added several NCZarr specific unlimited tests; more are needed.
* Add test of endianness.
## Misc. Other Changes
* Modify libdispatch/ncs3sdk_aws.cpp to optionally support use of the
AWS Transfer Utility mechanism. This is controlled by the
```#define TRANSFER```` command in that file. It defaults to being disabled.
* Parameterize both the standard Unidata S3 bucket (S3TESTBUCKET) and the netcdf-c test data prefix (S3TESTSUBTREE).
* Fixed an obscure memory leak in ncdump.
* Removed some obsolete unit testing code and test cases.
* Uncovered a bug in the netcdf-c handling of big-endian floats and doubles. Have not fixed yet. See tst_h5_endians.c.
* Renamed some nczarr_tests testcases to avoid name conflicts with nc_test4.
* Modify the semantics of zmap\#ncsmap_write to only allow total rewrite of objects.
* Modify the semantics of zodom to properly handle stride > 1.
* Add a truncate operation to the libnczarr zmap code.
2023-09-27 06:56:48 +08:00
< tr > < td > Mingw < td > CMake < td > nch5s3comms < td > yes
2023-04-26 07:15:06 +08:00
< / table >
## Automake
2023-05-10 11:13:49 +08:00
There are several options relevant to Amazon S3 support.
2023-04-26 07:15:06 +08:00
These are as follows.
1. _--enable-s3_ -- Enable S3 support.
2. _--enable-s3-internal_ -- Force use of the *nch5s3comms* SDK instead of the *aws-cpp-sdk* (assuming it is available).
2023-05-10 11:13:49 +08:00
3. _--with-s3-testing_ =yes|no|public -- "yes" means do all S3 tests, "no" means do no S3 testing, "public" means do only those tests that involve publically accessible S3 data.
2023-04-26 07:15:06 +08:00
__A note about using S3 with Automake.__
If S3 support is desired, and using the Amazon "aws-sdk-cpp" SDK, and using Automake, then LDFLAGS must be properly set, namely to this.
````
LDFLAGS="$LDFLAGS -L/usr/local/lib -laws-cpp-sdk-s3"
````
The above assumes that these libraries were installed in '/usr/local/lib', so the above requires modification if they were installed elsewhere.
Note also that if S3 support is enabled, then you need to have a C++ compiler installed because the "aws-sdk-cpp" S3 support code is written in C++.
## CMake {#nccloud_cmake}
The necessary CMake flags are as follows (with defaults)
2023-05-10 11:13:49 +08:00
1. *-DENABLE_S3* -- Controll S3 support
2. *-DENABLE_S3_INTERNAL* -- Force use of the *nch5s3comms* SDK instead of the *aws-cpp-sdk* .
3. *-DWITH-S3-TESTING_=ON|OFF|PUBLIC -- "ON" means do all S3 tests, "OFF" means do no S3 testing, "PUBLIC" means do only those tests that involve publically accessible S3 data.
2023-04-26 07:15:06 +08:00
Note that unlike Automake, CMake can properly locate C++ libraries, so it should not be necessary to specify _-laws-cpp-sdk-s3_ assuming that the aws s3 libraries are installed in the default location.
For CMake with Visual Studio, the default location is here:
````
C:/Program Files (x86)/aws-cpp-sdk-all
````
It is possible to install the sdk library in another location.
In this case, one must add the following flag to the cmake command.
````
cmake ... -DAWSSDK_DIR=\<awssdkdir\>
````
where "awssdkdir" is the path to the sdk installation.
For example, this might be as follows.
````
cmake ... -DAWSSDK_DIR="c:\tools\aws-cpp-sdk-all"
````
This can be useful if blanks in path names cause problems in your build environment.
# Appendix B. Building the S3 SDKs {#nccloud_s3sdk}
As mentioned, three S3 storage drivers are available for accessing Amazon S3.
1. *H5FDros3*
2. *aws-sdk-cpp*
2. *nch5s3comms*
## Building *H5FDros3*
This driver is part of the HDF5 library codebase.
It must be enabled at the time that the HDF5 library is built
by using the *--enable-ros3=vfd* option.
If built, then the NetCDF-C build process should detect it and make use of it.
## Building *aws-sdk-cpp*
Amazon provides (thru AWS-labs) an SDK for accessing the Amazon S3 cloud.
This library, [aws-sdk-cpp library ](https://github.com/aws/aws-sdk-cpp.git ),
has a number of properties of interest:
* It is written in C++
* It is available on [GitHub ](https://github.com/aws/aws-sdk-cpp.git ),
Mitigate S3 test interference + Unlimited Dimensions in NCZarr
This PR started as an attempt to add unlimited dimensions to NCZarr.
It did that, but this exposed significant problems with test interference.
So this PR is mostly about fixing -- well mitigating anyway -- test
interference.
The problem of test interference is now documented in the document docs/internal.md.
The solutions implemented here are also describe in that document.
The solution is somewhat fragile but multiple cleanup mechanisms
are provided. Note that this feature requires that the
AWS command line utility must be installed.
## Unlimited Dimensions.
The existing NCZarr extensions to Zarr are modified to support unlimited dimensions.
NCzarr extends the Zarr meta-data for the ".zgroup" object to include netcdf-4 model extensions. This information is stored in ".zgroup" as dictionary named "_nczarr_group".
Inside "_nczarr_group", there is a key named "dims" that stores information about netcdf-4 named dimensions. The value of "dims" is a dictionary whose keys are the named dimensions. The value associated with each dimension name has one of two forms
Form 1 is a special case of form 2, and is kept for backward compatibility. Whenever a new file is written, it uses format 1 if possible, otherwise format 2.
* Form 1: An integer representing the size of the dimension, which is used for simple named dimensions.
* Form 2: A dictionary with the following keys and values"
- "size" with an integer value representing the (current) size of the dimension.
- "unlimited" with a value of either "1" or "0" to indicate if this dimension is an unlimited dimension.
For Unlimited dimensions, the size is initially zero, and as variables extend the length of that dimension, the size value for the dimension increases.
That dimension size is shared by all arrays referencing that dimension, so if one array extends an unlimited dimension, it is implicitly extended for all other arrays that reference that dimension.
This is the standard semantics for unlimited dimensions.
Adding unlimited dimensions required a number of other changes to the NCZarr code-base. These included the following.
* Did a partial refactor of the slice handling code in zwalk.c to clean it up.
* Added a number of tests for unlimited dimensions derived from the same test in nc_test4.
* Added several NCZarr specific unlimited tests; more are needed.
* Add test of endianness.
## Misc. Other Changes
* Modify libdispatch/ncs3sdk_aws.cpp to optionally support use of the
AWS Transfer Utility mechanism. This is controlled by the
```#define TRANSFER```` command in that file. It defaults to being disabled.
* Parameterize both the standard Unidata S3 bucket (S3TESTBUCKET) and the netcdf-c test data prefix (S3TESTSUBTREE).
* Fixed an obscure memory leak in ncdump.
* Removed some obsolete unit testing code and test cases.
* Uncovered a bug in the netcdf-c handling of big-endian floats and doubles. Have not fixed yet. See tst_h5_endians.c.
* Renamed some nczarr_tests testcases to avoid name conflicts with nc_test4.
* Modify the semantics of zmap\#ncsmap_write to only allow total rewrite of objects.
* Modify the semantics of zodom to properly handle stride > 1.
* Add a truncate operation to the libnczarr zmap code.
2023-09-27 06:56:48 +08:00
* It uses CMake as its primary build system.
2023-04-26 07:15:06 +08:00
### *\*nix\** Build
For linux, the following context works. Of course your mileage may vary.
* OS: ubuntu 21
* aws-sdk-cpp version 1.9.96 (or later)
2023-05-10 11:13:49 +08:00
* Dependencies: openssl, libcurl, cmake, ninja (ninja-build using *apt-get* )
2023-04-26 07:15:06 +08:00
#### AWS-SDK-CPP CMake Build Recipe
````
git clone --recurse-submodules https://www.github.com/aws/aws-sdk-cpp
pushd aws-sdk-cpp
mkdir build
cd build
PREFIX=/usr/local
Mitigate S3 test interference + Unlimited Dimensions in NCZarr
This PR started as an attempt to add unlimited dimensions to NCZarr.
It did that, but this exposed significant problems with test interference.
So this PR is mostly about fixing -- well mitigating anyway -- test
interference.
The problem of test interference is now documented in the document docs/internal.md.
The solutions implemented here are also describe in that document.
The solution is somewhat fragile but multiple cleanup mechanisms
are provided. Note that this feature requires that the
AWS command line utility must be installed.
## Unlimited Dimensions.
The existing NCZarr extensions to Zarr are modified to support unlimited dimensions.
NCzarr extends the Zarr meta-data for the ".zgroup" object to include netcdf-4 model extensions. This information is stored in ".zgroup" as dictionary named "_nczarr_group".
Inside "_nczarr_group", there is a key named "dims" that stores information about netcdf-4 named dimensions. The value of "dims" is a dictionary whose keys are the named dimensions. The value associated with each dimension name has one of two forms
Form 1 is a special case of form 2, and is kept for backward compatibility. Whenever a new file is written, it uses format 1 if possible, otherwise format 2.
* Form 1: An integer representing the size of the dimension, which is used for simple named dimensions.
* Form 2: A dictionary with the following keys and values"
- "size" with an integer value representing the (current) size of the dimension.
- "unlimited" with a value of either "1" or "0" to indicate if this dimension is an unlimited dimension.
For Unlimited dimensions, the size is initially zero, and as variables extend the length of that dimension, the size value for the dimension increases.
That dimension size is shared by all arrays referencing that dimension, so if one array extends an unlimited dimension, it is implicitly extended for all other arrays that reference that dimension.
This is the standard semantics for unlimited dimensions.
Adding unlimited dimensions required a number of other changes to the NCZarr code-base. These included the following.
* Did a partial refactor of the slice handling code in zwalk.c to clean it up.
* Added a number of tests for unlimited dimensions derived from the same test in nc_test4.
* Added several NCZarr specific unlimited tests; more are needed.
* Add test of endianness.
## Misc. Other Changes
* Modify libdispatch/ncs3sdk_aws.cpp to optionally support use of the
AWS Transfer Utility mechanism. This is controlled by the
```#define TRANSFER```` command in that file. It defaults to being disabled.
* Parameterize both the standard Unidata S3 bucket (S3TESTBUCKET) and the netcdf-c test data prefix (S3TESTSUBTREE).
* Fixed an obscure memory leak in ncdump.
* Removed some obsolete unit testing code and test cases.
* Uncovered a bug in the netcdf-c handling of big-endian floats and doubles. Have not fixed yet. See tst_h5_endians.c.
* Renamed some nczarr_tests testcases to avoid name conflicts with nc_test4.
* Modify the semantics of zmap\#ncsmap_write to only allow total rewrite of objects.
* Modify the semantics of zodom to properly handle stride > 1.
* Add a truncate operation to the libnczarr zmap code.
2023-09-27 06:56:48 +08:00
FLAGS="-DCMAKE_INSTALL_PREFIX=/usr/local \
2023-04-26 07:15:06 +08:00
-DCMAKE_POLICY_DEFAULT_CMP0075=NEW \
Mitigate S3 test interference + Unlimited Dimensions in NCZarr
This PR started as an attempt to add unlimited dimensions to NCZarr.
It did that, but this exposed significant problems with test interference.
So this PR is mostly about fixing -- well mitigating anyway -- test
interference.
The problem of test interference is now documented in the document docs/internal.md.
The solutions implemented here are also describe in that document.
The solution is somewhat fragile but multiple cleanup mechanisms
are provided. Note that this feature requires that the
AWS command line utility must be installed.
## Unlimited Dimensions.
The existing NCZarr extensions to Zarr are modified to support unlimited dimensions.
NCzarr extends the Zarr meta-data for the ".zgroup" object to include netcdf-4 model extensions. This information is stored in ".zgroup" as dictionary named "_nczarr_group".
Inside "_nczarr_group", there is a key named "dims" that stores information about netcdf-4 named dimensions. The value of "dims" is a dictionary whose keys are the named dimensions. The value associated with each dimension name has one of two forms
Form 1 is a special case of form 2, and is kept for backward compatibility. Whenever a new file is written, it uses format 1 if possible, otherwise format 2.
* Form 1: An integer representing the size of the dimension, which is used for simple named dimensions.
* Form 2: A dictionary with the following keys and values"
- "size" with an integer value representing the (current) size of the dimension.
- "unlimited" with a value of either "1" or "0" to indicate if this dimension is an unlimited dimension.
For Unlimited dimensions, the size is initially zero, and as variables extend the length of that dimension, the size value for the dimension increases.
That dimension size is shared by all arrays referencing that dimension, so if one array extends an unlimited dimension, it is implicitly extended for all other arrays that reference that dimension.
This is the standard semantics for unlimited dimensions.
Adding unlimited dimensions required a number of other changes to the NCZarr code-base. These included the following.
* Did a partial refactor of the slice handling code in zwalk.c to clean it up.
* Added a number of tests for unlimited dimensions derived from the same test in nc_test4.
* Added several NCZarr specific unlimited tests; more are needed.
* Add test of endianness.
## Misc. Other Changes
* Modify libdispatch/ncs3sdk_aws.cpp to optionally support use of the
AWS Transfer Utility mechanism. This is controlled by the
```#define TRANSFER```` command in that file. It defaults to being disabled.
* Parameterize both the standard Unidata S3 bucket (S3TESTBUCKET) and the netcdf-c test data prefix (S3TESTSUBTREE).
* Fixed an obscure memory leak in ncdump.
* Removed some obsolete unit testing code and test cases.
* Uncovered a bug in the netcdf-c handling of big-endian floats and doubles. Have not fixed yet. See tst_h5_endians.c.
* Renamed some nczarr_tests testcases to avoid name conflicts with nc_test4.
* Modify the semantics of zmap\#ncsmap_write to only allow total rewrite of objects.
* Modify the semantics of zodom to properly handle stride > 1.
* Add a truncate operation to the libnczarr zmap code.
2023-09-27 06:56:48 +08:00
-DBUILD_ONLY=s3;transfer \
2023-04-26 07:15:06 +08:00
-DENABLE_UNITY_BUILD=ON \
Mitigate S3 test interference + Unlimited Dimensions in NCZarr
This PR started as an attempt to add unlimited dimensions to NCZarr.
It did that, but this exposed significant problems with test interference.
So this PR is mostly about fixing -- well mitigating anyway -- test
interference.
The problem of test interference is now documented in the document docs/internal.md.
The solutions implemented here are also describe in that document.
The solution is somewhat fragile but multiple cleanup mechanisms
are provided. Note that this feature requires that the
AWS command line utility must be installed.
## Unlimited Dimensions.
The existing NCZarr extensions to Zarr are modified to support unlimited dimensions.
NCzarr extends the Zarr meta-data for the ".zgroup" object to include netcdf-4 model extensions. This information is stored in ".zgroup" as dictionary named "_nczarr_group".
Inside "_nczarr_group", there is a key named "dims" that stores information about netcdf-4 named dimensions. The value of "dims" is a dictionary whose keys are the named dimensions. The value associated with each dimension name has one of two forms
Form 1 is a special case of form 2, and is kept for backward compatibility. Whenever a new file is written, it uses format 1 if possible, otherwise format 2.
* Form 1: An integer representing the size of the dimension, which is used for simple named dimensions.
* Form 2: A dictionary with the following keys and values"
- "size" with an integer value representing the (current) size of the dimension.
- "unlimited" with a value of either "1" or "0" to indicate if this dimension is an unlimited dimension.
For Unlimited dimensions, the size is initially zero, and as variables extend the length of that dimension, the size value for the dimension increases.
That dimension size is shared by all arrays referencing that dimension, so if one array extends an unlimited dimension, it is implicitly extended for all other arrays that reference that dimension.
This is the standard semantics for unlimited dimensions.
Adding unlimited dimensions required a number of other changes to the NCZarr code-base. These included the following.
* Did a partial refactor of the slice handling code in zwalk.c to clean it up.
* Added a number of tests for unlimited dimensions derived from the same test in nc_test4.
* Added several NCZarr specific unlimited tests; more are needed.
* Add test of endianness.
## Misc. Other Changes
* Modify libdispatch/ncs3sdk_aws.cpp to optionally support use of the
AWS Transfer Utility mechanism. This is controlled by the
```#define TRANSFER```` command in that file. It defaults to being disabled.
* Parameterize both the standard Unidata S3 bucket (S3TESTBUCKET) and the netcdf-c test data prefix (S3TESTSUBTREE).
* Fixed an obscure memory leak in ncdump.
* Removed some obsolete unit testing code and test cases.
* Uncovered a bug in the netcdf-c handling of big-endian floats and doubles. Have not fixed yet. See tst_h5_endians.c.
* Renamed some nczarr_tests testcases to avoid name conflicts with nc_test4.
* Modify the semantics of zmap\#ncsmap_write to only allow total rewrite of objects.
* Modify the semantics of zodom to properly handle stride > 1.
* Add a truncate operation to the libnczarr zmap code.
2023-09-27 06:56:48 +08:00
-DCMAKE_BUILD_TYPE=Release \
-DSIMPLE_INSTALL=ON \
-DENABLE_TESTING=OFF
cmake $FLAGS ..
cmake --build . --config Release
sudo cmake --install . --config Release
2023-04-26 07:15:06 +08:00
cd ..
popd
````
### Windows build
It is possible to build and install aws-sdk-cpp on Windows using CMake.
Unfortunately, testing currently fails.
For Windows, the following context work. Of course your mileage may vary.
* OS: Windows 10 64-bit with Visual Studio community edition 2019.
* aws-sdk-cpp version 1.9.96 (or later)
* Dependencies: openssl, libcurl, cmake
#### AWS-SDK-CPP Build Recipe
This command-line build assumes one is using Cygwin or Mingw to provide
tools such as bash.
````
git clone --recurse-submodules https://www.github.com/aws/aws-sdk-cpp
pushd aws-sdk-cpp
mkdir build
cd build
CFG="Release"
PREFIX="c:/tools/aws-sdk-cpp"
FLAGS="-DCMAKE_INSTALL_PREFIX=${PREFIX} \
-DCMAKE_INSTALL_LIBDIR=lib" \
-DCMAKE_MODULE_PATH=${PREFIX}/cmake \
-DCMAKE_POLICY_DEFAULT_CMP0075=NEW \
-DBUILD_ONLY=s3 \
-DENABLE_UNITY_BUILD=ON \
-DCMAKE_BUILD_TYPE=$CFG \
-DSIMPLE_INSTALL=ON"
rm -fr build
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=${CFG} $FLAGS ..
cmake --build . --config ${CFG}
cmake --install . --config ${CFG}
cd ..
popd
````
Notice that the sdk is being installed in the directory "c:\tools\aws-sdk-cpp"
rather than the default location "c:\Program Files (x86)/aws-sdk-cpp-all"
This is because when using a command line, an install path that contains
blanks may not work.
In order for CMake to find the aws sdk libraries,
the following environment variables must be set:
````
AWSSDK_ROOT_DIR="c:/tools/aws-sdk-cpp"
AWSSDKBIN="/cygdrive/c/tools/aws-sdk-cpp/bin"
PATH="$PATH:${AWSSDKBIN}"
````
Then the following options must be specified for cmake.
````
-DAWSSDK_ROOT_DIR=${AWSSDK_ROOT_DIR}
-DAWSSDK_DIR=${AWSSDK_ROOT_DIR}/lib/cmake/AWSSDK"
````
## Building ``nch5s3comms''
This is an experimental SDK provided internally in the netcdf-c library.
* It is written in C
* It provides the minimum functionality necessary to read/write/search an Amazon S3 bucket.
Improve performance of the nc_reclaim_data and nc_copy_data functions.
re: Issue https://github.com/Unidata/netcdf-c/issues/2685
re: PR https://github.com/Unidata/netcdf-c/pull/2179
As noted in PR https://github.com/Unidata/netcdf-c/pull/2179,
the old code did not allow for reclaiming instances of types,
nor for properly copying them. That PR provided new functions
capable of reclaiming/copying instances of arbitrary types.
However, as noted by Issue https://github.com/Unidata/netcdf-c/issues/2685, using these
most general functions resulted in a significant performance
degradation, even for common cases.
This PR attempts to mitigate the cost of using the general
reclaim/copy functions in two ways.
First, the previous functions operating at the top level by
using ncid and typeid arguments. These functions were augmented
with equivalent versions that used the netcdf-c library internal
data structures to allow direct access to needed information.
These new functions are used internally to the library.
The second mitigation involves optimizing the internal functions
by providing early tests for common cases. This avoids
unnecessary recursive function calls.
The overall result is a significant improvement in speed by a
factor of roughly twenty -- your mileage may vary. These
optimized functions are still not as fast as the original (more
limited) functions, but they are getting close. Additional optimizations are
possible. But the cost is a significant "uglification" of the
code that I deemed a step too far, at least for now.
## Misc. Changes
1. Added a test case to check the proper reclamation/copy of complex types.
2. Found and fixed some places where nc_reclaim/copy should have been used.
3. Replaced, in the netcdf-c library, (almost all) occurrences of nc_reclaim_copy with calls to NC_reclaim/copy. This plus the optimizations is the primary speed-up mechanism.
4. In DAP4, the metadata is held in a substrate in-memory file; this required some changes so that the reclaim/copy code accessed that substrate dispatcher rather than the DAP4 dispatcher.
5. Re-factored and isolated the code that computes if a type is (transitively) variable-sized or not.
6. Clean up the reclamation code in ncgen; adding the use of nc_reclaim exposed some memory problems.
2023-05-21 07:11:25 +08:00
* It was constructed by heavily modifying the HDF5 *H5Fs3commons.c* file and combining it with crypto code wrappers provided by libcurl. The resulting file was then modified to fit into the netcdf coding style.
2023-04-26 07:15:06 +08:00
* The resulting code is rather ugly, but appears to work under at least Linux and under Windows (using Visual C++).
### Dependencies
* *\*nix\**: the following packages need to be installed: openssl, libcurl, (optionally) libxml2.
* *Windows (Visual C++)*: the following packages need to be installed: libcurl, (optionally) libxml2.
### Build Options
In order to enable this SDK, the Automake option *--enable-s3-internal* or the CMake option *-DENABLE_S3_INTERNAL=ON* must be specified.
### Testing S3 Support {#nccloud_testing_S3_support}
The pure S3 test(s) are in the _unit_tests_ directory.
Currently, by default, testing of S3 is supported only for Unidata members of the NetCDF Development Group.
This is because it uses a Unidata-specific bucket is inaccessible to the general user.
# Appendix C. AWS Selection Algorithms. {#nccloud_awsselect}
If byterange support is enabled, the netcdf-c library will parse the files
````
${HOME}/.aws/config
and
${HOME}/.aws/credentials
````
to extract profile names plus a list of key=value pairs.
In case of duplicates, *credentials* takes precedence over *config* .
This example is typical of the contents of these files.
````
[default]
aws_access_key_id=XXXXXXXXXXXXXXXXXXXX
aws_secret_access_key=YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
aws_region=ZZZZZZZZZ
````
The keys in the profile will be used to set various parameters in the library
## Profile Selection
The algorithm for choosing the active profile to use is as follows:
1. If the "aws.profile" fragment flag is defined in the URL, then it is used. For example, see this URL.
````
https://...#mode=nczarr,s3& aws.profile=xxx
````
2. If the "AWS.PROFILE" entry in the .rc file (i.e. .netrc or .dodsrc) is set, then it is used.
2023-05-10 11:13:49 +08:00
3. If defined, then profile "default" is used.
4. Otherwise the profile "no" is used.
2023-04-26 07:15:06 +08:00
2023-05-10 11:13:49 +08:00
The profile named "no" is a special profile that the netcdf-c library automatically defines.
2023-04-26 07:15:06 +08:00
It should not be defined anywhere else. It signals to the library that no credentialas are to used.
It is equivalent to the "--no-sign-request" option in the AWS CLI.
## Region Selection
If the specified URL is of the form
````
s3://< bucket > /key
````
Then this is rebuilt to this form:
````
s3://s2.< region& gt.amazonaws.com>/key
````
However this requires figuring out the region to use.
The algorithm for picking an region is as follows.
1. If the "aws.region" fragment flag is defined in the URL, then it is used.
2. The active profile is searched for the "aws_region" key.
3. If the "AWS.REGION" entry in the .rc file (i.e. .netrc or .dodsrc) is set, then it is used.
4. Otherwise use "us-east-1" region.
## Authorization Selection
Picking an access-key/secret-key pair is always determined
by the current active profile. To choose to not use keys
2023-05-10 11:13:49 +08:00
requires that the active profile must be "no".
2023-04-26 07:15:06 +08:00
# Change Log {#nccloud_changelog}
[Note: minor text changes are not included.]
## 3/8/2023
1. Add an internal, minimal Amazon S3 SDK to support S3 access especially for Windows.
# Point of Contact {#nccloud_poc}
__Author__: Dennis Heimbigner< br >
__Email__: dmh at ucar dot edu< br >
__Initial Version__: 3/8/2023< br >
__Last Revised__: 3/8/2023