netcdf-c/ncdump/nccopy.1
Dennis Heimbigner f376c23329 Make utilities support NC_COMPACT
re: https://github.com/Unidata/netcdf-c/issues/1642

Modify ncdump, nccopy, and ncgen to support the NC_COMPACT storage option.
Added test cases and added description to the man pages for the utilities.

1. ncdump: For compact storage variable, print special attribute __Storage_ as
````
    <var>: _Storage = "compact";
````

2. ncgen: parse and implement
````
    <var>: _Storage = "compact";
````
in a .cdl file

3. nccopy: Extend the chunk specification (-c flag) to support
   compact using the forms
````
nccopy ... -c <var>:compact
and
nccopy ... -c <var>:contiguous
````

Misc. other changes
1. cleanup the copy_chunking function in ncdump/nccopy.c
2020-02-29 12:06:21 -07:00

444 lines
21 KiB
Groff

.\" $Id: nccopy.1 400 2010-08-27 21:02:52Z russ $
.TH NCCOPY 1 "2012-03-08" "Release 4.2" "UNIDATA UTILITIES"
.SH NAME
nccopy \- Copy a netCDF file, optionally changing format, compression, or chunking in the output.
.SH SYNOPSIS
.ft B
.HP
nccopy
.nh
\%[\-k \fI kind_name \fP]
\%[\-\fIkind_code\fP]
\%[\-d \fI n \fP]
\%[\-s]
\%[\-c \fI chunkspec \fP]
\%[\-u]
\%[\-w]
\%[\-[v|V] var1,...]
\%[\-[g|G] grp1,...]
\%[\-m \fI bufsize \fP]
\%[\-h \fI chunk_cache \fP]
\%[\-e \fI cache_elems \fP]
\%[\-r]
\%[\-F \fI filterspec \fP]
\%[\-L \fI n \fP]
\%[\-M \fI n \fP]
\%\fI infile \fP
\%\fI outfile \fP
.hy
.ft
.SH DESCRIPTION
.LP
The \fBnccopy\fP utility copies an input netCDF file in any supported
format variant to an output netCDF file, optionally converting the
output to any compatible netCDF format variant, compressing the data,
or rechunking the data. For example, if built with the netCDF-3
library, a netCDF classic file may be copied to a netCDF 64-bit offset
file, permitting larger variables. If built with the netCDF-4
library, a netCDF classic file may be copied to a netCDF-4 file or to
a netCDF-4 classic model file as well, permitting data compression,
efficient schema changes, larger variable sizes, and use of other
netCDF-4 features.
.LP
If no output format is specified, with either \-k \fIkind_name\fP
or \fI-kind_code\fP, then the output will use the same
format as the input, unless the input is classic or 64-bit offset
and either chunking or compression is specified, in which case the
output will be netCDF-4 classic model format. Attempting
some kinds of format conversion will result in an error, if the
conversion is not possible. For example, an attempt to copy a
netCDF-4 file that uses features of the enhanced model, such as
groups or variable-length strings, to any of the other kinds of netCDF
formats that use the classic model will result in an error.
.LP
\fBnccopy\fP also serves as an example of a generic netCDF-4 program,
with its ability to read any valid netCDF file and handle nested
groups, strings, and user-defined types, including arbitrarily
nested compound types, variable-length types, and data of any valid
netCDF-4 type.
.LP
If DAP support was enabled when \fBnccopy\fP was built, the file name may
specify a DAP URL. This may be used to convert data on DAP servers to
local netCDF files.
.SH OPTIONS
.IP "\fB \-k \fP \fI kind_name \fP"
Use format name to specify the kind of file to be created
and, by inference, the data model (i.e. netcdf-3 (classic) or
netcdf-4 (enhanced)). The possible arguments are:
.RS
.RS
.IP "'nc3' or 'classic' => netCDF classic format"
.IP "'nc6' or '64-bit offset' => netCDF 64-bit format"
.IP "'nc4' or 'netCDF-4' => netCDF-4 format (enhanced data model)"
.IP "'nc7' or 'netCDF-4 classic model' => netCDF-4 classic model format"
.RE
.RE
.IP
Note: The old format numbers '1', '2', '3', '4', equivalent
to the format names 'nc3', 'nc6', 'nc4', or 'nc7' respectively, are
also still accepted but deprecated, due to easy confusion between
format numbers and format names.
.IP "[\fB-\fP\fIkind_code\fP]"
Use format numeric code (instead of format name) to specify the kind of file to be created
and, by inference, the data model (i.e. netcdf-3 (classic) versus
netcdf-4 (enhanced)). The numeric codes are:
.RS
.RS
.IP "3 => netcdf classic format"
.IP "6 => netCDF 64-bit format"
.IP "4 => netCDF-4 format (enhanced data model)"
.IP "7 => netCDF-4 classic model format"
.RE
.RE
The numeric code "7" is used because "7=3+4", specifying the format
that uses the netCDF-3 data model for compatibility with the netCDF-4
storage format for performance. Credit is due to NCO for use of these
numeric codes instead of the old and confusing format numbers.
.IP "\fB \-d \fP \fI n \fP"
For netCDF-4 output, including netCDF-4 classic model, specify
deflation level (level of compression) for variable data output. 0
corresponds to no compression and 9 to maximum compression, with
higher levels of compression requiring marginally more time to
compress or uncompress than lower levels. Compression achieved may
also depend on output chunking parameters. If this option is
specified for a classic format or 64-bit offset format input file, it
is not necessary to also specify that the output should be netCDF-4
classic model, as that will be the default. If this option is not
specified and the input file has compressed variables, the compression
will still be preserved in the output, using the same chunking as in
the input by default.
.IP
Note that \fBnccopy\fP requires all variables to be compressed using the
same compression level, but the API has no such restriction. With
a program you can customize compression for each variable independently.
.IP "\fB \-s \fP"
For netCDF-4 output, including netCDF-4 classic model, specify
shuffling of variable data bytes before compression or after
decompression. Shuffling refers to interlacing of bytes in a chunk so
that the first bytes of all values are contiguous in storage, followed
by all the second bytes, and so on, which often improves compression.
This option is ignored unless a non-zero deflation level is specified.
Using \-d0 to specify no deflation on input data that has been
compressed and shuffled turns off both compression and shuffling in
the output.
.IP "\fB \-u \fP"
Convert any unlimited size dimensions in the input to fixed size
dimensions in the output. This can speed up variable-at-a-time
access, but slow down record-at-a-time access to multiple variables
along an unlimited dimension.
.IP "\fB \-w \fP"
Keep output in memory (as a diskless netCDF file) until output is
closed, at which time output file is written to disk. This can
greatly speedup operations such as converting unlimited dimension to
fixed size (\-u option), chunking, rechunking, or compressing the
input. It requires that available memory is large enough to hold the
output file. This option may provide a larger speedup than careful
tuning of the \-m, \-h, or \-e options, and it's certainly a lot simpler.
.IP "\fB \-c \fP \fIchunkspec\fP"
For netCDF-4 output, including netCDF-4 classic model, specify
chunking (multidimensional tiling) for variable data in the output.
This is useful to specify the units of disk access, compression, or
other filters such as checksums. Changing the chunking in a netCDF
file can also greatly speedup access, by choosing chunk shapes that
are appropriate for the most common access patterns.
.IP
The \fIchunkspec\fP argument has several forms. The first form is the
original, deprecated form and is a string of comma-separated associations,
each specifying a dimension name, a '/' character, and optionally the
corresponding chunk length for that dimension. No blanks should
appear in the chunkspec string, except possibly escaped blanks that
are part of a dimension name. A chunkspec names at least one
dimension, and may omit dimensions which are not to be chunked or for
which the default chunk length is desired. If a dimension name is
followed by a '/' character but no subsequent chunk length, the actual
dimension length is assumed. If copying a classic model file to a
netCDF-4 output file and not naming all dimensions in the chunkspec,
unnamed dimensions will also use the actual dimension length for the
chunk length. An example of a chunkspec for variables that use 'm'
and 'n' dimensions might be 'm/100,n/200' to specify 100 by 200
chunks. To see the chunking resulting from copying with a chunkspec,
use the '\-s' option of ncdump on the output file.
.IP
The chunkspec '/' that omits all dimension names and
corresponding chunk lengths specifies that no chunking is to occur in
the output, so can be used to unchunk all the chunked variables.
To see the chunking resulting from copying with a chunkspec,
use the '\-s' option of ncdump on the output file.
.IP
As an I/O optimization, \fBnccopy\fP has a threshold for the minimum size of
non-record variables that get chunked, currently 8192 bytes. The -M flag
can be used to override this value.
.IP
Note that \fBnccopy\fP requires variables that share a dimension to also
share the chunk size associated with that dimension, but the
programming interface has no such restriction. If you need to
customize chunking for variables independently, you will need to use
the second form of chunkspec. This second form of chunkspec has this
syntax: \fI var:n1,n2,...,nn \fP. This assumes that the variable named
"var" has rank n. The chunking to be applied to each dimension of the
variable is specified by the values of n1 through nn. This second
form of chunking specification can be repeated multiple times to specify
the exact chunking for different variables.
If the variable is specified but no chunk sizes are specified
(i.e. \fI -c var: \fP)
then chunking is disabled for that variable.
If the same variable is specified
more than once, the second and later specifications are ignored.
Also, this second form, per-variable chunking, takes precedence over any
per-dimension chunking except the bare "/" case.
.IP
The third form of the \fIchunkspec\fP has the
syntax: \fI var:compact\fP or \fI var:contiguous\fP.
This explicitly attempts to set the variable storage type as
compact or contiguous, respectively. These may be overridden
if other flags require the variable to be chunked.
.IP "\fB \-v \fP \fI var1,... \fP"
The output will include data values for the specified variables, in
addition to the declarations of all dimensions, variables, and
attributes. One or more variables must be specified by name in the
comma-delimited list following this option. The list must be a single
argument to the command, hence cannot contain unescaped blanks or
other white space characters. The named variables must be valid netCDF
variables in the input-file. A variable within a group in a netCDF-4
file may be specified with an absolute path name, such as
"/GroupA/GroupA2/var". Use of a relative path name such as 'var' or
"grp/var" specifies all matching variable names in the file. The
default, without this option, is to include data values for \fI all \fP variables
in the output.
.IP "\fB \-V \fP \fI var1,... \fP"
The output will include the specified variables only but all dimensions and
global or group attributes. One or more variables must be specified by name in the
comma-delimited list following this option. The list must be a single argument
to the command, hence cannot contain unescaped blanks or other white space
characters. The named variables must be valid netCDF variables in the
input-file. A variable within a group in a netCDF-4 file may be specified with
an absolute path name, such as '/GroupA/GroupA2/var'. Use of a relative path
name such as 'var' or 'grp/var' specifies all matching variable names in the
file. The default, without this option, is to include \fI all \fP variables in the
output.
.IP "\fB \-g \fP \fI grp1,... \fP"
The output will include data values only for the specified groups.
One or more groups must be specified by name in the comma-delimited
list following this option. The list must be a single argument to the
command. The named groups must be valid netCDF groups in the
input-file. The default, without this option, is to include data values for all
groups in the output.
.IP "\fB \-G \fP \fI grp1,... \fP"
The output will include only the specified groups.
One or more groups must be specified by name in the comma-delimited
list following this option. The list must be a single argument to the
command. The named groups must be valid netCDF groups in the
input-file. The default, without this option, is to include all groups in the
output.
.IP "\fB \-m \fP \fI bufsize \fP"
An integer or floating-point number that specifies the size, in bytes,
of the copy buffer used to copy large variables. A suffix of K, M, G,
or T multiplies the copy buffer size by one thousand, million,
billion, or trillion, respectively. The default is 5 Mbytes,
but will be increased if necessary to hold at least one chunk of
netCDF-4 chunked variables in the input file. You may want to specify
a value larger than the default for copying large files over high
latency networks. Using the '\-w' option may provide better
performance, if the output fits in memory.
.IP "\fB \-h \fP \fI chunk_cache \fP"
For netCDF-4 output, including netCDF-4 classic model, an integer or
floating-point number that specifies the size in bytes of chunk cache
allocated for each chunked variable. This is not a property of the file, but merely
a performance tuning parameter for avoiding compressing or
decompressing the same data multiple times while copying and changing
chunk shapes. A suffix of K, M, G, or T multiplies the chunk cache
size by one thousand, million, billion, or trillion, respectively.
The default is 4.194304 Mbytes (or whatever was specified for the
configure-time constant CHUNK_CACHE_SIZE when the netCDF library was
built). Ideally, the \fBnccopy\fP utility should accept only one memory
buffer size and divide it optimally between a copy buffer and chunk
cache, but no general algorithm for computing the optimum chunk cache
size has been implemented yet. Using the '\-w' option may provide
better performance, if the output fits in memory.
.IP "\fB \-e \fP \fI cache_elems \fP"
For netCDF-4 output, including netCDF-4 classic model, specifies
number of chunks that the chunk cache can hold. A suffix of K, M, G,
or T multiplies the number of chunks that can be held in the cache
by one thousand, million, billion, or trillion, respectively. This is not a
property of the file, but merely a performance tuning parameter for
avoiding compressing or decompressing the same data multiple times
while copying and changing chunk shapes. The default is 1009 (or
whatever was specified for the configure-time constant
CHUNK_CACHE_NELEMS when the netCDF library was built). Ideally, the
\fBnccopy\fP utility should determine an optimum value for this parameter,
but no general algorithm for computing the optimum number of chunk
cache elements has been implemented yet.
.IP "\fB \-r \fP"
Read netCDF classic or 64-bit offset input file into a diskless netCDF
file in memory before copying. Requires that input file be small
enough to fit into memory. For \fBnccopy\fP, this doesn't seem to provide
any significant speedup, so may not be a useful option.
.IP "\fB \-L \fP \fIn\fP"
Set the log level; only usable if nccopy supports netCDF-4 (enhanced).
.IP "\fB \-M \fP \fIn\fP"
Set the minimum chunk size; only usable if nccopy supports netCDF-4 (enhanced).
.IP "\fB \-F \fP \fIfilterspec\fP"
For netCDF-4 output, including netCDF-4 classic model, specify a filter
to apply to a specified set of variables in the output. As a rule, the filter
is a compression/decompression algorithm with a unique numeric identifier
assigned by the HDF Group (see https://support.hdfgroup.org/services/filters.html).
.IP
The \fIfilterspec\fP argument has this general form.
.RS
fqn1|fqn2...,filterid,param1,param2...paramn
or
*,filterid,param1,param2...paramn
.RE
An fqn (fully qualified name) is the name
of a variable prefixed by its containing
groups with the group names separated by forward slash ('/').
An example might be \FI/g1/g2/var\fP. Alternatively,
just the variable name can be given if it is in the root group:
e.g. \FIvar\fP. Backslash escapes may be used as needed.
A note of warning: the '|' separator is a bash reserved character, so you will
probably need to put the filter spec in some kind of quotes or otherwise escape it.
.IP
The filterid is an unsigned positive integer representing the id
assigned by the HDFgroup to the filter. Following the id is a sequence of
parameters defining the operation of the filter. Each parameter
is a 32-bit unsigned integer.
.IP
This parameter may be repeated multiple times with different
variable names.
.SH EXAMPLES
.LP
Make a copy of foo1.nc, a netCDF file of any type, to foo2.nc, a
netCDF file of the same type:
.RS
.HP
nccopy foo1.nc foo2.nc
.RE
.LP
Note that the above copy will not be as fast as use of cp or other
simple copy utility, because the file is copied using only the netCDF
API. If the input file has extra bytes after the end of the netCDF
data, those will not be copied, because they are not accessible
through the netCDF interface. If the original file was generated in
"No fill" mode so that fill values are not stored for padding for data
alignment, the output file may have different padding bytes.
.LP
Convert a netCDF-4 classic model file, compressed.nc, that uses compression,
to a netCDF-3 file classic.nc:
.RS
.HP
nccopy \-k classic compressed.nc classic.nc
.RE
.LP
Note that 'nc3' could be used instead of 'classic'.
.LP
Download the variable 'time_bnds' and its associated attributes from
an OPeNDAP server and copy the result to a netCDF file named 'tb.nc':
.RS
.HP
nccopy 'http://test.opendap.org/opendap/data/nc/sst.mnmean.nc.gz?time_bnds' tb.nc
.RE
.LP
Note that URLs that name specific variables as command-line arguments
should generally be quoted, to avoid the shell interpreting special
characters such as '?'.
.LP
Compress all the variables in the input file foo.nc, a netCDF file of any
type, to the output file bar.nc:
.RS
.HP
nccopy \-d1 foo.nc bar.nc
.RE
.LP
If foo.nc was a classic or 64-bit offset netCDF file, bar.nc will be a
netCDF-4 classic model netCDF file, because the classic and 64-bit
offset format variants don't support compression. If foo.nc was a
netCDF-4 file with some variables compressed using various deflation
levels, the output will also be a netCDF-4 file of the same type, but
all the variables, including any uncompressed variables in the input,
will now use deflation level 1.
.LP
Assume the input data includes gridded variables that use time, lat,
lon dimensions, with 1000 times by 1000 latitudes by 1000 longitudes,
and that the time dimension varies most slowly. Also assume that
users want quick access to data at all times for a small set of
lat-lon points. Accessing data for 1000 times would typically require
accessing 1000 disk blocks, which may be slow.
.LP
Reorganizing the data into chunks on disk that have all the time in
each chunk for a few lat and lon coordinates would greatly speed up
such access. To chunk the data in the input file slow.nc, a netCDF
file of any type, to the output file fast.nc, you could use;
.RS
.HP
nccopy \-c time/1000,lat/40,lon/40 slow.nc fast.nc
.RE
.LP
to specify data chunks of 1000 times, 40 latitudes, and 40 longitudes.
If you had enough memory to contain the output file, you could speed
up the rechunking operation significantly by creating the output in
memory before writing it to disk on close (using the -w flag):
.RS
.HP
nccopy \-w \-c time/1000,lat/40,lon/40 slow.nc fast.nc
.RE
Alternatively, one could write this using the alternate, variable-specific
chunking specification and assuming that times, lat, and lon
are variables.
.RS
.HP
nccopy \-c time:1000 -c lat:40 -c lon:40 slow.nc fast.nc
.RE
.LP
.SH "Chunking Rules"
.LP
The complete set of chunking rules is captured here. As a rough
summary, these rules preserve all chunking properties from the
input file. These rules apply only when the selected output
format supports chunking, i.e. for the netcdf-4 variants.
.LP
The variable specific chunking specification should be obvious
and translates directly to the corresponding "nc_def_var_chunking"
API call.
.LP
.\" see: https://github.com/Unidata/netcdf-c/issues/725
The original per-dimension, chunking specification requires some
interpretation by nccopy.
The following rules are applied in the given order independently
for each variable to be copied from input to output. The rules are
written assuming we are trying to determine the chunking for a given
output variable Vout that comes from an input variable Vin.
.IP "1."
If there is no '-c' option that applies to a variable and the
corresponding input variable is contiguous or the input is some
netcdf-3 variant, then let the netcdf-c library make all chunking
decisions.
.IP "2."
For each dimension of Vout explicitly specified on the command line
(using the '-c' option), apply the chunking value for that
dimension regardless of input format or input properties.
.IP "3."
For dimensions of Vout not named on the command line in a '-c' option, preserve chunk
sizes from the corresponding input variable, if it is chunked.
.IP "4."
If Vin is contiguous, and none of its dimensions are
named on the command line, and chunking is not mandated by other
options, then make Vout be contiguous.
.IP "5."
If the input variable is contiguous (or is some netcdf-3
variant) and there are no options requiring chunking, or the '/'
special case for the '-c' option is specified, then the output
variable V is marked as contiguous.
.IP "6."
Final, default case: some or all chunk sizes are not
determined by the command line or the input
variable. This includes the non-chunked input cases such
as netcdf-3, cdf5, and DAP. In these cases retain all
chunk sizes determined by previous rules, and use the full
dimension size as the default. The exception is unlimited dimensions,
where the default is 4 megabytes.
.SH "SEE ALSO"
.LP
.BR ncdump(1), ncgen(1), netcdf(3)