netcdf-c/docs/indexing.dox

/** \file

\internal

\page Indexed Access to Metadata Objects

\tableofcontents

The original internal representations of metadata in memory
relied on linear searching of lists to locate various objects
by name or by numeric id: by _varid_ or by _grpid_ for example.

In recent years, the flaws in that approach have become obvious
as users create files with extremely large numbers of objects:
groups, variables, attributes, and dimensions. One case
has 14 megabytes of metadata. Creating and (especially) later
opening such files was exceedingly slow.

This problem was partially alleviated in both netcdf-3 (libsrc)
and netcdf-4 (libsrc4) by adding name hashing tables.
However, and especially for netcdf-4, linear search still prevailed.

A pervasive change has been made to try to change searches by name
or by id from O(n) to O(1).
This uses hashing for name-based search
and vectors for numeric id-based search.
All other cases were left as O(n) searches.

This document describes the architecture and details of the netCDF
internal object lookup mechanisms now in place.

\section Sindexed_searches Indexed Searches

There are, as a rule, two searches that are used to locate
metadata objects: (1) search by name and (2) search by
externally visible id (e.g. dimid or varid).

It is currently the case that after all the metadata is read or
created, hashing is used for locating objects by name. In all
other cases -- apparently -- lookup is by linear search of a
of linked list.

It is relevant that, once created, no metadata object -- except
attributes -- can be deleted.  They can be renamed, but that
does not change the associated structure or id.  Deletion only
occurs when an error occurs in creating an object or on invoking
"nc_close()".

The numeric identifiers for dimensions, types, and groups are
all globally unique across a file.  But note that variable id's
are not globally unique (IMO a bad design decision) but are only
unique within the containing group.  Thus, in order to provide a
unique id for a variable it must be composed of the containing
group id plus the variable id.

Note also that names are unique only within a group and with respect
to some kind of metadata. That is a group cannot have e.g. two
dimensions with the same name. But it can have a variable and a dimension
with the same name (as with coordinate variables).

Finally, attribute names are unique only with respect to each other
and with respect to the containing object (a variable or a group).

\section Sbasic_data_structures Basic Data Structures

The basic data structures used by the new lookup mechanisms
are described in the following sections.

\subsection Snclist NClist

With rare exceptions, vectors of objects are maintained as
instances of NClist, which provides a dynamically extendible
vector of pointers: pointers to metadata objects in this case.
It is possible to append new objects or insert at a specific
vector offset, or overwrite an existing pointer at a specific
offset.

The NClist structure definition is as follows.

\code
typedef struct NClist {
  size_t alloc;
  size_t length;
  void** content;
} NClist;
\endcode

\subsection Snc_exhashmap NCexhashmap

The NCexhashmap type is a hash table mapping a key
to a data item. As a rule, the data item is a pointer to a
metadata object. This hash map uses Fagin's extendible hashing
algorithm. The current implementation supports table
expansion as described for that algorithm.

The hashtable definition is as follows.
\code
typedef struct NCexhashmap {
    int depth; /* Global depth */
    int nactive; /* # of active entries in whole table */
    NCexleaf** directory; /* |directory| == 2^depth */
    ...
} NCexhashmap;
\endcode

where the directory (of leaf pointers) has (1<<depth) pointers,
and where nactive is the total number of entries. There are other
fields not relevant to this discussion.

An entry in a leaf has this form:
\code
typedef struct NCexentry {
    ncexhashkey_t hashkey; /* Hash id */
    uintptr_t data;
} NCexentry;
\endcode

The _data_ field is of type _uintptr\_t_. Note that we assume
that sizeof(unintptr_t) == sizeof(void*). This is very
important.  It is supposed to be the case that a value of type
uintptr_t is an integer of sufficient size to hold a void* pointer.

This means that the data field can hold an unsigned integer or a
void* pointer. As a pointer, it often points to an instance of a
variable, or dimension, or other object.

The hashkey field is a CRC64 hash of the key. It is assumed that
64 bits is sufficient to guarantee that the hash of the key (a
string typically) is always unique.

This means that comparing hashkeys should be sufficient to
ensure that the corresponding keys are the same because hash
collisions will not occur. If a violation of this assumption
is detected, then a netcdf library internal error code is returned.

Finally, a leaf holds a vector of entries, and keeps them in
hashkey sorted order. Lookup within the leaf uses binary search.
This differs form Fagin's original algorithm, which used
secondary hashing within the leaf.

\subsection Sncindex NCxindex

An "index" \(aka instance of type "NCxindex"\) is a combination
of one NClist instance plus one NCexhashmap instance.
The hashmap maps a name to the position of the correspondingly
named object in the NClist part of the NCindex.

An index is used to provide several kinds of lookup with respect
to a specific list of metadata objects. For example, the
subgroups of a group are stored using a vector of pointers to
the subgroup objects.  This also provides information about
creation order, which is sometimes important.  However, we often
need fast access to that vector by name so an NCindex object
provides these capabilities. The NCindex object contains:

1. A vector into which the object pointers can be stored
   and iterated over.
2. A map from name to the corresponding object index in the vector.

Note that currently, NCxindex is only used in libsrc4 and
libhdf5 and libnczarr.  But if performance issues warrant, it
will eventually also be used in libsrc.

Note also that alternative implementations are feasible that do not
use a hash table for name indexing, but rather keep a list sorted by name
and use binary search to do name-based lookup. If this alternative were
implemented, then it is probable that we could get rid of using the NCexhashmap
structure altogether for netcdf-4. There is a performance cost since binary
search is O(log n). In practice, it is probable that this is of negligible
effect. The advantage is that rename operations become considerably simpler.

\section Sglobal_object_access Global Object Access

As mentioned, dimension, group, and type external id's (dimid,
grpid, typeid) are unique across the whole file. It is therefore
convenient to store in memory a per-file vector for each object
type such that the external id of the object is the same as the
position of that object in the corresponding per-file
vector. This makes lookup by external id very efficient.
Note that this was already the case for netcdf-3 (libsrc) so
this is a change for libsrc4 only.

The global set of dimensions, types, and groups is maintained by
three instances of NClist (within an index object) in the
NC_FILE_INFO structure: namely _alldims_, _alltypes_, and
_allgroups_.  The position of the object within the
corresponding list determines the object's external id. Thus, a
position of a dimension object within the "alldims" field of the
file structure determines its dimid. Similarly for types and groups.
Note that _alldims_ is an _NClist_ rather than an _NCxindex_ because
the dimension names are not unique unless they are full names (i.e. /g1/g2/dim).

\section Sper_group_object_access Per-Group Object Access

Each group object (NC_GRP_INFO_T) contains five instances of
NCxindex. One is for dimensions, one is for types, one is for
subgroups, one is for variables, and one is for attributes.  An
index is used for two reasons. First, it allows name-based lookup
for these items. Second, the declaration order is maintained by
the list within the index's vector. Note that the position of
an object in a group index vector has no necessary
relationship to the position of that object within the global
vectors.

Note however that the index vector for variables does define
the variable id, which is unique only within a group.
In this special case, the external id for the variable is
the same as its offset in the index's vector for the group.

A note about typeids. Since user defined types have an external
id starting at NC_FIRSTUSERTYPEID, we leave the global type
vector entries 0..NC_FIRSTUSERTYPEID-1 empty.

\section Smetadata_object_header Metadata Object Header

Each metadata object (e.g. NC_DIM_INFO_T, NC_VAR_INFO_T)
now has what is called a "hdr" object as its first field.
This provides a form of pseudo-inheritance for these objects
because they can all be cast to "NC_OBJ" to get common information.
The structure of the header is as follows.

\code
typedef struct NC_OBJ {
    NC_SORT sort;
    char* name; /* assumed to be null terminated */
    size_t id;
    ncexhashkey_t hashkey;
} NC_OBJ;
\endcode

The sort is one of the values _NCVAR_, _NCDIM_, _NCATT_, _NCTYP_, or _NCGRP_.
The name is assumed to be nul terminated. The id is the assigned id
for the object. The hashkey is the same hash value as used in NCexhashmap.

\section Scliches Programming cliches

\subsection Slookupname Lookup an Object by Name

In the original code, the following _cliche_ (code snippet)
was common for looking up an object by name.
\code
NC_GRP_INFO_T* grp = ...;
...
NC_GRP_INFO_T* g;
...
for (g = grp->children; g; g = g->l.next) {
    if(strcmp(name,g->name)==0) {
      ... code to process matching grp by name
    }
...
}
\endcode
In this case, this loop is iterating across all the subgroups (children)
of the grp. It does so by walking the linked list of child groups.
It does a name comparison in order to find the group with the desired name.

In the new code, this iteration cliche is replaced by something
that looks like this.
\code
NC_GRP_INFO_T* grp = ...;
NC_GRP_INFO_T* g;
...
g = ncxindexlookup(grp->children,name);
if(g != NULL)
      ... code to process matching grp by name
}
\endcode
In this case, the iteration is replaced by a hashtable lookup.

\subsection Slookupid Lookup an Object by id

In the original code, the following _cliche_ (code snippet)
was common for looking up an object by its id (dimid, varid, etc).
\code
NC_GRP_INFO_T* grp = ...;
...
NC_DIM_INFO_T* d;
...
for (d = grp->dim; d; d = d->l.next) {
    if(varid == d->dimid)
      ... code to process matching dim by index
    }
...
}
\endcode
In this case, this loop is iterating across all the dimension objects
of the grp. It does so by walking the linked list of dimensions.
It does an id comparison in order to find the group with the desired
dimension.

In the new code, this iteration cliche is replaced by something
that looks like this.
\code
NC_FILE_INFO_T* h5 = ...;
NC_DIM_INFO_T* d;;
...
d = nclistget(h5->alldims,id);
if(d != NULL)
      ... code to process matching dim by id
}
\endcode
This shows how the alldims vector is used to map from a
dimid directly to the matching dimension object.
In this example, h5 is the NC_FILE_INFO_T file object.
This approach works for dimension ids, group ids, and type ids
because they are globally unique.

For variables and attributes, we have to use the containing group's
Ncxindex, such as grp->vars. In this case, the varid, is mapped using
code like this.
\code
NC_GRP_INFO_T* grp = ...;
NC_VAR_INFO_T* v;
...
v = ncxindexith(grp->vars,id);
if(v != NULL)
      ... code to process matching variable by id
}
\endcode

\subsection Siterate Iterating over sets of objects

In the original code, the following _cliche_ (code snippet)
was common.
\code
NC_GRP_INFO_T* grp;
...
NC_GRP_INFO_T* g;
...
for (g = grp->children; g; g = g->l.next)
...
\endcode
In this case, this loop is iterating across all the subgroups (children)
of the grp. It does so by walking the linked list of child groups.
Similar loops are used to walk a list of dimensions, variables, types,
or attributes.

In the new code, this iteration cliche is replaced by something
that looks like this.
\code
NC_GRP_INFO_T* grp;
...
for(i=0;i<ncxindexsize(grp->children);i++) {
  NC_GRP_INFO_T* g = nclistith(grp->children,i);
...
}
\endcode
In this case, the iteration is by index into the underlying vector.

\section Sperf Performance

The initial impetus for this change was to improve the performance
of netcdf-4 metadata loading by replacing linear searches with O(1)
searches.

In fact, this goal was not initially met. It appears to be the case
that the metadata loading costs are entirely dominated by the
performance of the HDF5 library. The reason for this is that
the netcdf-c library used to load all the metadata immediately
when a file was opened. This in turn meant that all of the metadata
was immediately extracted from the underlying HDF5 file. So, there was
no opportunity for lazy loading to be used.

The netcdf library has since been modified to do lazy loading
of variables and attributes. A not pursued alternative would
have been to store a single metadata object into the file so it could
be read at one shot. This object would then be processed
in-memory to construct the internal metadata. The costs for
this approach are space in the file plus the need to keep it
consistent with the actual metadata stored by HDF5.

It should be noted that there is an effect from this change.
Using gprof, one can see that in the original code the obj_list_add
function was the dominate function called by a large percentage (about 20%).
Whereas with the new code, the function call distribution is much more
even with no function taking more than 4-5%.

Some other observations:

1. the utf8 code now shows up as taking about 4%. Given that most names
   are straight ASCII, it might pay to try to optimize for this to avoid
   invoking the utf8 processing code.
2. In the new code, attribute processing appears to take up a lot of the
   time. This, however might be an artifact of the test cases.
3. There is a small performance improvement from avoiding walking the linked
   list. It appears that creating a file is about 10% faster and opening a file
   is also about 10% faster.

\section Snotes_and_warnings Notes and Warning

1. Ncxindex is currently not used for enum constants and compound fields.
   Additionally, it is not used for listing the dimensions associated
   with a variable. Small size is the reason.
2. References between meta-data objects (e.g. group parent or
   containing group) are stored directly and not using any kind
   of vector or hashtable.
3. Attribute rename and deletion are still moderately costly operations.
4. As in the original code, object ids (dimid, etc) are assigned
   explicitly using counters within the NC_FILE_INFO_T object.
   When stored into, for example, "alldims", the position of the
   object is forcibly made to match the value of the assigned id.
5. The file ncxindex.c has a constant, NCNOHASH, that controls
   if the index uses that hash table versus just searching the
   index's vector. This is for experimental purposes.

\section Sprovenance Contact Information

__Author__: Dennis Heimbigner<br>
__Initial Version__: 01/10/2018<br>
__Last Revised__: 10/30/2020

*/