netcdf-c/docs/indexing.dox

/** \file

\internal

\page nchashmap Indexed Access to Metadata Objects

\tableofcontents

The original internal representations of metadata in memory
relied on linear searching of lists to locate various objects
by name or by numeric id (e.g. varid or grpid).

In recent years, the flaws in that approach have become obvious
as users create files with extremely large numbers of objects:
group, variables, attributes, and dimensions. One case
has 14 megabytes of metadata. Creating and (especially) later
opening such files was exceedingly slow.

This problem was partially alleviated in both netcdfd-3 (libsrc)
and netcdf-4 (libsrc4) by adding name hashing tables.
However, and especially for netcdf-4, linear search still prevailed.

A pervasive change has been made to try to remove (almost) all
occurrences of linear search and replace it with either hashing
(for name-based lookup) or vectors (for numeric id-based
lookup).  The cases left as linear search include these.

1. Enum constants for an enumeration
2. Dimensions associated with a variable
3. Fields of Compound types

This document describes the architecture and details of the netCDF
internal object lookup mechanisms now in place.

\section S1 Indexed Searches

There are, as a rule, two searches that are used to locate
metadata object: (1) search by name and (2) search by
externally visible id (e.g. dimid or varid).

Currently, and after all the metadata is read or created,
hashing is used for locating objects by name. In all other
cases -- apparently -- lookup is by linear search of some
kind of linked list or a vector.

It is relevant that, once created, no metadata object -- except
attributes -- can be deleted.  They can be renamed, but that
does not change the associated id.  Deletion only occurs when an
error occurs in creating an object or on invoking nc_close.

The numeric identifiers for dimensions, types, and groups are
all globally unique across a file.  But note that variable id's
are not globally unique (IMO a bad design decision) but are only
unique within the containing group.  Thus, in order to provide a
unique id for a variable it must be composed of the containing
group id plus the variable id.

Note also that names are unique only within a group and with respect
to some kind of metadata. That is a group cannot have e.g. two
dimensions with the same name. But it can have a variable and a dimension
with the same name (as with coordinate variables).

Finally, attribute names are unique only with respect to each other
and with respect to the containing object (a variable or a group).

\section S2 Basic Data Structures

The basic data structures used by the new lookup mechanisms
are described in the following sections.

\subsection SS1_1 NClist

With rare exceptions, vectors of objects are maintained as
instances of NClist, which provides a dynamically extendible
vector of pointers: pointers to metadata objects in this case.
It is possible to append new objects or insert at a specific
vector offset, or overwrite an existing pointer at a specific
offset.

The definition is as follows.

\code
typedef struct NClist {
  size_t alloc;
  size_t length;
  void** content;
} NClist;
\endcode


\subsection SS1_2 NC_hashmap

The NC_hashmap type is a hash table mapping a name to a pointer.
As a rule, the pointer points to a metadata object.  The current
implementation supports table expansion when the # of entries in
the table starts to get too large. Basically a simple linear
rehash is used for collisions and no separate hash-chain is
used.  This means that when expanded, it must be completely
rebuilt. The performance hit for this has yet to be determined.

The hashtable definition is as follows.

\code
typedef struct NC_hashmap {
  size_t size;
  size_t count;
  NC_hentry* table;
} NC_hashmap;
\endcode

where size is the current allocated size and count is the
number of active entries in the table. The "table" field is
a vector of entries of this form.

\code
typedef struct NC_hentry {
    int flags;
    void* data;
    size_t hashkey; /* Hash id */
    char* key; /* actual key; do not free */
} NC_hentry;
\endcode

The flags indicate the state of the entry and can be one of three states:

1. ACTIVE - there is an object referenced in this entry
2. DELETED - an entry was deleted, but must be marked so
   that linear rehash will work.
3. EMPTY - unused

There is an important WARNING with respect to the "key" field.
The key is not a copy of the object's name, but in fact is a duplicate
pointer to that same string. This means (1) that it should never be
free()'d and (2) if the name of the metadata object is changed, then
it must be removed and re-inserted into the table to that the key
points to the current name.

The "data" field is of type void*. Often it is a pointer to an instance
of a variable, or dimension, or other object. When used as part of an
NC_listmap (see below), then the key is an integer index into the
associated vector. In order to do this correctly, we need to rely
on  the type "uintptr_t". It is supposed to be the case
that a value of type uintptr_t is an integer of sufficient size to
hold a void* pointer. Usually, but not always, this would be the same
size as an "unsigned long" value. Using this allows the hashtable
to store either pointers or integer indices.

One further WARNING: any object that will be inserted into an NC_hashmap
must have its name as the first field so it can be cast
to char** for use with the hashtable.

\subsection SS1_3 NC_listmap

A listmap is a combination of an NClist and an NC_hashtable.
It is used to provide name-based lookup with respect to a
specific list of metadata objects. For example, the subgroups
of a group are stored using a listmap, where the list is a
vector of pointers to the subgroup objects and the hashmap
maps the subgroup name (unique to that group, remember) to
the corresponding index into the vector. In theory, only
the hashmap is needed because it could be walked to get all
of the metadata objects. However, the creation order is sometimes
important, so that is maintained by the vector.
This is especially important for attribute storage.

Note that currently, NC_listmap is only used in libsrc4,
but if performance issues warrant, it will also be used in
libsrc.

\section S3 Global Object Access

As mentioned, dimension, group, and type external id's (dimid,
grpid, typeid) are unique across the whole file. It is therefore
convenient to store in memory a per-file vector for each object
type such that the external id of the object is the same as the
position of that object in the corresponding per-file
vector. This maked lookup by external id efficient.
Note that this is was already the case for netcdf-3 (libsrc) so
this is a change for libsrc4 only.

The global set of dimensions, types, and groups is maintained by
three instances of NClist in the NC_HDF5_FILE_INFO structure:
alldims, alltypes, and allgroups.
The position of the object within the corresponding list determines
the object's external id. Thus, a position of a dimension object within the
"alldims" field of the file structure determines its dimid. Similarly
for types and groups.

\section S4 Per-Group Object Access

Each group object (NC_GRP_INFO) contains four
instances of NC_listmap. One is for dimensions, one is for
types, one is for subgroups, and one is for variables.  A
listmap is used for two reasons. First, allows name-based lookup
for these items. Second, the declaration order is maintained by
the list within the listmap's vector. Note that the position of
an object in a group listmap vector has no necessary
relationship to the position of that object within the global
vectors.  Note also that there is no global vector for variables
because variable external ids are unique only within the
group. In this special case, the external id for the variable is
the same as its offset in the listmap's vector for the group.

A note about typeids. Since user defined types have an external
id starting at NC_FIRSTUSERTYPEID, we leave the global type
vector entries 0..NC_FIRSTUSERTYPEID-1 empty.

\section S5 Exceptions

NC_Listmap is currently not used for enum constants and compound fields.
Additionally, it is not used for listing the dimensions associated
with a variable.

References between meta-data objects (e.g. group parent or
containing group) are stored directly and not using any kind
of vector or hashtable.

*/