mirror of
https://github.com/Unidata/netcdf-c.git
synced 2024-11-27 07:30:33 +08:00
727b613459
(I hope) metadata mechanism. This mostly just adds new pieces of code (e.g. nclistmap) and does some minor fixes. It should be transparent to everything else. The next set of changes will be the big step.
219 lines
8.3 KiB
Plaintext
219 lines
8.3 KiB
Plaintext
/** \file
|
|
|
|
\internal
|
|
|
|
\page nchashmap Indexed Access to Metadata Objects
|
|
|
|
\tableofcontents
|
|
|
|
The original internal representations of metadata in memory
|
|
relied on linear searching of lists to locate various objects
|
|
by name or by numeric id (e.g. varid or grpid).
|
|
|
|
In recent years, the flaws in that approach have become obvious
|
|
as users create files with extremely large numbers of objects:
|
|
group, variables, attributes, and dimensions. One case
|
|
has 14 megabytes of metadata. Creating and (especially) later
|
|
opening such files was exceedingly slow.
|
|
|
|
This problem was partially alleviated in both netcdfd-3 (libsrc)
|
|
and netcdf-4 (libsrc4) by adding name hashing tables.
|
|
However, and especially for netcdf-4, linear search still prevailed.
|
|
|
|
A pervasive change has been made to try to remove (almost) all
|
|
occurrences of linear search and replace it with either hashing
|
|
(for name-based lookup) or vectors (for numeric id-based
|
|
lookup). The cases left as linear search include these.
|
|
|
|
1. Enum constants for an enumeration
|
|
2. Dimensions associated with a variable
|
|
3. Fields of Compound types
|
|
|
|
This document describes the architecture and details of the netCDF
|
|
internal object lookup mechanisms now in place.
|
|
|
|
\section S1 Indexed Searches
|
|
|
|
There are, as a rule, two searches that are used to locate
|
|
metadata object: (1) search by name and (2) search by
|
|
externally visible id (e.g. dimid or varid).
|
|
|
|
Currently, and after all the metadata is read or created,
|
|
hashing is used for locating objects by name. In all other
|
|
cases -- apparently -- lookup is by linear search of some
|
|
kind of linked list or a vector.
|
|
|
|
It is relevant that, once created, no metadata object -- except
|
|
attributes -- can be deleted. They can be renamed, but that
|
|
does not change the associated id. Deletion only occurs when an
|
|
error occurs in creating an object or on invoking nc_close.
|
|
|
|
The numeric identifiers for dimensions, types, and groups are
|
|
all globally unique across a file. But note that variable id's
|
|
are not globally unique (IMO a bad design decision) but are only
|
|
unique within the containing group. Thus, in order to provide a
|
|
unique id for a variable it must be composed of the containing
|
|
group id plus the variable id.
|
|
|
|
Note also that names are unique only within a group and with respect
|
|
to some kind of metadata. That is a group cannot have e.g. two
|
|
dimensions with the same name. But it can have a variable and a dimension
|
|
with the same name (as with coordinate variables).
|
|
|
|
Finally, attribute names are unique only with respect to each other
|
|
and with respect to the containing object (a variable or a group).
|
|
|
|
\section S2 Basic Data Structures
|
|
|
|
The basic data structures used by the new lookup mechanisms
|
|
are described in the following sections.
|
|
|
|
\subsection SS1_1 NClist
|
|
|
|
With rare exceptions, vectors of objects are maintained as
|
|
instances of NClist, which provides a dynamically extendible
|
|
vector of pointers: pointers to metadata objects in this case.
|
|
It is possible to append new objects or insert at a specific
|
|
vector offset, or overwrite an existing pointer at a specific
|
|
offset.
|
|
|
|
The definition is as follows.
|
|
|
|
\code
|
|
typedef struct NClist {
|
|
size_t alloc;
|
|
size_t length;
|
|
void** content;
|
|
} NClist;
|
|
\endcode
|
|
|
|
|
|
\subsection SS1_2 NC_hashmap
|
|
|
|
The NC_hashmap type is a hash table mapping a name to a pointer.
|
|
As a rule, the pointer points to a metadata object. The current
|
|
implementation supports table expansion when the # of entries in
|
|
the table starts to get too large. Basically a simple linear
|
|
rehash is used for collisions and no separate hash-chain is
|
|
used. This means that when expanded, it must be completely
|
|
rebuilt. The performance hit for this has yet to be determined.
|
|
|
|
The hashtable definition is as follows.
|
|
|
|
\code
|
|
typedef struct NC_hashmap {
|
|
size_t size;
|
|
size_t count;
|
|
NC_hentry* table;
|
|
} NC_hashmap;
|
|
\endcode
|
|
|
|
where size is the current allocated size and count is the
|
|
number of active entries in the table. The "table" field is
|
|
a vector of entries of this form.
|
|
|
|
\code
|
|
typedef struct NC_hentry {
|
|
int flags;
|
|
void* data;
|
|
size_t hashkey; /* Hash id */
|
|
char* key; /* actual key; do not free */
|
|
} NC_hentry;
|
|
\endcode
|
|
|
|
The flags indicate the state of the entry and can be one of three states:
|
|
|
|
1. ACTIVE - there is an object referenced in this entry
|
|
2. DELETED - an entry was deleted, but must be marked so
|
|
that linear rehash will work.
|
|
3. EMPTY - unused
|
|
|
|
There is an important WARNING with respect to the "key" field.
|
|
The key is not a copy of the object's name, but in fact is a duplicate
|
|
pointer to that same string. This means (1) that it should never be
|
|
free()'d and (2) if the name of the metadata object is changed, then
|
|
it must be removed and re-inserted into the table to that the key
|
|
points to the current name.
|
|
|
|
The "data" field is of type void*. Often it is a pointer to an instance
|
|
of a variable, or dimension, or other object. When used as part of an
|
|
NC_listmap (see below), then the key is an integer index into the
|
|
associated vector. In order to do this correctly, we need to rely
|
|
on the type "uintptr_t". It is supposed to be the case
|
|
that a value of type uintptr_t is an integer of sufficient size to
|
|
hold a void* pointer. Usually, but not always, this would be the same
|
|
size as an "unsigned long" value. Using this allows the hashtable
|
|
to store either pointers or integer indices.
|
|
|
|
One further WARNING: any object that will be inserted into an NC_hashmap
|
|
must have its name as the first field so it can be cast
|
|
to char** for use with the hashtable.
|
|
|
|
\subsection SS1_3 NC_listmap
|
|
|
|
A listmap is a combination of an NClist and an NC_hashtable.
|
|
It is used to provide name-based lookup with respect to a
|
|
specific list of metadata objects. For example, the subgroups
|
|
of a group are stored using a listmap, where the list is a
|
|
vector of pointers to the subgroup objects and the hashmap
|
|
maps the subgroup name (unique to that group, remember) to
|
|
the corresponding index into the vector. In theory, only
|
|
the hashmap is needed because it could be walked to get all
|
|
of the metadata objects. However, the creation order is sometimes
|
|
important, so that is maintained by the vector.
|
|
This is especially important for attribute storage.
|
|
|
|
Note that currently, NC_listmap is only used in libsrc4,
|
|
but if performance issues warrant, it will also be used in
|
|
libsrc.
|
|
|
|
\section S3 Global Object Access
|
|
|
|
As mentioned, dimension, group, and type external id's (dimid,
|
|
grpid, typeid) are unique across the whole file. It is therefore
|
|
convenient to store in memory a per-file vector for each object
|
|
type such that the external id of the object is the same as the
|
|
position of that object in the corresponding per-file
|
|
vector. This maked lookup by external id efficient.
|
|
Note that this is was already the case for netcdf-3 (libsrc) so
|
|
this is a change for libsrc4 only.
|
|
|
|
The global set of dimensions, types, and groups is maintained by
|
|
three instances of NClist in the NC_HDF5_FILE_INFO structure:
|
|
alldims, alltypes, and allgroups.
|
|
The position of the object within the corresponding list determines
|
|
the object's external id. Thus, a position of a dimension object within the
|
|
"alldims" field of the file structure determines its dimid. Similarly
|
|
for types and groups.
|
|
|
|
\section S4 Per-Group Object Access
|
|
|
|
Each group object (NC_GRP_INFO) contains four
|
|
instances of NC_listmap. One is for dimensions, one is for
|
|
types, one is for subgroups, and one is for variables. A
|
|
listmap is used for two reasons. First, allows name-based lookup
|
|
for these items. Second, the declaration order is maintained by
|
|
the list within the listmap's vector. Note that the position of
|
|
an object in a group listmap vector has no necessary
|
|
relationship to the position of that object within the global
|
|
vectors. Note also that there is no global vector for variables
|
|
because variable external ids are unique only within the
|
|
group. In this special case, the external id for the variable is
|
|
the same as its offset in the listmap's vector for the group.
|
|
|
|
A note about typeids. Since user defined types have an external
|
|
id starting at NC_FIRSTUSERTYPEID, we leave the global type
|
|
vector entries 0..NC_FIRSTUSERTYPEID-1 empty.
|
|
|
|
\section S5 Exceptions
|
|
|
|
NC_Listmap is currently not used for enum constants and compound fields.
|
|
Additionally, it is not used for listing the dimensions associated
|
|
with a variable.
|
|
|
|
References between meta-data objects (e.g. group parent or
|
|
containing group) are stored directly and not using any kind
|
|
of vector or hashtable.
|
|
|
|
*/
|