netcdf-c/ncgen/internals.html
2019-09-18 08:03:01 -06:00

1478 lines
56 KiB
HTML

<html>
<body>
<center>
<pre>
/* Copyright 2009, UCAR/Unidata and OPeNDAP, Inc.
See the COPYRIGHT file for more information. */
</pre>
<h1>NCGEN Internals Documentation</h1>
<h3>Draft: 03/07/2009<br>
Last Revised: 07/15/2009</h3>
</center>
<h1><u>Introduction</u></h1>
This document is an ongoing effort to
describe the internal operation of the ncgen
cdl compiler; ncgen is a part of the netcdf
system.
<p>
The document has two primary parts.
<ol>
<li><a href="#LANG">Language Support</a>
-- describes how to add a new output language to ncgen.
<p>
<li><a href="#GIT">General Internals Information</a>
-- describes additional information about the internals;
parsing, for example.
</ol>
<h1></u><a name="LANG">Modifying NCGEN to Output a New Language</a></u></h1>
This document outlines the general method for adding
a new language output to ncgen. Currently, it supports
binary, C, and (experimentally) NcML and Java.
Before reading this document, the reader should also
review the internals.html document.
<p>
Also, the reader should note that code is a bit crufty
and needs refactoring. This is primarily because
it was originally defined to support only C and
each new language stresses the code.
<p>
In order to get ncgen to generate output for a new
language, the following steps are required.
<ol>
<li> <a href="#Misc">Modify various files to invoke the new language output.</a>
<li> <a href="#Create">Create a new set of generate functions.</a>
</ol>
<h2><a name="Misc">Modify various files to invoke the new language output.</a></h2>
The following steps are required to provide the necessary code
to invoke a new language output.
For the purposes of this discussion, let us call the language Java.
<h4>ncgen.h</h4>
<ol>
<li> Locate the code enabler #defines
(e.g. <code>#define ENABLE_C</code>)
and insert a new one of the form
<pre>
#define ENABLE_JAVA
</pre>
</ol>
<h4>main.c</h4>
<ol>
<li> Locate the global declaration (<code>int fortran_flag;</code>)
and insert a new declaration.
<pre>int java_flag;</pre>
<li> Locate the initialization (<code>fortran_flag = 0;</code>)
in the body of the main() procedure and add a new initialization.
<pre>java_flag = 0;</pre>.
<li>Locate the options processing switch case for -l (<code>case 'l':</code>).
Duplicate one of the instances there and add to the conditionals.
It should look like this.
<pre>
} else if(strcmp(lang_name, "java") == 0
|| strcmp(lang_name, "Java") == 0) {java_flag = 1;}
</pre>
<li> Just after the options processing switch code,
there are a number of #ifndef conditionals
(e.g. <code>#ifndef ENABLE_C</code>).
Add a new one for Java.
It should look like this.
<pre>
#ifndef ENABLE_JAVA
if(java_flag) {
fprintf(stderr,"Java not currently supported\n");
exit(1);
}
#endif
</pre>
</ol>
<h2><a name="Create">Create a new set of generate functions.</a></h2>
The hard part is creating the actual code generation files.
To do this, it is easiest to take one of the existing
generators and modify it, viz:
<ul>
<li> copy genc.c genj.c
<li> copy cdata.c jdata.c
</ul>
The genj.c file will do most of the code generation. The jdata.c file
will generate lists of data constants that come from the CDL data: section.
There is nothing magical about using two files: they can be refactored
as desired.
<p>
In order to facilitate code generation, it is useful to look
at the translations produced by other languages.
The idea is to take these translations and decide what the
corresponding Java (for example) code would look like.
Then the idea is to modify the genc code (in genj.c)
to reflect that translation.
<p>
In most of the rest of this discussion, the genc.c and cdata.c
code will be used to explain the operation.
Appropriate procedure renaming should be done for new languages
(e.g, for Java, <i>genc_XXX</i> is changed to <i>genj_XXX</i>
consistently).
<h3>Useful Output Procedures</h3>
The following output procedures are defined in genc.c to create C output.
The idea is that output is accumulated in a <a href="#Bytebuffer">Bytebuffer</a>
called ccode. Periodically, ccode
contents are flushed to stdout.
The relevant procedures from the C code are as follows.
<ol>
<li> <code>void cprint(Bytebuffer* buf)</code>
-- dump the contents of buf to output (ccode actually).
<li> <code>void cpartial(char* line)</code>
-- dump the specified string to output.
<li> <code>void cline(char* line)</code>
-- dump the specified string to output and add a newline.
<li> <code>void clined(int n, char* line)</code>
-- dump the specified string to output preceded by
<i>n</i> instances of indentation.
<li> <code>void cflush(void)</code>
-- dump the contents of ccode to standard output
and reset the ccode buffer.
</ol>
There is, of course, nothing sacred about these procedures:
feel free to modify as needed. In fact, there are two
important reasons to modify the code.
First, the indentation rules may differ from language to language
(FORTRAN 77 for example). Second, the rules for folding lines
that are too long differ across languages.
It is usually easiest to handle both of these issues
in the output procedures.
<p>
The <a href="#Bytebuffer">Bytebuffer</a> type is an important data structure.
It allows for dynamically creating strings of characters
(actually arbitrary 8 bit values).
Most of the operations should be obvious: examine bytebuffer.h.
It is used widely in this code especially to capture sub-pieces
of the generated code that must be saved for out-of-order output.
<h3>Code Generation</h3>
The code generation method used for C is a pretty good
general paradigm, so this discussion will use it as a model.
The gen_ncc procedure is responsible for
creating and dumping the generated C code.
<p>
It has at its disposal several global lists of Symbols.
Note that the lists cross all groups.
<ul>
<li>dimdefs - the set of symbols defining dimensions.
<li>vardefs - the set of symbols defining variables.
<li>attdefs - the set of symbols defining non-global attributes.
<li>gattdefs - the set of symbols defining global attributes.
<li>grpdefs - the set of symbols defining groups.
<li>typdefs - the set of symbols defining types; note that this list
has been topologically sorted so that a given type depends only
on types with lower indices in the list.
</ul>
<p>
The superficial operation of gen_ncc is as follows; the details
are provided later where the operation is complex.
<ol>
<li>Generate header code (e.g. #include <stdio.h>").
<li>Generate C type definitions corresponding to the
CDL types.
<li>Generate VLEN constants.
<li>Generate chunking constants.
<li>Generate initial part of the main() procedure.
<li>Generate C variable definitions to hold the ncids
for all created groups.
<li>Generate C variable definitions to hold the typeids
of all created types.
<li>Generate C variables and constants that correspond to
to the CDL dimensions.
<li>Generate C variable definitions to hold the dimids
of all created dimensions.
<li>Generate C variable definitions to hold the varids
of all created variables.
<li>Generate C code to create the netCDF binary file.
<li>Generate C code to create the all groups in the proper
hierarchy.
<li>Generate C code to create the type definitions.
<li>Generate C code to create the dimension definitions.
<li>Generate C code to create the variable definitions.
<li>Generate C code to create the global attributes.
<li>Generate C code to create the non-global attributes.
<li>Generate C code to leave define mode.
<li>Generate C code to assign variable datalists.
</ol>
<p>
The following code generates C code for defining the groups.
It is fairly canonical and can be seen repeated in variant form
when defining dimensions, types, variables, and attributes.
<p>
This code is redundant but for consistency, the root group
ncid is stored like all other group ncids.
Note that nprintf is a macro wrapper around snprint.
<pre>
nprintf(stmt,sizeof(stmt),"%s%s = ncid;",indented(1),groupncid(rootgroup));
cline(stmt);
</pre>
<p>
The loop walks all group symbols in preorder form
and generates C code call to nc_def_grp
using parameters taken from the group Symbol instance (gsym).
The call to nc_def_grp is succeeded by a call to the
check_err procedure to verify the operation's result code.
<pre>
for(igrp=0;igrp<listlength(grpdefs);igrp++) {
Symbol* gsym = (Symbol*)listget(grpdefs,igrp);
if(gsym == rootgroup) continue; // ignore root
if(gsym->container == NULL) PANIC("null container");
nprintf(stmt,sizeof(stmt),
"%sstat = nc_def_grp(%s, \"%s\", &%s);",
indented(1),
groupncid(gsym->container),
gsym->name, groupncid(gsym));
cline(stmt); // print the def_grp call
clined(1,"check_err(stat,__LINE__,__FILE__);");
}
flushcode();
</pre>
Note the call to indented(). It generates a blank string corresponding
to indentation to a level of its argument N; level n might result in
more or less than N blank characters.
<p>
Note also that one must be careful when dumping names
(e.g. gsym->name above) if the name is expected to contain
utf8 characters. For C, utf8 works fine in strings, but with
a language like Java, which takes utf-16 characters,
some special encoding is required to convert the non-ascii
characters to use the \uxxxx form.
<p>
The code to generate dimensions, types, attributes, variables
is similar, although often more complex.
<p>
The code to generate C equivalents of CDL types is
in the procedure definectype().
Note that this code is not the code that invokes e.g. nc_def_vlen.
The generated C types are used when generating datalists
so that the standard C constant assignment mechanism will produce
the correct memory values.
<p>
For non-C languages, the interaction between this code and the
nc_def_TYPE code may be rather more complex than with C.
<p>
The genc_deftype procedure is the one that actually
generates C code to define the netcdf types.
The generated C code is designed to store the resulting
typeid into the C variable defined earlier
for holding that typeid.
<p>
Note that for compound types, the NC_COMPOUND_OFFSET
macro is normally used to match netcdf offsets to
the corresponding struct type generated in definectype.
However, there is a flag, TESTALIGNMENT,
that can be set to use a computed value for the offset.
And for non-C languages, handling offsets is tricky and is
addressed in more detail below.
<h3>Data Generation Methods</h3>
There are basically three known approaches for generating
the data constants that are passed to, for example, <i>nc_put_vara</i>.
<ol>
<li> For C (and C++) it is possible to generate C language constants
directly into the code using the C initializer syntax.
This is because CDL was originally defined with C in mind.
This method can also be used for FORTRAN when doing classic model only.
<p>
<li> Generate the binary data
and convert it to a large single string constant using
appropriate escaping mechanisms; this was done in the original
ncgen.
This method has the advantage that it can be used for most
languages, but it has (at least) two disadvantages:
(1) it is not generally portable because the machine architecture
influences the memory encoding; (2) it loses all information
about the structure of the memory and hence makes more debugging
difficult.
<p>
<li>Extend the netCDF interface with a set
of operations to build up the memory structure piece by piece.
This is the approach taken in the Java generation code.
<p>
The idea is that one has a set of procedures in C with a simple
interface that can be invoked by the output language.
These procedures do the following.
<ol>
<li>Create a dynamically extendible memory buffer (much like Bytebuffer).
<li>Append an array of instances
of some primitive type to a specified buffer.
<li>Invoke nc_put_vara with a specified buffer.
<li>Reclaim a buffer
</ol>
Appropriate calls to these procedures can construct any required memory
in a portable fashion.
<p>
This method is appropriate to use with most non-C languages, with interpretive
languages (e.g., Ruby and Perl), and even is probably the best way to
get FORTRAN to handle the full netcdf-4 data model.
</ol>
<h3>Data Generation: Overview</h3>
The way to think about data generation is to consider
the following tree.
<ul>
<li>The root is a convenience and represents the whole
set of variables specified in the CDL "data:" section.
<li>The nodes in the tree just below the root represent
the set of variables to which values are assigned in the
data section.
<li>The subtrees below each variable are the basetypes
of each variable. Thus if a variables x has a basetype
that is a compound type, then the node below x will
represent the whole compound type and the nodes below
that compound type node will be the fields of the compound
type, and so on.
<li>The leaves of this tree are all of primitive type
(e.g. NC_CHAR, NC_INT, NC_STRING).
</ul>
<p>
The data generation code is divided into two
primary groups. One group handles all non-primitive variables
and types. The other group handles all primitive variables
and types (especially fields). The reason for this is that
almost all languages can handle simple lists of primitive values.
However, for non-primitive types, one of the methods from the previous
section needs to be used.
<p>
Secondarily, the primitive handling code is divided into
two groups. One group handles the character type
and the other group handles all other primitive types.
The code for the first group is in chardata.c and is generally
usable across all languages.
<p>
The reason for this split is for historical reasons.
It turns out that it is tricky to properly handle variables
(or Compound type fields) of type NC_CHAR.
Here the term "proper" means to mimic the output of
the original ncgen program. To this end, a set of generically useful routines
are define in the chardata.c file. These routines take a datasource
and walk it to build a single string of characters, with appropriate fill,
to correspond to a NC_CHAR typed variable or field.
Unless your language has special
requirements, it is probably best to always use these routines to process
datalists for variables of type NC_CHAR.
<h3>Data Generation: Part I</h3>
Data generation occurs in several places, but is roughly
divided into two parts. First, the genc.c code will set up
appropriate declarations to hold the data. Second, the code
in cdata.c will generate the actual memory contents that must be
passed to nc_put_vara.
<p>
As a rule, the genc.c code calls a limited set of
entry points into cdata.c. Again as a rule,
cdata.c does not call genc.c code except for the closure
mechanism described below.
<p>
The critical pieces of code for part I are the procedures
genc_defineattr() and genc_definevardata() in genc.c.
<h4>genc_definevardata</h4>
This procedure is responsible for generating C constants corresponding
to the data to be assigned to a variable as defined in the "data:" section
of a CDL file. It is also responsible for
generating the appropriate nc_put_vara_XXX code to actually assign
the data to the variable.
<h4>genc_defineattr</h4>
This procedure is responsible for generating C constants corresponding
to the data to be assigned to an attribute.
from a CDL file. It is also responsible for
generating the appropriate nc_put_att_XXX code to actually define
the attribute.
<p>
As with variables, defining attributes of type NC_CHAR requires use
of the gen_charXXX procedures.
<h3>Data Generation: Part II</h3>
The procedures in cdata.c walk a datalist
and generate a sequence of space separated constants
and possibly with nested paired braces ("{...}") as needed.
The result is placed into a specified Bytebuffer.
<p>
As an aside, commas are added when needed to the list of constants
using the <i>commify</i> procedure.
<p>
Their are three primary procedures that are called from
the genj.c code.
<ul>
<li>genc_attrdata --
store (in its Bytebuffer argument) the sequence of constants
corresponding to a given attribute datalist.
<li>genc_scalardata --
store the single constant (which may be of a user-defined type)
corresponding to its variable's datalist.
<li>and genc_arraydata.
store the vector of constants corresponding to its variable's datalist.
This is by far the most complicated of the three procedures.
</ul>
<p>
Internally, each of these three procedures invokes
the <i>genc_data</i> procedure to process part of a datalist.
<h3>Closures and VLEN</h4>
Closures and VLEN handling are two rather specialized mechanisms.
<h4>Closures</h4>
The data generation code uses a concept of closure or callback
to allow the datalist processing to periodically
call external code to do the actual C code generation.
The reason for this is that it significantly improves
performance if the generated datalist is periodically
dumped to the netcdf .nc file using <i>nc_put_vara</i>.
Note that the closure mechanism is only used for generating
variable data; attributes cannot use this mechanism
since they are defined all at once.
<p>
Basically, each call to the callback will generate
C code for some C constants and calls to nc_put_vara().
The closure data structure (struct Putvar) is defined as follows.
<pre>
typedef struct Putvar {
int (*putvar)(struct Putvar*, Odometer*, Bytebuffer*);
int rank;
Bytebuffer* code;
size_t startset[NC_MAX_VAR_DIMS];
struct CDF {
int grpid;
int varid;
} cdf;
struct C {
Symbol* var;
} c;
} Putvar;
</pre>
An instance of the closure is created for
each variable that is the target of nc_put_vara().
It is initialized with the variable's symbol, rank, group id and variable
id. It is also provided with a Bytebuffer into which it is supposed
to store the generated C code.
The startset is the cached previous set of dimension indices used
for generating the nc_put_vara (see below).
<p>
The callback procedure (field "putvar")
for generating C code putvar is assigned to the procedure called cputvara()
(defined in genc.c).
This procedure takes as arguments the closure object,
an <a href="#odometer">odometer</a> describing the current set of dimension indices,
and a Bytebuffer containing the generated C constants
to be assigned to this slice of the variable.
<p>
Every time the closure procedure is called, it generates a C variable
to hold the generated C constant. It also generated
C constants to hold the start and count vectors required
by <i>nc_put_vara</i>. It then generates an <i>nc_put_vara()</i> call.
The start vector argument for the nc_put_vara is defined by the startset
field of the closure. The count vector argument to nc_put_vara
is computed from the current cached
start vector and from the indices in the odometer.
After the nc_put_vara() is generated, the odometer vector
is assigned to the startset field in the closure for use on the next call.
<p>
There are some important assumptions about the state of the odometer
when it is called.
<ol>
<li>The zeroth index controls the count set.
<li>All other indices are assumed to be at their max values.
</ol>
<p>
In particular, this means that the start vector is zero
for all positions except position zero. The count vector
is positions, except zero is the index in the odometer,
which is assumed to be the max.
<p>
For start position zero, the position is taken from the last
saved startset. The count position zero is the difference between
that last start position and the current odometer zeroth index.
<h4>VLEN Constants</h4>
VLEN constants need to be constructed
as separate C data constants because
the C compiler will never convert nested
groups ({...}) to separate memory chunks.
Thus, ncgen must in several places
generate the VLEN constants as separate variables
and then insert pointers to them in the appropriate
places in the later datalist C constants.
Note that this process can be very tricky
for non-C language (see genj.c and jdata.c for one approach).
<p>
As an optimization, ncgen tracks which datatypes
will require use of vlen constants.
This is any type whose definition is a vlen or whose
basetype contains a vlen type.
<p>
The vlen generation process is two-fold.
First, in the procedure processdatalist1() in semantics.c,
the location of the struct Datalist objects
that correspond to vlen constants is stored in a list called vlenconstants.
When detected, each such Datalist object is tagged with
a unique identifier and the vlen length (count).
These will be used later to generate references to the vlen constant.
These counts are only accurate for non-char typed variables;
Special handling is in place to handle character vlen constants.
<p>
The second vlen constant processing action is in the
procedure genc_vlenconstant() in cdata.c First, it walks the
vlenconstants list and generates C code for C variables to
define the vlen constant and C code to assign the vlen
constant's data to that C variable.
<p>
When, later, the genc_datalist procedure encounters
a Datalist tagged as representing a data list, it can generate
a nc_vlen_t constant as {&lt;count&gt;,&lt;vlenconstantname&gt;}
and use it directly in the generated C datalist constant.
<h2>Utility Data Structures</h2>
<h3>Pool Memory Allocation</h3>
As an approximation to garbage collection,
this code uses a pool allocation mechanism.
The goal is to allow dynamic construction of strings
that have very short life-times; typically they are used
to construct strings to send to the output file.
<p>
The pool mechanism wraps malloc and records the malloc'd
memory in a circular buffer. When the buffer reaches its maximum
size, previously allocated pool buffers are free'd.
This is good in that the user does not have to litter
code with free() statements. It is bad in that the pool
allocated memory can be free'd too early if the memory
does not have a short enough life.
If you suspect the latter, then bump the size of the circular buffer
and see if the problem goes away. If so, then your code
is probably holding on to a pool buffer too long and should use
regular malloc/free.
<p>
In the end, I am not sure if this is a good idea, but
if does make the code simpler.
<h3><a name="List">List<a> and <a name="Bytebuffer">Bytebuffer</a></h3>
The two datatypes List and Bytebuffer are used through out the
code. They correspond closely in semantics to the Java Arraylist
and Stringbuffer types, respectively. They are used to help
encapsulate dynamically growing lists of objects or bytes
to reduce certain kinds of memory allocation errors.
<p>
The canonical code for non-destructive walking of a List<T>
is as follows.
<pre>
for(i=0;i&lt;listlength(list);i++) {
T* element = (T*)listget(list,i);
...
}
</pre>
<p>
Bytebuffer provides two ways to access its internal buffer of characters.
One is "bbContents()", which returns a direct pointer to the buffer,
and the other is "bbDup()", which returns a malloc'd string containing
the contents and is guaranteed to be null terminated.
<h3><a name="odometer">Odometer: Multi-Dimensional Array Handling</a></h3>
The odometer data type is used to convert
multiple dimensions into a single integer.
The rule for converting a multi-dimensional
array to a single dimensions is as follows.
<p>
Suppose we have the declaration <code>int F[2][5][3];</code>.
There are obviously a total of 2 X 5 X 3 = 30 integers in F.
Thus, these three dimensions will be reduced to a single dimension of size 30.
<p>
A particular point in the three dimensions, say [x][y][z], is reduced to
a number in the range 0..29 by computing <code>((x*5)+y)*3+z</code>.
The corresponding general C code is as follows.
<pre>
size_t
dimmap(int rank, size_t* indices, size_t* sizes)
{
int i;
size_t count = 0;
for(i=0;i<rank;i++) {
if(i > 0) count *= sizes[i];
count += indices[i];
}
return count;
}
</pre>
In this code, the indices variable corresponds to the x,y, and z.
The sizes variable corresponds to the 2,5, and 3.
<p>
The Odometer type stores a set of dimensions
and supports operations to iterate over all possible
dimension combinations.
The definition of Odometer is defined by the types Odometer and Dimdata.
<pre>
typedef struct Dimdata {
unsigned long datasize; // actual size of the datalist item
unsigned long index; // 0 <= index < datasize
unsigned long declsize;
} Dimdata;
typedef struct Odometer {
int rank;
Dimdata dims[NC_MAX_VAR_DIMS];
} Odometer;
</pre>
The following primary operations are defined.
<ul>
<li>Odometer* newodometer(Dimset*) - create an odometer from a set of Dimsets.
<li>void freeodometer(Odometer*) - release the memory of an odometer.
<li>int odometermore(Odometer* odom) - return 1 if there are more combinations
of dimension values.
<li>int odometerincr(Odometer* odo,int) - move to the next combination
of dimension values.
<li>unsigned long odometercount(Odometer* odo) -
apply the above algorithm to convert the current odometer combination
into a single integer.
</ul>
<h2>Misc. Notes</h2>
<ul>
<li> The flag "usingclassic" should be consulted when appropriate to determine
is this CDL file should be treated as using only the netCDF classic model.
</ul>
<h2><u>Change Log</u></h2>
<ul>
<li>07/04/2009 - First draft.
</ul>
</body>
</html>
<p>
<i>genc_scalardata</i> or <i>genc_arraydata</i>.
It stores in its Bytebuffer argument the sequence of constants
corresponding to a given datalist. Handling commas is a tricky issue
so you will that many of the non-top-level routines in cdata.c
take a pointer to a global state element, commap, that determines the
current state of adding commas. The idea is that at the beginning of
any (sub-) Datalist, we want to turn off the comma in front of the
first generated constant and then add commas until be reach the end
of that (sub-)Datalist.
<h1></u><a name="GIT">General Internals Information</a></u></h1>
<h2><u>Primary NCGEN Data Structures</u></h2>
There are two primary structures used in ncgen:
<a href="#Symbol">struct Symbol</a>) and
<a href="#Datalist">struct Datalist</a>).
<h3><a name="Symbol">struct Symbol</a></h3>
Symbol objects are linked into hierarchical structures
to represent netcdf dimensions, types, groups, and variables.
The struct has the following fields.
<table>
<tr><th colspan=3>struct Symbol Fields
<tr valign=top><td>struct Symbol* next<td>-<td>
The Symbol objects are all kept on a single linked list.
No symbol is ever deleted until the end of the program.
<tr valign=top><td>nc_class objectclass<td>-<td>
This defines the general class of symbol, one of: NC_GRP, NC_DIM, NC_VAR, NC_ATT, or NC_TYPE.
<tr valign=top><td>nc_classsubclass<td>-<td>
This defines the sub class of symbol, one of:
NC_PRIM, NC_OPAQUE, NC_ENUM,
NC_FIELD, NC_VLEN, NC_COMPOUND,
NC_ECONST, NC_ARRAY, or NC_FILLVALUE.
<tr valign=top><td>char*name<td>-<td>
The symbol's name.
<tr valign=top><td>struct&nbsp;Symbol*&nbsp;container<td>-<td>
The symbol that is the container for this symbol.
Typically, this the group symbol that contains
this symbol.
<tr valign=top><td>struct Symbol location<td>-<td>
The current group that was open when this symbol was created.
<tr valign=top><td>List* subnodes<td>-<td>
The list of child symbols of this symbol.
For example, a group symbol will have its dimensions,
types, vars, and subgroups will be in this list.
<tr valign=top><td>int is_prefixed<td>-<td>
True if the name of this symbol contains a complete
prefix path (e.g. /x/y/z).
<tr valign=top><td>List* prefix<td>-<td>
A list of the prefix names for this node.
Note that if is_prefixed is false, then this
list was constructed from the set of enclosing groups.
<tr valign=top><td>struct Datalist* data<td>-<td>
Stores the constants from attribute or datalist
constructs.
<tr valign=top><td>Typeinfo typ<td>-<td>
Type information about this symbol
as defined by the Typeinfo structure.
<tr valign=top><td>Varinfo var<td>-<td>
Variable information about a variable symbol
as defined by the Varinfo structure.
<tr valign=top><td>Attrinfo att<td>-<td>
Attribute information about an attribute symbol
as defined by the Attrinfo structure.
<tr valign=top><td>Diminfo dim<td>-<td>
Dimension information about a dimension symbol
as defined by the Diminfo structure.
<tr valign=top><td>Groupinfo grp<td>-<td>
Group information about a group symbol
as defined by the Groupinfo structure.
<tr valign=top><td>int lineno<td>-<td>
The source line in which this symbol was created.
<tr valign=top><td>int touched<td>-<td>
Used in transitive closure operations
to prevent revisiting symbols.
<tr valign=top><td>char* lname<td>-<td>
Cached C or FORTRAN name (not used?).
<tr valign=top><td>int ncid<td>-<td>
The ncid/varid/dimid, etc when
defining netcdf objects.
</table>
<h4>struct Groupinfo</h4>
Group symbols primarily keep the group
containment structure in the subnodes field of the Symbol.
<p>
<table>
<tr><th colspan=3>struct Groupinfo Fields
<tr valign=top><td>int is_root<td>-<td>
Is this the root group?
</table>
<h4>struct Diminfo</h4>
The only important information about a dimension,
aside from name, is the dimension size.
Additionally, type definitions may have anonymous
(unnamed) dimensions.
<p>
<table>
<tr><th colspan=3>struct Diminfo Fields
<tr valign=top><td>int isconstant<td>-<td>
Is this an anonymous dimension?
<tr valign=top><td>unsigned int size<td>-<td>
The size of the dimension.
</table>
<h4>struct Varinfo</h4>
Variables require two primary pieces of information:
the set of attributes (including special attributes)
and dimension information. The dimension information
is kept in the Typeinfo structure because things
other than variables have dimensions (e.g. user defined types).
<p>
<table>
<tr><th colspan=3>struct Varinfo Fields
<tr valign=top><td>int nattributes<td>-<td>
The number of attributes; this is redundant but useful.
<tr valign=top><td>List* attributes<td>-<td>
The list of all attribute symbols associated with this
variable.
<tr valign=top><td>Specialdata special<td>-<td>
Special attribute values.
</table>
<h4>struct Typeinfo</h4>
The type information is probably the second most
used structure in all of the code (second to Symbol itself).
<p>
<table>
<tr><th colspan=3>struct Typeinfo Fields
<tr valign=top><td>struct Symbol* basetype<td>-<td>
Provide a reference to the base type of this symbol.
This applies to other types, variables, and attributes.
<tr valign=top><td>int hasvlen<td>-<td>
Does the type have a vlen definition anywhere within it.
This is used as an optimization to avoid searching datalists
for vlen constants.
<tr valign=top><td>nc_type typecode<td>-<td>
The typecode of the basetype. This is most useful
when the basetype is a primitive type.
<tr valign=top><td>unsigned long size<td>-<td>
The size of this object.
<tr valign=top><td>unsigned long offset<td>-<td>
The field offset for fields in compound types.
<tr valign=top><td>unsigned long alignment<td>-<td>
The memory alignment (i.e. 1,2,4,or 8).
<tr valign=top><td>Constant econst<td>-<td>
For enumeration constants, the actual value of the constant.
<tr valign=top><td>Dimset dimset<td>-<td>
The dimension information for the type or variable.
The dimset stores the number of dimensions and a list
of pointers to the corresponding dimension symbols.
</table>
<h4>struct Attrinfo</h4>
Note that the actual attribute data is stored
in the data field of the containing Symbol.
<p>
<table>
<tr><th colspan=3>struct Attrinfo Fields
<tr valign=top><td>struct Symbol* var<td>-<td>
The variable with which this attribute is associated;
it is NULL for global attributes.
<tr valign=top><td>unsigned long count<td>-<td>
The number of instances associated with the attribute value.
</table>
<h3><a name="Datalist">Datalists and Datasrcs</a></h3>
Whenever a datalist is encountered during parsing, it is converted
to an instance of struct Datalist.
Each datalist instance contains a vector of instances of
struct Constant that contains the actual data.
<p>
Each datalist instance contains the following information.
<table>
<tr><th colspan=3>struct Datalist Fields
<tr valign=top><td>struct Datalist* next<td>-<td>
All datalists are chained for reclamation.
<tr valign=top><td>int readonly<td>-<td>
Can this datalist be modified?
<tr valign=top><td>unsigned int length<td>-<td>
The number of Constant instances in the data field.
<tr valign=top><td>unsigned int alloc<td>-<td>
The memory space allocated to the data field.
<tr valign=top><td>Constant* data<td>-<td>
The vector in sequential memory of the constants comprising this datalist.
<tr valign=top><td>struct Symbol* schema<td>-<td>
The symbol (type, variable, or attribute) defining the structure of this datalist,
if known.
<tr valign=top><td>struct Vlen {<td>-<td>
Information about the vlen instances contained in this datalist.
<tr><td>unsigned int count;
<tr><td>unsigned int uid;
<tr><td>} vlen
<tr valign=top><td>Odometer* dimdata<td>-<td>
A tracker to count through dimensions associated with this datalist via the schema.
</table>
<p>
In turn, a Constant instance is defined as follows.
<pre>
typedef struct Constant {
nc_type nctype;
int lineno;
Constvalue value;
} Constant;
</pre>
It indicates the type of the value and the source line number (if known)
in which this constant was created.
<p>
The ConstValue type is a union
of all possible values that can occur
in a datalist.
<pre>
typedef union Constvalue {
struct Datalist* compoundv; // NC_COMPOUND
char charv; // NC_CHAR
signed char int8v; // NC_BYTE
unsigned char uint8v; // NC_UBYTE
short int16v; // NC_SHORT
unsigned short uint16v; // NC_USHORT
int int32v; // NC_INT
unsigned int uint32v; // NC_UINT
long long int64v; // NC_INT64
unsigned long long uint64v; // NC_UINT64
float floatv; // NC_FLOAT
double doublev; // NC_DOUBLE
struct Stringv { // NC_STRING
int len;
char* stringv;
} stringv;
struct Opaquev { // NC_OPAQUE
int len; // length as originally written (rounded to even number)
char* stringv; //as constant was written
// (padded to even # chars >= 16)
// without leading 0x
} opaquev;
struct Symbol* enumv; // NC_ECONST
} Constvalue;
</pre>
<p>
Several fields are of particular interest:
<table>
<tr><th colspan=3>Selected Constvalue Fields
<tr valign=top><td>struct&nbsp;Datalist*&nbsp;compoundv<td>-<td>
This stores nested datalists - typically
of the form "{...{...}...}".
<tr valign=top><td>struct&nbsp;Stringv&nbsp;{int&nbsp;len;&nbsp;char*&nbsp;stringv;}&nbsp;stringv<td>-<td>
Store string constants.
<tr valign=top><td>struct&nbsp;Opaquev&nbsp;{int&nbsp;len;&nbsp;char*&nbsp;stringv;}&nbsp;opaquev<td>-<td>
Store opaque constants as written (i.e. abc...),
without the leading 0x, and
padded to an even number of characters to be
at least 16 characters long.
<tr valign=top><td>struct Symbol* enumv<td>-<td>
Pointer to an enumeration constant definition.
</table>
<h4>struct Datasrc</h3>
When it comes time to generate datalists for output,
it is necessary to "walk" the datalist (including nested
datalist). The Datasrc structure is used to do this.
Its definition is as follows.
<pre>
typedef struct Datasrc {
unsigned int index; // 0..length-1
unsigned int length;
int autopop; // pop when at end
Constant* data; // duplicate pointer; so do not free.
struct Datasrc* stack;
} Datasrc;
</pre>
The Datasrc tracks the "current" location in the sequence
of Constants (taken from a Datalist). The index field indicates
the current location.
In effect, Datasrc is the lexer and the code
that is walking it is in effect parsing the data sequence.
The following operations are supported (see data.[ch]).
<ul>
<li>datalist2src - takes a Datalist and constructs a Datasrc.
<li>srcpush - assumes the current constant is a nested Datalist
and pushes into that Datalist.
<li>srcpushlist - pushes into the passed Datalist argument.
<li>srcpop - pops the current list and resumes the next list in the
stack.
<li>srcnext - return the value at the index
and then advance the Datasrc index.
If at the end of the current datalist, then return NULL;
srcincr is an alias for srcnext.
<li>srcmore - return 1 is not at the end of the current Datasrc.
Pushed datalists are not considered.
<li>srcline - return a usable line number associated with the current
position of the Datasrc (that is why Constant instances have a line
number).
<li>srcpeek - return the value at the index but do not advance.
If at the end of the current datalist, then return NULL; srcget is an alias
for srcpeek.
</ul>
<h2><u>The CDL Parser</u></h2>
The CDL parser and associated lexer
(primarily files "ncgen.y" and "ncgen.l")
parse CDL files into various data structures
for use by the remaining ncgen code.
The data structures described above,
(<a href="#Symbol">Symbol</a>, and
<a href="#Datalist">Datalist</a>)
are primarily generated by the parser.
<h3>Parse Cliches</h3>
<h4>Node Stacking</h4>
One of the issues that must be addressed by any bottom-up
parser is handling the accumulation of sets of items (nodes,
etc.). The YACC/Bison parse stack cannot be used
because the set of accumulated nodes is unbounded
and the YACC stack mechanism is bounded (i.e. each rule
has a bounded right hand side length).
<p>
The node stacking set of cliches is ubiquitous in the
parser, so they must be understood to understand how the
parser works. The cliche here is shown in the handling of,
for example, the varlist rule, which is defined as follows.
<pre>
varlist: varspec
{$$=listlength(stack); listpush(stack,(elem_t)$1);}
| varlist ',' varspec
{$$=$1; listpush(stack,(elem_t)$3);}
;
</pre>
The varlist rule collects variable name declarations (via the varspec rule).
The idea is to use a separate stack named "stack", and tracking
the index into the stack of the start of collection of objects.
The varlist value (in the YACC sense) is defined as an integer
representing the size of the stack at the start of a list of variables.
That is what this code does: <code>$$=listlength(stack)</code>.
<p>
At the point where the set of varspecs should processed, the following code cliche
is used.
<pre>
vardecl: typeref varlist
{...
stackbase=$2;
stacklen=listlength(stack);
for(i=stackbase;i&lt;stacklen;i++) {
Symbol* sym = (Symbol*)listget(stack,i);
...
}
listsetlength(stack,stackbase);// remove stack nodes
}
...
</pre>
The start of the set of variable declaration symbols is extracted
as the integer associated with right-side non-terminal $2, e.g.
<code>stackbase=$2</code>.
The current stack length is obtained from <code>stacklen=listlength(stack)</code>.
Then the elements of the stack are extracted one by one using the above loop.
Finally, the nodes on the stack are cleared by the code segment
<code>listsetlength(stack,stackbase)</code>.
<h4><u>Semantic Processing</u></h4>
Semantic processing takes the output of the parser
and adds various pieces of semantic information.
The semantic actions are as follows.
<ol>
<li> Procedure processtypes().
<ol>
<li>Do a topological sort of the types based on dependency
so that the least dependent are first in the typdefs list.
<li>Fill in type typecodes.
<li>Mark types that have a vlen.
</ol>
<li> Procedure filltypecodes() - Fill in implied type codes.
<li> Procedure processvars() - Fill in missing values.
<li> Procedure processattributes() -
Process attributes to connect to corresponding variable.
<li> Procedure processcompound() -
Process each compound type to compute its size.
<li> Procedure processenums() -
Fix up enum constant values.
<li> Procedure processdatalists() -
Fix up datalists.
<li> Procedure checkconsistency() -
Check internal consistency.
<li> Procedure validate() -
Do any needed additional semantic checks.
</ol>
<h2><u>Generating C Code</u></h2>
The source code for generating C code output (via the -c option)
is of most interest because it is the pattern to be used
for other languages and because, frankly, it is complex and ugly
at the moment and so guidance is needed in understanding it.
<p>
The files genc.[ch] and cdata.c are the primary files for C code generation.
The files data.[ch] is also important.
<h3><u>Output Routines</u></h3>
The output routines are a bit of a mixed bag.
It is important to know that code is not directly
dumped to the output file; rather is is accumulated
in a global Bytebuffer instance called "ccode".
<p>
The output routines are as follows.
<ul>
<li>flushcode(void) - flush the ccode buffer to the output file.
<li>cprint(Bytebuffer* buf) - dump the contents
of buf to the ccode buffer.
<li>cpartial(char* line) - dump the contents of line
to the ccode buffer, but do not add a trailing newline.
<li>cline(char* line) - dump the contents of line
to the ccode buffer and add a trailing newline.
<li>clined(int n, char* line) - dump the contents of line to the ccode
buffer; prefix with n indentations (typically 4 blanks each)
and suffix with a trailing newline.
</ul>
<h3><u>gen_ncc</u></h3>
The gen_ncc procedure is responsible for
creating and dumping the generated C code.
<p>
It has at its disposal several global lists of Symbols.
Note that the lists cross all groups.
<ul>
<li>dimdefs - the set of symbols defining dimensions.
<li>vardefs - the set of symbols defining variables.
<li>attdefs - the set of symbols defining non-global attributes.
<li>gattdefs - the set of symbols defining global attributes.
<li>grpdefs - the set of symbols defining groups.
<li>typdefs - the set of symbols defining types; note that this list
has been topologically sorted so that a given type depends only
on types with lower indices in the list.
</ul>
<p>
The superficial operation of gen_ncc is as follows; the details
are provided later where the operation is complex.
<ol>
<li>Generate header code (e.g. #include <stdio.h>").
<li>Generate C type definitions corresponding to the
CDL types.
<li>Generate VLEN constants.
<li>Generate chunking constants.
<li>Generate initial part of the main() procedure.
<li>Generate C variable definitions to hold the ncids
for all created groups.
<li>Generate C variable definitions to hold the typeids
of all created types.
<li>Generate C variables and constants that correspond to
to the CDL dimensions.
<li>Generate C variable definitions to hold the varids
of all created variables.
<li>Generate C code to create the netCDF binary file.
<li>Generate C code to create the all groups in the proper
hierarchy.
<li>Generate C code to create the type definitions.
<li>Generate C code to create the dimension definitions.
<li>Generate C code to create the variable definitions.
<li>Generate C code to create the global attributes.
<li>Generate C code to create the non-global attributes.
<li>Generate C code to leave define mode.
<li>Generate C code to assign variable datalists.
</ol>
<p>
The following code generates C code for defining the groups.
It is fairly canonical and can be seen repeated in variant form
when defining dimensions, types, variables, and attributes.
<p>
This code is redundant but for consistency, the root group
ncid is stored like all other group ncids.
Note that nprintf is a macro wrapper around snprint.
<pre>
nprintf(stmt,sizeof(stmt)," %s = ncid;",groupncid(rootgroup));
cline(stmt);
</pre>
<p>
The loop walks all group symbols in preorder form
and generates C code call to nc_def_grp
using parameters taken from the group Symbol instance (gsym).
The call to nc_def_grp is succeeded by a call to the
check_err procedure to verify the operation's result code.
<pre>
for(igrp=0;igrp<listlength(grpdefs);igrp++) {
Symbol* gsym = (Symbol*)listget(grpdefs,igrp);
if(gsym == rootgroup) continue; // ignore root
if(gsym->container == NULL) PANIC("null container");
nprintf(stmt,sizeof(stmt),
" stat = nc_def_grp(%s, \"%s\", &%s);",
groupncid(gsym->container),
gsym->name, groupncid(gsym));
cline(stmt); // print the def_grp call
clined(1,"check_err(stat,__LINE__,__FILE__);");
}
flushcode();
</pre>
<p>
The code to generate dimensions, types, attributes, variables
is similar, although often more complex.
<p>
The code to generate C equivalents of CDL types is
in the procedure definectype().
Note that this code is not the code that invokes e.g. nc_def_vlen.
The generated C types are used when generating datalists
so that the standard C constant assignment mechanism will produce
the correct memory values.
<p>
The genc_deftype procedure is the one that actually
generates C code to define the netcdf types.
The generated C code is designed to store the resulting
typeid into the C variable defined earlier
for holding that typeid.
<p>
Note that for compound types, the NC_COMPOUND_OFFSET
macro is normally used to match netcdf offsets to
the corresponding struct type generated in definectype.
However, there is a flag, TESTALIGNMENT,
that can be set to use a computed value for the offset.
<h3><u>C Constant Datalist Generation</u></h3>
All attributes, and some variables, require the
construction of a memory object containing data
to be assigned to that attribute or variable.
The code to do this is by far the most complicated
in ncgen.
The file cdata.c contains the procedure genc_datalist(),
which does most of the heavy lifting.
<p>
For attributes, the general form generated is
<pre>
T* attributevar = {...};
</pre>
Except for VLENs, the datalist is completely
contained in the brackets, with bracket nesting as required.
A generated pointer the attributevar is included
in the generated call to nc_put_att().
<p>
For variables, the general form generated is similar to attributes.
<pre>
T* varvar = {...};
</pre>
Again, VLENs are handled specially.
Also, for performance purposes, the datalist
is loaded in pieces using nc_put_vara(). This is required if
there are UNLIMITED dimensions, but is used for all cases
for uniformity.
<h4>Datalist Closures</h4>
The code uses a concept of closure or callback
to allow the datalist processing to periodically
call external code to do the actual C code generation.
Basically, each call to the callback will generate
C code for constants and calls to nc_put_vara().
The closure data structure (struct Putvar) is defined as follows.
<pre>
typedef struct Putvar {
int (*putvar)(struct Putvar*, Odometer*, Bytebuffer*);
int rank;
Bytebuffer* code;
size_t startset[NC_MAX_VAR_DIMS];
struct CDF {
int grpid;
int varid;
} cdf;
struct C {
Symbol* var;
} c;
} Putvar;
</pre>
An instance of the closure is created for
each variable that is the target of nc_put_vara().
It is initialized with the variable's symbol, rank, group id and variable
id. It is also provided with a Bytebuffer into which it is supposed
to store the generated C code.
The startset is the cached previous set of dimension indices used
for generating the nc_put_vara (see below).
<p>
The callback procedure (field "putvar")
for generating C code putvar is assigned to the procedure called cputvara()
(defined in genc.c).
This procedure takes as arguments the closure object,
an odometer describing the current set of dimension indices,
and a Bytebuffer containing the generated C constants
to be assigned to this slice of the variable.
<p>
Every time the closure procedure is called, it generates a C variable
to hold the generated C constant. It then generates an nc_put_vara()
call. The start vector argument for the nc_put_vara is defined by the startset
field of the closure. The count vector argument to nc_put_vara
is computed from the current cached
start vector and from the indices in the odometer.
After the nc_put_vara() is generated, the odometer vector
is assigned to the startset field in the closure for use on the next call.
<p>
There are some important assumptions about the state of the odometer
when it is called.
<ol>
<li>The zeroth index controls the count set.
<li>All other indices are assumed to be at their max values.
</ol>
<p>
In particular, this means that the start vector is zero
for all positions except position zero. The count vector
is positions, except zero is the index in the odometer,
which is assumed to be the max.
<p>
For start position zero, the position is taken from the last
saved startset. The count position zero is the difference between
that last start position and the current odometer zeroth index.
<p>
If all of this sounds complex, it is, and if/when I have time
I will rethink the whole process of datalist generation
from beginning to end.
<h4>VLEN Constants</h4>
VLEN constants need to be constructed
as separate C data constants because
the C compiler will never convert nested
groups ({...}) to separate memory chunks.
Thus, ncgen must in several places
generate the VLEN constants as separate variables
and then insert pointers to them in the appropriate
places in the later datalist C constants.
<p>
As an optimization, ncgen tracks which datatypes
will require use of vlen constants.
This is any type whose definition is a vlen or whose
basetype contains a vlen type.
<p>
The vlen generation process is two-fold.
First, in the procedure processdatalist1() in semantics.c,
the location of the struct Datalist objects
that correspond to vlen constants is stored in a list called vlenconstants.
When detected, each such struct Datalist object is tagged with
a unique identifier and the vlen length (count).
These will be used later to generate references to the vlen constant.
<p>
The second vlen constant processing action is in the
procedure genc_vlenconstant() in cdata.c First, it walks the
vlenconstants list and generates C code for variables to
define the vlen constant and C code to assign the vlen
constant's data to that variable.
<p>
When, later, the genc_datalist procedure encounters
a Datalist tagged as representing a data list, it can generate
a nc_vlen_t constant as {&lt;count&gt;,&lt;vlenconstantname&gt;}
and use it directly in the generated C datalist constant.
<h4>Walking the Datalist</h4>
To actually generate the C code for a datalist constant,
the procedure genc_datalist wraps the Datalist in a Datasrc,
and proceeds to walk it constant by constant and generating
the corresponding C constant. The bulk of the work
is performed in the recursive procedure genc_datalist1().
<p>
For better or worse, the code
acts like a 1-lookahead parser. This means that it decides
what to do based on the current type, the current constant and, when necessary,
the next constant in the Datasrc. In practice, the lookahead
is hidden, so it is not represented in the following table.
<p>
<table border=1>
<tr><th>Current Type<th>Current Constant<th>action
<tr valign=top><td>NC_PRIM<td>Primitive Constant<td>Generate the C constant; convert as necessary.
<tr valign=top><td>NC_OPAQUE<td>''<td>''
<tr valign=top><td>NC_ENUM<td>''<td>''
<tr valign=top><td>NC_ENUM<td>''<td>''
<tr valign=top><td>NC_COMPOUND<td>Nested Datalist Constant<td>Push into the datalist and recurse on each field; When done, pop back to previous datalist.
<tr valign=top><td>NC_COMPOUND<td>Any other Constant<td>
Continue to recurse on each field; This allows
specification of fields without enclosing in {...}.
<tr valign=top><td>NC_VLEN<td>Nested Datalist Constant<td>Generate the
nc_vlen_t instance using the tagged information in the struct Datalist.
<tr valign=top><td>NC_FIELD<td>NA<td>If this field is dimensioned,
then call genc_fielddata to walk the dimensions. Otherwise, just
recurse on genc_datalist1.
</table>
<p>
The genc_fielddata() procedure iterates over a field dimension
and calls itself recursively to walk the remaining dimensions.
It this is the last dimension, then it calls genc_datalist1 to
generate C code for the basetype of the field.
<h4>String/Char Handling</h4>
All through the genc_datalist code,
there are special cases for handling string constants.
The reason is, of course, that the string constant "abcd.."
may, depending on the type context, be either a string
or an array of characters.
<h4>Generating Variable Data</h4>
The genc_datalist code does not call closures.
The closures are used in the genc_vardata() and genc_vardata1()
procedures; genc_vardata1 being the recursive procedure that actually
calls the closure.
<p>
The genc_vardata1() procedure, like genc_fielddata,
iterates over a top-level dimension and calls itself recursively
to iterate over the remaining dimensions.
The term "top-level" refers to the fact that these are the dimensions
specified for a variable as opposed to field dimensions.
<p>
When iterating an UNLIMITED dimension, or when iterating the first
dimension, the code generates a datalist for this subslice
and then calls the closure to generate the C code.
<h2><u>Miscellaneous</u></h2>
<h4>Pool Memory Allocation</h4>
As an approximation to garbage collection,
this code uses a pool allocation mechanism.
The goal is to allow dynamic construction of strings
that have very short life-times; typically they are used
to construct strings to send to the output file.
<p>
The pool mechanism wraps malloc and records the malloc'd
memory in a circular buffer. When the buffer reaches its maximum
size, previously allocated pool buffers are free'd.
This is good in that the user does not have to litter
code with free() statements. It is bad in that the pool
allocated memory can be free'd too early if the memory
does not have a short enough life.
If you suspect the latter, then bump the size of the circular buffer
and see if the problem goes away. If so, then your code
is probably holding on to a pool buffer too long and should use
regular malloc/free.
<p>
In the end, I am not sure if this is a good idea, but
if does make the code simpler.
<h4>List and Bytebuffer</h4>
The two datatypes List and Bytebuffer are used through out the
code. They correspond closely in semantics to the Java Arraylist
and Stringbuffer types, respectively. They are used to help
encapsulate dynamically growing lists of objects or bytes
to reduce certain kinds of memory allocation errors.
<p>
The canonical code for non-destructive walking of a List<T>
is as follows.
<pre>
for(i=0;i&lt;listlength(list);i++) {
T* element = (T*)listget(list,i);
...
}
</pre>
<p>
Bytebuffer provides two ways to access its internal buffer of characters.
One is "bbContents()", which returns a direct pointer to the buffer,
and the other is "bbDup()", which returns a malloc'd string containing
the contents and is guaranteed to be null terminated.
<h4><a name="odometer">Odometer: Multi-Dimensional Array Handling</a></h4>
The odometer data type is used to convert
multiple dimensions into a single integer.
The rule for converting a multi-dimensional
array to a single dimensions is as follows.
<p>
Suppose we have the declaration <code>int F[2][5][3];</code>.
There are obviously a total of 2 X 5 X 3 = 30 integers in F.
Thus, these three dimensions will be reduced to a single dimension of size 30.
<p>
A particular point in the three dimensions, say [x][y][z], is reduced to
a number in the range 0..29 by computing <code>((x*5)+y)*3+z</code>.
The corresponding general C code is as follows.
<pre>
size_t
dimmap(int rank, size_t* indices, size_t* sizes)
{
int i;
size_t count = 0;
for(i=0;i<rank;i++) {
if(i > 0) count *= sizes[i];
count += indices[i];
}
return count;
}
</pre>
In this code, the indices variable corresponds to the x,y, and z.
The sizes variable corresponds to the 2,5, and 3.
<p>
The Odometer type stores a set of dimensions
and supports operations to iterate over all possible
dimension combinations.
The definition of Odometer is defined by the types Odometer and Dimdata.
<pre>
typedef struct Dimdata {
unsigned long datasize; // actual size of the datalist item
unsigned long index; // 0 <= index < datasize
unsigned long declsize;
} Dimdata;
typedef struct Odometer {
int rank;
Dimdata dims[NC_MAX_VAR_DIMS];
} Odometer;
</pre>
The following primary operations are defined.
<ul>
<li>Odometer* newodometer(Dimset*) - create an odometer from a set of Dimsets.
<li>void freeodometer(Odometer*) - release the memory of an odometer.
<li>int odometermore(Odometer* odom) - return 1 if there are more combinations
of dimension values.
<li>int odometerincr(Odometer* odo,int) - move to the next combination
of dimension values.
<li>unsigned long odometercount(Odometer* odo) -
apply the above algorithm to convert the current odometer combination
into a single integer.
</ul>
<h3><u>Change Log</u></h3>
<ul>
<li>04/15/2009 - Add major discussion about adding a new output language.
<li>03/10/2009 - Fix typos.
<li>03/07/2009 - First draft.
</ul>
</body
</html>