/* Copyright 2009, UCAR/Unidata and OPeNDAP, Inc.
   See the COPYRIGHT file for more information. */

NCGEN Internals Documentation

Draft: 03/07/2009
Last Revised: 07/15/2009

Introduction

This document is an ongoing effort to describe the internal operation of the ncgen cdl compiler; ncgen is a part of the netcdf system.

The document has two primary parts.

  1. Language Support -- describes how to add a new output language to ncgen.

  2. General Internals Information -- describes additional information about the internals; parsing, for example.

Modifying NCGEN to Output a New Language

This document outlines the general method for adding a new language output to ncgen. Currently, it supports binary, C, and (experimentally) NcML and Java. Before reading this document, the reader should also review the internals.html document.

Also, the reader should note that code is a bit crufty and needs refactoring. This is primarily because it was originally defined to support only C and each new language stresses the code.

In order to get ncgen to generate output for a new language, the following steps are required.

  1. Modify various files to invoke the new language output.
  2. Create a new set of generate functions.

Modify various files to invoke the new language output.

The following steps are required to provide the necessary code to invoke a new language output. For the purposes of this discussion, let us call the language Java.

ncgen.h

  1. Locate the code enabler #defines (e.g. #define ENABLE_C) and insert a new one of the form
    #define ENABLE_JAVA
    

main.c

  1. Locate the global declaration (int fortran_flag;) and insert a new declaration.
    int java_flag;
  2. Locate the initialization (fortran_flag = 0;) in the body of the main() procedure and add a new initialization.
    java_flag = 0;
    .
  3. Locate the options processing switch case for -l (case 'l':). Duplicate one of the instances there and add to the conditionals. It should look like this.
        } else if(strcmp(lang_name, "java") == 0
           || strcmp(lang_name, "Java") == 0) {java_flag = 1;}
    
  4. Just after the options processing switch code, there are a number of #ifndef conditionals (e.g. #ifndef ENABLE_C). Add a new one for Java. It should look like this.
    #ifndef ENABLE_JAVA
        if(java_flag) {
    	  fprintf(stderr,"Java not currently supported\n");
    	  exit(1);
        }
    #endif
    

Create a new set of generate functions.

The hard part is creating the actual code generation files. To do this, it is easiest to take one of the existing generators and modify it, viz: The genj.c file will do most of the code generation. The jdata.c file will generate lists of data constants that come from the CDL data: section. There is nothing magical about using two files: they can be refactored as desired.

In order to facilitate code generation, it is useful to look at the translations produced by other languages. The idea is to take these translations and decide what the corresponding Java (for example) code would look like. Then the idea is to modify the genc code (in genj.c) to reflect that translation.

In most of the rest of this discussion, the genc.c and cdata.c code will be used to explain the operation. Appropriate procedure renaming should be done for new languages (e.g, for Java, genc_XXX is changed to genj_XXX consistently).

Useful Output Procedures

The following output procedures are defined in genc.c to create C output. The idea is that output is accumulated in a Bytebuffer called ccode. Periodically, ccode contents are flushed to stdout. The relevant procedures from the C code are as follows.
  1. void cprint(Bytebuffer* buf) -- dump the contents of buf to output (ccode actually).
  2. void cpartial(char* line) -- dump the specified string to output.
  3. void cline(char* line) -- dump the specified string to output and add a newline.
  4. void clined(int n, char* line) -- dump the specified string to output preceded by n instances of indentation.
  5. void cflush(void) -- dump the contents of ccode to standard output and reset the ccode buffer.
There is, of course, nothing sacred about these procedures: feel free to modify as needed. In fact, there are two important reasons to modify the code. First, the indentation rules may differ from language to language (FORTRAN 77 for example). Second, the rules for folding lines that are too long differ across languages. It is usually easiest to handle both of these issues in the output procedures.

The Bytebuffer type is an important data structure. It allows for dynamically creating strings of characters (actually arbitrary 8 bit values). Most of the operations should be obvious: examine bytebuffer.h. It is used widely in this code especially to capture sub-pieces of the generated code that must be saved for out-of-order output.

Code Generation

The code generation method used for C is a pretty good general paradigm, so this discussion will use it as a model. The gen_ncc procedure is responsible for creating and dumping the generated C code.

It has at its disposal several global lists of Symbols. Note that the lists cross all groups.

The superficial operation of gen_ncc is as follows; the details are provided later where the operation is complex.

  1. Generate header code (e.g. #include ").
  2. Generate C type definitions corresponding to the CDL types.
  3. Generate VLEN constants.
  4. Generate chunking constants.
  5. Generate initial part of the main() procedure.
  6. Generate C variable definitions to hold the ncids for all created groups.
  7. Generate C variable definitions to hold the typeids of all created types.
  8. Generate C variables and constants that correspond to to the CDL dimensions.
  9. Generate C variable definitions to hold the dimids of all created dimensions.
  10. Generate C variable definitions to hold the varids of all created variables.
  11. Generate C code to create the netCDF binary file.
  12. Generate C code to create the all groups in the proper hierarchy.
  13. Generate C code to create the type definitions.
  14. Generate C code to create the dimension definitions.
  15. Generate C code to create the variable definitions.
  16. Generate C code to create the global attributes.
  17. Generate C code to create the non-global attributes.
  18. Generate C code to leave define mode.
  19. Generate C code to assign variable datalists.

The following code generates C code for defining the groups. It is fairly canonical and can be seen repeated in variant form when defining dimensions, types, variables, and attributes.

This code is redundant but for consistency, the root group ncid is stored like all other group ncids. Note that nprintf is a macro wrapper around snprint.

nprintf(stmt,sizeof(stmt),"%s%s = ncid;",indented(1),groupncid(rootgroup));
cline(stmt);

The loop walks all group symbols in preorder form and generates C code call to nc_def_grp using parameters taken from the group Symbol instance (gsym). The call to nc_def_grp is succeeded by a call to the check_err procedure to verify the operation's result code.

for(igrp=0;igrpcontainer == NULL) PANIC("null container");
    nprintf(stmt,sizeof(stmt),
            "%sstat = nc_def_grp(%s, \"%s\", &%s);",
	    indented(1),
            groupncid(gsym->container),
            gsym->name, groupncid(gsym));
    cline(stmt); // print the def_grp call
    clined(1,"check_err(stat,__LINE__,__FILE__);");
}
flushcode();
Note the call to indented(). It generates a blank string corresponding to indentation to a level of its argument N; level n might result in more or less than N blank characters.

Note also that one must be careful when dumping names (e.g. gsym->name above) if the name is expected to contain utf8 characters. For C, utf8 works fine in strings, but with a language like Java, which takes utf-16 characters, some special encoding is required to convert the non-ascii characters to use the \uxxxx form.

The code to generate dimensions, types, attributes, variables is similar, although often more complex.

The code to generate C equivalents of CDL types is in the procedure definectype(). Note that this code is not the code that invokes e.g. nc_def_vlen. The generated C types are used when generating datalists so that the standard C constant assignment mechanism will produce the correct memory values.

For non-C languages, the interaction between this code and the nc_def_TYPE code may be rather more complex than with C.

The genc_deftype procedure is the one that actually generates C code to define the netcdf types. The generated C code is designed to store the resulting typeid into the C variable defined earlier for holding that typeid.

Note that for compound types, the NC_COMPOUND_OFFSET macro is normally used to match netcdf offsets to the corresponding struct type generated in definectype. However, there is a flag, TESTALIGNMENT, that can be set to use a computed value for the offset. And for non-C languages, handling offsets is tricky and is addressed in more detail below.

Data Generation Methods

There are basically three known approaches for generating the data constants that are passed to, for example, nc_put_vara.
  1. For C (and C++) it is possible to generate C language constants directly into the code using the C initializer syntax. This is because CDL was originally defined with C in mind. This method can also be used for FORTRAN when doing classic model only.

  2. Generate the binary data and convert it to a large single string constant using appropriate escaping mechanisms; this was done in the original ncgen. This method has the advantage that it can be used for most languages, but it has (at least) two disadvantages: (1) it is not generally portable because the machine architecture influences the memory encoding; (2) it loses all information about the structure of the memory and hence makes more debugging difficult.

  3. Extend the netCDF interface with a set of operations to build up the memory structure piece by piece. This is the approach taken in the Java generation code.

    The idea is that one has a set of procedures in C with a simple interface that can be invoked by the output language. These procedures do the following.

    1. Create a dynamically extendible memory buffer (much like Bytebuffer).
    2. Append an array of instances of some primitive type to a specified buffer.
    3. Invoke nc_put_vara with a specified buffer.
    4. Reclaim a buffer
    Appropriate calls to these procedures can construct any required memory in a portable fashion.

    This method is appropriate to use with most non-C languages, with interpretive languages (e.g., Ruby and Perl), and even is probably the best way to get FORTRAN to handle the full netcdf-4 data model.

Data Generation: Overview

The way to think about data generation is to consider the following tree.

The data generation code is divided into two primary groups. One group handles all non-primitive variables and types. The other group handles all primitive variables and types (especially fields). The reason for this is that almost all languages can handle simple lists of primitive values. However, for non-primitive types, one of the methods from the previous section needs to be used.

Secondarily, the primitive handling code is divided into two groups. One group handles the character type and the other group handles all other primitive types. The code for the first group is in chardata.c and is generally usable across all languages.

The reason for this split is for historical reasons. It turns out that it is tricky to properly handle variables (or Compound type fields) of type NC_CHAR. Here the term "proper" means to mimic the output of the original ncgen program. To this end, a set of generically useful routines are define in the chardata.c file. These routines take a datasource and walk it to build a single string of characters, with appropriate fill, to correspond to a NC_CHAR typed variable or field. Unless your language has special requirements, it is probably best to always use these routines to process datalists for variables of type NC_CHAR.

Data Generation: Part I

Data generation occurs in several places, but is roughly divided into two parts. First, the genc.c code will set up appropriate declarations to hold the data. Second, the code in cdata.c will generate the actual memory contents that must be passed to nc_put_vara.

As a rule, the genc.c code calls a limited set of entry points into cdata.c. Again as a rule, cdata.c does not call genc.c code except for the closure mechanism described below.

The critical pieces of code for part I are the procedures genc_defineattr() and genc_definevardata() in genc.c.

genc_definevardata

This procedure is responsible for generating C constants corresponding to the data to be assigned to a variable as defined in the "data:" section of a CDL file. It is also responsible for generating the appropriate nc_put_vara_XXX code to actually assign the data to the variable.

genc_defineattr

This procedure is responsible for generating C constants corresponding to the data to be assigned to an attribute. from a CDL file. It is also responsible for generating the appropriate nc_put_att_XXX code to actually define the attribute.

As with variables, defining attributes of type NC_CHAR requires use of the gen_charXXX procedures.

Data Generation: Part II

The procedures in cdata.c walk a datalist and generate a sequence of space separated constants and possibly with nested paired braces ("{...}") as needed. The result is placed into a specified Bytebuffer.

As an aside, commas are added when needed to the list of constants using the commify procedure.

Their are three primary procedures that are called from the genj.c code.

Internally, each of these three procedures invokes the genc_data procedure to process part of a datalist.

Closures and VLEN

Closures and VLEN handling are two rather specialized mechanisms.

Closures

The data generation code uses a concept of closure or callback to allow the datalist processing to periodically call external code to do the actual C code generation. The reason for this is that it significantly improves performance if the generated datalist is periodically dumped to the netcdf .nc file using nc_put_vara. Note that the closure mechanism is only used for generating variable data; attributes cannot use this mechanism since they are defined all at once.

Basically, each call to the callback will generate C code for some C constants and calls to nc_put_vara(). The closure data structure (struct Putvar) is defined as follows.

typedef struct Putvar {
    int (*putvar)(struct Putvar*, Odometer*, Bytebuffer*);
    int rank;
    Bytebuffer* code;
    size_t startset[NC_MAX_VAR_DIMS];
    struct CDF {
        int grpid;
        int varid;
    } cdf;
    struct C {
        Symbol* var;
    } c;
} Putvar;
An instance of the closure is created for each variable that is the target of nc_put_vara(). It is initialized with the variable's symbol, rank, group id and variable id. It is also provided with a Bytebuffer into which it is supposed to store the generated C code. The startset is the cached previous set of dimension indices used for generating the nc_put_vara (see below).

The callback procedure (field "putvar") for generating C code putvar is assigned to the procedure called cputvara() (defined in genc.c). This procedure takes as arguments the closure object, an odometer describing the current set of dimension indices, and a Bytebuffer containing the generated C constants to be assigned to this slice of the variable.

Every time the closure procedure is called, it generates a C variable to hold the generated C constant. It also generated C constants to hold the start and count vectors required by nc_put_vara. It then generates an nc_put_vara() call. The start vector argument for the nc_put_vara is defined by the startset field of the closure. The count vector argument to nc_put_vara is computed from the current cached start vector and from the indices in the odometer. After the nc_put_vara() is generated, the odometer vector is assigned to the startset field in the closure for use on the next call.

There are some important assumptions about the state of the odometer when it is called.

  1. The zeroth index controls the count set.
  2. All other indices are assumed to be at their max values.

In particular, this means that the start vector is zero for all positions except position zero. The count vector is positions, except zero is the index in the odometer, which is assumed to be the max.

For start position zero, the position is taken from the last saved startset. The count position zero is the difference between that last start position and the current odometer zeroth index.

VLEN Constants

VLEN constants need to be constructed as separate C data constants because the C compiler will never convert nested groups ({...}) to separate memory chunks. Thus, ncgen must in several places generate the VLEN constants as separate variables and then insert pointers to them in the appropriate places in the later datalist C constants. Note that this process can be very tricky for non-C language (see genj.c and jdata.c for one approach).

As an optimization, ncgen tracks which datatypes will require use of vlen constants. This is any type whose definition is a vlen or whose basetype contains a vlen type.

The vlen generation process is two-fold. First, in the procedure processdatalist1() in semantics.c, the location of the struct Datalist objects that correspond to vlen constants is stored in a list called vlenconstants. When detected, each such Datalist object is tagged with a unique identifier and the vlen length (count). These will be used later to generate references to the vlen constant. These counts are only accurate for non-char typed variables; Special handling is in place to handle character vlen constants.

The second vlen constant processing action is in the procedure genc_vlenconstant() in cdata.c First, it walks the vlenconstants list and generates C code for C variables to define the vlen constant and C code to assign the vlen constant's data to that C variable.

When, later, the genc_datalist procedure encounters a Datalist tagged as representing a data list, it can generate a nc_vlen_t constant as {<count>,<vlenconstantname>} and use it directly in the generated C datalist constant.

Utility Data Structures

Pool Memory Allocation

As an approximation to garbage collection, this code uses a pool allocation mechanism. The goal is to allow dynamic construction of strings that have very short life-times; typically they are used to construct strings to send to the output file.

The pool mechanism wraps malloc and records the malloc'd memory in a circular buffer. When the buffer reaches its maximum size, previously allocated pool buffers are free'd. This is good in that the user does not have to litter code with free() statements. It is bad in that the pool allocated memory can be free'd too early if the memory does not have a short enough life. If you suspect the latter, then bump the size of the circular buffer and see if the problem goes away. If so, then your code is probably holding on to a pool buffer too long and should use regular malloc/free.

In the end, I am not sure if this is a good idea, but if does make the code simpler.

List and Bytebuffer

The two datatypes List and Bytebuffer are used through out the code. They correspond closely in semantics to the Java Arraylist and Stringbuffer types, respectively. They are used to help encapsulate dynamically growing lists of objects or bytes to reduce certain kinds of memory allocation errors.

The canonical code for non-destructive walking of a List is as follows.

for(i=0;i<listlength(list);i++) {
    T* element = (T*)listget(list,i);
    ...
}

Bytebuffer provides two ways to access its internal buffer of characters. One is "bbContents()", which returns a direct pointer to the buffer, and the other is "bbDup()", which returns a malloc'd string containing the contents and is guaranteed to be null terminated.

Odometer: Multi-Dimensional Array Handling

The odometer data type is used to convert multiple dimensions into a single integer. The rule for converting a multi-dimensional array to a single dimensions is as follows.

Suppose we have the declaration int F[2][5][3];. There are obviously a total of 2 X 5 X 3 = 30 integers in F. Thus, these three dimensions will be reduced to a single dimension of size 30.

A particular point in the three dimensions, say [x][y][z], is reduced to a number in the range 0..29 by computing ((x*5)+y)*3+z. The corresponding general C code is as follows.

size_t
dimmap(int rank, size_t* indices, size_t* sizes)
{
    int i;
    size_t count = 0;
    for(i=0;i 0) count *= sizes[i];
	count += indices[i];
    }
    return count;
}
In this code, the indices variable corresponds to the x,y, and z. The sizes variable corresponds to the 2,5, and 3.

The Odometer type stores a set of dimensions and supports operations to iterate over all possible dimension combinations. The definition of Odometer is defined by the types Odometer and Dimdata.

typedef struct Dimdata {
    unsigned long datasize; // actual size of the datalist item
    unsigned long index;    // 0 <= index < datasize
    unsigned long declsize;
} Dimdata;

typedef struct Odometer {
    int     rank;
    Dimdata dims[NC_MAX_VAR_DIMS];
} Odometer;
The following primary operations are defined.

Misc. Notes

Change Log

genc_scalardata or genc_arraydata. It stores in its Bytebuffer argument the sequence of constants corresponding to a given datalist. Handling commas is a tricky issue so you will that many of the non-top-level routines in cdata.c take a pointer to a global state element, commap, that determines the current state of adding commas. The idea is that at the beginning of any (sub-) Datalist, we want to turn off the comma in front of the first generated constant and then add commas until be reach the end of that (sub-)Datalist.

General Internals Information

Primary NCGEN Data Structures

There are two primary structures used in ncgen: struct Symbol) and struct Datalist).

struct Symbol

Symbol objects are linked into hierarchical structures to represent netcdf dimensions, types, groups, and variables. The struct has the following fields.
struct Symbol Fields
struct Symbol* next- The Symbol objects are all kept on a single linked list. No symbol is ever deleted until the end of the program.
nc_class objectclass- This defines the general class of symbol, one of: NC_GRP, NC_DIM, NC_VAR, NC_ATT, or NC_TYPE.
nc_classsubclass- This defines the sub class of symbol, one of: NC_PRIM, NC_OPAQUE, NC_ENUM, NC_FIELD, NC_VLEN, NC_COMPOUND, NC_ECONST, NC_ARRAY, or NC_FILLVALUE.
char*name- The symbol's name.
struct Symbol* container- The symbol that is the container for this symbol. Typically, this the group symbol that contains this symbol.
struct Symbol location- The current group that was open when this symbol was created.
List* subnodes- The list of child symbols of this symbol. For example, a group symbol will have its dimensions, types, vars, and subgroups will be in this list.
int is_prefixed- True if the name of this symbol contains a complete prefix path (e.g. /x/y/z).
List* prefix- A list of the prefix names for this node. Note that if is_prefixed is false, then this list was constructed from the set of enclosing groups.
struct Datalist* data- Stores the constants from attribute or datalist constructs.
Typeinfo typ- Type information about this symbol as defined by the Typeinfo structure.
Varinfo var- Variable information about a variable symbol as defined by the Varinfo structure.
Attrinfo att- Attribute information about an attribute symbol as defined by the Attrinfo structure.
Diminfo dim- Dimension information about a dimension symbol as defined by the Diminfo structure.
Groupinfo grp- Group information about a group symbol as defined by the Groupinfo structure.
int lineno- The source line in which this symbol was created.
int touched- Used in transitive closure operations to prevent revisiting symbols.
char* lname- Cached C or FORTRAN name (not used?).
int ncid- The ncid/varid/dimid, etc when defining netcdf objects.

struct Groupinfo

Group symbols primarily keep the group containment structure in the subnodes field of the Symbol.

struct Groupinfo Fields
int is_root- Is this the root group?

struct Diminfo

The only important information about a dimension, aside from name, is the dimension size. Additionally, type definitions may have anonymous (unnamed) dimensions.

struct Diminfo Fields
int isconstant- Is this an anonymous dimension?
unsigned int size- The size of the dimension.

struct Varinfo

Variables require two primary pieces of information: the set of attributes (including special attributes) and dimension information. The dimension information is kept in the Typeinfo structure because things other than variables have dimensions (e.g. user defined types).

struct Varinfo Fields
int nattributes- The number of attributes; this is redundant but useful.
List* attributes- The list of all attribute symbols associated with this variable.
Specialdata special- Special attribute values.

struct Typeinfo

The type information is probably the second most used structure in all of the code (second to Symbol itself).

struct Typeinfo Fields
struct Symbol* basetype- Provide a reference to the base type of this symbol. This applies to other types, variables, and attributes.
int hasvlen- Does the type have a vlen definition anywhere within it. This is used as an optimization to avoid searching datalists for vlen constants.
nc_type typecode- The typecode of the basetype. This is most useful when the basetype is a primitive type.
unsigned long size- The size of this object.
unsigned long offset- The field offset for fields in compound types.
unsigned long alignment- The memory alignment (i.e. 1,2,4,or 8).
Constant econst- For enumeration constants, the actual value of the constant.
Dimset dimset- The dimension information for the type or variable. The dimset stores the number of dimensions and a list of pointers to the corresponding dimension symbols.

struct Attrinfo

Note that the actual attribute data is stored in the data field of the containing Symbol.

struct Attrinfo Fields
struct Symbol* var- The variable with which this attribute is associated; it is NULL for global attributes.
unsigned long count- The number of instances associated with the attribute value.

Datalists and Datasrcs

Whenever a datalist is encountered during parsing, it is converted to an instance of struct Datalist. Each datalist instance contains a vector of instances of struct Constant that contains the actual data.

Each datalist instance contains the following information.
struct Datalist Fields
struct Datalist* next- All datalists are chained for reclamation.
int readonly- Can this datalist be modified?
unsigned int length- The number of Constant instances in the data field.
unsigned int alloc- The memory space allocated to the data field.
Constant* data- The vector in sequential memory of the constants comprising this datalist.
struct Symbol* schema- The symbol (type, variable, or attribute) defining the structure of this datalist, if known.
struct Vlen {- Information about the vlen instances contained in this datalist.
unsigned int count;
unsigned int uid;
} vlen
Odometer* dimdata- A tracker to count through dimensions associated with this datalist via the schema.

In turn, a Constant instance is defined as follows.

typedef struct Constant {
    nc_type 	  nctype;
    int		  lineno;
    Constvalue    value;
} Constant;
It indicates the type of the value and the source line number (if known) in which this constant was created.

The ConstValue type is a union of all possible values that can occur in a datalist.

typedef union Constvalue {
    struct Datalist* compoundv; // NC_COMPOUND
    char charv;                 // NC_CHAR
    signed char int8v;          // NC_BYTE
    unsigned char uint8v;       // NC_UBYTE
    short int16v;               // NC_SHORT
    unsigned short uint16v;     // NC_USHORT
    int int32v;                 // NC_INT
    unsigned int uint32v;       // NC_UINT
    long long int64v;           // NC_INT64
    unsigned long long uint64v; // NC_UINT64
    float floatv;               // NC_FLOAT
    double doublev;             // NC_DOUBLE
    struct Stringv {            // NC_STRING
        int len;
        char* stringv;
    } stringv;
    struct Opaquev {     // NC_OPAQUE
        int len; // length as originally written (rounded to even number)
        char* stringv; //as  constant was written
                      // (padded to even # chars >= 16)
                      // without leading 0x
    } opaquev;
    struct Symbol* enumv;   // NC_ECONST
} Constvalue;

Several fields are of particular interest:
Selected Constvalue Fields
struct Datalist* compoundv- This stores nested datalists - typically of the form "{...{...}...}".
struct Stringv {int len; char* stringv;} stringv- Store string constants.
struct Opaquev {int len; char* stringv;} opaquev- Store opaque constants as written (i.e. abc...), without the leading 0x, and padded to an even number of characters to be at least 16 characters long.
struct Symbol* enumv- Pointer to an enumeration constant definition.

struct Datasrc

When it comes time to generate datalists for output, it is necessary to "walk" the datalist (including nested datalist). The Datasrc structure is used to do this. Its definition is as follows.
typedef struct Datasrc {
    unsigned int index;     // 0..length-1
    unsigned int length;
    int          autopop;   // pop when at end
    Constant*    data;      // duplicate pointer; so do not free.
    struct Datasrc* stack;
} Datasrc;
The Datasrc tracks the "current" location in the sequence of Constants (taken from a Datalist). The index field indicates the current location. In effect, Datasrc is the lexer and the code that is walking it is in effect parsing the data sequence. The following operations are supported (see data.[ch]).

The CDL Parser

The CDL parser and associated lexer (primarily files "ncgen.y" and "ncgen.l") parse CDL files into various data structures for use by the remaining ncgen code. The data structures described above, (Symbol, and Datalist) are primarily generated by the parser.

Parse Cliches

Node Stacking

One of the issues that must be addressed by any bottom-up parser is handling the accumulation of sets of items (nodes, etc.). The YACC/Bison parse stack cannot be used because the set of accumulated nodes is unbounded and the YACC stack mechanism is bounded (i.e. each rule has a bounded right hand side length).

The node stacking set of cliches is ubiquitous in the parser, so they must be understood to understand how the parser works. The cliche here is shown in the handling of, for example, the varlist rule, which is defined as follows.

varlist:   varspec
             {$$=listlength(stack); listpush(stack,(elem_t)$1);}
         | varlist ',' varspec
	     {$$=$1; listpush(stack,(elem_t)$3);}
         ;
The varlist rule collects variable name declarations (via the varspec rule). The idea is to use a separate stack named "stack", and tracking the index into the stack of the start of collection of objects. The varlist value (in the YACC sense) is defined as an integer representing the size of the stack at the start of a list of variables. That is what this code does: $$=listlength(stack).

At the point where the set of varspecs should processed, the following code cliche is used.

vardecl: typeref varlist
           {...
            stackbase=$2;
            stacklen=listlength(stack);
            for(i=stackbase;i<stacklen;i++) {
              Symbol* sym = (Symbol*)listget(stack,i);
              ...
            }
	    listsetlength(stack,stackbase);// remove stack nodes
	   }
           ...
The start of the set of variable declaration symbols is extracted as the integer associated with right-side non-terminal $2, e.g. stackbase=$2. The current stack length is obtained from stacklen=listlength(stack). Then the elements of the stack are extracted one by one using the above loop. Finally, the nodes on the stack are cleared by the code segment listsetlength(stack,stackbase).

Semantic Processing

Semantic processing takes the output of the parser and adds various pieces of semantic information. The semantic actions are as follows.
  1. Procedure processtypes().
    1. Do a topological sort of the types based on dependency so that the least dependent are first in the typdefs list.
    2. Fill in type typecodes.
    3. Mark types that have a vlen.
  2. Procedure filltypecodes() - Fill in implied type codes.
  3. Procedure processvars() - Fill in missing values.
  4. Procedure processattributes() - Process attributes to connect to corresponding variable.
  5. Procedure processcompound() - Process each compound type to compute its size.
  6. Procedure processenums() - Fix up enum constant values.
  7. Procedure processdatalists() - Fix up datalists.
  8. Procedure checkconsistency() - Check internal consistency.
  9. Procedure validate() - Do any needed additional semantic checks.

Generating C Code

The source code for generating C code output (via the -c option) is of most interest because it is the pattern to be used for other languages and because, frankly, it is complex and ugly at the moment and so guidance is needed in understanding it.

The files genc.[ch] and cdata.c are the primary files for C code generation. The files data.[ch] is also important.

Output Routines

The output routines are a bit of a mixed bag. It is important to know that code is not directly dumped to the output file; rather is is accumulated in a global Bytebuffer instance called "ccode".

The output routines are as follows.

gen_ncc

The gen_ncc procedure is responsible for creating and dumping the generated C code.

It has at its disposal several global lists of Symbols. Note that the lists cross all groups.

The superficial operation of gen_ncc is as follows; the details are provided later where the operation is complex.

  1. Generate header code (e.g. #include ").
  2. Generate C type definitions corresponding to the CDL types.
  3. Generate VLEN constants.
  4. Generate chunking constants.
  5. Generate initial part of the main() procedure.
  6. Generate C variable definitions to hold the ncids for all created groups.
  7. Generate C variable definitions to hold the typeids of all created types.
  8. Generate C variables and constants that correspond to to the CDL dimensions.
  9. Generate C variable definitions to hold the varids of all created variables.
  10. Generate C code to create the netCDF binary file.
  11. Generate C code to create the all groups in the proper hierarchy.
  12. Generate C code to create the type definitions.
  13. Generate C code to create the dimension definitions.
  14. Generate C code to create the variable definitions.
  15. Generate C code to create the global attributes.
  16. Generate C code to create the non-global attributes.
  17. Generate C code to leave define mode.
  18. Generate C code to assign variable datalists.

The following code generates C code for defining the groups. It is fairly canonical and can be seen repeated in variant form when defining dimensions, types, variables, and attributes.

This code is redundant but for consistency, the root group ncid is stored like all other group ncids. Note that nprintf is a macro wrapper around snprint.

nprintf(stmt,sizeof(stmt),"    %s = ncid;",groupncid(rootgroup));
cline(stmt);

The loop walks all group symbols in preorder form and generates C code call to nc_def_grp using parameters taken from the group Symbol instance (gsym). The call to nc_def_grp is succeeded by a call to the check_err procedure to verify the operation's result code.

for(igrp=0;igrpcontainer == NULL) PANIC("null container");
    nprintf(stmt,sizeof(stmt),
            "    stat = nc_def_grp(%s, \"%s\", &%s);",
            groupncid(gsym->container),
            gsym->name, groupncid(gsym));
    cline(stmt); // print the def_grp call
    clined(1,"check_err(stat,__LINE__,__FILE__);");
}
flushcode();

The code to generate dimensions, types, attributes, variables is similar, although often more complex.

The code to generate C equivalents of CDL types is in the procedure definectype(). Note that this code is not the code that invokes e.g. nc_def_vlen. The generated C types are used when generating datalists so that the standard C constant assignment mechanism will produce the correct memory values.

The genc_deftype procedure is the one that actually generates C code to define the netcdf types. The generated C code is designed to store the resulting typeid into the C variable defined earlier for holding that typeid.

Note that for compound types, the NC_COMPOUND_OFFSET macro is normally used to match netcdf offsets to the corresponding struct type generated in definectype. However, there is a flag, TESTALIGNMENT, that can be set to use a computed value for the offset.

C Constant Datalist Generation

All attributes, and some variables, require the construction of a memory object containing data to be assigned to that attribute or variable. The code to do this is by far the most complicated in ncgen. The file cdata.c contains the procedure genc_datalist(), which does most of the heavy lifting.

For attributes, the general form generated is

T* attributevar = {...};
Except for VLENs, the datalist is completely contained in the brackets, with bracket nesting as required. A generated pointer the attributevar is included in the generated call to nc_put_att().

For variables, the general form generated is similar to attributes.

T* varvar = {...};
Again, VLENs are handled specially. Also, for performance purposes, the datalist is loaded in pieces using nc_put_vara(). This is required if there are UNLIMITED dimensions, but is used for all cases for uniformity.

Datalist Closures

The code uses a concept of closure or callback to allow the datalist processing to periodically call external code to do the actual C code generation. Basically, each call to the callback will generate C code for constants and calls to nc_put_vara(). The closure data structure (struct Putvar) is defined as follows.
typedef struct Putvar {
    int (*putvar)(struct Putvar*, Odometer*, Bytebuffer*);
    int rank;
    Bytebuffer* code;
    size_t startset[NC_MAX_VAR_DIMS];
    struct CDF {
        int grpid;
        int varid;
    } cdf;
    struct C {
        Symbol* var;
    } c;
} Putvar;
An instance of the closure is created for each variable that is the target of nc_put_vara(). It is initialized with the variable's symbol, rank, group id and variable id. It is also provided with a Bytebuffer into which it is supposed to store the generated C code. The startset is the cached previous set of dimension indices used for generating the nc_put_vara (see below).

The callback procedure (field "putvar") for generating C code putvar is assigned to the procedure called cputvara() (defined in genc.c). This procedure takes as arguments the closure object, an odometer describing the current set of dimension indices, and a Bytebuffer containing the generated C constants to be assigned to this slice of the variable.

Every time the closure procedure is called, it generates a C variable to hold the generated C constant. It then generates an nc_put_vara() call. The start vector argument for the nc_put_vara is defined by the startset field of the closure. The count vector argument to nc_put_vara is computed from the current cached start vector and from the indices in the odometer. After the nc_put_vara() is generated, the odometer vector is assigned to the startset field in the closure for use on the next call.

There are some important assumptions about the state of the odometer when it is called.

  1. The zeroth index controls the count set.
  2. All other indices are assumed to be at their max values.

In particular, this means that the start vector is zero for all positions except position zero. The count vector is positions, except zero is the index in the odometer, which is assumed to be the max.

For start position zero, the position is taken from the last saved startset. The count position zero is the difference between that last start position and the current odometer zeroth index.

If all of this sounds complex, it is, and if/when I have time I will rethink the whole process of datalist generation from beginning to end.

VLEN Constants

VLEN constants need to be constructed as separate C data constants because the C compiler will never convert nested groups ({...}) to separate memory chunks. Thus, ncgen must in several places generate the VLEN constants as separate variables and then insert pointers to them in the appropriate places in the later datalist C constants.

As an optimization, ncgen tracks which datatypes will require use of vlen constants. This is any type whose definition is a vlen or whose basetype contains a vlen type.

The vlen generation process is two-fold. First, in the procedure processdatalist1() in semantics.c, the location of the struct Datalist objects that correspond to vlen constants is stored in a list called vlenconstants. When detected, each such struct Datalist object is tagged with a unique identifier and the vlen length (count). These will be used later to generate references to the vlen constant.

The second vlen constant processing action is in the procedure genc_vlenconstant() in cdata.c First, it walks the vlenconstants list and generates C code for variables to define the vlen constant and C code to assign the vlen constant's data to that variable.

When, later, the genc_datalist procedure encounters a Datalist tagged as representing a data list, it can generate a nc_vlen_t constant as {<count>,<vlenconstantname>} and use it directly in the generated C datalist constant.

Walking the Datalist

To actually generate the C code for a datalist constant, the procedure genc_datalist wraps the Datalist in a Datasrc, and proceeds to walk it constant by constant and generating the corresponding C constant. The bulk of the work is performed in the recursive procedure genc_datalist1().

For better or worse, the code acts like a 1-lookahead parser. This means that it decides what to do based on the current type, the current constant and, when necessary, the next constant in the Datasrc. In practice, the lookahead is hidden, so it is not represented in the following table.

Current TypeCurrent Constantaction
NC_PRIMPrimitive ConstantGenerate the C constant; convert as necessary.
NC_OPAQUE''''
NC_ENUM''''
NC_ENUM''''
NC_COMPOUNDNested Datalist ConstantPush into the datalist and recurse on each field; When done, pop back to previous datalist.
NC_COMPOUNDAny other Constant Continue to recurse on each field; This allows specification of fields without enclosing in {...}.
NC_VLENNested Datalist ConstantGenerate the nc_vlen_t instance using the tagged information in the struct Datalist.
NC_FIELDNAIf this field is dimensioned, then call genc_fielddata to walk the dimensions. Otherwise, just recurse on genc_datalist1.

The genc_fielddata() procedure iterates over a field dimension and calls itself recursively to walk the remaining dimensions. It this is the last dimension, then it calls genc_datalist1 to generate C code for the basetype of the field.

String/Char Handling

All through the genc_datalist code, there are special cases for handling string constants. The reason is, of course, that the string constant "abcd.." may, depending on the type context, be either a string or an array of characters.

Generating Variable Data

The genc_datalist code does not call closures. The closures are used in the genc_vardata() and genc_vardata1() procedures; genc_vardata1 being the recursive procedure that actually calls the closure.

The genc_vardata1() procedure, like genc_fielddata, iterates over a top-level dimension and calls itself recursively to iterate over the remaining dimensions. The term "top-level" refers to the fact that these are the dimensions specified for a variable as opposed to field dimensions.

When iterating an UNLIMITED dimension, or when iterating the first dimension, the code generates a datalist for this subslice and then calls the closure to generate the C code.

Miscellaneous

Pool Memory Allocation

As an approximation to garbage collection, this code uses a pool allocation mechanism. The goal is to allow dynamic construction of strings that have very short life-times; typically they are used to construct strings to send to the output file.

The pool mechanism wraps malloc and records the malloc'd memory in a circular buffer. When the buffer reaches its maximum size, previously allocated pool buffers are free'd. This is good in that the user does not have to litter code with free() statements. It is bad in that the pool allocated memory can be free'd too early if the memory does not have a short enough life. If you suspect the latter, then bump the size of the circular buffer and see if the problem goes away. If so, then your code is probably holding on to a pool buffer too long and should use regular malloc/free.

In the end, I am not sure if this is a good idea, but if does make the code simpler.

List and Bytebuffer

The two datatypes List and Bytebuffer are used through out the code. They correspond closely in semantics to the Java Arraylist and Stringbuffer types, respectively. They are used to help encapsulate dynamically growing lists of objects or bytes to reduce certain kinds of memory allocation errors.

The canonical code for non-destructive walking of a List is as follows.

for(i=0;i<listlength(list);i++) {
    T* element = (T*)listget(list,i);
    ...
}

Bytebuffer provides two ways to access its internal buffer of characters. One is "bbContents()", which returns a direct pointer to the buffer, and the other is "bbDup()", which returns a malloc'd string containing the contents and is guaranteed to be null terminated.

Odometer: Multi-Dimensional Array Handling

The odometer data type is used to convert multiple dimensions into a single integer. The rule for converting a multi-dimensional array to a single dimensions is as follows.

Suppose we have the declaration int F[2][5][3];. There are obviously a total of 2 X 5 X 3 = 30 integers in F. Thus, these three dimensions will be reduced to a single dimension of size 30.

A particular point in the three dimensions, say [x][y][z], is reduced to a number in the range 0..29 by computing ((x*5)+y)*3+z. The corresponding general C code is as follows.

size_t
dimmap(int rank, size_t* indices, size_t* sizes)
{
    int i;
    size_t count = 0;
    for(i=0;i 0) count *= sizes[i];
	count += indices[i];
    }
    return count;
}
In this code, the indices variable corresponds to the x,y, and z. The sizes variable corresponds to the 2,5, and 3.

The Odometer type stores a set of dimensions and supports operations to iterate over all possible dimension combinations. The definition of Odometer is defined by the types Odometer and Dimdata.

typedef struct Dimdata {
    unsigned long datasize; // actual size of the datalist item
    unsigned long index;    // 0 <= index < datasize
    unsigned long declsize;
} Dimdata;

typedef struct Odometer {
    int     rank;
    Dimdata dims[NC_MAX_VAR_DIMS];
} Odometer;
The following primary operations are defined.

Change Log