mirror of
https://github.com/Unidata/netcdf-c.git
synced 2024-12-27 08:49:16 +08:00
1478 lines
56 KiB
HTML
1478 lines
56 KiB
HTML
|
<html>
|
||
|
<body>
|
||
|
<center>
|
||
|
<pre>
|
||
|
/* Copyright 2009, UCAR/Unidata and OPeNDAP, Inc.
|
||
|
See the COPYRIGHT file for more information. */
|
||
|
</pre>
|
||
|
<h1>NCGEN Internals Documentation</h1>
|
||
|
<h3>Draft: 03/07/2009<br>
|
||
|
Last Revised: 07/15/2009</h3>
|
||
|
</center>
|
||
|
|
||
|
<h1><u>Introduction</u></h1>
|
||
|
This document is an ongoing effort to
|
||
|
describe the internal operation of the ncgen
|
||
|
cdl compiler; ncgen is a part of the netcdf
|
||
|
system.
|
||
|
<p>
|
||
|
The document has two primary parts.
|
||
|
<ol>
|
||
|
<li><a href="#LANG">Language Support</a>
|
||
|
-- describes how to add a new output language to ncgen.
|
||
|
<p>
|
||
|
<li><a href="#GIT">General Internals Information</a>
|
||
|
-- describes additional information about the internals;
|
||
|
parsing, for example.
|
||
|
</ol>
|
||
|
|
||
|
<h1></u><a name="LANG">Modifying NCGEN to Output a New Language</a></u></h1>
|
||
|
|
||
|
This document outlines the general method for adding
|
||
|
a new language output to ncgen. Currently, it supports
|
||
|
binary, C, and (experimentally) NcML and Java.
|
||
|
Before reading this document, the reader should also
|
||
|
review the internals.html document.
|
||
|
<p>
|
||
|
Also, the reader should note that code is a bit crufty
|
||
|
and needs refactoring. This is primarily because
|
||
|
it was originally defined to support only C and
|
||
|
each new language stresses the code.
|
||
|
<p>
|
||
|
In order to get ncgen to generate output for a new
|
||
|
language, the following steps are required.
|
||
|
|
||
|
<ol>
|
||
|
<li> <a href="#Misc">Modify various files to invoke the new language output.</a>
|
||
|
<li> <a href="#Create">Create a new set of generate functions.</a>
|
||
|
</ol>
|
||
|
|
||
|
<h2><a name="Misc">Modify various files to invoke the new language output.</a></h2>
|
||
|
The following steps are required to provide the necessary code
|
||
|
to invoke a new language output.
|
||
|
For the purposes of this discussion, let us call the language Java.
|
||
|
|
||
|
<h4>ncgen.h</h4>
|
||
|
<ol>
|
||
|
<li> Locate the code enabler #defines
|
||
|
(e.g. <code>#define ENABLE_C</code>)
|
||
|
and insert a new one of the form
|
||
|
<pre>
|
||
|
#define ENABLE_JAVA
|
||
|
</pre>
|
||
|
</ol>
|
||
|
|
||
|
<h4>main.c</h4>
|
||
|
<ol>
|
||
|
<li> Locate the global declaration (<code>int fortran_flag;</code>)
|
||
|
and insert a new declaration.
|
||
|
<pre>int java_flag;</pre>
|
||
|
|
||
|
<li> Locate the initialization (<code>fortran_flag = 0;</code>)
|
||
|
in the body of the main() procedure and add a new initialization.
|
||
|
<pre>java_flag = 0;</pre>.
|
||
|
|
||
|
<li>Locate the options processing switch case for -l (<code>case 'l':</code>).
|
||
|
Duplicate one of the instances there and add to the conditionals.
|
||
|
It should look like this.
|
||
|
<pre>
|
||
|
} else if(strcmp(lang_name, "java") == 0
|
||
|
|| strcmp(lang_name, "Java") == 0) {java_flag = 1;}
|
||
|
</pre>
|
||
|
|
||
|
<li> Just after the options processing switch code,
|
||
|
there are a number of #ifndef conditionals
|
||
|
(e.g. <code>#ifndef ENABLE_C</code>).
|
||
|
Add a new one for Java.
|
||
|
It should look like this.
|
||
|
<pre>
|
||
|
#ifndef ENABLE_JAVA
|
||
|
if(java_flag) {
|
||
|
fprintf(stderr,"Java not currently supported\n");
|
||
|
exit(1);
|
||
|
}
|
||
|
#endif
|
||
|
</pre>
|
||
|
</ol>
|
||
|
|
||
|
<h2><a name="Create">Create a new set of generate functions.</a></h2>
|
||
|
The hard part is creating the actual code generation files.
|
||
|
To do this, it is easiest to take one of the existing
|
||
|
generators and modify it, viz:
|
||
|
<ul>
|
||
|
<li> copy genc.c genj.c
|
||
|
<li> copy cdata.c jdata.c
|
||
|
</ul>
|
||
|
The genj.c file will do most of the code generation. The jdata.c file
|
||
|
will generate lists of data constants that come from the CDL data: section.
|
||
|
There is nothing magical about using two files: they can be refactored
|
||
|
as desired.
|
||
|
<p>
|
||
|
In order to facilitate code generation, it is useful to look
|
||
|
at the translations produced by other languages.
|
||
|
The idea is to take these translations and decide what the
|
||
|
corresponding Java (for example) code would look like.
|
||
|
Then the idea is to modify the genc code (in genj.c)
|
||
|
to reflect that translation.
|
||
|
<p>
|
||
|
In most of the rest of this discussion, the genc.c and cdata.c
|
||
|
code will be used to explain the operation.
|
||
|
Appropriate procedure renaming should be done for new languages
|
||
|
(e.g, for Java, <i>genc_XXX</i> is changed to <i>genj_XXX</i>
|
||
|
consistently).
|
||
|
|
||
|
<h3>Useful Output Procedures</h3>
|
||
|
The following output procedures are defined in genc.c to create C output.
|
||
|
The idea is that output is accumulated in a <a href="#Bytebuffer">Bytebuffer</a>
|
||
|
called ccode. Periodically, ccode
|
||
|
contents are flushed to stdout.
|
||
|
The relevant procedures from the C code are as follows.
|
||
|
<ol>
|
||
|
<li> <code>void cprint(Bytebuffer* buf)</code>
|
||
|
-- dump the contents of buf to output (ccode actually).
|
||
|
<li> <code>void cpartial(char* line)</code>
|
||
|
-- dump the specified string to output.
|
||
|
<li> <code>void cline(char* line)</code>
|
||
|
-- dump the specified string to output and add a newline.
|
||
|
<li> <code>void clined(int n, char* line)</code>
|
||
|
-- dump the specified string to output preceded by
|
||
|
<i>n</i> instances of indentation.
|
||
|
<li> <code>void cflush(void)</code>
|
||
|
-- dump the contents of ccode to standard output
|
||
|
and reset the ccode buffer.
|
||
|
</ol>
|
||
|
There is, of course, nothing sacred about these procedures:
|
||
|
feel free to modify as needed. In fact, there are two
|
||
|
important reasons to modify the code.
|
||
|
First, the indentation rules may differ from language to language
|
||
|
(FORTRAN 77 for example). Second, the rules for folding lines
|
||
|
that are too long differ across languages.
|
||
|
It is usually easiest to handle both of these issues
|
||
|
in the output procedures.
|
||
|
<p>
|
||
|
The <a href="#Bytebuffer">Bytebuffer</a> type is an important data structure.
|
||
|
It allows for dynamically creating strings of characters
|
||
|
(actually arbitrary 8 bit values).
|
||
|
Most of the operations should be obvious: examine bytebuffer.h.
|
||
|
It is used widely in this code especially to capture sub-pieces
|
||
|
of the generated code that must be saved for out-of-order output.
|
||
|
|
||
|
<h3>Code Generation</h3>
|
||
|
The code generation method used for C is a pretty good
|
||
|
general paradigm, so this discussion will use it as a model.
|
||
|
The gen_ncc procedure is responsible for
|
||
|
creating and dumping the generated C code.
|
||
|
<p>
|
||
|
It has at its disposal several global lists of Symbols.
|
||
|
Note that the lists cross all groups.
|
||
|
<ul>
|
||
|
<li>dimdefs - the set of symbols defining dimensions.
|
||
|
<li>vardefs - the set of symbols defining variables.
|
||
|
<li>attdefs - the set of symbols defining non-global attributes.
|
||
|
<li>gattdefs - the set of symbols defining global attributes.
|
||
|
<li>grpdefs - the set of symbols defining groups.
|
||
|
<li>typdefs - the set of symbols defining types; note that this list
|
||
|
has been topologically sorted so that a given type depends only
|
||
|
on types with lower indices in the list.
|
||
|
</ul>
|
||
|
<p>
|
||
|
The superficial operation of gen_ncc is as follows; the details
|
||
|
are provided later where the operation is complex.
|
||
|
<ol>
|
||
|
<li>Generate header code (e.g. #include <stdio.h>").
|
||
|
<li>Generate C type definitions corresponding to the
|
||
|
CDL types.
|
||
|
<li>Generate VLEN constants.
|
||
|
<li>Generate chunking constants.
|
||
|
<li>Generate initial part of the main() procedure.
|
||
|
<li>Generate C variable definitions to hold the ncids
|
||
|
for all created groups.
|
||
|
<li>Generate C variable definitions to hold the typeids
|
||
|
of all created types.
|
||
|
<li>Generate C variables and constants that correspond to
|
||
|
to the CDL dimensions.
|
||
|
<li>Generate C variable definitions to hold the dimids
|
||
|
of all created dimensions.
|
||
|
<li>Generate C variable definitions to hold the varids
|
||
|
of all created variables.
|
||
|
<li>Generate C code to create the netCDF binary file.
|
||
|
<li>Generate C code to create the all groups in the proper
|
||
|
hierarchy.
|
||
|
<li>Generate C code to create the type definitions.
|
||
|
<li>Generate C code to create the dimension definitions.
|
||
|
<li>Generate C code to create the variable definitions.
|
||
|
<li>Generate C code to create the global attributes.
|
||
|
<li>Generate C code to create the non-global attributes.
|
||
|
<li>Generate C code to leave define mode.
|
||
|
<li>Generate C code to assign variable datalists.
|
||
|
</ol>
|
||
|
<p>
|
||
|
The following code generates C code for defining the groups.
|
||
|
It is fairly canonical and can be seen repeated in variant form
|
||
|
when defining dimensions, types, variables, and attributes.
|
||
|
<p>
|
||
|
This code is redundant but for consistency, the root group
|
||
|
ncid is stored like all other group ncids.
|
||
|
Note that nprintf is a macro wrapper around snprint.
|
||
|
<pre>
|
||
|
nprintf(stmt,sizeof(stmt),"%s%s = ncid;",indented(1),groupncid(rootgroup));
|
||
|
cline(stmt);
|
||
|
</pre>
|
||
|
<p>
|
||
|
The loop walks all group symbols in preorder form
|
||
|
and generates C code call to nc_def_grp
|
||
|
using parameters taken from the group Symbol instance (gsym).
|
||
|
The call to nc_def_grp is succeeded by a call to the
|
||
|
check_err procedure to verify the operation's result code.
|
||
|
<pre>
|
||
|
for(igrp=0;igrp<listlength(grpdefs);igrp++) {
|
||
|
Symbol* gsym = (Symbol*)listget(grpdefs,igrp);
|
||
|
if(gsym == rootgroup) continue; // ignore root
|
||
|
if(gsym->container == NULL) PANIC("null container");
|
||
|
nprintf(stmt,sizeof(stmt),
|
||
|
"%sstat = nc_def_grp(%s, \"%s\", &%s);",
|
||
|
indented(1),
|
||
|
groupncid(gsym->container),
|
||
|
gsym->name, groupncid(gsym));
|
||
|
cline(stmt); // print the def_grp call
|
||
|
clined(1,"check_err(stat,__LINE__,__FILE__);");
|
||
|
}
|
||
|
flushcode();
|
||
|
</pre>
|
||
|
Note the call to indented(). It generates a blank string corresponding
|
||
|
to indentation to a level of its argument N; level n might result in
|
||
|
more or less than N blank characters.
|
||
|
<p>
|
||
|
Note also that one must be careful when dumping names
|
||
|
(e.g. gsym->name above) if the name is expected to contain
|
||
|
utf8 characters. For C, utf8 works fine in strings, but with
|
||
|
a language like Java, which takes utf-16 characters,
|
||
|
some special encoding is required to convert the non-ascii
|
||
|
characters to use the \uxxxx form.
|
||
|
<p>
|
||
|
The code to generate dimensions, types, attributes, variables
|
||
|
is similar, although often more complex.
|
||
|
<p>
|
||
|
The code to generate C equivalents of CDL types is
|
||
|
in the procedure definectype().
|
||
|
Note that this code is not the code that invokes e.g. nc_def_vlen.
|
||
|
The generated C types are used when generating datalists
|
||
|
so that the standard C constant assignment mechanism will produce
|
||
|
the correct memory values.
|
||
|
<p>
|
||
|
For non-C languages, the interaction between this code and the
|
||
|
nc_def_TYPE code may be rather more complex than with C.
|
||
|
<p>
|
||
|
The genc_deftype procedure is the one that actually
|
||
|
generates C code to define the netcdf types.
|
||
|
The generated C code is designed to store the resulting
|
||
|
typeid into the C variable defined earlier
|
||
|
for holding that typeid.
|
||
|
<p>
|
||
|
Note that for compound types, the NC_COMPOUND_OFFSET
|
||
|
macro is normally used to match netcdf offsets to
|
||
|
the corresponding struct type generated in definectype.
|
||
|
However, there is a flag, TESTALIGNMENT,
|
||
|
that can be set to use a computed value for the offset.
|
||
|
And for non-C languages, handling offsets is tricky and is
|
||
|
addressed in more detail below.
|
||
|
|
||
|
<h3>Data Generation Methods</h3>
|
||
|
There are basically three known approaches for generating
|
||
|
the data constants that are passed to, for example, <i>nc_put_vara</i>.
|
||
|
<ol>
|
||
|
<li> For C (and C++) it is possible to generate C language constants
|
||
|
directly into the code using the C initializer syntax.
|
||
|
This is because CDL was originally defined with C in mind.
|
||
|
This method can also be used for FORTRAN when doing classic model only.
|
||
|
<p>
|
||
|
<li> Generate the binary data
|
||
|
and convert it to a large single string constant using
|
||
|
appropriate escaping mechanisms; this was done in the original
|
||
|
ncgen.
|
||
|
This method has the advantage that it can be used for most
|
||
|
languages, but it has (at least) two disadvantages:
|
||
|
(1) it is not generally portable because the machine architecture
|
||
|
influences the memory encoding; (2) it loses all information
|
||
|
about the structure of the memory and hence makes more debugging
|
||
|
difficult.
|
||
|
<p>
|
||
|
<li>Extend the netCDF interface with a set
|
||
|
of operations to build up the memory structure piece by piece.
|
||
|
This is the approach taken in the Java generation code.
|
||
|
<p>
|
||
|
The idea is that one has a set of procedures in C with a simple
|
||
|
interface that can be invoked by the output language.
|
||
|
These procedures do the following.
|
||
|
<ol>
|
||
|
<li>Create a dynamically extendible memory buffer (much like Bytebuffer).
|
||
|
<li>Append an array of instances
|
||
|
of some primitive type to a specified buffer.
|
||
|
<li>Invoke nc_put_vara with a specified buffer.
|
||
|
<li>Reclaim a buffer
|
||
|
</ol>
|
||
|
Appropriate calls to these procedures can construct any required memory
|
||
|
in a portable fashion.
|
||
|
<p>
|
||
|
This method is appropriate to use with most non-C languages, with interpretive
|
||
|
languages (e.g., Ruby and Perl), and even is probably the best way to
|
||
|
get FORTRAN to handle the full netcdf-4 data model.
|
||
|
</ol>
|
||
|
|
||
|
<h3>Data Generation: Overview</h3>
|
||
|
The way to think about data generation is to consider
|
||
|
the following tree.
|
||
|
<ul>
|
||
|
<li>The root is a convenience and represents the whole
|
||
|
set of variables specified in the CDL "data:" section.
|
||
|
<li>The nodes in the tree just below the root represent
|
||
|
the set of variables to which values are assigned in the
|
||
|
data section.
|
||
|
<li>The subtrees below each variable are the basetypes
|
||
|
of each variable. Thus if a variables x has a basetype
|
||
|
that is a compound type, then the node below x will
|
||
|
represent the whole compound type and the nodes below
|
||
|
that compound type node will be the fields of the compound
|
||
|
type, and so on.
|
||
|
<li>The leaves of this tree are all of primitive type
|
||
|
(e.g. NC_CHAR, NC_INT, NC_STRING).
|
||
|
</ul>
|
||
|
<p>
|
||
|
The data generation code is divided into two
|
||
|
primary groups. One group handles all non-primitive variables
|
||
|
and types. The other group handles all primitive variables
|
||
|
and types (especially fields). The reason for this is that
|
||
|
almost all languages can handle simple lists of primitive values.
|
||
|
However, for non-primitive types, one of the methods from the previous
|
||
|
section needs to be used.
|
||
|
<p>
|
||
|
Secondarily, the primitive handling code is divided into
|
||
|
two groups. One group handles the character type
|
||
|
and the other group handles all other primitive types.
|
||
|
The code for the first group is in chardata.c and is generally
|
||
|
usable across all languages.
|
||
|
<p>
|
||
|
The reason for this split is for historical reasons.
|
||
|
It turns out that it is tricky to properly handle variables
|
||
|
(or Compound type fields) of type NC_CHAR.
|
||
|
Here the term "proper" means to mimic the output of
|
||
|
the original ncgen program. To this end, a set of generically useful routines
|
||
|
are define in the chardata.c file. These routines take a datasource
|
||
|
and walk it to build a single string of characters, with appropriate fill,
|
||
|
to correspond to a NC_CHAR typed variable or field.
|
||
|
Unless your language has special
|
||
|
requirements, it is probably best to always use these routines to process
|
||
|
datalists for variables of type NC_CHAR.
|
||
|
|
||
|
<h3>Data Generation: Part I</h3>
|
||
|
Data generation occurs in several places, but is roughly
|
||
|
divided into two parts. First, the genc.c code will set up
|
||
|
appropriate declarations to hold the data. Second, the code
|
||
|
in cdata.c will generate the actual memory contents that must be
|
||
|
passed to nc_put_vara.
|
||
|
<p>
|
||
|
As a rule, the genc.c code calls a limited set of
|
||
|
entry points into cdata.c. Again as a rule,
|
||
|
cdata.c does not call genc.c code except for the closure
|
||
|
mechanism described below.
|
||
|
<p>
|
||
|
The critical pieces of code for part I are the procedures
|
||
|
genc_defineattr() and genc_definevardata() in genc.c.
|
||
|
|
||
|
<h4>genc_definevardata</h4>
|
||
|
This procedure is responsible for generating C constants corresponding
|
||
|
to the data to be assigned to a variable as defined in the "data:" section
|
||
|
of a CDL file. It is also responsible for
|
||
|
generating the appropriate nc_put_vara_XXX code to actually assign
|
||
|
the data to the variable.
|
||
|
|
||
|
<h4>genc_defineattr</h4>
|
||
|
This procedure is responsible for generating C constants corresponding
|
||
|
to the data to be assigned to an attribute.
|
||
|
from a CDL file. It is also responsible for
|
||
|
generating the appropriate nc_put_att_XXX code to actually define
|
||
|
the attribute.
|
||
|
<p>
|
||
|
As with variables, defining attributes of type NC_CHAR requires use
|
||
|
of the gen_charXXX procedures.
|
||
|
|
||
|
<h3>Data Generation: Part II</h3>
|
||
|
The procedures in cdata.c walk a datalist
|
||
|
and generate a sequence of space separated constants
|
||
|
and possibly with nested paired braces ("{...}") as needed.
|
||
|
The result is placed into a specified Bytebuffer.
|
||
|
<p>
|
||
|
As an aside, commas are added when needed to the list of constants
|
||
|
using the <i>commify</i> procedure.
|
||
|
<p>
|
||
|
Their are three primary procedures that are called from
|
||
|
the genj.c code.
|
||
|
<ul>
|
||
|
<li>genc_attrdata --
|
||
|
store (in its Bytebuffer argument) the sequence of constants
|
||
|
corresponding to a given attribute datalist.
|
||
|
<li>genc_scalardata --
|
||
|
store the single constant (which may be of a user-defined type)
|
||
|
corresponding to its variable's datalist.
|
||
|
<li>and genc_arraydata.
|
||
|
store the vector of constants corresponding to its variable's datalist.
|
||
|
This is by far the most complicated of the three procedures.
|
||
|
</ul>
|
||
|
<p>
|
||
|
Internally, each of these three procedures invokes
|
||
|
the <i>genc_data</i> procedure to process part of a datalist.
|
||
|
|
||
|
|
||
|
<h3>Closures and VLEN</h4>
|
||
|
Closures and VLEN handling are two rather specialized mechanisms.
|
||
|
|
||
|
<h4>Closures</h4>
|
||
|
The data generation code uses a concept of closure or callback
|
||
|
to allow the datalist processing to periodically
|
||
|
call external code to do the actual C code generation.
|
||
|
The reason for this is that it significantly improves
|
||
|
performance if the generated datalist is periodically
|
||
|
dumped to the netcdf .nc file using <i>nc_put_vara</i>.
|
||
|
Note that the closure mechanism is only used for generating
|
||
|
variable data; attributes cannot use this mechanism
|
||
|
since they are defined all at once.
|
||
|
<p>
|
||
|
Basically, each call to the callback will generate
|
||
|
C code for some C constants and calls to nc_put_vara().
|
||
|
The closure data structure (struct Putvar) is defined as follows.
|
||
|
<pre>
|
||
|
typedef struct Putvar {
|
||
|
int (*putvar)(struct Putvar*, Odometer*, Bytebuffer*);
|
||
|
int rank;
|
||
|
Bytebuffer* code;
|
||
|
size_t startset[NC_MAX_VAR_DIMS];
|
||
|
struct CDF {
|
||
|
int grpid;
|
||
|
int varid;
|
||
|
} cdf;
|
||
|
struct C {
|
||
|
Symbol* var;
|
||
|
} c;
|
||
|
} Putvar;
|
||
|
</pre>
|
||
|
An instance of the closure is created for
|
||
|
each variable that is the target of nc_put_vara().
|
||
|
It is initialized with the variable's symbol, rank, group id and variable
|
||
|
id. It is also provided with a Bytebuffer into which it is supposed
|
||
|
to store the generated C code.
|
||
|
The startset is the cached previous set of dimension indices used
|
||
|
for generating the nc_put_vara (see below).
|
||
|
<p>
|
||
|
The callback procedure (field "putvar")
|
||
|
for generating C code putvar is assigned to the procedure called cputvara()
|
||
|
(defined in genc.c).
|
||
|
This procedure takes as arguments the closure object,
|
||
|
an <a href="#odometer">odometer</a> describing the current set of dimension indices,
|
||
|
and a Bytebuffer containing the generated C constants
|
||
|
to be assigned to this slice of the variable.
|
||
|
<p>
|
||
|
Every time the closure procedure is called, it generates a C variable
|
||
|
to hold the generated C constant. It also generated
|
||
|
C constants to hold the start and count vectors required
|
||
|
by <i>nc_put_vara</i>. It then generates an <i>nc_put_vara()</i> call.
|
||
|
The start vector argument for the nc_put_vara is defined by the startset
|
||
|
field of the closure. The count vector argument to nc_put_vara
|
||
|
is computed from the current cached
|
||
|
start vector and from the indices in the odometer.
|
||
|
After the nc_put_vara() is generated, the odometer vector
|
||
|
is assigned to the startset field in the closure for use on the next call.
|
||
|
<p>
|
||
|
There are some important assumptions about the state of the odometer
|
||
|
when it is called.
|
||
|
<ol>
|
||
|
<li>The zeroth index controls the count set.
|
||
|
<li>All other indices are assumed to be at their max values.
|
||
|
</ol>
|
||
|
<p>
|
||
|
In particular, this means that the start vector is zero
|
||
|
for all positions except position zero. The count vector
|
||
|
is positions, except zero is the index in the odometer,
|
||
|
which is assumed to be the max.
|
||
|
<p>
|
||
|
For start position zero, the position is taken from the last
|
||
|
saved startset. The count position zero is the difference between
|
||
|
that last start position and the current odometer zeroth index.
|
||
|
|
||
|
<h4>VLEN Constants</h4>
|
||
|
VLEN constants need to be constructed
|
||
|
as separate C data constants because
|
||
|
the C compiler will never convert nested
|
||
|
groups ({...}) to separate memory chunks.
|
||
|
Thus, ncgen must in several places
|
||
|
generate the VLEN constants as separate variables
|
||
|
and then insert pointers to them in the appropriate
|
||
|
places in the later datalist C constants.
|
||
|
Note that this process can be very tricky
|
||
|
for non-C language (see genj.c and jdata.c for one approach).
|
||
|
<p>
|
||
|
As an optimization, ncgen tracks which datatypes
|
||
|
will require use of vlen constants.
|
||
|
This is any type whose definition is a vlen or whose
|
||
|
basetype contains a vlen type.
|
||
|
<p>
|
||
|
The vlen generation process is two-fold.
|
||
|
First, in the procedure processdatalist1() in semantics.c,
|
||
|
the location of the struct Datalist objects
|
||
|
that correspond to vlen constants is stored in a list called vlenconstants.
|
||
|
When detected, each such Datalist object is tagged with
|
||
|
a unique identifier and the vlen length (count).
|
||
|
These will be used later to generate references to the vlen constant.
|
||
|
These counts are only accurate for non-char typed variables;
|
||
|
Special handling is in place to handle character vlen constants.
|
||
|
<p>
|
||
|
The second vlen constant processing action is in the
|
||
|
procedure genc_vlenconstant() in cdata.c First, it walks the
|
||
|
vlenconstants list and generates C code for C variables to
|
||
|
define the vlen constant and C code to assign the vlen
|
||
|
constant's data to that C variable.
|
||
|
<p>
|
||
|
When, later, the genc_datalist procedure encounters
|
||
|
a Datalist tagged as representing a data list, it can generate
|
||
|
a nc_vlen_t constant as {<count>,<vlenconstantname>}
|
||
|
and use it directly in the generated C datalist constant.
|
||
|
|
||
|
|
||
|
<h2>Utility Data Structures</h2>
|
||
|
|
||
|
<h3>Pool Memory Allocation</h3>
|
||
|
As an approximation to garbage collection,
|
||
|
this code uses a pool allocation mechanism.
|
||
|
The goal is to allow dynamic construction of strings
|
||
|
that have very short life-times; typically they are used
|
||
|
to construct strings to send to the output file.
|
||
|
<p>
|
||
|
The pool mechanism wraps malloc and records the malloc'd
|
||
|
memory in a circular buffer. When the buffer reaches its maximum
|
||
|
size, previously allocated pool buffers are free'd.
|
||
|
This is good in that the user does not have to litter
|
||
|
code with free() statements. It is bad in that the pool
|
||
|
allocated memory can be free'd too early if the memory
|
||
|
does not have a short enough life.
|
||
|
If you suspect the latter, then bump the size of the circular buffer
|
||
|
and see if the problem goes away. If so, then your code
|
||
|
is probably holding on to a pool buffer too long and should use
|
||
|
regular malloc/free.
|
||
|
<p>
|
||
|
In the end, I am not sure if this is a good idea, but
|
||
|
if does make the code simpler.
|
||
|
|
||
|
<h3><a name="List">List<a> and <a name="Bytebuffer">Bytebuffer</a></h3>
|
||
|
The two datatypes List and Bytebuffer are used through out the
|
||
|
code. They correspond closely in semantics to the Java Arraylist
|
||
|
and Stringbuffer types, respectively. They are used to help
|
||
|
encapsulate dynamically growing lists of objects or bytes
|
||
|
to reduce certain kinds of memory allocation errors.
|
||
|
<p>
|
||
|
The canonical code for non-destructive walking of a List<T>
|
||
|
is as follows.
|
||
|
<pre>
|
||
|
for(i=0;i<listlength(list);i++) {
|
||
|
T* element = (T*)listget(list,i);
|
||
|
...
|
||
|
}
|
||
|
</pre>
|
||
|
<p>
|
||
|
Bytebuffer provides two ways to access its internal buffer of characters.
|
||
|
One is "bbContents()", which returns a direct pointer to the buffer,
|
||
|
and the other is "bbDup()", which returns a malloc'd string containing
|
||
|
the contents and is guaranteed to be null terminated.
|
||
|
|
||
|
<h3><a name="odometer">Odometer: Multi-Dimensional Array Handling</a></h3>
|
||
|
The odometer data type is used to convert
|
||
|
multiple dimensions into a single integer.
|
||
|
The rule for converting a multi-dimensional
|
||
|
array to a single dimensions is as follows.
|
||
|
<p>
|
||
|
Suppose we have the declaration <code>int F[2][5][3];</code>.
|
||
|
There are obviously a total of 2 X 5 X 3 = 30 integers in F.
|
||
|
Thus, these three dimensions will be reduced to a single dimension of size 30.
|
||
|
<p>
|
||
|
A particular point in the three dimensions, say [x][y][z], is reduced to
|
||
|
a number in the range 0..29 by computing <code>((x*5)+y)*3+z</code>.
|
||
|
The corresponding general C code is as follows.
|
||
|
<pre>
|
||
|
size_t
|
||
|
dimmap(int rank, size_t* indices, size_t* sizes)
|
||
|
{
|
||
|
int i;
|
||
|
size_t count = 0;
|
||
|
for(i=0;i<rank;i++) {
|
||
|
if(i > 0) count *= sizes[i];
|
||
|
count += indices[i];
|
||
|
}
|
||
|
return count;
|
||
|
}
|
||
|
</pre>
|
||
|
In this code, the indices variable corresponds to the x,y, and z.
|
||
|
The sizes variable corresponds to the 2,5, and 3.
|
||
|
<p>
|
||
|
The Odometer type stores a set of dimensions
|
||
|
and supports operations to iterate over all possible
|
||
|
dimension combinations.
|
||
|
The definition of Odometer is defined by the types Odometer and Dimdata.
|
||
|
<pre>
|
||
|
typedef struct Dimdata {
|
||
|
unsigned long datasize; // actual size of the datalist item
|
||
|
unsigned long index; // 0 <= index < datasize
|
||
|
unsigned long declsize;
|
||
|
} Dimdata;
|
||
|
|
||
|
typedef struct Odometer {
|
||
|
int rank;
|
||
|
Dimdata dims[NC_MAX_VAR_DIMS];
|
||
|
} Odometer;
|
||
|
</pre>
|
||
|
The following primary operations are defined.
|
||
|
<ul>
|
||
|
<li>Odometer* newodometer(Dimset*) - create an odometer from a set of Dimsets.
|
||
|
<li>void freeodometer(Odometer*) - release the memory of an odometer.
|
||
|
<li>int odometermore(Odometer* odom) - return 1 if there are more combinations
|
||
|
of dimension values.
|
||
|
<li>int odometerincr(Odometer* odo,int) - move to the next combination
|
||
|
of dimension values.
|
||
|
<li>unsigned long odometercount(Odometer* odo) -
|
||
|
apply the above algorithm to convert the current odometer combination
|
||
|
into a single integer.
|
||
|
</ul>
|
||
|
|
||
|
|
||
|
|
||
|
<h2>Misc. Notes</h2>
|
||
|
<ul>
|
||
|
<li> The flag "usingclassic" should be consulted when appropriate to determine
|
||
|
is this CDL file should be treated as using only the netCDF classic model.
|
||
|
</ul>
|
||
|
|
||
|
<h2><u>Change Log</u></h2>
|
||
|
<ul>
|
||
|
<li>07/04/2009 - First draft.
|
||
|
</ul>
|
||
|
|
||
|
</body>
|
||
|
</html>
|
||
|
|
||
|
<p>
|
||
|
|
||
|
|
||
|
<i>genc_scalardata</i> or <i>genc_arraydata</i>.
|
||
|
It stores in its Bytebuffer argument the sequence of constants
|
||
|
corresponding to a given datalist. Handling commas is a tricky issue
|
||
|
so you will that many of the non-top-level routines in cdata.c
|
||
|
take a pointer to a global state element, commap, that determines the
|
||
|
current state of adding commas. The idea is that at the beginning of
|
||
|
any (sub-) Datalist, we want to turn off the comma in front of the
|
||
|
first generated constant and then add commas until be reach the end
|
||
|
of that (sub-)Datalist.
|
||
|
|
||
|
<h1></u><a name="GIT">General Internals Information</a></u></h1>
|
||
|
|
||
|
<h2><u>Primary NCGEN Data Structures</u></h2>
|
||
|
There are two primary structures used in ncgen:
|
||
|
<a href="#Symbol">struct Symbol</a>) and
|
||
|
<a href="#Datalist">struct Datalist</a>).
|
||
|
|
||
|
<h3><a name="Symbol">struct Symbol</a></h3>
|
||
|
Symbol objects are linked into hierarchical structures
|
||
|
to represent netcdf dimensions, types, groups, and variables.
|
||
|
The struct has the following fields.
|
||
|
<table>
|
||
|
<tr><th colspan=3>struct Symbol Fields
|
||
|
<tr valign=top><td>struct Symbol* next<td>-<td>
|
||
|
The Symbol objects are all kept on a single linked list.
|
||
|
No symbol is ever deleted until the end of the program.
|
||
|
<tr valign=top><td>nc_class objectclass<td>-<td>
|
||
|
This defines the general class of symbol, one of: NC_GRP, NC_DIM, NC_VAR, NC_ATT, or NC_TYPE.
|
||
|
<tr valign=top><td>nc_classsubclass<td>-<td>
|
||
|
This defines the sub class of symbol, one of:
|
||
|
NC_PRIM, NC_OPAQUE, NC_ENUM,
|
||
|
NC_FIELD, NC_VLEN, NC_COMPOUND,
|
||
|
NC_ECONST, NC_ARRAY, or NC_FILLVALUE.
|
||
|
<tr valign=top><td>char*name<td>-<td>
|
||
|
The symbol's name.
|
||
|
<tr valign=top><td>struct Symbol* container<td>-<td>
|
||
|
The symbol that is the container for this symbol.
|
||
|
Typically, this the group symbol that contains
|
||
|
this symbol.
|
||
|
<tr valign=top><td>struct Symbol location<td>-<td>
|
||
|
The current group that was open when this symbol was created.
|
||
|
<tr valign=top><td>List* subnodes<td>-<td>
|
||
|
The list of child symbols of this symbol.
|
||
|
For example, a group symbol will have its dimensions,
|
||
|
types, vars, and subgroups will be in this list.
|
||
|
<tr valign=top><td>int is_prefixed<td>-<td>
|
||
|
True if the name of this symbol contains a complete
|
||
|
prefix path (e.g. /x/y/z).
|
||
|
<tr valign=top><td>List* prefix<td>-<td>
|
||
|
A list of the prefix names for this node.
|
||
|
Note that if is_prefixed is false, then this
|
||
|
list was constructed from the set of enclosing groups.
|
||
|
<tr valign=top><td>struct Datalist* data<td>-<td>
|
||
|
Stores the constants from attribute or datalist
|
||
|
constructs.
|
||
|
<tr valign=top><td>Typeinfo typ<td>-<td>
|
||
|
Type information about this symbol
|
||
|
as defined by the Typeinfo structure.
|
||
|
<tr valign=top><td>Varinfo var<td>-<td>
|
||
|
Variable information about a variable symbol
|
||
|
as defined by the Varinfo structure.
|
||
|
<tr valign=top><td>Attrinfo att<td>-<td>
|
||
|
Attribute information about an attribute symbol
|
||
|
as defined by the Attrinfo structure.
|
||
|
<tr valign=top><td>Diminfo dim<td>-<td>
|
||
|
Dimension information about a dimension symbol
|
||
|
as defined by the Diminfo structure.
|
||
|
<tr valign=top><td>Groupinfo grp<td>-<td>
|
||
|
Group information about a group symbol
|
||
|
as defined by the Groupinfo structure.
|
||
|
<tr valign=top><td>int lineno<td>-<td>
|
||
|
The source line in which this symbol was created.
|
||
|
<tr valign=top><td>int touched<td>-<td>
|
||
|
Used in transitive closure operations
|
||
|
to prevent revisiting symbols.
|
||
|
<tr valign=top><td>char* lname<td>-<td>
|
||
|
Cached C or FORTRAN name (not used?).
|
||
|
<tr valign=top><td>int ncid<td>-<td>
|
||
|
The ncid/varid/dimid, etc when
|
||
|
defining netcdf objects.
|
||
|
</table>
|
||
|
|
||
|
<h4>struct Groupinfo</h4>
|
||
|
Group symbols primarily keep the group
|
||
|
containment structure in the subnodes field of the Symbol.
|
||
|
<p>
|
||
|
<table>
|
||
|
<tr><th colspan=3>struct Groupinfo Fields
|
||
|
<tr valign=top><td>int is_root<td>-<td>
|
||
|
Is this the root group?
|
||
|
</table>
|
||
|
|
||
|
<h4>struct Diminfo</h4>
|
||
|
The only important information about a dimension,
|
||
|
aside from name, is the dimension size.
|
||
|
Additionally, type definitions may have anonymous
|
||
|
(unnamed) dimensions.
|
||
|
<p>
|
||
|
<table>
|
||
|
<tr><th colspan=3>struct Diminfo Fields
|
||
|
<tr valign=top><td>int isconstant<td>-<td>
|
||
|
Is this an anonymous dimension?
|
||
|
<tr valign=top><td>unsigned int size<td>-<td>
|
||
|
The size of the dimension.
|
||
|
</table>
|
||
|
|
||
|
<h4>struct Varinfo</h4>
|
||
|
Variables require two primary pieces of information:
|
||
|
the set of attributes (including special attributes)
|
||
|
and dimension information. The dimension information
|
||
|
is kept in the Typeinfo structure because things
|
||
|
other than variables have dimensions (e.g. user defined types).
|
||
|
<p>
|
||
|
<table>
|
||
|
<tr><th colspan=3>struct Varinfo Fields
|
||
|
<tr valign=top><td>int nattributes<td>-<td>
|
||
|
The number of attributes; this is redundant but useful.
|
||
|
<tr valign=top><td>List* attributes<td>-<td>
|
||
|
The list of all attribute symbols associated with this
|
||
|
variable.
|
||
|
<tr valign=top><td>Specialdata special<td>-<td>
|
||
|
Special attribute values.
|
||
|
</table>
|
||
|
|
||
|
<h4>struct Typeinfo</h4>
|
||
|
The type information is probably the second most
|
||
|
used structure in all of the code (second to Symbol itself).
|
||
|
<p>
|
||
|
<table>
|
||
|
<tr><th colspan=3>struct Typeinfo Fields
|
||
|
<tr valign=top><td>struct Symbol* basetype<td>-<td>
|
||
|
Provide a reference to the base type of this symbol.
|
||
|
This applies to other types, variables, and attributes.
|
||
|
<tr valign=top><td>int hasvlen<td>-<td>
|
||
|
Does the type have a vlen definition anywhere within it.
|
||
|
This is used as an optimization to avoid searching datalists
|
||
|
for vlen constants.
|
||
|
<tr valign=top><td>nc_type typecode<td>-<td>
|
||
|
The typecode of the basetype. This is most useful
|
||
|
when the basetype is a primitive type.
|
||
|
<tr valign=top><td>unsigned long size<td>-<td>
|
||
|
The size of this object.
|
||
|
<tr valign=top><td>unsigned long offset<td>-<td>
|
||
|
The field offset for fields in compound types.
|
||
|
<tr valign=top><td>unsigned long alignment<td>-<td>
|
||
|
The memory alignment (i.e. 1,2,4,or 8).
|
||
|
<tr valign=top><td>Constant econst<td>-<td>
|
||
|
For enumeration constants, the actual value of the constant.
|
||
|
<tr valign=top><td>Dimset dimset<td>-<td>
|
||
|
The dimension information for the type or variable.
|
||
|
The dimset stores the number of dimensions and a list
|
||
|
of pointers to the corresponding dimension symbols.
|
||
|
</table>
|
||
|
|
||
|
<h4>struct Attrinfo</h4>
|
||
|
Note that the actual attribute data is stored
|
||
|
in the data field of the containing Symbol.
|
||
|
<p>
|
||
|
<table>
|
||
|
<tr><th colspan=3>struct Attrinfo Fields
|
||
|
<tr valign=top><td>struct Symbol* var<td>-<td>
|
||
|
The variable with which this attribute is associated;
|
||
|
it is NULL for global attributes.
|
||
|
<tr valign=top><td>unsigned long count<td>-<td>
|
||
|
The number of instances associated with the attribute value.
|
||
|
</table>
|
||
|
|
||
|
<h3><a name="Datalist">Datalists and Datasrcs</a></h3>
|
||
|
Whenever a datalist is encountered during parsing, it is converted
|
||
|
to an instance of struct Datalist.
|
||
|
Each datalist instance contains a vector of instances of
|
||
|
struct Constant that contains the actual data.
|
||
|
<p>
|
||
|
Each datalist instance contains the following information.
|
||
|
<table>
|
||
|
<tr><th colspan=3>struct Datalist Fields
|
||
|
<tr valign=top><td>struct Datalist* next<td>-<td>
|
||
|
All datalists are chained for reclamation.
|
||
|
<tr valign=top><td>int readonly<td>-<td>
|
||
|
Can this datalist be modified?
|
||
|
<tr valign=top><td>unsigned int length<td>-<td>
|
||
|
The number of Constant instances in the data field.
|
||
|
<tr valign=top><td>unsigned int alloc<td>-<td>
|
||
|
The memory space allocated to the data field.
|
||
|
<tr valign=top><td>Constant* data<td>-<td>
|
||
|
The vector in sequential memory of the constants comprising this datalist.
|
||
|
<tr valign=top><td>struct Symbol* schema<td>-<td>
|
||
|
The symbol (type, variable, or attribute) defining the structure of this datalist,
|
||
|
if known.
|
||
|
<tr valign=top><td>struct Vlen {<td>-<td>
|
||
|
Information about the vlen instances contained in this datalist.
|
||
|
<tr><td>unsigned int count;
|
||
|
<tr><td>unsigned int uid;
|
||
|
<tr><td>} vlen
|
||
|
<tr valign=top><td>Odometer* dimdata<td>-<td>
|
||
|
A tracker to count through dimensions associated with this datalist via the schema.
|
||
|
</table>
|
||
|
<p>
|
||
|
In turn, a Constant instance is defined as follows.
|
||
|
<pre>
|
||
|
typedef struct Constant {
|
||
|
nc_type nctype;
|
||
|
int lineno;
|
||
|
Constvalue value;
|
||
|
} Constant;
|
||
|
</pre>
|
||
|
It indicates the type of the value and the source line number (if known)
|
||
|
in which this constant was created.
|
||
|
<p>
|
||
|
The ConstValue type is a union
|
||
|
of all possible values that can occur
|
||
|
in a datalist.
|
||
|
<pre>
|
||
|
typedef union Constvalue {
|
||
|
struct Datalist* compoundv; // NC_COMPOUND
|
||
|
char charv; // NC_CHAR
|
||
|
signed char int8v; // NC_BYTE
|
||
|
unsigned char uint8v; // NC_UBYTE
|
||
|
short int16v; // NC_SHORT
|
||
|
unsigned short uint16v; // NC_USHORT
|
||
|
int int32v; // NC_INT
|
||
|
unsigned int uint32v; // NC_UINT
|
||
|
long long int64v; // NC_INT64
|
||
|
unsigned long long uint64v; // NC_UINT64
|
||
|
float floatv; // NC_FLOAT
|
||
|
double doublev; // NC_DOUBLE
|
||
|
struct Stringv { // NC_STRING
|
||
|
int len;
|
||
|
char* stringv;
|
||
|
} stringv;
|
||
|
struct Opaquev { // NC_OPAQUE
|
||
|
int len; // length as originally written (rounded to even number)
|
||
|
char* stringv; //as constant was written
|
||
|
// (padded to even # chars >= 16)
|
||
|
// without leading 0x
|
||
|
} opaquev;
|
||
|
struct Symbol* enumv; // NC_ECONST
|
||
|
} Constvalue;
|
||
|
</pre>
|
||
|
<p>
|
||
|
Several fields are of particular interest:
|
||
|
<table>
|
||
|
<tr><th colspan=3>Selected Constvalue Fields
|
||
|
<tr valign=top><td>struct Datalist* compoundv<td>-<td>
|
||
|
This stores nested datalists - typically
|
||
|
of the form "{...{...}...}".
|
||
|
<tr valign=top><td>struct Stringv {int len; char* stringv;} stringv<td>-<td>
|
||
|
Store string constants.
|
||
|
<tr valign=top><td>struct Opaquev {int len; char* stringv;} opaquev<td>-<td>
|
||
|
Store opaque constants as written (i.e. abc...),
|
||
|
without the leading 0x, and
|
||
|
padded to an even number of characters to be
|
||
|
at least 16 characters long.
|
||
|
<tr valign=top><td>struct Symbol* enumv<td>-<td>
|
||
|
Pointer to an enumeration constant definition.
|
||
|
</table>
|
||
|
|
||
|
<h4>struct Datasrc</h3>
|
||
|
When it comes time to generate datalists for output,
|
||
|
it is necessary to "walk" the datalist (including nested
|
||
|
datalist). The Datasrc structure is used to do this.
|
||
|
Its definition is as follows.
|
||
|
<pre>
|
||
|
typedef struct Datasrc {
|
||
|
unsigned int index; // 0..length-1
|
||
|
unsigned int length;
|
||
|
int autopop; // pop when at end
|
||
|
Constant* data; // duplicate pointer; so do not free.
|
||
|
struct Datasrc* stack;
|
||
|
} Datasrc;
|
||
|
</pre>
|
||
|
The Datasrc tracks the "current" location in the sequence
|
||
|
of Constants (taken from a Datalist). The index field indicates
|
||
|
the current location.
|
||
|
In effect, Datasrc is the lexer and the code
|
||
|
that is walking it is in effect parsing the data sequence.
|
||
|
The following operations are supported (see data.[ch]).
|
||
|
<ul>
|
||
|
<li>datalist2src - takes a Datalist and constructs a Datasrc.
|
||
|
<li>srcpush - assumes the current constant is a nested Datalist
|
||
|
and pushes into that Datalist.
|
||
|
<li>srcpushlist - pushes into the passed Datalist argument.
|
||
|
<li>srcpop - pops the current list and resumes the next list in the
|
||
|
stack.
|
||
|
<li>srcnext - return the value at the index
|
||
|
and then advance the Datasrc index.
|
||
|
If at the end of the current datalist, then return NULL;
|
||
|
srcincr is an alias for srcnext.
|
||
|
<li>srcmore - return 1 is not at the end of the current Datasrc.
|
||
|
Pushed datalists are not considered.
|
||
|
<li>srcline - return a usable line number associated with the current
|
||
|
position of the Datasrc (that is why Constant instances have a line
|
||
|
number).
|
||
|
<li>srcpeek - return the value at the index but do not advance.
|
||
|
If at the end of the current datalist, then return NULL; srcget is an alias
|
||
|
for srcpeek.
|
||
|
</ul>
|
||
|
|
||
|
<h2><u>The CDL Parser</u></h2>
|
||
|
|
||
|
The CDL parser and associated lexer
|
||
|
(primarily files "ncgen.y" and "ncgen.l")
|
||
|
parse CDL files into various data structures
|
||
|
for use by the remaining ncgen code.
|
||
|
The data structures described above,
|
||
|
(<a href="#Symbol">Symbol</a>, and
|
||
|
<a href="#Datalist">Datalist</a>)
|
||
|
are primarily generated by the parser.
|
||
|
|
||
|
<h3>Parse Cliches</h3>
|
||
|
<h4>Node Stacking</h4>
|
||
|
One of the issues that must be addressed by any bottom-up
|
||
|
parser is handling the accumulation of sets of items (nodes,
|
||
|
etc.). The YACC/Bison parse stack cannot be used
|
||
|
because the set of accumulated nodes is unbounded
|
||
|
and the YACC stack mechansim is bounded (i.e. each rule
|
||
|
has a bounded right hand side length).
|
||
|
<p>
|
||
|
The node stacking set of cliches is ubiquitous in the
|
||
|
parser, so they must be understood to understand how the
|
||
|
parser works. The cliche here is shown in the handling of,
|
||
|
for example, the varlist rule, which is defined as follows.
|
||
|
<pre>
|
||
|
varlist: varspec
|
||
|
{$$=listlength(stack); listpush(stack,(elem_t)$1);}
|
||
|
| varlist ',' varspec
|
||
|
{$$=$1; listpush(stack,(elem_t)$3);}
|
||
|
;
|
||
|
</pre>
|
||
|
The varlist rule collects variable name declarations (via the varspec rule).
|
||
|
The idea is to use a separate stack named "stack", and tracking
|
||
|
the index into the stack of the start of collection of objects.
|
||
|
The varlist value (in the YACC sense) is defined as an integer
|
||
|
representing the size of the stack at the start of a list of variables.
|
||
|
That is what this code does: <code>$$=listlength(stack)</code>.
|
||
|
<p>
|
||
|
At the point where the set of varspecs should processed, the following code cliche
|
||
|
is used.
|
||
|
<pre>
|
||
|
vardecl: typeref varlist
|
||
|
{...
|
||
|
stackbase=$2;
|
||
|
stacklen=listlength(stack);
|
||
|
for(i=stackbase;i<stacklen;i++) {
|
||
|
Symbol* sym = (Symbol*)listget(stack,i);
|
||
|
...
|
||
|
}
|
||
|
listsetlength(stack,stackbase);// remove stack nodes
|
||
|
}
|
||
|
...
|
||
|
</pre>
|
||
|
The start of the set of variable declaration symbols is extracted
|
||
|
as the integer associated with right-side non-terminal $2, e.g.
|
||
|
<code>stackbase=$2</code>.
|
||
|
The current stack length is obtained from <code>stacklen=listlength(stack)</code>.
|
||
|
Then the elements of the stack are extracted one by one using the above loop.
|
||
|
Finally, the nodes on the stack are cleared by the code segment
|
||
|
<code>listsetlength(stack,stackbase)</code>.
|
||
|
|
||
|
<h4><u>Semantic Processing</u></h4>
|
||
|
Semantic processing takes the output of the parser
|
||
|
and adds various pieces of semantic information.
|
||
|
The semantic actions are as follows.
|
||
|
<ol>
|
||
|
<li> Procedure processtypes().
|
||
|
<ol>
|
||
|
<li>Do a topological sort of the types based on dependency
|
||
|
so that the least dependent are first in the typdefs list.
|
||
|
<li>Fill in type typecodes.
|
||
|
<li>Mark types that have a vlen.
|
||
|
</ol>
|
||
|
<li> Procedure filltypecodes() - Fill in implied type codes.
|
||
|
<li> Procedure processvars() - Fill in missing values.
|
||
|
<li> Procedure processattributes() -
|
||
|
Process attributes to connect to corresponding variable.
|
||
|
<li> Procedure processcompound() -
|
||
|
Process each compound type to compute its size.
|
||
|
<li> Procedure processenums() -
|
||
|
Fix up enum constant values.
|
||
|
<li> Procedure processdatalists() -
|
||
|
Fix up datalists.
|
||
|
<li> Procedure checkconsistency() -
|
||
|
Check internal consistency.
|
||
|
<li> Procedure validate() -
|
||
|
Do any needed additional semantic checks.
|
||
|
</ol>
|
||
|
|
||
|
<h2><u>Generating C Code</u></h2>
|
||
|
The source code for generating C code output (via the -c option)
|
||
|
is of most interest because it is the pattern to be used
|
||
|
for other languages and because, frankly, it is complex and ugly
|
||
|
at the moment and so guidance is needed in understanding it.
|
||
|
<p>
|
||
|
The files genc.[ch] and cdata.c are the primary files for C code generation.
|
||
|
The files data.[ch] is also important.
|
||
|
|
||
|
<h3><u>Output Routines</u></h3>
|
||
|
The output routines are a bit of a mixed bag.
|
||
|
It is important to know that code is not directly
|
||
|
dumped to the output file; rather is is accumulated
|
||
|
in a global Bytebuffer instance called "ccode".
|
||
|
<p>
|
||
|
The output routines are as follows.
|
||
|
<ul>
|
||
|
<li>flushcode(void) - flush the ccode buffer to the output file.
|
||
|
<li>cprint(Bytebuffer* buf) - dump the contents
|
||
|
of buf to the ccode buffer.
|
||
|
<li>cpartial(char* line) - dump the contents of line
|
||
|
to the ccode buffer, but do not add a trailing newline.
|
||
|
<li>cline(char* line) - dump the contents of line
|
||
|
to the ccode buffer and add a trailing newline.
|
||
|
<li>clined(int n, char* line) - dump the contents of line to the ccode
|
||
|
buffer; prefix with n indentations (typically 4 blanks each)
|
||
|
and suffix with a trailing newline.
|
||
|
</ul>
|
||
|
|
||
|
<h3><u>gen_ncc</u></h3>
|
||
|
The gen_ncc procedure is responsible for
|
||
|
creating and dumping the generated C code.
|
||
|
<p>
|
||
|
It has at its disposal several global lists of Symbols.
|
||
|
Note that the lists cross all groups.
|
||
|
<ul>
|
||
|
<li>dimdefs - the set of symbols defining dimensions.
|
||
|
<li>vardefs - the set of symbols defining variables.
|
||
|
<li>attdefs - the set of symbols defining non-global attributes.
|
||
|
<li>gattdefs - the set of symbols defining global attributes.
|
||
|
<li>grpdefs - the set of symbols defining groups.
|
||
|
<li>typdefs - the set of symbols defining types; note that this list
|
||
|
has been topologically sorted so that a given type depends only
|
||
|
on types with lower indices in the list.
|
||
|
</ul>
|
||
|
<p>
|
||
|
The superficial operation of gen_ncc is as follows; the details
|
||
|
are provided later where the operation is complex.
|
||
|
<ol>
|
||
|
<li>Generate header code (e.g. #include <stdio.h>").
|
||
|
<li>Generate C type definitions corresponding to the
|
||
|
CDL types.
|
||
|
<li>Generate VLEN constants.
|
||
|
<li>Generate chunking constants.
|
||
|
<li>Generate initial part of the main() procedure.
|
||
|
<li>Generate C variable definitions to hold the ncids
|
||
|
for all created groups.
|
||
|
<li>Generate C variable definitions to hold the typeids
|
||
|
of all created types.
|
||
|
<li>Generate C variables and constants that correspond to
|
||
|
to the CDL dimensions.
|
||
|
<li>Generate C variable definitions to hold the varids
|
||
|
of all created variables.
|
||
|
<li>Generate C code to create the netCDF binary file.
|
||
|
<li>Generate C code to create the all groups in the proper
|
||
|
hierarchy.
|
||
|
<li>Generate C code to create the type definitions.
|
||
|
<li>Generate C code to create the dimension definitions.
|
||
|
<li>Generate C code to create the variable definitions.
|
||
|
<li>Generate C code to create the global attributes.
|
||
|
<li>Generate C code to create the non-global attributes.
|
||
|
<li>Generate C code to leave define mode.
|
||
|
<li>Generate C code to assign variable datalists.
|
||
|
</ol>
|
||
|
<p>
|
||
|
The following code generates C code for defining the groups.
|
||
|
It is fairly canonical and can be seen repeated in variant form
|
||
|
when defining dimensions, types, variables, and attributes.
|
||
|
<p>
|
||
|
This code is redundant but for consistency, the root group
|
||
|
ncid is stored like all other group ncids.
|
||
|
Note that nprintf is a macro wrapper around snprint.
|
||
|
<pre>
|
||
|
nprintf(stmt,sizeof(stmt)," %s = ncid;",groupncid(rootgroup));
|
||
|
cline(stmt);
|
||
|
</pre>
|
||
|
<p>
|
||
|
The loop walks all group symbols in preorder form
|
||
|
and generates C code call to nc_def_grp
|
||
|
using parameters taken from the group Symbol instance (gsym).
|
||
|
The call to nc_def_grp is succeeded by a call to the
|
||
|
check_err procedure to verify the operation's result code.
|
||
|
<pre>
|
||
|
for(igrp=0;igrp<listlength(grpdefs);igrp++) {
|
||
|
Symbol* gsym = (Symbol*)listget(grpdefs,igrp);
|
||
|
if(gsym == rootgroup) continue; // ignore root
|
||
|
if(gsym->container == NULL) PANIC("null container");
|
||
|
nprintf(stmt,sizeof(stmt),
|
||
|
" stat = nc_def_grp(%s, \"%s\", &%s);",
|
||
|
groupncid(gsym->container),
|
||
|
gsym->name, groupncid(gsym));
|
||
|
cline(stmt); // print the def_grp call
|
||
|
clined(1,"check_err(stat,__LINE__,__FILE__);");
|
||
|
}
|
||
|
flushcode();
|
||
|
</pre>
|
||
|
<p>
|
||
|
The code to generate dimensions, types, attributes, variables
|
||
|
is similar, although often more complex.
|
||
|
<p>
|
||
|
The code to generate C equivalents of CDL types is
|
||
|
in the procedure definectype().
|
||
|
Note that this code is not the code that invokes e.g. nc_def_vlen.
|
||
|
The generated C types are used when generating datalists
|
||
|
so that the standard C constant assignment mechanism will produce
|
||
|
the correct memory values.
|
||
|
<p>
|
||
|
The genc_deftype procedure is the one that actually
|
||
|
generates C code to define the netcdf types.
|
||
|
The generated C code is designed to store the resulting
|
||
|
typeid into the C variable defined earlier
|
||
|
for holding that typeid.
|
||
|
<p>
|
||
|
Note that for compound types, the NC_COMPOUND_OFFSET
|
||
|
macro is normally used to match netcdf offsets to
|
||
|
the corresponding struct type generated in definectype.
|
||
|
However, there is a flag, TESTALIGNMENT,
|
||
|
that can be set to use a computed value for the offset.
|
||
|
|
||
|
<h3><u>C Constant Datalist Generation</u></h3>
|
||
|
All attributes, and some variables, require the
|
||
|
construction of a memory object containing data
|
||
|
to be assigned to that attribute or variable.
|
||
|
The code to do this is by far the most complicated
|
||
|
in ncgen.
|
||
|
The file cdata.c contains the procedure genc_datalist(),
|
||
|
which does most of the heavy lifting.
|
||
|
<p>
|
||
|
For attributes, the general form generated is
|
||
|
<pre>
|
||
|
T* attributevar = {...};
|
||
|
</pre>
|
||
|
Except for VLENs, the datalist is completely
|
||
|
contained in the brackets, with bracket nesting as required.
|
||
|
A generated pointer the attributevar is included
|
||
|
in the generated call to nc_put_att().
|
||
|
<p>
|
||
|
For variables, the general form generated is similar to attributes.
|
||
|
<pre>
|
||
|
T* varvar = {...};
|
||
|
</pre>
|
||
|
Again, VLENs are handled specially.
|
||
|
Also, for performance purposes, the datalist
|
||
|
is loaded in pieces using nc_put_vara(). This is required if
|
||
|
there are UNLIMITED dimensions, but is used for all cases
|
||
|
for uniformity.
|
||
|
|
||
|
<h4>Datalist Closures</h4>
|
||
|
The code uses a concept of closure or callback
|
||
|
to allow the datalist processing to periodically
|
||
|
call external code to do the actual C code generation.
|
||
|
Basically, each call to the callback will generate
|
||
|
C code for constants and calls to nc_put_vara().
|
||
|
The closure data structure (struct Putvar) is defined as follows.
|
||
|
<pre>
|
||
|
typedef struct Putvar {
|
||
|
int (*putvar)(struct Putvar*, Odometer*, Bytebuffer*);
|
||
|
int rank;
|
||
|
Bytebuffer* code;
|
||
|
size_t startset[NC_MAX_VAR_DIMS];
|
||
|
struct CDF {
|
||
|
int grpid;
|
||
|
int varid;
|
||
|
} cdf;
|
||
|
struct C {
|
||
|
Symbol* var;
|
||
|
} c;
|
||
|
} Putvar;
|
||
|
</pre>
|
||
|
An instance of the closure is created for
|
||
|
each variable that is the target of nc_put_vara().
|
||
|
It is initialized with the variable's symbol, rank, group id and variable
|
||
|
id. It is also provided with a Bytebuffer into which it is supposed
|
||
|
to store the generated C code.
|
||
|
The startset is the cached previous set of dimension indices used
|
||
|
for generating the nc_put_vara (see below).
|
||
|
<p>
|
||
|
The callback procedure (field "putvar")
|
||
|
for generating C code putvar is assigned to the procedure called cputvara()
|
||
|
(defined in genc.c).
|
||
|
This procedure takes as arguments the closure object,
|
||
|
an odometer describing the current set of dimension indices,
|
||
|
and a Bytebuffer containing the generated C constants
|
||
|
to be assigned to this slice of the variable.
|
||
|
<p>
|
||
|
Every time the closure procedure is called, it generates a C variable
|
||
|
to hold the generated C constant. It then generates an nc_put_vara()
|
||
|
call. The start vector argument for the nc_put_vara is defined by the startset
|
||
|
field of the closure. The count vector argument to nc_put_vara
|
||
|
is computed from the current cached
|
||
|
start vector and from the indices in the odometer.
|
||
|
After the nc_put_vara() is generated, the odometer vector
|
||
|
is assigned to the startset field in the closure for use on the next call.
|
||
|
<p>
|
||
|
There are some important assumptions about the state of the odometer
|
||
|
when it is called.
|
||
|
<ol>
|
||
|
<li>The zeroth index controls the count set.
|
||
|
<li>All other indices are assumed to be at their max values.
|
||
|
</ol>
|
||
|
<p>
|
||
|
In particular, this means that the start vector is zero
|
||
|
for all positions except position zero. The count vector
|
||
|
is positions, except zero is the index in the odometer,
|
||
|
which is assumed to be the max.
|
||
|
<p>
|
||
|
For start position zero, the position is taken from the last
|
||
|
saved startset. The count position zero is the difference between
|
||
|
that last start position and the current odometer zeroth index.
|
||
|
<p>
|
||
|
If all of this sounds complex, it is, and if/when I have time
|
||
|
I will rethink the whole process of datalist generation
|
||
|
from beginning to end.
|
||
|
|
||
|
<h4>VLEN Constants</h4>
|
||
|
VLEN constants need to be constructed
|
||
|
as separate C data constants because
|
||
|
the C compiler will never convert nested
|
||
|
groups ({...}) to separate memory chunks.
|
||
|
Thus, ncgen must in several places
|
||
|
generate the VLEN constants as separate variables
|
||
|
and then insert pointers to them in the appropriate
|
||
|
places in the later datalist C constants.
|
||
|
<p>
|
||
|
As an optimization, ncgen tracks which datatypes
|
||
|
will require use of vlen constants.
|
||
|
This is any type whose definition is a vlen or whose
|
||
|
basetype contains a vlen type.
|
||
|
<p>
|
||
|
The vlen generation process is two-fold.
|
||
|
First, in the procedure processdatalist1() in semantics.c,
|
||
|
the location of the struct Datalist objects
|
||
|
that correspond to vlen constants is stored in a list called vlenconstants.
|
||
|
When detected, each such struct Datalist object is tagged with
|
||
|
a unique identifier and the vlen length (count).
|
||
|
These will be used later to generate references to the vlen constant.
|
||
|
<p>
|
||
|
The second vlen constant processing action is in the
|
||
|
procedure genc_vlenconstant() in cdata.c First, it walks the
|
||
|
vlenconstants list and generates C code for variables to
|
||
|
define the vlen constant and C code to assign the vlen
|
||
|
constant's data to that variable.
|
||
|
<p>
|
||
|
When, later, the genc_datalist procedure encounters
|
||
|
a Datalist tagged as representing a data list, it can generate
|
||
|
a nc_vlen_t constant as {<count>,<vlenconstantname>}
|
||
|
and use it directly in the generated C datalist constant.
|
||
|
|
||
|
<h4>Walking the Datalist</h4>
|
||
|
To actually generate the C code for a datalist constant,
|
||
|
the procedure genc_datalist wraps the Datalist in a Datasrc,
|
||
|
and proceeds to walk it constant by constant and generating
|
||
|
the corresponding C constant. The bulk of the work
|
||
|
is performed in the recursive procedure genc_datalist1().
|
||
|
<p>
|
||
|
For better or worse, the code
|
||
|
acts like a 1-lookahead parser. This means that it decides
|
||
|
what to do based on the current type, the current constant and, when necessary,
|
||
|
the next constant in the Datasrc. In practice, the lookahead
|
||
|
is hidden, so it is not represented in the following table.
|
||
|
<p>
|
||
|
<table border=1>
|
||
|
<tr><th>Current Type<th>Current Constant<th>action
|
||
|
<tr valign=top><td>NC_PRIM<td>Primitive Constant<td>Generate the C constant; convert as necessary.
|
||
|
<tr valign=top><td>NC_OPAQUE<td>''<td>''
|
||
|
<tr valign=top><td>NC_ENUM<td>''<td>''
|
||
|
<tr valign=top><td>NC_ENUM<td>''<td>''
|
||
|
<tr valign=top><td>NC_COMPOUND<td>Nested Datalist Constant<td>Push into the datalist and recurse on each field; When done, pop back to previous datalist.
|
||
|
<tr valign=top><td>NC_COMPOUND<td>Any other Constant<td>
|
||
|
Continue to recurse on each field; This allows
|
||
|
specification of fields without enclosing in {...}.
|
||
|
<tr valign=top><td>NC_VLEN<td>Nested Datalist Constant<td>Generate the
|
||
|
nc_vlen_t instance using the tagged information in the struct Datalist.
|
||
|
<tr valign=top><td>NC_FIELD<td>NA<td>If this field is dimensioned,
|
||
|
then call genc_fielddata to walk the dimensions. Otherwise, just
|
||
|
recurse on genc_datalist1.
|
||
|
</table>
|
||
|
<p>
|
||
|
The genc_fielddata() procedure iterates over a field dimension
|
||
|
and calls itself recursively to walk the remaining dimensions.
|
||
|
It this is the last dimension, then it calls genc_datalist1 to
|
||
|
generate C code for the basetype of the field.
|
||
|
|
||
|
<h4>String/Char Handling</h4>
|
||
|
All through the genc_datalist code,
|
||
|
there are special cases for handling string constants.
|
||
|
The reason is, of course, that the string constant "abcd.."
|
||
|
may, depending on the type context, be either a string
|
||
|
or an array of characters.
|
||
|
|
||
|
<h4>Generating Variable Data</h4>
|
||
|
The genc_datalist code does not call closures.
|
||
|
The closures are used in the genc_vardata() and genc_vardata1()
|
||
|
procedures; genc_vardata1 being the recursive procedure that actually
|
||
|
calls the closure.
|
||
|
<p>
|
||
|
The genc_vardata1() procedure, like genc_fielddata,
|
||
|
iterates over a top-level dimension and calls itself recursively
|
||
|
to iterate over the remaining dimensions.
|
||
|
The term "top-level" refers to the fact that these are the dimensions
|
||
|
specified for a variable as opposed to field dimensions.
|
||
|
<p>
|
||
|
When iterating an UNLIMITED dimension, or when iterating the first
|
||
|
dimension, the code generates a datalist for this subslice
|
||
|
and then calls the closure to generate the C code.
|
||
|
|
||
|
|
||
|
<h2><u>Miscellaneous</u></h2>
|
||
|
<h4>Pool Memory Allocation</h4>
|
||
|
As an approximation to garbage collection,
|
||
|
this code uses a pool allocation mechanism.
|
||
|
The goal is to allow dynamic construction of strings
|
||
|
that have very short life-times; typically they are used
|
||
|
to construct strings to send to the output file.
|
||
|
<p>
|
||
|
The pool mechanism wraps malloc and records the malloc'd
|
||
|
memory in a circular buffer. When the buffer reaches its maximum
|
||
|
size, previously allocated pool buffers are free'd.
|
||
|
This is good in that the user does not have to litter
|
||
|
code with free() statements. It is bad in that the pool
|
||
|
allocated memory can be free'd too early if the memory
|
||
|
does not have a short enough life.
|
||
|
If you suspect the latter, then bump the size of the circular buffer
|
||
|
and see if the problem goes away. If so, then your code
|
||
|
is probably holding on to a pool buffer too long and should use
|
||
|
regular malloc/free.
|
||
|
<p>
|
||
|
In the end, I am not sure if this is a good idea, but
|
||
|
if does make the code simpler.
|
||
|
|
||
|
<h4>List and Bytebuffer</h4>
|
||
|
The two datatypes List and Bytebuffer are used through out the
|
||
|
code. They correspond closely in semantics to the Java Arraylist
|
||
|
and Stringbuffer types, respectively. They are used to help
|
||
|
encapsulate dynamically growing lists of objects or bytes
|
||
|
to reduce certain kinds of memory allocation errors.
|
||
|
<p>
|
||
|
The canonical code for non-destructive walking of a List<T>
|
||
|
is as follows.
|
||
|
<pre>
|
||
|
for(i=0;i<listlength(list);i++) {
|
||
|
T* element = (T*)listget(list,i);
|
||
|
...
|
||
|
}
|
||
|
</pre>
|
||
|
<p>
|
||
|
Bytebuffer provides two ways to access its internal buffer of characters.
|
||
|
One is "bbContents()", which returns a direct pointer to the buffer,
|
||
|
and the other is "bbDup()", which returns a malloc'd string containing
|
||
|
the contents and is guaranteed to be null terminated.
|
||
|
|
||
|
<h4><a name="odometer">Odometer: Multi-Dimensional Array Handling</a></h4>
|
||
|
The odometer data type is used to convert
|
||
|
multiple dimensions into a single integer.
|
||
|
The rule for converting a multi-dimensional
|
||
|
array to a single dimensions is as follows.
|
||
|
<p>
|
||
|
Suppose we have the declaration <code>int F[2][5][3];</code>.
|
||
|
There are obviously a total of 2 X 5 X 3 = 30 integers in F.
|
||
|
Thus, these three dimensions will be reduced to a single dimension of size 30.
|
||
|
<p>
|
||
|
A particular point in the three dimensions, say [x][y][z], is reduced to
|
||
|
a number in the range 0..29 by computing <code>((x*5)+y)*3+z</code>.
|
||
|
The corresponding general C code is as follows.
|
||
|
<pre>
|
||
|
size_t
|
||
|
dimmap(int rank, size_t* indices, size_t* sizes)
|
||
|
{
|
||
|
int i;
|
||
|
size_t count = 0;
|
||
|
for(i=0;i<rank;i++) {
|
||
|
if(i > 0) count *= sizes[i];
|
||
|
count += indices[i];
|
||
|
}
|
||
|
return count;
|
||
|
}
|
||
|
</pre>
|
||
|
In this code, the indices variable corresponds to the x,y, and z.
|
||
|
The sizes variable corresponds to the 2,5, and 3.
|
||
|
<p>
|
||
|
The Odometer type stores a set of dimensions
|
||
|
and supports operations to iterate over all possible
|
||
|
dimension combinations.
|
||
|
The definition of Odometer is defined by the types Odometer and Dimdata.
|
||
|
<pre>
|
||
|
typedef struct Dimdata {
|
||
|
unsigned long datasize; // actual size of the datalist item
|
||
|
unsigned long index; // 0 <= index < datasize
|
||
|
unsigned long declsize;
|
||
|
} Dimdata;
|
||
|
|
||
|
typedef struct Odometer {
|
||
|
int rank;
|
||
|
Dimdata dims[NC_MAX_VAR_DIMS];
|
||
|
} Odometer;
|
||
|
</pre>
|
||
|
The following primary operations are defined.
|
||
|
<ul>
|
||
|
<li>Odometer* newodometer(Dimset*) - create an odometer from a set of Dimsets.
|
||
|
<li>void freeodometer(Odometer*) - release the memory of an odometer.
|
||
|
<li>int odometermore(Odometer* odom) - return 1 if there are more combinations
|
||
|
of dimension values.
|
||
|
<li>int odometerincr(Odometer* odo,int) - move to the next combination
|
||
|
of dimension values.
|
||
|
<li>unsigned long odometercount(Odometer* odo) -
|
||
|
apply the above algorithm to convert the current odometer combination
|
||
|
into a single integer.
|
||
|
</ul>
|
||
|
|
||
|
<h3><u>Change Log</u></h3>
|
||
|
<ul>
|
||
|
<li>04/15/2009 - Add major discussion about adding a new output language.
|
||
|
<li>03/10/2009 - Fix typos.
|
||
|
<li>03/07/2009 - First draft.
|
||
|
</ul>
|
||
|
|
||
|
</body
|
||
|
</html>
|
||
|
|