2002-05-01 05:08:11 +08:00
|
|
|
Internals of the Netwide Assembler
|
|
|
|
==================================
|
|
|
|
|
|
|
|
The Netwide Assembler is intended to be a modular, re-usable x86
|
|
|
|
assembler, which can be embedded in other programs, for example as
|
|
|
|
the back end to a compiler.
|
|
|
|
|
|
|
|
The assembler is composed of modules. The interfaces between them
|
|
|
|
look like:
|
|
|
|
|
|
|
|
+--- preproc.c ----+
|
|
|
|
| |
|
|
|
|
+---- parser.c ----+
|
|
|
|
| | |
|
|
|
|
| float.c |
|
|
|
|
| |
|
|
|
|
+--- assemble.c ---+
|
|
|
|
| | |
|
|
|
|
nasm.c ---+ insnsa.c +--- nasmlib.c
|
|
|
|
| |
|
|
|
|
+--- listing.c ----+
|
|
|
|
| |
|
|
|
|
+---- labels.c ----+
|
|
|
|
| |
|
|
|
|
+--- outform.c ----+
|
|
|
|
| |
|
|
|
|
+----- *out.c -----+
|
|
|
|
|
|
|
|
In other words, each of `preproc.c', `parser.c', `assemble.c',
|
|
|
|
`labels.c', `listing.c', `outform.c' and each of the output format
|
|
|
|
modules `*out.c' are independent modules, which do not directly
|
|
|
|
inter-communicate except through the main program.
|
|
|
|
|
|
|
|
The Netwide *Disassembler* is not intended to be particularly
|
|
|
|
portable or reusable or anything, however. So I won't bother
|
|
|
|
documenting it here. :-)
|
|
|
|
|
|
|
|
nasmlib.c
|
|
|
|
---------
|
|
|
|
|
|
|
|
This is a library module; it contains simple library routines which
|
|
|
|
may be referenced by all other modules. Among these are a set of
|
|
|
|
wrappers around the standard `malloc' routines, which will report a
|
|
|
|
fatal error if they run out of memory, rather than returning NULL.
|
|
|
|
|
|
|
|
preproc.c
|
|
|
|
---------
|
|
|
|
|
|
|
|
This contains a macro preprocessor, which takes a file name as input
|
|
|
|
and returns a sequence of preprocessed source lines. The only symbol
|
|
|
|
exported from the module is `nasmpp', which is a data structure of
|
|
|
|
type `Preproc', declared in nasm.h. This structure contains pointers
|
|
|
|
to all the functions designed to be callable from outside the
|
|
|
|
module.
|
|
|
|
|
|
|
|
parser.c
|
|
|
|
--------
|
|
|
|
|
|
|
|
This contains a source-line parser. It parses `canonical' assembly
|
|
|
|
source lines, containing some combination of the `label', `opcode',
|
|
|
|
`operand' and `comment' fields: it does not process directives or
|
|
|
|
macros. It exports two functions: `parse_line' and `cleanup_insn'.
|
|
|
|
|
|
|
|
`parse_line' is the main parser function: you pass it a source line
|
|
|
|
in ASCII text form, and it returns you an `insn' structure
|
|
|
|
containing all the details of the instruction on that line. The
|
|
|
|
parameters it requires are:
|
|
|
|
|
|
|
|
- The location (segment, offset) where the instruction on this line
|
|
|
|
will eventually be placed. This is necessary in order to evaluate
|
|
|
|
expressions containing the Here token, `$'.
|
|
|
|
|
|
|
|
- A function which can be called to retrieve the value of any
|
|
|
|
symbols the source line references.
|
|
|
|
|
|
|
|
- Which pass the assembler is on: an undefined symbol only causes an
|
|
|
|
error condition on pass two.
|
|
|
|
|
|
|
|
- The source line to be parsed.
|
|
|
|
|
|
|
|
- A structure to fill with the results of the parse.
|
|
|
|
|
|
|
|
- A function which can be called to report errors.
|
|
|
|
|
|
|
|
Some instructions (DB, DW, DD for example) can require an arbitrary
|
|
|
|
amount of storage, and so some of the members of the resulting
|
|
|
|
`insn' structure will be dynamically allocated. The other function
|
|
|
|
exported by `parser.c' is `cleanup_insn', which can be called to
|
|
|
|
deallocate any dynamic storage associated with the results of a
|
|
|
|
parse.
|
|
|
|
|
|
|
|
names.c
|
|
|
|
-------
|
|
|
|
|
|
|
|
This doesn't count as a module - it defines a few arrays which are
|
|
|
|
shared between NASM and NDISASM, so it's a separate file which is
|
|
|
|
#included by both parser.c and disasm.c.
|
|
|
|
|
|
|
|
float.c
|
|
|
|
-------
|
|
|
|
|
|
|
|
This is essentially a library module: it exports one function,
|
|
|
|
`float_const', which converts an ASCII representation of a
|
|
|
|
floating-point number into an x86-compatible binary representation,
|
|
|
|
without using any built-in floating-point arithmetic (so it will run
|
|
|
|
on any platform, portably). It calls nothing, and is called only by
|
|
|
|
`parser.c'. Note that the function `float_const' must be passed an
|
|
|
|
error reporting routine.
|
|
|
|
|
|
|
|
assemble.c
|
|
|
|
----------
|
|
|
|
|
|
|
|
This module contains the code generator: it translates `insn'
|
|
|
|
structures as returned from the parser module into actual generated
|
|
|
|
code which can be placed in an output file. It exports two
|
|
|
|
functions, `assemble' and `insn_size'.
|
|
|
|
|
|
|
|
`insn_size' is designed to be called on pass one of assembly: it
|
|
|
|
takes an `insn' structure as input, and returns the amount of space
|
|
|
|
that would be taken up if the instruction described in the structure
|
|
|
|
were to be converted to real machine code. `insn_size' also requires
|
|
|
|
to be told the location (as a segment/offset pair) where the
|
|
|
|
instruction would be assembled, the mode of assembly (16/32 bit
|
|
|
|
default), and a function it can call to report errors.
|
|
|
|
|
|
|
|
`assemble' is designed to be called on pass two: it takes all the
|
|
|
|
parameters that `insn_size' does, but has an extra parameter which
|
|
|
|
is an output driver. `assemble' actually converts the input
|
|
|
|
instruction into machine code, and outputs the machine code by means
|
|
|
|
of calling the `output' function of the driver.
|
|
|
|
|
|
|
|
insnsa.c
|
|
|
|
--------
|
|
|
|
|
|
|
|
This is another library module: it exports one very big array of
|
2002-05-25 09:39:12 +08:00
|
|
|
instruction translations. It is generated automatically from the
|
|
|
|
insns.dat file by the insns.pl script.
|
2002-05-01 05:08:11 +08:00
|
|
|
|
|
|
|
labels.c
|
|
|
|
--------
|
|
|
|
|
|
|
|
This module contains a label manager. It exports six functions:
|
|
|
|
|
|
|
|
`init_labels' should be called before any other function in the
|
|
|
|
module. `cleanup_labels' may be called after all other use of the
|
|
|
|
module has finished, to deallocate storage.
|
|
|
|
|
|
|
|
`define_label' is called to define new labels: you pass it the name
|
|
|
|
of the label to be defined, and the (segment,offset) pair giving the
|
|
|
|
value of the label. It is also passed an error-reporting function,
|
|
|
|
and an output driver structure (so that it can call the output
|
|
|
|
driver's label-definition function). `define_label' mentally
|
|
|
|
prepends the name of the most recently defined non-local label to
|
|
|
|
any label beginning with a period.
|
|
|
|
|
|
|
|
`define_label_stub' is designed to be called in pass two, once all
|
|
|
|
the labels have already been defined: it does nothing except to
|
|
|
|
update the "most-recently-defined-non-local-label" status, so that
|
|
|
|
references to local labels in pass two will work correctly.
|
|
|
|
|
|
|
|
`declare_as_global' is used to declare that a label should be
|
|
|
|
global. It must be called _before_ the label in question is defined.
|
|
|
|
|
|
|
|
Finally, `lookup_label' attempts to translate a label name into a
|
|
|
|
(segment,offset) pair. It returns non-zero on success.
|
|
|
|
|
|
|
|
The label manager module is (theoretically :) restartable: after
|
|
|
|
calling `cleanup_labels', you can call `init_labels' again, and
|
|
|
|
start a new assembly with a new set of symbols.
|
|
|
|
|
|
|
|
listing.c
|
|
|
|
---------
|
|
|
|
|
|
|
|
This file contains the listing file generator. The interface to the
|
|
|
|
module is through the one symbol it exports, `nasmlist', which is a
|
|
|
|
structure containing six function pointers. The calling semantics of
|
|
|
|
these functions isn't terribly well thought out, as yet, but it
|
|
|
|
works (just about) so it's going to get left alone for now...
|
|
|
|
|
|
|
|
outform.c
|
|
|
|
---------
|
|
|
|
|
|
|
|
This small module contains a set of routines to manage a list of
|
|
|
|
output formats, and select one given a keyword. It contains three
|
|
|
|
small routines: `ofmt_register' which registers an output driver as
|
|
|
|
part of the managed list, `ofmt_list' which lists the available
|
|
|
|
drivers on stdout, and `ofmt_find' which tries to find the driver
|
|
|
|
corresponding to a given name.
|
|
|
|
|
|
|
|
The output modules
|
|
|
|
------------------
|
|
|
|
|
|
|
|
Each of the output modules, `outbin.o', `outelf.o' and so on,
|
|
|
|
exports only one symbol, which is an output driver data structure
|
|
|
|
containing pointers to all the functions needed to produce output
|
|
|
|
files of the appropriate type.
|
|
|
|
|
|
|
|
The exception to this is `outcoff.o', which exports _two_ output
|
|
|
|
driver structures, since COFF and Win32 object file formats are very
|
|
|
|
similar and most of the code is shared between them.
|
|
|
|
|
|
|
|
nasm.c
|
|
|
|
------
|
|
|
|
|
|
|
|
This is the main program: it calls all the functions in the above
|
|
|
|
modules, and puts them together to form a working assembler. We
|
|
|
|
hope. :-)
|
|
|
|
|
|
|
|
Segment Mechanism
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
In NASM, the term `segment' is used to separate the different
|
|
|
|
sections/segments/groups of which an object file is composed.
|
|
|
|
Essentially, every address NASM is capable of understanding is
|
|
|
|
expressed as an offset from the beginning of some segment.
|
|
|
|
|
|
|
|
The defining property of a segment is that if two symbols are
|
|
|
|
declared in the same segment, then the distance between them is
|
|
|
|
fixed at assembly time. Hence every externally-declared variable
|
|
|
|
must be declared in its own segment, since none of the locations of
|
|
|
|
these are known, and so no distances may be computed at assembly
|
|
|
|
time.
|
|
|
|
|
|
|
|
The special segment value NO_SEG (-1) is used to denote an absolute
|
|
|
|
value, e.g. a constant whose value does not depend on relocation,
|
|
|
|
such as the _size_ of a data object.
|
|
|
|
|
|
|
|
Apart from NO_SEG, segment indices all have their least significant
|
|
|
|
bit clear, if they refer to actual in-memory segments. For each
|
|
|
|
segment of this type, there is an auxiliary segment value, defined
|
|
|
|
to be the same number but with the LSB set, which denotes the
|
|
|
|
segment-base value of that segment, for object formats which support
|
|
|
|
it (Microsoft .OBJ, for example).
|
|
|
|
|
|
|
|
Hence, if `textsym' is declared in a code segment with index 2, then
|
|
|
|
referencing `SEG textsym' would return zero offset from
|
|
|
|
segment-index 3. Or, in object formats which don't understand such
|
|
|
|
references, it would return an error instead.
|
|
|
|
|
|
|
|
The next twist is SEG_ABS. Some symbols may be declared with a
|
|
|
|
segment value of SEG_ABS plus a 16-bit constant: this indicates that
|
|
|
|
they are far-absolute symbols, such as the BIOS keyboard buffer
|
|
|
|
under MS-DOS, which always resides at 0040h:001Eh. Far-absolutes are
|
|
|
|
handled with care in the parser, since they are supposed to evaluate
|
|
|
|
simply to their offset part within expressions, but applying SEG to
|
|
|
|
one should yield its segment part. A far-absolute should never find
|
|
|
|
its way _out_ of the parser, unless it is enclosed in a WRT clause,
|
|
|
|
in which case Microsoft 16-bit object formats will want to know
|
|
|
|
about it.
|
|
|
|
|
|
|
|
Porting Issues
|
|
|
|
--------------
|
|
|
|
|
|
|
|
We have tried to write NASM in portable ANSI C: we do not assume
|
|
|
|
little-endianness or any hardware characteristics (in order that
|
|
|
|
NASM should work as a cross-assembler for x86 platforms, even when
|
|
|
|
run on other, stranger machines).
|
|
|
|
|
|
|
|
Assumptions we _have_ made are:
|
|
|
|
|
|
|
|
- We assume that `short' is at least 16 bits, and `long' at least
|
|
|
|
32. This really _shouldn't_ be a problem, since Kernighan and
|
|
|
|
Ritchie tell us we are entitled to do so.
|
|
|
|
|
|
|
|
- We rely on having more than 6 characters of significance on
|
|
|
|
externally linked symbols in the NASM sources. This may get fixed
|
|
|
|
at some point. We haven't yet come across a linker brain-dead
|
|
|
|
enough to get it wrong anyway.
|
|
|
|
|
|
|
|
- We assume that `fopen' using the mode "wb" can be used to write
|
|
|
|
binary data files. This may be wrong on systems like VMS, with a
|
|
|
|
strange file system. Though why you'd want to run NASM on VMS is
|
|
|
|
beyond me anyway.
|
|
|
|
|
|
|
|
That's it. Subject to those caveats, NASM should be completely
|
|
|
|
portable. If not, we _really_ want to know about it.
|
|
|
|
|
|
|
|
Porting Non-Issues
|
|
|
|
------------------
|
|
|
|
|
|
|
|
The following is _not_ a portability problem, although it looks like
|
|
|
|
one.
|
|
|
|
|
|
|
|
- When compiling with some versions of DJGPP, you may get errors
|
|
|
|
such as `warning: ANSI C forbids braced-groups within
|
|
|
|
expressions'. This isn't NASM's fault - the problem seems to be
|
|
|
|
that DJGPP's definitions of the <ctype.h> macros include a
|
|
|
|
GNU-specific C extension. So when compiling using -ansi and
|
|
|
|
-pedantic, DJGPP complains about its own header files. It isn't a
|
|
|
|
problem anyway, since it still generates correct code.
|