* doc/cppinternals.texi: Update.

From-SVN: r46009
This commit is contained in:
Neil Booth 2001-10-04 12:22:03 +00:00 committed by Neil Booth
parent 3054eeed1d
commit d3d43aabbd
2 changed files with 338 additions and 143 deletions

View File

@ -1,3 +1,7 @@
2001-10-04 Neil Booth <neil@daikokuya.demon.co.uk>
* doc/cppinternals.texi: Update.
2001-10-04 Eric Christopher <echristo@redhat.com>
* config/mips/mips.c (init_cumulative_args): Remember to set

View File

@ -66,7 +66,8 @@ into another language, under the above conditions for modified versions.
@contents
@page
@node Top, Conventions,, (DIR)
@node Top
@top
@chapter Cpplib---the core of the GNU C Preprocessor
The GNU C preprocessor in GCC 3.x has been completely rewritten. It is
@ -87,16 +88,18 @@ tricky issues encountered. It also describes certain behaviour we would
like to preserve, such as the format and spacing of its output.
@menu
* Conventions:: Conventions used in the code.
* Lexer:: The combined C, C++ and Objective-C Lexer.
* Whitespace:: Input and output newlines and whitespace.
* Hash Nodes:: All identifiers are hashed.
* Macro Expansion:: Macro expansion algorithm.
* Files:: File handling.
* Index:: Index.
* Conventions:: Conventions used in the code.
* Lexer:: The combined C, C++ and Objective-C Lexer.
* Hash Nodes:: All identifiers are entered into a hash table.
* Macro Expansion:: Macro expansion algorithm.
* Token Spacing:: Spacing and paste avoidance issues.
* Line Numbering:: Tracking location within files.
* Guard Macros:: Optimizing header files with guard macros.
* Files:: File handling.
* Index:: Index.
@end menu
@node Conventions, Lexer, Top, Top
@node Conventions
@unnumbered Conventions
@cindex interface
@cindex header files
@ -118,9 +121,11 @@ change internals in the future without worrying whether library clients
are perhaps relying on some kind of undocumented implementation-specific
behaviour.
@node Lexer, Whitespace, Conventions, Top
@node Lexer
@unnumbered The Lexer
@cindex lexer
@cindex newlines
@cindex escaped newlines
@section Overview
The lexer is contained in the file @file{cpplex.c}. It is a hand-coded
@ -143,7 +148,7 @@ output.
@section Lexing a token
Lexing of an individual token is handled by @code{_cpp_lex_direct} and
its subroutines. In its current form the code is quite complicated,
with read ahead characters and suchlike, since it strives to not step
with read ahead characters and such-like, since it strives to not step
back in the character stream in preparation for handling non-ASCII file
encodings. The current plan is to convert any such files to UTF-8
before processing them. This complexity is therefore unnecessary and
@ -175,7 +180,7 @@ using the line map code.
The first token on a logical, i.e.@: unescaped, line has the flag
@code{BOL} set for beginning-of-line. This flag is intended for
internal use, both to distinguish a @samp{#} that begins a directive
from one that doesn't, and to generate a callback to clients that want
from one that doesn't, and to generate a call-back to clients that want
to be notified about the start of every non-directive line with tokens
on it. Clients cannot reliably determine this for themselves: the first
token might be a macro, and the tokens of a macro expansion do not have
@ -219,9 +224,28 @@ foo
@end smallexample
This is a good example of the subtlety of getting token spacing correct
in the preprocessor; there are plenty of tests in the testsuite for
in the preprocessor; there are plenty of tests in the test-suite for
corner cases like this.
The lexer is written to treat each of @samp{\r}, @samp{\n}, @samp{\r\n}
and @samp{\n\r} as a single new line indicator. This allows it to
transparently preprocess MS-DOS, Macintosh and Unix files without their
needing to pass through a special filter beforehand.
We also decided to treat a backslash, either @samp{\} or the trigraph
@samp{??/}, separated from one of the above newline indicators by
non-comment whitespace only, as intending to escape the newline. It
tends to be a typing mistake, and cannot reasonably be mistaken for
anything else in any of the C-family grammars. Since handling it this
way is not strictly conforming to the ISO standard, the library issues a
warning wherever it encounters it.
Handling newlines like this is made simpler by doing it in one place
only. The function @code{handle_newline} takes care of all newline
characters, and @code{skip_escaped_newlines} takes care of arbitrarily
long sequences of escaped newlines, deferring to @code{handle_newline}
to handle the newlines themselves.
The most painful aspect of lexing ISO-standard C and C++ is handling
trigraphs and backlash-escaped newlines. Trigraphs are processed before
any interpretation of the meaning of a character is made, and unfortunately
@ -255,6 +279,7 @@ should be done even within C-style comments; they can appear in the
middle of a line, and we want to report diagnostics in the correct
position for text appearing after the end of the comment.
@anchor{Invalid identifiers}
Some identifiers, such as @code{__VA_ARGS__} and poisoned identifiers,
may be invalid and require a diagnostic. However, if they appear in a
macro expansion we don't want to complain with each use of the macro.
@ -282,71 +307,298 @@ two separate @samp{:} tokens and almost certainly a syntax error. Such
cases are handled by @code{_cpp_lex_direct} based upon command-line
flags stored in the @code{cpp_options} structure.
Once a token has been lexed, it leads an independent existence. The
spelling of numbers, identifiers and strings is copied to permanent
storage from the original input buffer, so a token remains valid and
correct even if its source buffer is freed with @code{_cpp_pop_buffer}.
The storage holding the spellings of such tokens remains until the
client program calls cpp_destroy, probably at the end of the translation
unit.
@anchor{Lexing a line}
@section Lexing a line
@cindex token run
@node Whitespace, Hash Nodes, Lexer, Top
@unnumbered Whitespace
@cindex whitespace
@cindex newlines
@cindex escaped newlines
When the preprocessor was changed to return pointers to tokens, one
feature I wanted was some sort of guarantee regarding how long a
returned pointer remains valid. This is important to the stand-alone
preprocessor, the future direction of the C family front ends, and even
to cpplib itself internally.
Occasionally the preprocessor wants to be able to peek ahead in the
token stream. For example, after the name of a function-like macro, it
wants to check the next token to see if it is an opening parenthesis.
Another example is that, after reading the first few tokens of a
@code{#pragma} directive and not recognising it as a registered pragma,
it wants to backtrack and allow the user-defined handler for unknown
pragmas to access the full @code{#pragma} token stream. The stand-alone
preprocessor wants to be able to test the current token with the
previous one to see if a space needs to be inserted to preserve their
separate tokenization upon re-lexing (paste avoidance), so it needs to
be sure the pointer to the previous token is still valid. The
recursive-descent C++ parser wants to be able to perform tentative
parsing arbitrarily far ahead in the token stream, and then to be able
to jump back to a prior position in that stream if necessary.
The rule I chose, which is fairly natural, is to arrange that the
preprocessor lex all tokens on a line consecutively into a token buffer,
which I call a @dfn{token run}, and when meeting an unescaped new line
(newlines within comments do not count either), to start lexing back at
the beginning of the run. Note that we do @emph{not} lex a line of
tokens at once; if we did that @code{parse_identifier} would not have
state flags available to warn about invalid identifiers (@pxref{Invalid
identifiers}).
In other words, accessing tokens that appeared earlier in the current
line is valid, but since each logical line overwrites the tokens of the
previous line, tokens from prior lines are unavailable. In particular,
since a directive only occupies a single logical line, this means that
the directive handlers like the @code{#pragma} handler can jump around
in the directive's tokens if necessary.
Two issues remain: what about tokens that arise from macro expansions,
and what happens when we have a long line that overflows the token run?
Since we promise clients that we preserve the validity of pointers that
we have already returned for tokens that appeared earlier in the line,
we cannot reallocate the run. Instead, on overflow it is expanded by
chaining a new token run on to the end of the existing one.
The tokens forming a macro's replacement list are collected by the
@code{#define} handler, and placed in storage that is only freed by
@code{cpp_destroy}. So if a macro is expanded in our line of tokens,
the pointers to the tokens of its expansion that we return will always
remain valid. However, macros are a little trickier than that, since
they give rise to three sources of fresh tokens. They are the built-in
macros like @code{__LINE__}, and the @samp{#} and @samp{##} operators
for stringifcation and token pasting. I handled this by allocating
space for these tokens from the lexer's token run chain. This means
they automatically receive the same lifetime guarantees as lexed tokens,
and we don't need to concern ourselves with freeing them.
Lexing into a line of tokens solves some of the token memory management
issues, but not all. The opening parenthesis after a function-like
macro name might lie on a different line, and the front ends definitely
want the ability to look ahead past the end of the current line. So
cpplib only moves back to the start of the token run at the end of a
line if the variable @code{keep_tokens} is zero. Line-buffering is
quite natural for the preprocessor, and as a result the only time cpplib
needs to increment this variable is whilst looking for the opening
parenthesis to, and reading the arguments of, a function-like macro. In
the near future cpplib will export an interface to increment and
decrement this variable, so that clients can share full control over the
lifetime of token pointers too.
The routine @code{_cpp_lex_token} handles moving to new token runs,
calling @code{_cpp_lex_direct} to lex new tokens, or returning
previously-lexed tokens if we stepped back in the token stream. It also
checks each token for the @code{BOL} flag, which might indicate a
directive that needs to be handled, or require a start-of-line call-back
to be made. @code{_cpp_lex_token} also handles skipping over tokens in
failed conditional blocks, and invalidates the control macro of the
multiple-include optimization if a token was successfully lexed outside
a directive. In other words, its callers do not need to concern
themselves with such issues.
@node Hash Nodes
@unnumbered Hash Nodes
@cindex hash table
@cindex identifiers
@cindex macros
@cindex assertions
@cindex named operators
When cpplib encounters an ``identifier'', it generates a hash code for
it and stores it in the hash table. By ``identifier'' we mean tokens
with type @code{CPP_NAME}; this includes identifiers in the usual C
sense, as well as keywords, directive names, macro names and so on. For
example, all of @code{pragma}, @code{int}, @code{foo} and
@code{__GNUC__} are identifiers and hashed when lexed.
Each node in the hash table contain various information about the
identifier it represents. For example, its length and type. At any one
time, each identifier falls into exactly one of three categories:
@itemize @bullet
@item Macros
These have been declared to be macros, either on the command line or
with @code{#define}. A few, such as @code{__TIME__} are built-ins
entered in the hash table during initialisation. The hash node for a
normal macro points to a structure with more information about the
macro, such as whether it is function-like, how many arguments it takes,
and its expansion. Built-in macros are flagged as special, and instead
contain an enum indicating which of the various built-in macros it is.
@item Assertions
Assertions are in a separate namespace to macros. To enforce this, cpp
actually prepends a @code{#} character before hashing and entering it in
the hash table. An assertion's node points to a chain of answers to
that assertion.
@item Void
Everything else falls into this category---an identifier that is not
currently a macro, or a macro that has since been undefined with
@code{#undef}.
When preprocessing C++, this category also includes the named operators,
such as @code{xor}. In expressions these behave like the operators they
represent, but in contexts where the spelling of a token matters they
are spelt differently. This spelling distinction is relevant when they
are operands of the stringizing and pasting macro operators @code{#} and
@code{##}. Named operator hash nodes are flagged, both to catch the
spelling distinction and to prevent them from being defined as macros.
@end itemize
The same identifiers share the same hash node. Since each identifier
token, after lexing, contains a pointer to its hash node, this is used
to provide rapid lookup of various information. For example, when
parsing a @code{#define} statement, CPP flags each argument's identifier
hash node with the index of that argument. This makes duplicated
argument checking an O(1) operation for each argument. Similarly, for
each identifier in the macro's expansion, lookup to see if it is an
argument, and which argument it is, is also an O(1) operation. Further,
each directive name, such as @code{endif}, has an associated directive
enum stored in its hash node, so that directive lookup is also O(1).
@node Macro Expansion
@unnumbered Macro Expansion Algorithm
@c TODO
@node Token Spacing
@unnumbered Token Spacing
@cindex paste avoidance
@cindex spacing
@cindex token spacing
First, let's look at an issue that only concerns the stand-alone
preprocessor: we want to guarantee that re-reading its preprocessed
output results in an identical token stream. Without taking special
measures, this might not be the case because of macro substitution. For
example:
@smallexample
#define PLUS +
#define EMPTY
#define f(x) =x=
+PLUS -EMPTY- PLUS+ f(=)
@expansion{} + + - - + + = = =
@emph{not}
@expansion{} ++ -- ++ ===
@end smallexample
One solution would be to simply insert a space between all adjacent
tokens. However, we would like to keep space insertion to a minimum,
both for aesthetic reasons and because it causes problems for people who
still try to abuse the preprocessor for things like Fortran source and
Makefiles.
For now, just notice that the only places we need to be careful about
@dfn{paste avoidance} are when tokens are added (or removed) from the
original token stream. This only occurs because of macro expansion, but
care is needed in many places: before @strong{and} after each macro
replacement, each argument replacement, and additionally each token
created by the @samp{#} and @samp{##} operators.
Let's look at how the preprocessor gets whitespace output correct
normally. The @code{cpp_token} structure contains a flags byte, and one
of those flags is @code{PREV_WHITE}. This is flagged by the lexer, and
indicates that the token was preceded by whitespace of some form other
than a new line. The stand-alone preprocessor can use this flag to
decide whether to insert a space between tokens in the output.
Now consider the following:
@smallexample
#define add(x, y, z) x + y +z;
sum = add (1,2, 3);
@expansion{} sum = 1 + 2 +3;
@end smallexample
The interesting thing here is that the tokens @samp{1} and @samp{2} are
output with a preceding space, and @samp{3} is output without a
preceding space, but when lexed none of these tokens had that property.
Careful consideration reveals that @samp{1} gets its preceding
whitespace from the space preceding @samp{add} in the macro
@emph{invocation}, @samp{2} gets its whitespace from the space preceding
the parameter @samp{y} in the macro @emph{replacement list}, and
@samp{3} has no preceding space because parameter @samp{z} has none in
the replacement list.
Once lexed, tokens are effectively fixed and cannot be altered, since
pointers to them might be held in many places, in particular by
in-progress macro expansions. So instead of modifying the two tokens
above, the preprocessor inserts a special token, which I call a
@dfn{padding token}, into the token stream in front of every macro
expansion and expanded macro argument, to indicate that the subsequent
token should assume its @code{PREV_WHITE} flag from a different
@dfn{source token}. In the above example, the source tokens are
@samp{add} in the macro invocation, and @samp{y} and @samp{z} in the
macro replacement list, respectively.
It is quite easy to get multiple padding tokens in a row, for example if
a macro's first replacement token expands straight into another macro.
@smallexample
#define foo bar
#define bar baz
[foo]
@expansion{} [baz]
@end smallexample
Here, two padding tokens with sources @samp{foo} between the brackets,
and @samp{bar} from foo's replacement list, are generated. Clearly the
first padding token is the one that matters. But what if we happen to
leave a macro expansion? Adjusting the above example slightly:
@smallexample
#define foo bar
#define bar EMPTY baz
#define EMPTY
[foo] EMPTY;
@expansion{} [ baz] ;
@end smallexample
As shown, now there should be a space before baz and the semicolon. Our
initial algorithm fails for the former, because we would see three
padding tokens, one per macro invocation, followed by @samp{baz}, which
would have inherit its spacing from the original source, @samp{foo},
which has no leading space. Note that it is vital that cpplib get
spacing correct in these examples, since any of these macro expansions
could be stringified, where spacing matters.
So, I have demonstrated that not just entering macro and argument
expansions, but leaving them requires special handling too. So cpplib
inserts a padding token with a @code{NULL} source token when leaving
macro expansions and after each replaced argument in a macro's
replacement list. It also inserts appropriate padding tokens on either
side of tokens created by the @samp{#} and @samp{##} operators.
Now we can see the relationship with paste avoidance: we have to be
careful about paste avoidance in exactly the same locations we take care
to get white space correct. This makes implementation of paste
avoidance easy: wherever the stand-alone preprocessor is fixing up
spacing because of padding tokens, and it turns out that no space is
needed, it has to take the extra step to check that a space is not
needed after all to avoid an accidental paste. The function
@code{cpp_avoid_paste} advises whether a space is required between two
consecutive tokens. To avoid excessive spacing, it tries hard to only
require a space if one is likely to be necessary, but for reasons of
efficiency it is slightly conservative and might recommend a space where
one is not strictly needed.
@node Line Numbering
@unnumbered Line numbering
@cindex line numbers
The lexer has been written to treat each of @samp{\r}, @samp{\n},
@samp{\r\n} and @samp{\n\r} as a single new line indicator. This allows
it to transparently preprocess MS-DOS, Macintosh and Unix files without
their needing to pass through a special filter beforehand.
We also decided to treat a backslash, either @samp{\} or the trigraph
@samp{??/}, separated from one of the above newline indicators by
non-comment whitespace only, as intending to escape the newline. It
tends to be a typing mistake, and cannot reasonably be mistaken for
anything else in any of the C-family grammars. Since handling it this
way is not strictly conforming to the ISO standard, the library issues a
warning wherever it encounters it.
Handling newlines like this is made simpler by doing it in one place
only. The function @samp{handle_newline} takes care of all newline
characters, and @samp{skip_escaped_newlines} takes care of arbitrarily
long sequences of escaped newlines, deferring to @samp{handle_newline}
to handle the newlines themselves.
Another whitespace issue only concerns the stand-alone preprocessor: we
want to guarantee that re-reading the preprocessed output results in an
identical token stream. Without taking special measures, this might not
be the case because of macro substitution. We could simply insert a
space between adjacent tokens, but ideally we would like to keep this to
a minimum, both for aesthetic reasons and because it causes problems for
people who still try to abuse the preprocessor for things like Fortran
source and Makefiles.
The token structure contains a flags byte, and two flags are of interest
here: @samp{PREV_WHITE} and @samp{AVOID_LPASTE}. @samp{PREV_WHITE}
indicates that the token was preceded by whitespace; if this is the case
we need not worry about it incorrectly pasting with its predecessor.
The @samp{AVOID_LPASTE} flag is set by the macro expansion routines, and
indicates that paste avoidance by insertion of a space to the left of
the token may be necessary. Recursively, the first token of a macro
substitution, the first token after a macro substitution, the first
token of a substituted argument, and the first token after a substituted
argument are all flagged @samp{AVOID_LPASTE} by the macro expander.
If a token flagged in this way does not have a @samp{PREV_WHITE} flag,
and the routine @code{cpp_avoid_paste} determines that it might be
misinterpreted by the lexer if a space is not inserted between it and
the immediately preceding token, then stand-alone CPP's output routines
will insert a space between them. To avoid excessive spacing,
@code{cpp_avoid_paste} tries hard to only request a space if one is
likely to be necessary, but for reasons of efficiency it is slightly
conservative and might recommend a space where one is not strictly
needed.
Finally, the preprocessor takes great care to ensure it keeps track of
both the position of a token in the source file, for diagnostic
purposes, and where it should appear in the output file, because using
CPP for other languages like assembler requires this. The two positions
may differ for the following reasons:
The preprocessor takes great care to ensure it keeps track of both the
position of a token in the source file, for diagnostic purposes, and
where it should appear in the output file, because using CPP for other
languages like assembler requires this. The two positions may differ
for the following reasons:
@itemize @bullet
@item
@ -367,75 +619,14 @@ The source file location is maintained in the @code{lineno} member of the
current position in the buffer relative to the @code{line_base} buffer
variable, which is updated with every newline whether escaped or not.
TODO: Finish this.
@c FINISH THIS
@node Hash Nodes, Macro Expansion, Whitespace, Top
@unnumbered Hash Nodes
@cindex hash table
@cindex identifiers
@cindex macros
@cindex assertions
@cindex named operators
@node Guard Macros
@unnumbered The Multiple-Include Optimization
When cpplib encounters an ``identifier'', it generates a hash code for it
and stores it in the hash table. By ``identifier'' we mean tokens with
type @samp{CPP_NAME}; this includes identifiers in the usual C sense, as
well as keywords, directive names, macro names and so on. For example,
all of @samp{pragma}, @samp{int}, @samp{foo} and @samp{__GNUC__} are identifiers and hashed
when lexed.
@c TODO
Each node in the hash table contain various information about the
identifier it represents. For example, its length and type. At any one
time, each identifier falls into exactly one of three categories:
@itemize @bullet
@item Macros
These have been declared to be macros, either on the command line or
with @code{#define}. A few, such as @samp{__TIME__} are builtins
entered in the hash table during initialisation. The hash node for a
normal macro points to a structure with more information about the
macro, such as whether it is function-like, how many arguments it takes,
and its expansion. Builtin macros are flagged as special, and instead
contain an enum indicating which of the various builtin macros it is.
@item Assertions
Assertions are in a separate namespace to macros. To enforce this, cpp
actually prepends a @code{#} character before hashing and entering it in
the hash table. An assertion's node points to a chain of answers to
that assertion.
@item Void
Everything else falls into this category---an identifier that is not
currently a macro, or a macro that has since been undefined with
@code{#undef}.
When preprocessing C++, this category also includes the named operators,
such as @samp{xor}. In expressions these behave like the operators they
represent, but in contexts where the spelling of a token matters they
are spelt differently. This spelling distinction is relevant when they
are operands of the stringizing and pasting macro operators @code{#} and
@code{##}. Named operator hash nodes are flagged, both to catch the
spelling distinction and to prevent them from being defined as macros.
@end itemize
The same identifiers share the same hash node. Since each identifier
token, after lexing, contains a pointer to its hash node, this is used
to provide rapid lookup of various information. For example, when
parsing a @code{#define} statement, CPP flags each argument's identifier
hash node with the index of that argument. This makes duplicated
argument checking an O(1) operation for each argument. Similarly, for
each identifier in the macro's expansion, lookup to see if it is an
argument, and which argument it is, is also an O(1) operation. Further,
each directive name, such as @samp{endif}, has an associated directive
enum stored in its hash node, so that directive lookup is also O(1).
@node Macro Expansion, Files, Hash Nodes, Top
@unnumbered Macro Expansion Algorithm
@node Files, Index, Macro Expansion, Top
@node Files
@unnumbered File Handling
@cindex files
@ -459,10 +650,10 @@ filesystem queries whilst searching for the correct file.
For each file we try to open, we store the constructed path in a splay
tree. This path first undergoes simplification by the function
@code{_cpp_simplify_pathname}. For example,
@samp{/usr/include/bits/../foo.h} is simplified to
@samp{/usr/include/foo.h} before we enter it in the splay tree and try
@file{/usr/include/bits/../foo.h} is simplified to
@file{/usr/include/foo.h} before we enter it in the splay tree and try
to @code{open ()} the file. CPP will then find subsequent uses of
@samp{foo.h}, even as @samp{/usr/include/foo.h}, in the splay tree and
@file{foo.h}, even as @file{/usr/include/foo.h}, in the splay tree and
save system calls.
Further, it is likely the file contents have also been cached, saving a
@ -486,7 +677,7 @@ directory on a per-file basis is handled by the function
Note that a header included with a directory component, such as
@code{#include "mydir/foo.h"} and opened as
@samp{/usr/local/include/mydir/foo.h}, will have the complete path minus
@file{/usr/local/include/mydir/foo.h}, will have the complete path minus
the basename @samp{foo.h} as the current directory.
Enough information is stored in the splay tree that CPP can immediately
@ -503,7 +694,7 @@ command line (or system) include directories to which the mapping
applies. This may be higher up the directory tree than the full path to
the file minus the base name.
@node Index,, Files, Top
@node Index
@unnumbered Index
@printindex cp