postgresql/doc/TODO.detail/yacc

From selkovjr@mcs.anl.gov Sat Jul 25 05:31:05 1998
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id FAA16564
	for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 05:31:03 -0400 (EDT)
Received: from antares.mcs.anl.gov (mcs.anl.gov [140.221.9.6]) by renoir.op.net (o1/$ Revision: 1.18 $) with SMTP id FAA01775 for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 05:28:22 -0400 (EDT)
Received: from mcs.anl.gov (wit.mcs.anl.gov [140.221.5.148]) by antares.mcs.anl.gov (8.6.10/8.6.10)  with ESMTP
	id EAA28698 for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 04:27:05 -0500
Sender: selkovjr@mcs.anl.gov
Message-ID: <35B9968D.21CF60A2@mcs.anl.gov>
Date: Sat, 25 Jul 1998 08:25:49 +0000
From: "Gene Selkov, Jr." <selkovjr@mcs.anl.gov>
Organization: MCS, Argonne Natl. Lab
X-Mailer: Mozilla 4.03 [en] (X11; I; Linux 2.0.32 i586)
MIME-Version: 1.0
To: Bruce Momjian <maillist@candle.pha.pa.us>
Subject: position-aware scanners
References: <199807250524.BAA07296@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: RO

Bruce,

I attached here (trough the web links) a couple examples, totally
irrelevant to postgres but good enough to discuss token locations. I
might as well try to patch the backend parser, though not sure how soon.


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1.

The first c parser I wrote,
http://wit.mcs.anl.gov/~selkovjr/unit-troff.tgz, is not very
sophisticated, so token locations reported by yyerr() may be slightly
incorrect (+/- one position depending on the existence and type of the
lookahead token. It is a filter used to typeset the units of measurement
with eqn. To use it, unpack the tar file and run make. The Makefile is
not too generic but I built it on various systems including linux,
freebsd and sunos 4.3. The invocation can be something like this:

./check 0 parse "l**3/(mmoll*min)"
parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or
`'(''

l**3/(mmoll*min)
      ^^^^^

Now to the guts. As far as I can imagine, the only way to consistently
keep track of each character read by the scanner (regardless of the
length of expressions it will match) is to redefine its YY_INPUT like
this:

#undef YY_INPUT
#define YY_INPUT(buf,result,max_size) \
{ \
	int c	= (int) buffer[pos++]; \
	result = (c == '\0') ?	YY_NULL	: (buf[0] = c, 1); \
}

Here, buffer is the pointer to the origin of the string being scanned
and pos is a global variable, similar in usage to a file pointer (you
can both read and manipulate it at will). The buffer and the pointer are
initialized by the function

void setString(char *s)
{
   buffer = s;
   pos = 0;
}

each time the new string is to be parsed. This (exportable) function is
part of the interface.

In this simplistic design, yyerror() is part of the scanner module and
it uses the pos variable to report the location of unexpected tokens.
The downside of such arrangement is that in case of error condition, you
can't easily tell whether your context is current or lookahead token, it
just reports the position of the last token read (be it $ (end of
buffer) or something else):

./check 0 convert "mol/foo"
parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or
`'(''

mol/foo
       ^^^

(should be at the beginning of "foo")

./check 0 convert "mmol//l"
parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or
`'(''

mmol//l
    ^

(should be at the second '/')


I believe this is why most simple parsers made with yacc would report
parse errors being "at or near" some token, which is fair enough if the
expression is not too complex.


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2. The second version of the same scanner,
http://wit.mcs.anl.gov/~selkovjr/scanner-example.tgz, addresses this
problem by recording exact locations of the tokens in each instance of
the token semantic data structure. The global,

UNIT_YYSTYPE unit_yylval;

would be normally used to export the token semantics (including its
original or modified text and location data) to the parser.
Unfortunately, I cannot show you the parser part in c, because that's
about when I stopped writing parsers in c. Instead, I included a small
test program, test.c, that mimics the parser's expectations for the
scanner data pretty well. I am assuming here that you are not interested
in digging someone else's ugly guts for relatively small bit of
information; let me know if I am wrong and I will send you the complete
perl code (also generated with bison).

To run this example, unpack the tar file and run Make. Then do

  gcc test.c scanner.o

and run a.out

Note the line

    yylval = unit_getyylval();

in test.c. You will not normally need it in a c parser. It is enough to
define yylval as an external variable and link it to yylval in yylex()

In the bison-generated parser, yylval gets pushed into a stack (pointed
to by yylsp) each time a new token is read. For each syntax rule, the
bison macros @1, @2, ... are just shortcuts to locations in the stack 1,
2, ... levels deep. In following code fragment, @3 refers to the
location info for the third term in the rule (INTEGER):

(sorry about perl, but I think you can do the same things in c without
significant changes to your existing parser)

term:           base    {
                        $$ = $1;
                        $$->{'order'} = 1;
                }
        |       base EXP INTEGER {
                        $$ = $1;
                        $$->{'order'} = @3->{'text'};
                        $$->{'scale'} = $$->{'scale'} ** $$->{'order'};
                        if ( $$->{'order'} == 0 ) {
                                yyerror("Error: expecting a non-zero
integer exponent");
                                YYERROR;
                        }
                }


which translates to:

  ($yyn == 10)    && do {
          $yyval = $yyvsa[-1];
          $yyval->{'order'} = 1;
          last SWITCH;
  };

  ($yyn == 11)    && do {
          $yyval = $yyvsa[-3];
          $yyval->{'order'} = $yylsa[-1]->{'text'}
          $yyval->{'scale'} = $yyval->{'scale'} ** $yyval->{'order'};
          if ( $yyval->{'order'} == 0 ) {
                   yyerror("Error: expecting a non-zero integer
exponent");
                   goto yyerrlab1 ;
          }
          last SWITCH;
  };

In c, you will have a bit more complicated pointer arithmetic to adress
the stack, but the usage of objects will be the same. Note here that it
is convenient to keep all information about the token in its location
info, (yylsa, yylsp, yylval, @n), while everything relating to the value
of the expression, or to the parse tree, is better placed in the
semantic stack (yyssa, yyssp, yysval, $n). Also note that in some cases
you can do semantic checks inside rules and report useful messages
before or instead of invoking yyerror();

Finally, it is useful to make the following wrapper function around
external yylex() in order to maintain your own token stack. Unlike the
parser's internal stack which is only as deep as the rule being reduced,
this one can hold all tokens recognized during the current run, and that
can be extremely helpful for error reporting and any transformations you
may need. In this way, you can even scan (tokenize) the whole buffer
before handing it off to the parser (who knows, you may need a token
ahead of what is currently seen by the parser):


sub tokenize {
    undef @tokenTable;
    my ($tok, $text, $name, $unit, $first_line, $first_column,
$last_line, $last_column);

    while ( ($tok = &UnitLex::yylex()) > 0 ) { # this is where the
c-coded yylex is called,
                                               # UnitLex is the perl
extension encapsulating it
       ( $text, $name, $unit, $first_line, $first_column, $last_line,
$last_column ) = &UnitLex::getyylval;
       push(@tokenTable,
           Unit::yyltype->new (
              'token'         => $tok,
              'text'          => $text,
              'name'          => $name,
              'unit'          => $unit,
              'first_line'    => $first_line,
              'first_column'  => $first_column,
              'last_line'     => $last_line,
              'last_column'   => $last_column,
           )
       )
    }

}


It is now a lot easier to handle various state-related problems, such as
backtracking and error reporting. The yylex() function as seen by the
parser might be constructed somewhat like this:

sub yylex {
    $yylloc = $tokenTable[$tokenNo];  # $tokenNo is a global; now
instead of a "file pointer",
                                      # as in the first example, we have
a "token pointer"
    undef $yylval;


    # disregard this; name this block "computing semantic values"
    if ( $yylloc->{'token'} == UNIT) {
        $yylval = Unit::Operand->new(
        'unit'  => Unit::Dict::unit($yylloc->{'unit'}),
        'base'  => Unit::Dict::base($yylloc->{'unit'}),
        'scale' => Unit::Dict::scale($yylloc->{'unit'}),
        'scaleToBase' => Unit::Dict::scaleToBase($yylloc->{'unit'}),
        'loc'   => $yylloc,
       );
    }
    elsif ( ($yylloc->{'token'} == INTEGER ) || ($yylloc->{'token'} ==
POSITIVE_NUMBER) ) {
        $yylval = Unit::Operand->new(
          'unit' => '1',
          'base' => '1',
          'scale' => 1,
          'scaleToBase' => 1,
          'loc'   => $yylloc,
        );
    }

    $tokenNo++;
    return(%{$yylloc}->{'token'}); # This is all the parser needs to
know about this token.
                                   # But we already made sure we saved
everything we need to know.
}


Now the most interesting part, the error reporting routine:


sub yyerror {
    my ($str) = @_;
    my ($message, $start, $end, $loc);

    $loc = $tokenTable[$tokenNo-1]; # This is the same as to say,
                                    # "obtain the location info for the
current token"

    # You may use this routine for your own purposes or let parser use
it
    if( $str ne 'parse error' ) {
        $message = "$str instead of `" . $loc->{'name'} . "' <" .
$loc->{'text'} . ">,  at line " . $loc->{'first_line'} . ":\n\
n";
    }
    else {
        $message = "unexpected token `" . $loc->{'name'} . "' <" .
$loc->{'text'} . ">,  at line " . loc->{'first_line'} . ":\n
\n";
    }

    $message .= $parseBuffer . "\n"; # that's the original string that
was used to set the parser buffer

    $message .= ( ' ' x ($loc->{'first_column'} + 1) ) . ( '^' x
length($loc->{'text'}) ). "\n";
    if( $str ne 'parse error' ) {
        print STDERR "$str instead of `", $loc->{'name'}, "' {",
$loc->{'text'}, "},  at line ", $loc->{'first_line'}, ":\n\n";
    }
    else {
        print STDERR "unexpected token `", $loc->{'name'}, "' {",
$loc->{'text'}, "},  at line ", $loc->{'first_line'}, ":\n\n";
    }

    print STDERR "$parseBuffer\n";
    print STDERR ' ' x ($loc->{'first_column'} + 1), '^' x
length($loc->{'text'}), "\n";
}

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Scanners used in these examples assume there is a single line of text on
the input (the first_line and last_line elements of yylloc are simply
ignored). If you want to be able to parse multi-line buffers, just add a
lex rule for '\n' that will increment the line count and reset the pos
variable to zero.


Ugly as it may seem, I find this approach extremely liberating. If the
grammar becomes too complicated for a LALR(1) parser, I can cascade
multiple parsers. The token table can then be used to reassemble parts
of original expression for subordinate parsers, preserving the location
info all the way down, so that subordinate parsers can report their
problems consistently. You probably don't need this, as SQL is very well
thought of and has parsable grammar. But it may be of some help, for
error reporting.


--Gene