Improve parser's reporting of statement start locations.

Up to now, the parser's reporting of a statement's stmt_location
included any preceding whitespace or comments.  This isn't really
desirable but was done to avoid accounting honestly for nonterminals
that reduce to empty.  It causes problems for pg_stat_statements,
which partially compensates by manually stripping whitespace, but
is not bright enough to strip /*-style comments.  There will be
more problems with an upcoming patch to improve reporting of errors
in extension scripts, so it's time to do something about this.

The thing we have to do to make it work right is to adjust
YYLLOC_DEFAULT to scan the inputs of each production to find the
first one that has a valid location (i.e., did not reduce to
empty).  In theory this adds a little bit of per-reduction overhead,
but in practice it's negligible.  I checked by measuring the time
to run raw_parser() on the contents of information_schema.sql, and
there was basically no change.

Having done that, we can rely on any nonterminal that didn't reduce
to completely empty to have a correct starting location, and we don't
need the kluges the stmtmulti production formerly used.

This should have a side benefit of allowing parse error reports to
include an error position in some cases where they formerly failed to
do so, due to trying to report the position of an empty nonterminal.
I did not go looking for an example though.  The one previously known
case where that could happen (OptSchemaEltList) no longer needs the
kluge it had; but I rather doubt that that was the only case.

Discussion: https://postgr.es/m/ZvV1ClhnbJLCz7Sm@msg.df7cb.de
This commit is contained in:
Tom Lane 2024-10-22 11:26:05 -04:00
parent 7c4d3fe272
commit 14e5680eee
4 changed files with 34 additions and 46 deletions

View File

@ -19,8 +19,9 @@ SELECT 1 AS "int";
1
(1 row)
/* this comment should not appear in the output */
SELECT 'hello'
-- multiline
-- but this one will appear
AS "text";
text
-------
@ -129,7 +130,7 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
-------+------+------------------------------------------------------------------------------
1 | 1 | PREPARE pgss_test (int) AS SELECT $1, $2 LIMIT $3
4 | 4 | SELECT $1 +
| | -- multiline +
| | -- but this one will appear +
| | AS "text"
2 | 2 | SELECT $1 + $2
3 | 3 | SELECT $1 + $2 + $3 AS "add"

View File

@ -12,8 +12,9 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
--
SELECT 1 AS "int";
/* this comment should not appear in the output */
SELECT 'hello'
-- multiline
-- but this one will appear
AS "text";
SELECT 'world' AS "text";

View File

@ -90,6 +90,12 @@ CleanQuerytext(const char *query, int *location, int *len)
/*
* Discard leading and trailing whitespace, too. Use scanner_isspace()
* not libc's isspace(), because we want to match the lexer's behavior.
*
* Note: the parser now strips leading comments and whitespace from the
* reported stmt_location, so this first loop will only iterate in the
* unusual case that the location didn't propagate to here. But the
* statement length will extend to the end-of-string or terminating
* semicolon, so the second loop often does something useful.
*/
while (query_len > 0 && scanner_isspace(query[0]))
query++, query_location++, query_len--;

View File

@ -67,39 +67,25 @@
/*
* Location tracking support --- simpler than bison's default, since we only
* want to track the start position not the end position of each nonterminal.
* Location tracking support. Unlike bison's default, we only want
* to track the start position not the end position of each nonterminal.
* Nonterminals that reduce to empty receive position "-1". Since a
* production's leading RHS nonterminal(s) may have reduced to empty,
* we have to scan to find the first one that's not -1.
*/
#define YYLLOC_DEFAULT(Current, Rhs, N) \
do { \
if ((N) > 0) \
(Current) = (Rhs)[1]; \
else \
(Current) = (-1); \
(Current) = (-1); \
for (int _i = 1; _i <= (N); _i++) \
{ \
if ((Rhs)[_i] >= 0) \
{ \
(Current) = (Rhs)[_i]; \
break; \
} \
} \
} while (0)
/*
* The above macro assigns -1 (unknown) as the parse location of any
* nonterminal that was reduced from an empty rule, or whose leftmost
* component was reduced from an empty rule. This is problematic
* for nonterminals defined like
* OptFooList: / * EMPTY * / { ... } | OptFooList Foo { ... } ;
* because we'll set -1 as the location during the first reduction and then
* copy it during each subsequent reduction, leaving us with -1 for the
* location even when the list is not empty. To fix that, do this in the
* action for the nonempty rule(s):
* if (@$ < 0) @$ = @2;
* (Although we have many nonterminals that follow this pattern, we only
* bother with fixing @$ like this when the nonterminal's parse location
* is actually referenced in some rule.)
*
* A cleaner answer would be to make YYLLOC_DEFAULT scan all the Rhs
* locations until it's found one that's not -1. Then we'd get a correct
* location for any nonterminal that isn't entirely empty. But this way
* would add overhead to every rule reduction, and so far there's not been
* a compelling reason to pay that overhead.
*/
/*
* Bison doesn't allocate anything that needs to live across parser calls,
* so we can easily have it use palloc instead of malloc. This prevents
@ -930,7 +916,7 @@ parse_toplevel:
| MODE_PLPGSQL_EXPR PLpgSQL_Expr
{
pg_yyget_extra(yyscanner)->parsetree =
list_make1(makeRawStmt($2, 0));
list_make1(makeRawStmt($2, @2));
}
| MODE_PLPGSQL_ASSIGN1 PLAssignStmt
{
@ -938,7 +924,7 @@ parse_toplevel:
n->nnames = 1;
pg_yyget_extra(yyscanner)->parsetree =
list_make1(makeRawStmt((Node *) n, 0));
list_make1(makeRawStmt((Node *) n, @2));
}
| MODE_PLPGSQL_ASSIGN2 PLAssignStmt
{
@ -946,7 +932,7 @@ parse_toplevel:
n->nnames = 2;
pg_yyget_extra(yyscanner)->parsetree =
list_make1(makeRawStmt((Node *) n, 0));
list_make1(makeRawStmt((Node *) n, @2));
}
| MODE_PLPGSQL_ASSIGN3 PLAssignStmt
{
@ -954,19 +940,15 @@ parse_toplevel:
n->nnames = 3;
pg_yyget_extra(yyscanner)->parsetree =
list_make1(makeRawStmt((Node *) n, 0));
list_make1(makeRawStmt((Node *) n, @2));
}
;
/*
* At top level, we wrap each stmt with a RawStmt node carrying start location
* and length of the stmt's text. Notice that the start loc/len are driven
* entirely from semicolon locations (@2). It would seem natural to use
* @1 or @3 to get the true start location of a stmt, but that doesn't work
* for statements that can start with empty nonterminals (opt_with_clause is
* the main offender here); as noted in the comments for YYLLOC_DEFAULT,
* we'd get -1 for the location in such cases.
* We also take care to discard empty statements entirely.
* and length of the stmt's text.
* We also take care to discard empty statements entirely (which among other
* things dodges the problem of assigning them a location).
*/
stmtmulti: stmtmulti ';' toplevel_stmt
{
@ -976,14 +958,14 @@ stmtmulti: stmtmulti ';' toplevel_stmt
updateRawStmtEnd(llast_node(RawStmt, $1), @2);
}
if ($3 != NULL)
$$ = lappend($1, makeRawStmt($3, @2 + 1));
$$ = lappend($1, makeRawStmt($3, @3));
else
$$ = $1;
}
| toplevel_stmt
{
if ($1 != NULL)
$$ = list_make1(makeRawStmt($1, 0));
$$ = list_make1(makeRawStmt($1, @1));
else
$$ = NIL;
}
@ -1584,8 +1566,6 @@ CreateSchemaStmt:
OptSchemaEltList:
OptSchemaEltList schema_stmt
{
if (@$ < 0) /* see comments for YYLLOC_DEFAULT */
@$ = @2;
$$ = lappend($1, $2);
}
| /* EMPTY */