quote: emit invalid UTF-8 rather than just dropping a strange value

If an UTF-8 value exceeds 0x7fffffff, there is no legitimate encoding for it. However, using FE or FF as leading bytes provide at least some kind of encoding. This is assembly, and the programmer is (almost?) always right. It might be worthwhile to add a suppressible warning for invalid UTF-8 strings in general, though, including any character > 0x10ffff, surrogates, or a string that is constructed by hand. Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2025-01-18 16:25:05 +08:00 · 2019-06-06 17:26:28 -07:00 · 2019-06-06 17:26:28 -07:00 · 10d9589f02
commit 10d9589f02
parent 236f4a832b
1 changed files with 9 additions and 7 deletions
--- a/asm/quote.c
+++ b/asm/quote.c
@ -220,14 +220,16 @@ static unsigned char *emit_utf8(unsigned char *q, uint32_t v)
        goto out4;
    }

+    /*
+     * Note: this is invalid even for "classic" (pre-UTF16) 31-bit
+     * UTF-8 if the value is >= 0x8000000. This at least tries to do
+     * something vaguely sensible with it. Caveat programmer.
+     * The __utf*__ string transform functions do reject these
+     * as invalid input.
+     */
    vb5 = vb4 >> 6;
-    if (vb5 <= 0x01) {
    *q++ = 0xfc + vb5;
    goto out5;
-    }
-
-    /* Otherwise invalid, even with 31-bit "extended Unicode" (pre-UTF-16) */
-    goto out0;

    /* Emit extension bytes as appropriate */
 out5: *q++ = 0x80 + (vb4 & 63);