103 Commits

Author SHA1 Message Date
Nick Wellnhofer
8cf6129bbd html: Stop implying <p> start tags
Only <html>, <head> or <body> should be implied. Opening extra <p> tags
has always been a libxml2 quirk.
2025-02-13 20:20:17 +01:00
Nick Wellnhofer
71122421a1 html: Make implied <p> tags more deterministic
libxml2's HTML parser adds <p> start tags in some situations. This
behavior, which doesn't follow any standard, was added in 2000, see
here: http://veillard.com/XML/messages/0655.html

Text nodes that only contain whitespace don't imply a <p> tag, but the
whitespace check cannot work reliably if we're parsing partial text data
which can happen with both pull and push parser.

The logic in `areBlanks` is hard to follow. The checks involving `CUR`
depend on the position of the input pointer and seem dubious. It's also
possible that the behavior changed inadvertently with a later commit.
As a result, it's hard to come up with good test cases.

We now process leading whitespace before creating implied tags. This is
more in line with HTML5 and should avoid at least some issues with
partial text data.

For example, parsing the string "<head>   x" used to result in:

<html>
<head></head>
<body><p>   x</p></body>
</html>

And now results in:

<html>
<head>   </head>
<body><p>x</p></body>
</html>

Except for the implied <p> tag, this matches HTML5.
2025-02-13 14:31:44 +01:00
Nick Wellnhofer
080285724b html: Make data parsing modes work with push parser
This can't be solved with a simple scan for a terminator. Instead, we
make htmlParseCharData handle incomplete data if the "partial" flag is
set.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
f77ec16db0 html: Optimize htmlParseCharData 2024-10-06 20:04:00 +02:00
Nick Wellnhofer
575be6c1f1 html: Fix line numbers with CRs 2024-10-06 20:04:00 +02:00
Nick Wellnhofer
e179f3ec0e html: Stop reporting syntax errors
It doesn't make much sense to keep the old syntax error handling which
doesn't conform to HTML5.

Handling HTML5 parser errors is rather involved and not essential for
parsers.
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
9678163f54 html: Don't check for valid XML characters 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
4eeac30944 html: Start to fix EOF and U+0000 handling 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
17da54c522 html: Normalize newlines 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
3adb396d87 html: Parse bogus comments instead of ignoring them
Also treat XML processing instructions as bogus comments.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
e1834745e0 html: Add character data tests 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
f9ed30e972 html: HTML5 character data states 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
5951179239 html: Parse named character references according to HTML5 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
a80f8b64a9 html: Allow attributes in end tags
Attribute are syntactically allowed in HTML5 end tags but otherwise
ignored.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
dcb2abb2fe html: Parse tag and attribute names according to HTML5
HTML5 allows bascially all characters in tag and attribute names.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
e395946194 html: Reenable buggy detection of XML declarations
Switch to UTF-8 if a document starts with '<?xm' to match old behavior.
Also enable this check in the push parser.

Fixes #637.
2023-11-30 16:22:59 +01:00
Nick Wellnhofer
d7d0bc6581 SAX2: Ignore namespaces in HTML documents
In commit 21ca8829, we started to ignore namespaces in HTML element
names but we still called xmlSplitQName, effectively stripping the
namespace prefix. This would cause elements like <o:p> being parsed
as <p>. Now we leave the name untouched.

Fixes #508.
2023-03-31 17:08:43 +02:00
Nick Wellnhofer
76c6da4209 error: Make sure that error messages are valid UTF-8
This has caused issues with the Python bindings for a long time.

Should fix #64.
2022-12-04 23:34:19 +01:00
Nick Wellnhofer
76d6b0d768 html: Don't escape ASCII chars in href attributes
In several cases, href attributes can contain ASCII characters which are
illegal in URIs. Escaping them often does more harm than good.

Fixes #321.
2022-11-20 21:16:03 +01:00
Nick Wellnhofer
e986d09cf5 Skip incorrectly opened HTML comments
Commit 4fd69f3e fixed handling of '<' characters not followed by an
ASCII letter. But a '<!' sequence followed by invalid characters should
be treated as bogus comment and skipped.

Fixes #380.
2022-08-02 14:38:09 +02:00
Nick Wellnhofer
f1c32b4c78 Allow missing result files in runtest
Treat missing files as empty.
2022-04-04 04:28:15 +02:00
Mike Dalessio
d7b287b94c htmlParseComment: handle abruptly-closed comments
See guidance provided on abrutply-closed comments here:

https://html.spec.whatwg.org/multipage/parsing.html#parse-error-abrupt-closing-of-empty-comment
2022-03-02 14:42:47 +00:00
Mike Dalessio
24cdc89006 test coverage for abruptly-closed comments
These establish baseline behavior so that the subsequent commit is
clear about the behavior it will modify.
2022-03-02 14:42:47 +00:00
Nick Wellnhofer
2732b23466 Fix regression parsing public IDs literals in HTML
Fix regression introduced when reworking htmlParsePubidLiteral in
commit 93ce33c2.

Fixes #318.
2022-01-10 13:37:59 +01:00
Mike Dalessio
a67b63d183 use new htmlParseLookupCommentEnd to find comment ends
Note that the caret in error messages generated during comment parsing
may have moved by one byte.

See guidance provided on incorrectly-closed comments here:

https://html.spec.whatwg.org/multipage/parsing.html#parse-error-incorrectly-closed-comment
2020-12-16 16:12:07 +01:00
Mike Dalessio
29f5d20e84 htmlParseComment: treat --!> as if it closed the comment
See guidance provided on incorrectly-closed comments here:

https://html.spec.whatwg.org/multipage/parsing.html#parse-error-incorrectly-closed-comment
2020-12-16 16:12:07 +01:00
Mike Dalessio
e28d9347bc add test coverage for incorrectly-closed comments
this establishes the baseline behavior so that subsequent commits
which modify this behavior are clear about what's being changed.
2020-12-16 16:12:07 +01:00
Nick Wellnhofer
93ce33c2b8 Fix several quadratic runtime issues in HTML push parser
Fix a few remaining cases where the HTML push parser would scan more
content during lookahead than being parsed later.

Make sure that htmlParseDocTypeDecl consumes all content up to the
final '>' in case of errors. The old comment said "We shouldn't try to
resynchronize", but ignoring invalid content is also what the HTML5
spec mandates.

Likewise, make htmlParseEndTag skip to the final '>' in invalid end
tags even if not in recovery mode. This is probably the most visible
change in practice and leads to different output for some tests but is
also more in line with HTML5.

Make sure that htmlParsePI and htmlParseComment don't abort if invalid
characters are encountered but log an error and ignore the character.

Change some other end-of-buffer checks to test for a zero byte instead
of relying on IS_CHAR.

Fix usage of IS_CHAR macro in htmlParseScript.
2020-07-23 20:47:35 +02:00
Nick Wellnhofer
477c7f6aff Fix quadratic runtime in HTML parser
Commit eeb99329 removed an important optimization avoiding quadratic
runtime when repeatedly scanning the input buffer for terminating
characters in the HTML push parser. The related bug is

    https://bugzilla.gnome.org/show_bug.cgi?id=444994

Make sure that ctxt->checkIndex is always written and store additional
parser state in ctxt->inSubset which is unused in the HTML parser.

Found by OSS-Fuzz.
2020-07-06 12:17:20 +02:00
Nick Wellnhofer
0b2d5c48e3 Initialize keepBlanks in HTML parser
This caused failures in the HTML push tests but the fix required to
change the expected output of the HTML SAX tests.
2017-06-12 19:11:54 +02:00
David Kilzer
85c112a082 Add test cases for bug 758518
test/HTML/758518-entity.html exposed a bug in pushParseTest() in
runtest.c which assumed that an input file was at least 4 bytes long.
That test case is only 3 bytes, so we now take the minimum of 4 bytes
or the length of the test input.  We also now use 'chunkSize' in place
of the hard-coded value '1024' later in the function.
2017-06-12 18:26:11 +02:00
Pranjal Jumde
0bcd05c5cd Heap-based buffer overread in htmlCurrentChar
For https://bugzilla.gnome.org/show_bug.cgi?id=758606

* parserInternals.c:
(xmlNextChar): Add an test to catch other issues on ctxt->input
corruption proactively.
For non-UTF-8 charsets, xmlNextChar() failed to check for the end
of the input buffer and would continuing reading.  Fix this by
pulling out the check for the end of the input buffer into common
code, and return if we reach the end of the input buffer
prematurely.
* result/HTML/758606.html: Added.
* result/HTML/758606.html.err: Added.
* result/HTML/758606.html.sax: Added.
* result/HTML/758606_2.html: Added.
* result/HTML/758606_2.html.err: Added.
* result/HTML/758606_2.html.sax: Added.
* test/HTML/758606.html: Added test case.
* test/HTML/758606_2.html: Added test case.
2016-05-23 15:01:07 +08:00
Hugh Davenport
beca86e8c8 Detect change of encoding when parsing HTML names
From https://bugzilla.gnome.org/show_bug.cgi?id=758518

Happens when a file has a name getting parsed, but no valid encoding
set, so libxml has to guess what the encoding is. This patch detects
when the buffer location changes, and if it does, restarts the parsing
of the name.

This slightly change a couple of regression tests output
2016-05-23 15:01:07 +08:00
Pranjal Jumde
a820dbeac2 Bug 758605: Heap-based buffer overread in xmlDictAddString <https://bugzilla.gnome.org/show_bug.cgi?id=758605>
Reviewed by David Kilzer.

* HTMLparser.c:
(htmlParseName): Add bounds check.
(htmlParseNameComplex): Ditto.
* result/HTML/758605.html: Added.
* result/HTML/758605.html.err: Added.
* result/HTML/758605.html.sax: Added.
* runtest.c:
(pushParseTest): The input for the new test case was so small
(4 bytes) that htmlParseChunk() was never called after
htmlCreatePushParserCtxt(), thereby creating a false positive
test failure.  Fixed by using a do-while loop so we always call
htmlParseChunk() at least once.
* test/HTML/758605.html: Added.
2016-05-23 15:01:07 +08:00
Daniel Veillard
f933c89813 Keep non-significant blanks node in HTML parser
For https://bugzilla.gnome.org/show_bug.cgi?id=681822

Regardless if the option HTML_PARSE_NOBLANKS is set or not, blank nodes
are removed from a HTML document, for example:

<html>
  <head>
    <title>This is a test.</title>
  </head>
  <body>
    <p>This is a test.</p>
  </body>
</html>

is read as:

<html><head><title>This is a test.</title></head><body>
    <p>This is a test.</p>
  </body></html>

This changes the default behaviour but the old behaviour is available
as expected when using the parser flag HTML_PARSE_NOBLANKS

Based on original patch from Igor Ignatyuk <igor_ignatiouk@hotmail.com>

* HTMLparser.c: change various places in the parser where ignorable_space
  SAX callback was called without checking for the parser flag preference
* xmllint.c: make sure we use the new flag even for HTML parsing
* result/HTML/*: this modifies the output of a number of tests
2012-09-07 19:32:12 +08:00
Denis Pauk
a0cd075d94 HTML parser error with <noscript> in the <head>
For https://bugzilla.gnome.org/show_bug.cgi?id=615785
When the <noscript> is found, <head> is closed and a <body> element is created.
The real <body id="xxx"> gets skipped over, so I can't see any of the
body's attributes.
Just don't close <head> when encountering a <noscript>
Add a regression test too
2012-05-11 19:31:12 +08:00
Denis Pauk
868d92da89 Add HTML parser support for HTML5 meta charset encoding declaration
For https://bugzilla.gnome.org/show_bug.cgi?id=655218

http://www.w3.org/TR/2011/WD-html5-20110525/semantics.html#the-meta-element

"""
The charset attribute specifies the character encoding used by the document.
This is a character encoding declaration. If the attribute is present in an XML
document, its value must be an ASCII case-insensitive match for the string
"UTF-8" (and the document is therefore forced to use UTF-8 as its
encoding).
"""

However, while <meta http-equiv="Content-Type" content="text/html;
charset=utf8"> works, <meta charset="utf8"> does not.

While libxml2 HTML parser is not tuned for HTML5, this is a simple
addition

Also added a testcase
2012-05-10 15:34:57 +08:00
Daniel Veillard
3c080d6d72 Don't give default HTML boolean attribute values in parser
* HTMLparser.c: don't default value of HTML boolean attributes in the
  parser
* SAX2.c: move this to SAX2 tree building backend
* result/HTML/doc2.htm.sax result/HTML/doc3.htm.sax
  result/HTML/wired.html.sax: this changes a few HTML SAX regression
  tests
2010-03-15 15:47:50 +01:00
Daniel Veillard
a57ba4ce96 fix an HTML parsing error on large data sections reported by Mike Day add
* HTMLparser.c: fix an HTML parsing error on large data sections
  reported by Mike Day
* test/HTML/utf8bug.html result/HTML/utf8bug.html.err
  result/HTML/utf8bug.html.sax result/HTML/utf8bug.html: add the
  reproducer to the test suite
daniel

svn path=/trunk/; revision=3797
2008-09-25 16:06:18 +00:00
Daniel Veillard
42720248e6 change the way script/style are parsed to not try to detect comments,
* HTMLparser.c: change the way script/style are parsed to
  not try to detect comments, reported by Mike Day
* result/HTML/doc3.*: affects the result of that test
Daniel

svn path=/trunk/; revision=3598
2007-04-16 07:02:31 +00:00
Daniel Veillard
c47d263049 fixing HTML minimized attribute values to be generated internally if not
* HTMLparser.c: fixing HTML minimized attribute values to be generated
  internally if not present, fixes bug #332124
* result/HTML/doc2.htm.sax result/HTML/doc3.htm.sax
  result/HTML/wired.html.sax: this affects the SAX event strem for
  a few test cases
Daniel
2006-10-17 16:13:27 +00:00
Daniel Veillard
48519092e5 fixing HTML entities in attributes parsing bug #362552 added to the
* HTMLparser.c: fixing HTML entities in attributes parsing bug #362552
* result/HTML/entities2.html* test/HTML/entities2.html: added to
  the regression suite
Daniel
2006-10-17 15:56:35 +00:00
Daniel Veillard
b990008f05 script HTML parser error fix, corrects bug #319715 added test from Michael
* HTMLparser.c: script HTML parser error fix, corrects bug #319715
* result/HTML/53867* test/HTML/53867.html: added test from Michael Day
  to the regression suite
Daniel
2005-10-25 12:36:29 +00:00
Daniel Veillard
36d73403ff Applied the last patch from Gary Coady for #304637 changing the behaviour
* HTMLparser.c: Applied the last patch from Gary Coady for #304637
  changing the behaviour when text nodes are found in body
* result/HTML/*: this changes the output of some tests
Daniel
2005-09-01 09:52:30 +00:00
Daniel Veillard
b8c8016044 fixed bug #310333 with a patch close to the provided patch for HTML UTF-8
* HTMLtree.c: fixed bug #310333 with a patch close to the provided
  patch for HTML UTF-8 serialization
* result/HTML/script2.html: this changed the output of that test
Daniel
2005-08-08 13:46:45 +00:00
Daniel Veillard
358fef4b1e applied UTF-8 script parsing bug #310229 fix from Jiri Netolicky added the
* HTMLparser.c: applied UTF-8 script parsing bug #310229 fix from
  Jiri Netolicky
* result/HTML/script2.html* test/HTML/script2.html: added the test
  case from the regression suite
Daniel
2005-07-13 16:37:38 +00:00
Daniel Veillard
597f1c1f34 applied patch from James Bursa fixing an html parsing bug in push mode
* HTMLparser.c: applied patch from James Bursa fixing an html parsing
  bug in push mode
* result/HTML/repeat.html* test/HTML/repeat.html: added the test to the
  regression suite
Daniel
2005-07-03 23:00:18 +00:00
Daniel Veillard
fc484dd0a0 added support for HTML PIs #156087 added specific tests Daniel
* HTMLparser.c: added support for HTML PIs #156087
* test/HTML/python.html result/HTML/python.html*: added specific tests
Daniel
2004-10-22 14:34:23 +00:00
Daniel Veillard
18a65095e0 fix to the fix for #141864 from Paul Elseth apply fix from David Gatwood
* xmlIO.c: fix to the fix for #141864 from Paul Elseth
* HTMLparser.c result/HTML/doc3.htm: apply fix from David Gatwood for
  #141195 about text between comments.
Daniel
2004-05-11 15:57:42 +00:00
Daniel Veillard
42fd412637 change --html to make sure we use the HTML serialization rule by default
* xmllint.c: change --html to make sure we use the HTML serialization
  rule by default when HTML parser is used, add --xmlout to allow to
  force the XML serializer on HTML.
* HTMLtree.c: ugly tweak to fix the output on <p> element and
  solve #125093
* result/HTML/*: this changes the output of some tests
Daniel
2003-11-04 08:47:48 +00:00