Directory

KSES: Preserve some additional invalid HTML comment syntaxes. by dmsnell · Pull Request #6395 · WordPress/wordpress-develop · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KSES: Preserve some additional invalid HTML comment syntaxes. #6395

Open
wants to merge 3 commits into
base: trunk
Choose a base branch
from

Conversation

dmsnell
Copy link
Contributor

@dmsnell dmsnell commented Apr 15, 2024

Trac ticket: Core-61009

Status

The one test failure appears to be related to deprecating the outdated IE comment syntax. It's not a problem because the test enshrines a mis-parsing of special HTML comments.

This change fixes that mis-parsing and will prevent further corruption.

Description

When wp_kses_split processes a document it attempts to leave HTML comments alone. It makes minor adjustments, but leaves the comments in the document in its output. Unfortunately it only recognizes one kind of HTML comment and rejects many others.

In HTML there are many kinds of invalid markup which, according to the specification, are to be interpreted as an HTML comment. These include, but are not limited to:

  • HTML comments with invalid syntax, <!-->, <!-- --!>, etc…
  • HTML closing tags whose tag name is invalid </3>, </%happy>, etc…
  • Things that look like XML CDATA sections, <![CDATA[…]]>
  • Things that look like XML Processor Instruction nodes, <?include "blarg">

This patch makes a minor adjustment to the algorithm in wp_kses_split to allow two additional kinds of HTML comments:

  • HTML comments with the incorrect closer --!>, because this one was a simple and easy change.
  • Closing tags with an invalid tag name, e.g. </%dolly>, because these are required to open up explorations in Gutenberg on Bits, a new iteration of dynamic tokens for externally-sourced data, or "Shortcodes 2.0"

These invalid closing tags, which in a browser are interpreted as comments, are one proposal for a placeholder mechanism in the HTML API unlocking HTML templating, a new kind of shortcode, and more. Having these persist in Core is a requirement for exploring and utilizing the new syntax because as long as Core removes them, there's no way to load content from the database and experiment on the full life cycle of potential Bits systems.

On its own, however, this represents a kind of bug fix for Core, making the implementation of wp_kses_split() more closely align with its stated goal of leaving HTML comments as comments. It doesn't attempt to fully fix the mis-parsed comments (because that is a much deeper issue and involves many more questions about existing expectations) but it does propose a couple of hopefully and expectedly minor fixes that hopefully won't break any existing code or projects.

Discussion

Concerning the removed kses test, as far as I can tell there is no support for conditional comments like these after IE9, which I believe we don't support anymore. All modern browsers will interpret the entire span of content as a single comment.

Unfortunately there's a twist to the attack which follow best-practice recommendations for conditional comments, which is to add a short abruptly-closed comment after the opening.

<!--[if gte IE 4]><!-->
<SCRIPT>alert('XSS');</SCRIPT>
<![endif]-->

The addition of <!--> ends the opening comment and re-reveals the <SCRIPT> tag. However, this should not affect the security of WordPress because it will be looking for the SCRIPT tags through other code paths after the removal of this restriction.

Warning: Should verify this in a clear way.

When `wp_kses_split` processes a document it attempts to leave HTML
comments relatively alone. It makes minor adjustments, but leaves the
comments in the document in its output.

Unfortunately it only recognizes one kind of HTML comment and rejects
many other kinds which appear as the result of various invalid HTML
markup.

This patch makes a minor adjustment to the algorithm in `wp_kses_split`
to allow two additional kinds of HTML comments:

 - HTML comments with the incorrect closer `--!>`.
 - Closing tags with an invalid tag name, e.g. `</%dolly>`.

In an HTML parser these all become comments, and so leaving them in the
document should be a benign operation, improving the reliability of
detecting comments in Core. These invalid closing tags, which in a
browser are interpreted as comments, are one proposal for a placeholder
mechanism in the HTML API unlocking HTML templating, a new kind of
shortcode, and more. Having these persist in Core is a requirement for
exploring and utilizing the new syntax.
Copy link

github-actions bot commented Apr 15, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, zieladam, westonruter.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@adamziel
Copy link
Contributor

The test failure is related to IE conditional comments. However, dealing with those shouldn't matter anymore:

Conditional comments are conditional statements interpreted by Microsoft Internet Explorer versions 5 through 9 in HTML source code

https://www.sitepoint.com/microsoft-drop-ie10-conditional-comments/

* it should be interpreted as an HTML comment. It extends until the
* first `>` character after the initial opening `</`.
*/
if ( 1 === preg_match( '~</[^a-zA-Z][^>]*>~', $content ) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this bales out an returns $content if there is a match anywhere in the string. Is this expected? Would this kick in if that's inside of an attribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call: I think this needs anchors. I'll review it and add them accordingly

@adamziel
Copy link
Contributor

adamziel commented Apr 16, 2024

Other than my preg_match question, this LGTM. I'd just remove the failing test case, conditional comments were only relevant up to IE9 which WordPress doesn't support anymore. And even if it did, the new output doesn't seem like an XSS vector. I'd still get inputs from others, cc @westonruter

@westonruter
Copy link
Member

Confirmed that Chrome parses these as comments:

<!-- comment 1 -->
foo
<!-- comment 2 --!>
bar
</%comment3>
baz

image

How common are these other invalid comments?

I presume you'll be adding tests specifically for the changes here?

@dmsnell
Copy link
Contributor Author

dmsnell commented Apr 17, 2024

How common are these other invalid comments?

I'm guessing rare, but it might be a fun experiment to scan some sites and count them. This patch only addresses a single kind of invalid comment leaving most other constructions alone. This is because I'm trying to be tactical here and leave a small surface area of change.

I presume you'll be adding tests specifically for the changes here?

Yes absolutely. Before I do that though I want to make sure folks are on board with the change. If we use these "funky comments" for Bits, template placeholders, translation wrappers, they could become quite popular.

In a local branch I was playing with replacing wp_kses_split2 using the HTML API and it provides some great affordances, but as usual, far more questions to resolve.

For instance, WordPress currently strips away <:::>, interpreting it as an invalid tag. It's not though, it's legitimate plaintext despite being invalid markup. So now what do we do?

  • Do we attempt to maintain this legacy behavior that corrupts documents that otherwise are fine?
  • Do we preserve this legitimate HTML and allow its surprising form to thread through a stack of code expecting more familiar HTML?
  • Do we eagerly replace confusing HTML like this before passing it along into WordPress, taking a performance hit to provide regularity and mitigate further misinterpretation by later code on the render chain?

There are further nuances. For now I would like to preserve just this one case so that we can get testing and exploring the UX of potential "Bits" systems. In my opinion this code is no worse off than the existing code, so the fact that it doesn't solve more syntax irregularities seems like something that shouldn't be required.

@westonruter
Copy link
Member

This patch only addresses a single kind of invalid comment leaving most other constructions alone.

Is there a list of all the possible invalid comment syntaxes?

@dmsnell
Copy link
Contributor Author

dmsnell commented Apr 18, 2024

Is there a list of all the possible invalid comment syntaxes?

They are not listed separated like this but you can find them in the parsing errors in the specification.

The Tag Processor draws its own classification which is retrievable with get_comment_type(), whereas it distinguishes:

  • Comments that are basically well-formed.
  • Comments that were abruptly closed.
  • Comments that appear as though someone wanted to write a CDATA section.
  • Comments that appear as though someone wanted to write an SGML processing instruction.
  • Comments that form from a variety of other invalid HTML constructions.

It separates out comments that appear as the result of being a closing tag with an invalid tag name. These it calls funky comments and these are the proposed syntax for Bits and other atomic shortcode like placeholders.

@dmsnell
Copy link
Contributor Author

dmsnell commented Apr 19, 2024

How common are these other invalid comments?

@westonruter I took a list of the top-ranked domains and grabbed the first 296,000 pages at / for the domain.

invalid-comments.txt

This file has ANSI terminal codes for clearer output. You can cat it and see colors…

Screenshot 2024-04-19 at 6 47 42 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants