-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KSES: Preserve some additional invalid HTML comment syntaxes. #6395
base: trunk
Are you sure you want to change the base?
KSES: Preserve some additional invalid HTML comment syntaxes. #6395
Conversation
When `wp_kses_split` processes a document it attempts to leave HTML comments relatively alone. It makes minor adjustments, but leaves the comments in the document in its output. Unfortunately it only recognizes one kind of HTML comment and rejects many other kinds which appear as the result of various invalid HTML markup. This patch makes a minor adjustment to the algorithm in `wp_kses_split` to allow two additional kinds of HTML comments: - HTML comments with the incorrect closer `--!>`. - Closing tags with an invalid tag name, e.g. `</%dolly>`. In an HTML parser these all become comments, and so leaving them in the document should be a benign operation, improving the reliability of detecting comments in Core. These invalid closing tags, which in a browser are interpreted as comments, are one proposal for a placeholder mechanism in the HTML API unlocking HTML templating, a new kind of shortcode, and more. Having these persist in Core is a requirement for exploring and utilizing the new syntax.
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN:
To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
The test failure is related to IE conditional comments. However, dealing with those shouldn't matter anymore:
https://www.sitepoint.com/microsoft-drop-ie10-conditional-comments/ |
* it should be interpreted as an HTML comment. It extends until the | ||
* first `>` character after the initial opening `</`. | ||
*/ | ||
if ( 1 === preg_match( '~</[^a-zA-Z][^>]*>~', $content ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this bales out an returns $content
if there is a match anywhere in the string. Is this expected? Would this kick in if that's inside of an attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call: I think this needs anchors. I'll review it and add them accordingly
Other than my |
I'm guessing rare, but it might be a fun experiment to scan some sites and count them. This patch only addresses a single kind of invalid comment leaving most other constructions alone. This is because I'm trying to be tactical here and leave a small surface area of change.
Yes absolutely. Before I do that though I want to make sure folks are on board with the change. If we use these "funky comments" for Bits, template placeholders, translation wrappers, they could become quite popular. In a local branch I was playing with replacing For instance, WordPress currently strips away
There are further nuances. For now I would like to preserve just this one case so that we can get testing and exploring the UX of potential "Bits" systems. In my opinion this code is no worse off than the existing code, so the fact that it doesn't solve more syntax irregularities seems like something that shouldn't be required. |
Is there a list of all the possible invalid comment syntaxes? |
They are not listed separated like this but you can find them in the parsing errors in the specification. The Tag Processor draws its own classification which is retrievable with
It separates out comments that appear as the result of being a closing tag with an invalid tag name. These it calls funky comments and these are the proposed syntax for Bits and other atomic shortcode like placeholders. |
@westonruter I took a list of the top-ranked domains and grabbed the first 296,000 pages at This file has ANSI terminal codes for clearer output. You can |
Trac ticket: Core-61009
Status
The one test failure appears to be related to deprecating the outdated IE comment syntax. It's not a problem because the test enshrines a mis-parsing of special HTML comments.
This change fixes that mis-parsing and will prevent further corruption.
Description
When
wp_kses_split
processes a document it attempts to leave HTML comments alone. It makes minor adjustments, but leaves the comments in the document in its output. Unfortunately it only recognizes one kind of HTML comment and rejects many others.In HTML there are many kinds of invalid markup which, according to the specification, are to be interpreted as an HTML comment. These include, but are not limited to:
<!-->
,<!-- --!>
, etc…</3>
,</%happy>
, etc…<![CDATA[…]]>
<?include "blarg">
This patch makes a minor adjustment to the algorithm in wp_kses_split to allow two additional kinds of HTML comments:
--!>
, because this one was a simple and easy change.</%dolly>
, because these are required to open up explorations in Gutenberg on Bits, a new iteration of dynamic tokens for externally-sourced data, or "Shortcodes 2.0"These invalid closing tags, which in a browser are interpreted as comments, are one proposal for a placeholder mechanism in the HTML API unlocking HTML templating, a new kind of shortcode, and more. Having these persist in Core is a requirement for exploring and utilizing the new syntax because as long as Core removes them, there's no way to load content from the database and experiment on the full life cycle of potential Bits systems.
On its own, however, this represents a kind of bug fix for Core, making the implementation of wp_kses_split() more closely align with its stated goal of leaving HTML comments as comments. It doesn't attempt to fully fix the mis-parsed comments (because that is a much deeper issue and involves many more questions about existing expectations) but it does propose a couple of hopefully and expectedly minor fixes that hopefully won't break any existing code or projects.
Discussion
Concerning the removed
kses
test, as far as I can tell there is no support for conditional comments like these after IE9, which I believe we don't support anymore. All modern browsers will interpret the entire span of content as a single comment.Unfortunately there's a twist to the attack which follow best-practice recommendations for conditional comments, which is to add a short abruptly-closed comment after the opening.
The addition of
<!-->
ends the opening comment and re-reveals the<SCRIPT>
tag. However, this should not affect the security of WordPress because it will be looking for theSCRIPT
tags through other code paths after the removal of this restriction.