HTML API: Add custom text decoder #6387

dmsnell · 2024-04-12T17:57:27Z

Trac ticket: Core-61072
Token Map Trac ticket: Core-60698

From #5337 takes the HTML text decoder.

Status

The code should be working now with this, and fully spec-compliant.
Tests are covered generally by the html5lib test suite.

Performance

After some initial testing this appears to be around 20% slower in its current state at decoding text values compared to using html_entity_decode(). I tested against a set of 296,046 web pages at the root domain for a list of the top-ranked domains that I found online.

The decoder itself, when run against a worst-case scenario document with one million randomly-generated hexadecimal and decimal numeric character references is much slower than html_entity_decode() (which itself is moot given the reliability issues with that function). html_entity_decode() processes that test file at 195 MB/s while the HTML decoder processes it at 6.40 MB/s. With the pre-decoder patch this rises to 9.78 MB/s.
Understanding context is important here because most documents aren't worse-case documents. Typical web performance is reported in this PR based on real web pages on high-ranked domains. The fact that this step is slow does not make page rendering slow because it's such a small part of a real web path. In WordPress this would be even less impactful because we're not parsing all text content of all HTML on the server; only a fraction of it.

The impact is quite marginal, adding around 60 µs per page. For the set of close to 300k pages that took the total runtime from 87s to 105s. I tested with the following main loop, using microtime( true ) before and after the loop to add to the total time in an attempt to eliminate the I/O wait time from the results. This is a worst-case scenario where decode every attribute and every text node. Again, in practice, WordPress would only likely experience a fraction of that 60 µs because it's not decodingevery text node and every attribute of the HTML it ships to a browser.

I attempted to avoid string allocations and this raised a challenge: strpos() doesn't provide a way to stop at a given index. This led me to try replacing it with a simple look to advance character by character until finding a &. This slowed it down to about 25% slower than html_entity_decode() so I removed that and instead relied on using strpos() with the possibility that it scans much further past the end of the value. On the test set of data it was still faster.

while ( $p->next_token() ) {
	$token_name = $p->get_token_name();
	$token_type = $p->get_token_type();

	if ( '#tag' === $token_type && ! $p->is_tag_closer() ) {
		foreach ( $p->get_attribute_names_with_prefix( '' ) ?? array() as $name ) {
			$chunk = $p->get_attribute( $name );
			if ( is_string( $chunk ) ) {
				$total_code_points += mb_strlen( $chunk );
			}
		}
	}

	$text = $p->get_modifiable_text();
	if ( '' !== $text ) {
		$total_code_points += mb_strlen( $text );
	}
}

For comparison, I built a version that skips the WP_Token_Map and instead relies on a basic associative array whose keys are the character reference names and whose values are the replacements. This was 840% slower than html_decode_entities() and increased the average page processing time by 2.175 ms. The token map is thus approximately 36x faster than the naive implementation.

Pre-decoding

In an attempt to rely more on html_entity_decode() I added a pre-decoding step that would handle all well-formed numeric character encodings. The logic here is that if we can use a quick preg_replace_callback() pass to get as much into C-code as we can, by means of html_entity_decode(), then maybe it would be worth it even with the additional pass.

Unfortunately the results were instantly slower, adding another 20% slowdown in my first 100k domains under test. That is, it's over 40% slower than a pure html_entity_decode() whereas the code without the pre-encoding step is only 20% slower.

The Pre-Decoder

		// pre-decode certain known numeric character references.
		$text = preg_replace_callback(
			'~&#((?P<is_hex>[Xx])0*(?P<hex_digits>[1-9A-Fa-f][0-9A-Fa-f]{0,5})|0*(?P<dec_digits>[1-9][0-9]{0,6}));~',
			static function ( $matches ) {
				$is_hex = strlen( $matches['is_hex'] ) > 0;
				$digits = $matches[ $is_hex ? 'hex_digits' : 'dec_digits' ];

				if ( ( $is_hex ? 2 : 3 ) === strlen( $digits ) ) {
					$code_point = intval( $digits, $is_hex ? 16 : 10 );

					/*
					 * Noncharacters, 0x0D, and non-ASCII-whitespace control characters.
					 *
					 * > A noncharacter is a code point that is in the range U+FDD0 to U+FDEF,
					 * > inclusive, or U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF,
					 * > U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE,
					 * > U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF,
					 * > U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE,
					 * > U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, or U+10FFFF.
					 *
					 * A C0 control is a code point that is in the range of U+00 to U+1F,
					 * but ASCII whitespace includes U+09, U+0A, U+0C, and U+0D.
					 *
					 * These characters are invalid but still decode as any valid character.
					 * This comment is here to note and explain why there's no check to
					 * remove these characters or replace them.
					 *
					 * @see https://infra.spec.whatwg.org/#noncharacter
					 */

					/*
					 * Code points in the C1 controls area need to be remapped as if they
					 * were stored in Windows-1252. Note! This transformation only happens
					 * for numeric character references. The raw code points in the byte
					 * stream are not translated.
					 *
					 * > If the number is one of the numbers in the first column of
					 * > the following table, then find the row with that number in
					 * > the first column, and set the character reference code to
					 * > the number in the second column of that row.
					 */
					if ( $code_point >= 0x80 && $code_point <= 0x9F ) {
						$windows_1252_mapping = array(
							'€', // 0x80 -> EURO SIGN (€).
							"\xC2\x81",   // 0x81 -> (no change).
							'‚', // 0x82 -> SINGLE LOW-9 QUOTATION MARK (‚).
							'ƒ', // 0x83 -> LATIN SMALL LETTER F WITH HOOK (ƒ).
							'„', // 0x84 -> DOUBLE LOW-9 QUOTATION MARK („).
							'…', // 0x85 -> HORIZONTAL ELLIPSIS (…).
							'†', // 0x86 -> DAGGER (†).
							'‡', // 0x87 -> DOUBLE DAGGER (‡).
							'ˆ', // 0x88 -> MODIFIER LETTER CIRCUMFLEX ACCENT (ˆ).
							'‰', // 0x89 -> PER MILLE SIGN (‰).
							'Š', // 0x8A -> LATIN CAPITAL LETTER S WITH CARON (Š).
							'‹', // 0x8B -> SINGLE LEFT-POINTING ANGLE QUOTATION MARK (‹).
							'Œ', // 0x8C -> LATIN CAPITAL LIGATURE OE (Œ).
							"\xC2\x8D",   // 0x8D -> (no change).
							'Ž', // 0x8E -> LATIN CAPITAL LETTER Z WITH CARON (Ž).
							"\xC2\x8F",   // 0x8F -> (no change).
							"\xC2\x90",   // 0x90 -> (no change).
							'‘', // 0x91 -> LEFT SINGLE QUOTATION MARK (‘).
							'’', // 0x92 -> RIGHT SINGLE QUOTATION MARK (’).
							'“', // 0x93 -> LEFT DOUBLE QUOTATION MARK (“).
							'”', // 0x94 -> RIGHT DOUBLE QUOTATION MARK (”).
							'•', // 0x95 -> BULLET (•).
							'–', // 0x96 -> EN DASH (–).
							'—', // 0x97 -> EM DASH (—).
							'˜', // 0x98 -> SMALL TILDE (˜).
							'™', // 0x99 -> TRADE MARK SIGN (™).
							'š', // 0x9A -> LATIN SMALL LETTER S WITH CARON (š).
							'›', // 0x9B -> SINGLE RIGHT-POINTING ANGLE QUOTATION MARK (›).
							'œ', // 0x9C -> LATIN SMALL LIGATURE OE (œ).
							"\xC2\x9D",   // 0x9D -> (no change).
							'ž', // 0x9E -> LATIN SMALL LETTER Z WITH CARON (ž).
							'Ÿ', // 0x9F -> LATIN CAPITAL LETTER Y WITH DIAERESIS (Ÿ).
						);

						return $windows_1252_mapping[ $code_point - 0x80 ];
					}
				}

				return html_entity_decode( $matches[0], ENT_SUBSTITUTE, 'UTF-8' );
			},
			$text
		);

Faster integer decoding.

I attempted to parse the code point inline while scanning the digits in hopes to save some time computing, but this dramatically slowed down the interpret. I think that the per-character parsing is much slower than intval().

Faster digit detection.

I attempted to replace strspn( $text, $numeric_digits ) with a custom look examining each character for whether it was in the digit ranges, but this was just as slow as the custom integer decoder.

Quick table lookup of group/small token indexing.

On the idea that looking up the group or small word in the lookup strings might be slow, given that it's required to iterate every time, I tried adding a patch to introduce an index table for direct lookup into where words of the given starting letter start, and if they even exist in the table at all.

Table-lookup patch

diff --git a/src/wp-includes/class-wp-token-map.php b/src/wp-includes/class-wp-token-map.php
index c7bb9316ed..60a189b37d 100644
--- a/src/wp-includes/class-wp-token-map.php
+++ b/src/wp-includes/class-wp-token-map.php
@@ -182,6 +182,38 @@ class WP_Token_Map {
 	 */
 	private $groups = '';
 
+	/**
+	 * Indicates where the first group key starting with a given
+	 * letter is to be found in the groups string.
+	 *
+	 *  - Each position in the string corresponds to the first byte
+	 *    of a lookup key. E.g. keys starting with `A` will look
+	 *    at the 65th index (A is U+0041) to see if it's in the set
+	 *    and where.
+	 *
+	 *  - Each index is stored as a two-byte unsigned integer
+	 *    indicating where to start looking. This limits the
+	 *    key size to 256 bytes.
+	 *
+	 *  - A null value indicates that there are no words present
+	 *    starting with the given byte.
+	 *
+	 * Example:
+	 *
+	 *     Suppose that there exists a key starting with `B` at
+	 *     offset 481 (0x01 0xE1) but none starting with `A`.
+	 *     ┌────────┬───────┬───────┬───────┐
+	 *     │  ...   │   A   │   B   │  ...  │
+	 *     ├────────┼───────┼───────┼───────┤
+	 *     │        │ 00 00 │ 01 E1 │       │
+	 *     └────────┴───────┴───────┴───────┘
+	 *
+	 * @since 6.6.0
+	 *
+	 * @var string
+	 */
+	private $group_index;
+
 	/**
 	 * Stores an optimized row of small words, where every entry is
 	 * `$this->key_size + 1` bytes long and zero-extended.
@@ -200,6 +232,38 @@ class WP_Token_Map {
 	 */
 	private $small_words = '';
 
+	/**
+	 * Indicates where the first word starting with a given letter
+	 * is to be found in the small words string.
+	 *
+	 *  - Each position in the string corresponds to the first byte
+	 *    of a lookup word. E.g. words starting with `A` will look
+	 *    at the 65th index (A is U+0041) to see if it's in the set
+	 *    and where.
+	 *
+	 *  - Each index is stored as a two-byte unsigned integer
+	 *    indicating where to start looking. This limits the
+	 *    key size to 256 bytes.
+	 *
+	 *  - A null value indicates that there are no words present
+	 *    starting with the given byte.
+	 *
+	 * Example:
+	 *
+	 *     Suppose that there exists a word starting with `B` at
+	 *     offset 481 (0x01 0xE1) but none starting with `A`.
+	 *     ┌────────┬───────┬───────┬───────┐
+	 *     │  ...   │   A   │   B   │  ...  │
+	 *     ├────────┼───────┼───────┼───────┤
+	 *     │        │ 00 00 │ 01 E1 │       │
+	 *     └────────┴───────┴───────┴───────┘
+	 *
+	 * @since 6.6.0
+	 *
+	 * @var string
+	 */
+	private $small_index;
+
 	/**
 	 * Replacements for the small words, in the same order they appear.
 	 *
@@ -287,9 +351,22 @@ class WP_Token_Map {
 			);
 		}
 
+		// Prime the search indices.
+		$map->small_index = str_repeat( "\xFF\xFF", 256 );
+		$map->group_index = str_repeat( "\xFF\xFF", 256 );
+
 		// Finally construct the optimized lookups.
 
+		$last_byte = "\x00";
 		foreach ( $shorts as $word ) {
+			if ( $last_byte !== $word[0] ) {
+				$last_byte                         = $word[0];
+				$index_at                          = 2 * ord( $last_byte );
+				$offset                            = pack( 'n', strlen( $map->small_words ) );
+				$map->small_index[ $index_at ]     = $offset[0];
+				$map->small_index[ $index_at + 1 ] = $offset[1];
+			}
+
 			$map->small_words     .= str_pad( $word, $key_length + 1, "\x00", STR_PAD_RIGHT );
 			$map->small_mappings[] = $mappings[ $word ];
 		}
@@ -297,7 +374,16 @@ class WP_Token_Map {
 		$group_keys = array_keys( $groups );
 		sort( $group_keys );
 
+		$last_byte = "\x00";
 		foreach ( $group_keys as $group ) {
+			if ( $last_byte !== $group[0] ) {
+				$last_byte                         = $group[0];
+				$index_at                          = 2 * ord( $last_byte );
+				$offset                            = pack( 'n', strlen( $map->groups ) );
+				$map->group_index[ $index_at ]     = $offset[0];
+				$map->group_index[ $index_at + 1 ] = $offset[1];
+			}
+
 			$map->groups .= "{$group}\x00";
 
 			$group_string = '';
@@ -327,19 +413,23 @@ class WP_Token_Map {
 	 *
 	 * @param int    $key_length     Group key length.
 	 * @param string $groups         Group lookup index.
+	 * @param string $group_index    Locations in the group lookup where each character starts.
 	 * @param array  $large_words    Large word groups and packed strings.
 	 * @param string $small_words    Small words packed string.
+	 * @param string $small_index    Locations in the small word lookup where each character starts.
 	 * @param array  $small_mappings Small word mappings.
 	 *
 	 * @return WP_Token_Map Map with precomputed data loaded.
 	 */
-	public static function from_precomputed_table( $key_length, $groups, $large_words, $small_words, $small_mappings ) {
+	public static function from_precomputed_table( $key_length, $groups, $group_index, $large_words, $small_words, $small_index, $small_mappings ) {
 		$map = new WP_Token_Map();
 
 		$map->key_length     = $key_length;
 		$map->groups         = $groups;
+		$map->group_index    = $group_index;
 		$map->large_words    = $large_words;
 		$map->small_words    = $small_words;
+		$map->small_index    = $small_index;
 		$map->small_mappings = $small_mappings;
 
 		return $map;
@@ -454,7 +544,18 @@ class WP_Token_Map {
 		if ( $text_length > $this->key_length ) {
 			$group_key = substr( $text, $offset, $this->key_length );
 
-			$group_at = $ignore_case ? stripos( $this->groups, $group_key ) : strpos( $this->groups, $group_key );
+			$group_index = unpack( 'n', $this->group_index, 2 * ord( $text[ $offset ] ) )[1];
+			if ( 0xFFFF === $group_index && ! $ignore_case ) {
+				// Perhaps a short word then.
+				return strlen( $this->small_words ) > 0
+					? $this->read_small_token( $text, $offset, $skip_bytes, $case_sensitivity )
+					: false;
+			}
+
+			$group_at = $ignore_case
+				? stripos( $this->groups, $group_key )
+				: strpos( $this->groups, $group_key, $group_index );
+
 			if ( false === $group_at ) {
 				// Perhaps a short word then.
 				return strlen( $this->small_words ) > 0
@@ -499,7 +600,14 @@ class WP_Token_Map {
 	 * @return string|false Mapped value of lookup key if found, otherwise `false`.
 	 */
 	private function read_small_token( $text, $offset, &$skip_bytes, $case_sensitivity = 'case-sensitive' ) {
-		$ignore_case  = 'case-insensitive' === $case_sensitivity;
+		$ignore_case = 'case-insensitive' === $case_sensitivity;
+
+		// Quickly eliminate impossible matches.
+		$small_index = unpack( 'n', $this->small_index, 2 * ord( $text[ $offset ] ) )[1];
+		if ( 0xFFFF === $small_index && ! $ignore_case ) {
+			return false;
+		}
+
 		$small_length = strlen( $this->small_words );
 		$search_text  = substr( $text, $offset, $this->key_length );
 		if ( $ignore_case ) {
@@ -507,7 +615,7 @@ class WP_Token_Map {
 		}
 		$starting_char = $search_text[0];
 
-		$at = 0;
+		$at = $ignore_case ? 0 : $small_index;
 		while ( $at < $small_length ) {
 			if (
 				$starting_char !== $this->small_words[ $at ] &&
@@ -621,6 +729,13 @@ class WP_Token_Map {
 		$group_line = str_replace( "\x00", "\\x00", $this->groups );
 		$output    .= "{$i1}\"{$group_line}\",\n";
 
+		$group_index        = '';
+		$group_index_length = strlen( $this->group_index );
+		for ( $i = 0; $i < $group_index_length; $i++ ) {
+			$group_index .= '\\x' . str_pad( dechex( ord( $this->group_index[ $i ] ) ), 2, '0', STR_PAD_LEFT );
+		}
+		$output .= "{$i1}\"{$group_index}\",\n";
+
 		$output .= "{$i1}array(\n";
 
 		$prefixes = explode( "\x00", $this->groups );
@@ -685,6 +800,13 @@ class WP_Token_Map {
 		$small_text = str_replace( "\x00", '\x00', implode( '', $small_words ) );
 		$output    .= "{$i1}\"{$small_text}\",\n";
 
+		$small_index        = '';
+		$small_index_length = strlen( $this->small_index );
+		for ( $i = 0; $i < $small_index_length; $i++ ) {
+			$small_index .= '\\x' . str_pad( dechex( ord( $this->small_index[ $i ] ) ), 2, '0', STR_PAD_LEFT );
+		}
+		$output .= "{$i1}\"{$small_index}\",\n";
+
 		$output .= "{$i1}array(\n";
 		foreach ( $this->small_mappings as $mapping ) {
 			$output .= "{$i2}\"{$mapping}\",\n";
diff --git a/src/wp-includes/html-api/html5-named-character-references.php b/src/wp-includes/html-api/html5-named-character-references.php
index 41c467e268..b7827f9b8e 100644
--- a/src/wp-includes/html-api/html5-named-character-references.php
+++ b/src/wp-includes/html-api/html5-named-character-references.php
@@ -31,6 +31,7 @@ global $html5_named_character_references;
 $html5_named_character_references = WP_Token_Map::from_precomputed_table(
 	2,
 	"AE\x00AM\x00Aa\x00Ab\x00Ac\x00Af\x00Ag\x00Al\x00Am\x00An\x00Ao\x00Ap\x00Ar\x00As\x00At\x00Au\x00Ba\x00Bc\x00Be\x00Bf\x00Bo\x00Br\x00Bs\x00Bu\x00CH\x00CO\x00Ca\x00Cc\x00Cd\x00Ce\x00Cf\x00Ch\x00Ci\x00Cl\x00Co\x00Cr\x00Cs\x00Cu\x00DD\x00DJ\x00DS\x00DZ\x00Da\x00Dc\x00De\x00Df\x00Di\x00Do\x00Ds\x00EN\x00ET\x00Ea\x00Ec\x00Ed\x00Ef\x00Eg\x00El\x00Em\x00Eo\x00Ep\x00Eq\x00Es\x00Et\x00Eu\x00Ex\x00Fc\x00Ff\x00Fi\x00Fo\x00Fs\x00GJ\x00GT\x00Ga\x00Gb\x00Gc\x00Gd\x00Gf\x00Gg\x00Go\x00Gr\x00Gs\x00Gt\x00HA\x00Ha\x00Hc\x00Hf\x00Hi\x00Ho\x00Hs\x00Hu\x00IE\x00IJ\x00IO\x00Ia\x00Ic\x00Id\x00If\x00Ig\x00Im\x00In\x00Io\x00Is\x00It\x00Iu\x00Jc\x00Jf\x00Jo\x00Js\x00Ju\x00KH\x00KJ\x00Ka\x00Kc\x00Kf\x00Ko\x00Ks\x00LJ\x00LT\x00La\x00Lc\x00Le\x00Lf\x00Ll\x00Lm\x00Lo\x00Ls\x00Lt\x00Ma\x00Mc\x00Me\x00Mf\x00Mi\x00Mo\x00Ms\x00Mu\x00NJ\x00Na\x00Nc\x00Ne\x00Nf\x00No\x00Ns\x00Nt\x00Nu\x00OE\x00Oa\x00Oc\x00Od\x00Of\x00Og\x00Om\x00Oo\x00Op\x00Or\x00Os\x00Ot\x00Ou\x00Ov\x00Pa\x00Pc\x00Pf\x00Ph\x00Pi\x00Pl\x00Po\x00Pr\x00Ps\x00QU\x00Qf\x00Qo\x00Qs\x00RB\x00RE\x00Ra\x00Rc\x00Re\x00Rf\x00Rh\x00Ri\x00Ro\x00Rr\x00Rs\x00Ru\x00SH\x00SO\x00Sa\x00Sc\x00Sf\x00Sh\x00Si\x00Sm\x00So\x00Sq\x00Ss\x00St\x00Su\x00TH\x00TR\x00TS\x00Ta\x00Tc\x00Tf\x00Th\x00Ti\x00To\x00Tr\x00Ts\x00Ua\x00Ub\x00Uc\x00Ud\x00Uf\x00Ug\x00Um\x00Un\x00Uo\x00Up\x00Ur\x00Us\x00Ut\x00Uu\x00VD\x00Vb\x00Vc\x00Vd\x00Ve\x00Vf\x00Vo\x00Vs\x00Vv\x00Wc\x00We\x00Wf\x00Wo\x00Ws\x00Xf\x00Xi\x00Xo\x00Xs\x00YA\x00YI\x00YU\x00Ya\x00Yc\x00Yf\x00Yo\x00Ys\x00Yu\x00ZH\x00Za\x00Zc\x00Zd\x00Ze\x00Zf\x00Zo\x00Zs\x00aa\x00ab\x00ac\x00ae\x00af\x00ag\x00al\x00am\x00an\x00ao\x00ap\x00ar\x00as\x00at\x00au\x00aw\x00bN\x00ba\x00bb\x00bc\x00bd\x00be\x00bf\x00bi\x00bk\x00bl\x00bn\x00bo\x00bp\x00br\x00bs\x00bu\x00ca\x00cc\x00cd\x00ce\x00cf\x00ch\x00ci\x00cl\x00co\x00cr\x00cs\x00ct\x00cu\x00cw\x00cy\x00dA\x00dH\x00da\x00db\x00dc\x00dd\x00de\x00df\x00dh\x00di\x00dj\x00dl\x00do\x00dr\x00ds\x00dt\x00du\x00dw\x00dz\x00eD\x00ea\x00ec\x00ed\x00ee\x00ef\x00eg\x00el\x00em\x00en\x00eo\x00ep\x00eq\x00er\x00es\x00et\x00eu\x00ex\x00fa\x00fc\x00fe\x00ff\x00fi\x00fj\x00fl\x00fn\x00fo\x00fp\x00fr\x00fs\x00gE\x00ga\x00gb\x00gc\x00gd\x00ge\x00gf\x00gg\x00gi\x00gj\x00gl\x00gn\x00go\x00gr\x00gs\x00gt\x00gv\x00hA\x00ha\x00hb\x00hc\x00he\x00hf\x00hk\x00ho\x00hs\x00hy\x00ia\x00ic\x00ie\x00if\x00ig\x00ii\x00ij\x00im\x00in\x00io\x00ip\x00iq\x00is\x00it\x00iu\x00jc\x00jf\x00jm\x00jo\x00js\x00ju\x00ka\x00kc\x00kf\x00kg\x00kh\x00kj\x00ko\x00ks\x00lA\x00lB\x00lE\x00lH\x00la\x00lb\x00lc\x00ld\x00le\x00lf\x00lg\x00lh\x00lj\x00ll\x00lm\x00ln\x00lo\x00lp\x00lr\x00ls\x00lt\x00lu\x00lv\x00mD\x00ma\x00mc\x00md\x00me\x00mf\x00mh\x00mi\x00ml\x00mn\x00mo\x00mp\x00ms\x00mu\x00nG\x00nL\x00nR\x00nV\x00na\x00nb\x00nc\x00nd\x00ne\x00nf\x00ng\x00nh\x00ni\x00nj\x00nl\x00nm\x00no\x00np\x00nr\x00ns\x00nt\x00nu\x00nv\x00nw\x00oS\x00oa\x00oc\x00od\x00oe\x00of\x00og\x00oh\x00oi\x00ol\x00om\x00oo\x00op\x00or\x00os\x00ot\x00ou\x00ov\x00pa\x00pc\x00pe\x00pf\x00ph\x00pi\x00pl\x00pm\x00po\x00pr\x00ps\x00pu\x00qf\x00qi\x00qo\x00qp\x00qs\x00qu\x00rA\x00rB\x00rH\x00ra\x00rb\x00rc\x00rd\x00re\x00rf\x00rh\x00ri\x00rl\x00rm\x00rn\x00ro\x00rp\x00rr\x00rs\x00rt\x00ru\x00rx\x00sa\x00sb\x00sc\x00sd\x00se\x00sf\x00sh\x00si\x00sl\x00sm\x00so\x00sp\x00sq\x00sr\x00ss\x00st\x00su\x00sw\x00sz\x00ta\x00tb\x00tc\x00td\x00te\x00tf\x00th\x00ti\x00to\x00tp\x00tr\x00ts\x00tw\x00uA\x00uH\x00ua\x00ub\x00uc\x00ud\x00uf\x00ug\x00uh\x00ul\x00um\x00uo\x00up\x00ur\x00us\x00ut\x00uu\x00uw\x00vA\x00vB\x00vD\x00va\x00vc\x00vd\x00ve\x00vf\x00vl\x00vn\x00vo\x00vp\x00vr\x00vs\x00vz\x00wc\x00we\x00wf\x00wo\x00wp\x00wr\x00ws\x00xc\x00xd\x00xf\x00xh\x00xi\x00xl\x00xm\x00xn\x00xo\x00xr\x00xs\x00xu\x00xv\x00xw\x00ya\x00yc\x00ye\x00yf\x00yi\x00yo\x00ys\x00yu\x00za\x00zc\x00zd\x00ze\x00zf\x00zh\x00zi\x00zo\x00zs\x00zw\x00",
+	"\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x00\x00\x00\x30\x00\x48\x00\x72\x00\x93\x00\xc3\x00\xd2\x00\xf6\x01\x0e\x01\x38\x01\x47\x01\x5c\x01\x7d\x01\x95\x01\xb0\x01\xda\x01\xf5\x02\x01\x02\x25\x02\x4c\x02\x6d\x02\x97\x02\xb2\x02\xc1\x02\xcd\x02\xe8\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x03\x00\x03\x30\x03\x60\x03\x8d\x03\xc6\x03\xfc\x04\x20\x04\x53\x04\x71\x04\x9e\x04\xb0\x04\xc8\x05\x0d\x05\x37\x05\x7f\x05\xb5\x05\xd9\x05\xeb\x06\x2a\x06\x63\x06\x8a\x06\xc0\x06\xed\x07\x02\x07\x2c\x07\x44\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff",
 	array(
 		// AElig;[Æ] AElig[Æ].
 		"\x04lig;\x02Æ\x03lig\x02Æ",
@@ -1294,6 +1295,7 @@ $html5_named_character_references = WP_Token_Map::from_precomputed_table(
 		"\x03nj;\x03‌\x02j;\x03‍",
 	),
 	"GT\x00LT\x00gt\x00lt\x00",
+	"\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\x00\x03\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x00\x06\xff\xff\xff\xff\xff\xff\xff\xff\x00\x09\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff",
 	array(
 		">",
 		"<",

This did not introduce a measurable speedup or slowdown on the dataset of 300k HTML pages. While I believe that the table lookup could speed up certain workloads that are heavy with named character references, it does not justify itself on realistic data and so I'm leaving the patch out.

Metrics on character references.

From the same set of 296k webpages I counted the frequency of each character reference. This includes the full syntax, so if we were to have come across &#x00000000039 it would appear in the list. The linked file contains ANSII terminal codes, so view it through cat or less -R.

all-text-and-ref-counts.txt

Based on this data I added a special-case for ",  , and & before calling into the WP_Token_Map but it didn't have a measurable impact on performance. I'm led to conclude from this that it's not those common character references slowing things down. Possibly it's the numeric character references.

In another experiment I replaced my custom code_point_to_utf8_bytes() function with a call to mb_chr(), and again the impact wasn't significant. That method performs the same computation within PHP that this application-level does, so this is not surprising.

For clearer performance direction it's probably most helpful to profile a run of decoding and see where the CPU is spending its time. It appears to be fairly quick as it is in this patch.

Attempted alternatives

Tried to use an associative array whose keys are the character reference names and whose values are the translated UTF-8 strings. This operated more than 8x slower than the WP_Token_Map implementation.
Based on the frequency of named character references I tried short-circuiting ", &, and  , as they might account for up to 70-80% of all named character references in practice. This didn't impact the runtime. Runtime is likely dominated by numeric character reference decoding.
In 0351b78 tried to micro-optimize by eliminating if checks, rearranging code for frequency-analysis of code points, and replaced the Windows-1252 remapping with direct replacement. In a test 10 million randomly generated numeric character references, this performed around 3-5% faster than in the branch, but in real tests I could not measure any impact. The micro-optimizations are likely inert in a real context.

In my benchmark of decoding 10 million randomly-generated numeric character references about half the time is spent exclusively inside read_character_reference() and the other half is spent in code_point_to_utf8_bytes().

I tried replacing substr() + intval() with an unrolled table-lookup custom string-to-integer decoder. While that decoder performed significantly better than a native pure-PHP decoder, it was still noticeably slower than intval().

I'm led to believe that his is nearly optimal for a pure PHP solution.

Character-set detections.

The following CSV file is the result of surveying the / path of popular domains. It includes detections of whether the given HTML found at that path is valid UTF8, valid Windows-1252, valid ASCII, and whether it's valid in its self-reported character sets.

A 1 indicates that the HTML passes mb_check_encoding() for the encoding of the given column. A 0 indicates that it doesn't. A missing value indicates that the site did not self-report to contain that encoding.

Note that a site might self-report being encoded in multiple simultaneous and mutually-exclusive encodings.

charset-detections.csv

`html5lib` tests

Results	Output
Before	`Tests: 609, Assertions: 172, Failures: 63, Skipped: 435.`
After	`Tests: 607, Assertions: 172, Skipped: 435.`

Tests that are now possible to run that previously weren't.

Differences from `html_entity_decode()`

PHP misses 720 character references

&AElig &AMP &AMP; &Aacute &Acirc &Agrave ⁡ &Aring &Assign; &Atilde &Auml &Backslash; &Barwed; &Bernoullis; &Bumpeq; &COPY &COPY; &Cayleys; &Ccedil &CircleMinus; &ClockwiseContourIntegral; &CloseCurlyDoubleQuote; &CloseCurlyQuote; &Conint; &Copf; &CounterClockwiseContourIntegral; &DD; &Del; &DiacriticalDot; &DiacriticalGrave; &Diamond; &Dot; &DotEqual; &DoubleDownArrow; &DoubleLeftRightArrow; &DoubleLeftTee; &DoubleLongLeftArrow; &DoubleLongLeftRightArrow; &DoubleRightArrow; &DoubleUpDownArrow; &DoubleVerticalBar; &DownArrow; &DownLeftVector; &DownRightVector; &ETH &Eacute &Ecirc &Egrave &Element; &EqualTilde; &Equilibrium; &Escr; &Euml &ExponentialE; &FilledVerySmallSquare; &ForAll; &Fscr; &GT &GT; &GreaterEqual; &GreaterEqualLess; &GreaterFullEqual; &GreaterLess; &GreaterSlantEqual; &Gt; &Hscr; &HumpDownHump; &Iacute &Icirc &Igrave &Im; &Intersection; ⁣ &Iscr; &Iuml &LT &LT; &Laplacetrf; &LeftAngleBracket; &LeftArrow; &LeftArrowRightArrow; &LeftCeiling; &LeftDownVector; &LeftRightArrow; &LeftTee; &LeftTriangle; &LeftUpVector; &LeftVector; &Leftarrow; &Leftrightarrow; &LessEqualGreater; &LessFullEqual; &LessGreater; &LessSlantEqual; &Lleftarrow; &LongLeftArrow; &Longleftarrow; &Longleftrightarrow; &Longrightarrow; &LowerLeftArrow; &Lscr; &Lt; &Mscr; &NegativeMediumSpace; &NegativeThickSpace; &NegativeThinSpace; &NegativeVeryThinSpace; &NestedGreaterGreater; &NestedLessLess; &NonBreakingSpace; &Nopf; &NotDoubleVerticalBar; &NotElement; &NotEqualTilde; &NotExists; &NotGreater; &NotGreaterEqual; &NotGreaterSlantEqual; &NotGreaterTilde; &NotHumpDownHump; &NotHumpEqual; &NotLeftTriangle; &NotLeftTriangleEqual; &NotLessGreater; &NotLessLess; &NotLessSlantEqual; &NotLessTilde; &NotPrecedes; &NotReverseElement; &NotRightTriangle; &NotSubset; &NotSuperset; &NotTildeEqual; &NotTildeFullEqual; &NotTildeTilde; &NotVerticalBar; &Ntilde &Oacute &Ocirc &Ograve &Oslash &Otilde &Ouml &OverBar; &PartialD; &PlusMinus; &Poincareplane; &Popf; &Precedes; &PrecedesEqual; &PrecedesTilde; &Product; &Proportion; &Proportional; &QUOT &QUOT; &Qopf; &RBarr; &REG &REG; &Rarr; &Re; &ReverseEquilibrium; &RightArrow; &RightArrowLeftArrow; &RightTee; &RightTeeArrow; &RightTriangle; &RightVector; &Rightarrow; &Rrightarrow; &Rscr; &Rsh; &ShortDownArrow; &ShortLeftArrow; &ShortRightArrow; &ShortUpArrow; &SmallCircle; &SquareIntersection; &SquareSubset; &SquareSuperset; &SquareUnion; &Subset; &Succeeds; &SucceedsSlantEqual; &SuchThat; &Sum; &Sup; &Superset; &SupersetEqual; &THORN &TRADE; &Therefore; &Tilde; &TildeEqual; &TildeTilde; &Uacute &Ucirc &Ugrave &UnderBar; &UnderBracket; &Union; &UpArrow; &UpArrowDownArrow; &UpEquilibrium; &UpTee; &Uparrow; &UpperLeftArrow; &Upsi; &Uuml &Vee; &Vert; &VerticalBar; &VerticalLine; &VerticalTilde; &VeryThinSpace; &Wedge; &Yacute &aacute &acirc &acute ´ &aelig &agrave &alefsym; &amp &ang; &angst; ≈ &approxeq; &aring ≈ &asympeq; &atilde &auml &backcong; &backsim; &barwedge; &becaus; &because; &bepsi; &bernou; &bigodot; &bigstar; &bigvee; &bigwedge; &blacklozenge; &blacksquare; &bot; &bottom; &boxh; &boxtimes; &bprime; &breve; &brvbar &bsime; &bullet; &bumpe; &bumpeq; &caron; &ccedil &cedil ¸ &cent · &checkmark; &circlearrowleft; &circlearrowright; &circledR; &circledS; &circledast; &circledcirc; &circleddash; &cire; &clubsuit; &colone; &complement; &cong; &conint; &coprod; &copy &cuepr; &cularr; &curlyeqsucc; &curren &curvearrowright; &cuvee; &cuwed; &dArr; &dash; &dblac; &dd; &ddagger; &ddarr; &deg &dharr; &diam; &diams; &die; &digamma; &div; &divide &divideontimes; &dlcorn; &doteq; &dotminus; &dotplus; &dotsquare; &downarrow; &downharpoonleft; &downharpoonright; &dtri; &dtrif; &duarr; &duhar; &eDDot; &eDot; &eacute &ecirc &ecolon; &ee; &efDot; &egrave &emptyset; &emptyv; ε ϵ &eqcirc; &eqsim; &eqslantgtr; &eqslantless; &equiv; &erDot; &eth &euml ∃ &fork; &frac12 ½ &frac14 &frac34 &gE; &gel; &geq; &geqslant; &ggg; &gnE; &gnapprox; &gneq; &gsim; &gt &gtdot; &gtrapprox; &gtreqqless; &gtrless; &gtrsim; &gvnE; &hamilt; &hbar; &heartsuit; &hksearow; &hkswarow; &hookleftarrow; &hookrightarrow; &hslash; &iacute &icirc &iexcl &iff; &igrave &ii; &iiint; &image; &imagpart; &imath; &in; ∫ &integers; &intercal; &intprod; &iquest ∈ ⁢ &iuml ϰ ⇐ &lEg; &lang; &laquo &larrb; { “ &ldquor; ≤ &leftarrow; &leftarrowtail; &leftleftarrows; &leftrightarrow; &leftrightarrows; &leftrightsquigarrow; &leftthreetimes; &leg; &leqq; &leqslant; &lessapprox; &lesssim; &lfloor; &lg; &lhard; &lharu; &lmoustache; &lnE; &lnapprox; &lneq; &lobrk; &longleftrightarrow; &longmapsto; &longrightarrow; &looparrowleft; &loz; &lrcorner; &lrhar; &lsh; &lsim; [ ‘ &lsquor; &lt &ltdot; &ltrie; &ltrif; &lvnE; &macr &malt; &mapsto; &mapstodown; &mapstoleft; &mapstoup; &measuredangle; &micro &midast; &middot · &minusb; &mldr; &mnplus; &mp; &mstpos; &multimap; &nGtv; &nLeftrightarrow; &nRightarrow; &nap; &natural; &nbsp ≠ &nearr; &nearrow; &nequiv; &nesear; &nexists; &ngE; &nge; &ngeqq; &ngeqslant; &ngt; &nharr; &ni; &niv; &nlArr; &nlarr; &nle; &nleq; &nleqq; &nleqslant; &nless; &nlt; &nmid; &not &notinva; &notni; &npar; &nprcue; &npre; &nprec; &npreceq; &nrightarrow; &nrtri; &nrtrie; &nsc; &nsccue; &nsce; &nshortparallel; &nsim; &nsimeq; &nsmid; &nspar; &nsqsube; &nsqsupe; &nsube; &nsubset; &nsubseteq; &nsubseteqq; &nsucc; &nsucceq; &nsupE; &nsupe; &nsupseteq; &ntgl; &ntilde &ntriangleleft; &ntrianglelefteq; &ntrianglerighteq; &nwarr; &oacute &ocirc &odot; &ograve &ohm; &oint; &oplus; &order; &ordf &ordm &oscr; &oslash &otilde &otimes; &ouml &par; &para &parallel; ϕ &phmmat; &plankv; &plusb; &plusmn &pm; &pound &pr; &prap; &prcue; &pre; &preccurlyeq; &prnE; &prnap; &prnsim; &propto; &prsim; &qint; &quaternions; &questeq; &quot ⇒ &rBarr; √ &rang; &rangle; &raquo → &rarrb; &rarrlp; &rbarr; } ] &rceil; &rdquor; &real; &realpart; &reals; &reg &rfloor; &rightarrow; &rightarrowtail; &rightharpoondown; &rightharpoonup; &rightrightarrows; &rightsquigarrow; &rightthreetimes; &rlarr; &rlhar; &rmoustache; &robrk; &rsquor; &rtrie; &rtrif; &sc; &scap; &sccue; &sce; &scnap; &scsim; &searr; &searrow; &sect &setminus; &setmn; &sfrown; &shortmid; &shy &sigmaf; &sime; &slarr; &smallsetminus; &smid; &spades; &spar; &sqsube; &sqsubset; &sqsubseteq; &sqsup; &sqsupe; &sqsupseteq; &squ; &square; &squf; &ssmile; &sstarf; &strns; &sube; &subnE; &subne; &subset; &subseteq; &subseteqq; &succeq; &succneqq; &succnsim; &succsim; &sup1 &sup2 &sup3 &supE; &supne; &supset; &supseteq; &supsetneqq; &swarrow; &szlig &tbrk; &tdot; &therefore; ϑ &thickapprox; &thicksim;   &thkap; &thksim; &thorn &tilde; &times &top; &tosa; &triangleleft; &trianglelefteq; &triangleright; &trianglerighteq; &trie; &twixt; &twoheadleftarrow; &uArr; &uacute &ucirc &ugrave &uharr; &ulcorn; &uml ¨ &uparrow; &updownarrow; &upharpoonleft; &upharpoonright; &uplus; υ &urcorn; &utri; &utrif; &uuarr; &uuml &vArr; &vDash; ϵ &varnothing; ϕ ϖ &varpropto; &varr; ϱ ς &varsubsetneq; &varsubsetneqq; &varsupsetneq; ϑ &vartriangleright; &vee; | &vltri; &vnsup; &vprop; &vsupnE; &wedge; &weierp; &wreath; &xcap; &xcirc; &xcup; &xdtri; &xharr; &xlarr; &xoplus; &xotime; &xrArr; &xrarr; &xsqcup; &xuplus; &xutri; &yacute &yen &yuml &zeetrf;

In this list are many named character references without a trailing ;. This is because HTML does not require one in all cases. There's another behavior concerning numeric character references where the trailing ; isn't required at certain boundaries.

Further, whether or not the trailing ; is required is subject to the ambiguous ampersand rule, which guards a legacy behavior for certain query args in URL attributes which weren't properly encoded.

Outputs from this PR

The FromPHP column shows how html_entity_decode( $input, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5 ) would decode the input.

The Data and Attribute columns show how the HTML API decodes the text in the context of markup (data) and of an attribute value (attribute). These are different in HTML, and unfortunately PHP does not provide a way to differentiate them. The main difference is in the so-called "ambiguous ampersand" rule which allows many "entities" to be written without the terminating semicolon ; (though not all of the named character references may do this). In attributes, however, some of these can look like URL query params. E.g. is &not=dogs supposed to be ¬=dogs or a query arg named not whose value is dogs? HTML chose to ensure the safety of URLs and forbid decoding character references in these ambiguous cases.

Outputs from a browser

I've compared Firefox and Safari. The middle column shows the data value and the right column has extracted the title attribute of the input and set it as the innerHTML of the TD.

The empty boxes represent unrendered Unicode charactered. While some characters, like the null byte, are replaced with a Replacement Character �, "non-characters" are passed through, even though they are parser errors.

Trac ticket:

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

github-actions · 2024-04-12T18:39:16Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

westonruter · 2024-04-17T20:32:25Z

Is there a performance penalty to implement this in PHP versus using html_entity_decode()?

dmsnell · 2024-04-18T15:51:08Z

Is there a performance penalty to implement this in PHP versus using html_entity_decode()?

Probably, and I will attempt to quantify that. This patch is mostly about avoiding corruption though, so introducing a performance penalty, if one exists, will not be a blocker in principle unless it's heavy enough to warrant a problem.

Personally I've spent too much time trying over and over again to get WordPress to properly save and render markup with specific characters after it corrupts or eliminates what I wrote. This is editor time wasted and not render time, but it's hard to quantify the costs of data corruption.

dmsnell · 2024-04-18T15:53:38Z

Is there a performance penalty to implement this in PHP versus using html_entity_decode()?

Further, there's a real clear need to push this upstream, but that requires some design changes as PHP's html_entity_decode() does not expose a way to indicate the decoding context, specifically concerning the ambiguous ampersand rule.

Theoretically this patch can cure the decoding for now while we wait for PHP to get a better decoder, and then we can rely on PHP's functionality, which is surely faster.

dmsnell · 2024-04-26T16:22:39Z

@westonruter I have some preliminary performance testing data. this is slower, but not so much that it should make much of an impact. in the worst case scenario it was adding tens of microseconds when decoding every text value in the document, including every attribute value.

github-actions · 2024-04-29T13:20:31Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jonsurrell, westonruter.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Provide a custom decoder for strings coming from HTML attributes and markup. This custom decoder is necessary because of deficiencies in PHP's `html_entity_decode()` function: - It isn't aware of 720 of the possible named character references in HTML, leaving many out that should be translated. - It isn't able to decode character references in data segments where the final semicolon is missing, or when there are ambiguous characters after the reference name but before the semicolon. This one is complicated: refer to the HTML5 specification to clarify. This decoder will also provide some conveniences, such as making a single-pass and interruptable decode operation possible. This will provide a number of opportunities to optimize detection and decoding of things like value prefixes, and whether a value contains a given substring.

sirreal · 2024-05-28T16:12:21Z

I've pushed a few changes, I think there was some inconsistency in null/false in the read_character_reference return type. This resulted in calling strlen( null ) showing up in tests as:

1) Tests_HtmlApi_WpHtmlDecoder::test_detects_ascii_case_insensitive_attribute_prefixes with data set "&#X6A;avascript&colon" ('&#X6A;avascript&colon', 'javascript:')
strlen(): Passing null to parameter #1 ($string) of type string is deprecated

/var/www/src/wp-includes/html-api/class-wp-html-decoder.php:63
/var/www/tests/phpunit/tests/html-api/wpHtmlDecoder.php:27
/var/www/vendor/bin/phpunit:122

…decoder

sirreal · 2024-05-29T11:40:56Z

src/wp-includes/html-api/class-wp-html-decoder.php

+	 * @param ?string $case_sensitivity Set to `ascii-case-insensitive` to ignore ASCII case when matching.
+	 * @return bool Whether the attribute value starts with the given string.
+	 */
+	public static function attribute_starts_with( $haystack, $search_text, $case_sensitivity = 'case-sensitive' ) {


I was a little bit surprised to see this method. It seems like an optimization on what could be str_starts_with + decode_attribute. Will you explain the motivation?

@dmsnell explained the motivation to me, I'll paraphrase here (my interpretation):

It is an optimization, but an important one. I've seen a number of crashes where many megabytes of data URIs are (attempted to be) decoded just want to check if a URL starts with https. This provides an optimized but also safer way to perform these smaller checks.

sirreal · 2024-05-29T11:57:41Z

src/wp-includes/html-api/class-wp-html-decoder.php

+	 * @param string  $haystack         String containing the raw non-decoded attribute value.
+	 * @param string  $search_text      Does the attribute value start with this plain string.
+	 * @param ?string $case_sensitivity Set to `ascii-case-insensitive` to ignore ASCII case when matching.
+	 * @return bool Whether the attribute value starts with the given string.


I'll push a change updating this (and something similar in WP_Token_Map).

I believe ?string is a nullable string, not an optional string, equivalent to string|null.

optional vs. nullable params

<?php function f( ?string $s = "default" ) { var_dump($s); }; function g( string $s = "default" ) { var_dump($s); }; echo "F\n"; f(); f(null); f('y'); echo "G\n"; g(); try { g(null); } catch ( Error $e ) { echo $e->getMessage() . "\n"; } g('y');

F string(7) "default" NULL string(1) "y" G string(7) "default" g(): Argument #1 ($s) must be of type string, null given, called in /in/EicXB on line 12 string(1) "y"

I don't think PHPDoc has a way of annotating optional parameters, we just state it:

Suggested change

* @param string $haystack String containing the raw non-decoded attribute value.

* @param string $search_text Does the attribute value start with this plain string.

* @param ?string $case_sensitivity Set to `ascii-case-insensitive` to ignore ASCII case when matching.

* @return bool Whether the attribute value starts with the given string.

* @param string $haystack String containing the raw non-decoded attribute value.

* @param string $search_text Does the attribute value start with this plain string.

* @param string $case_sensitivity Optional. Set to `ascii-case-insensitive` to ignore ASCII case when matching.

* @return bool Whether the attribute value starts with the given string.

?string (nullable string) was used for these types. The type should be string, optionality can appear in the param description but is derived from a default argument.

@param

?Type is a nullable Type, not an optional parameter. Identify optional parameters in the @param description.

@param

Instead of a nullable parameter that updates to `0` in the body of the function when null, use a default argument `0`, update the @param type and remove the null check and set from the function body.

Add mixed case and non-matching test cases

Update the parameter name and description to align with the descriptions of analogous parameters used in WP_Token_Map methods.

sirreal · 2024-05-29T13:01:00Z

src/wp-includes/html-api/class-wp-html-decoder.php

+
+		while ( $search_at < $search_length && $haystack_at < $haystack_end ) {
+			$chars_match = $loose_case
+				? strtolower( $haystack[ $haystack_at ] ) === strtolower( $search_text[ $search_at ] )


I mentioned this in the Token Map PR (#5373 (comment)) and now I've taken a moment to do some synthetic benchmarks, and I think we should change both that comparison in Token Map and this case to use strcasecmp. The difference is small, but may be relevant for highly optimized code. In ~3 million comparisons, strcasecmp ran 1.52 ± 0.02 times faster than strtolower.

Suggested change

? strtolower( $haystack[ $haystack_at ] ) === strtolower( $search_text[ $search_at ] )

? 0 === strcasecmp( $haystack[ $haystack_at ], $search_text[ $search_at ] )

I also put these both into 3v4l.org to get "Vulcan Logic Dumper" results, which does show strcasecmp performing fewer operations, supporting the idea that strcasecmp may perform slightly better:

Function str_case_cmp: Finding entry points Branch analysis from position: 0 1 jumps found. (Code = 62) Position 1 = -2 filename: /in/6a2UE function name: str_case_cmp number of ops: 9 compiled vars: !0 = $a, !1 = $b line #* E I O op fetch ext return operands ------------------------------------------------------------------------------------- 3 0 E > RECV !0 1 RECV !1 4 2 INIT_FCALL 'strcasecmp' 3 SEND_VAR !0 4 SEND_VAR !1 5 DO_ICALL $2 6 IS_IDENTICAL ~3 $2, 0 7 > RETURN ~3 5 8* > RETURN null End of function str_case_cmp Function str_to_lower: Finding entry points Branch analysis from position: 0 1 jumps found. (Code = 62) Position 1 = -2 filename: /in/6a2UE function name: str_to_lower number of ops: 11 compiled vars: !0 = $a, !1 = $b line #* E I O op fetch ext return operands ------------------------------------------------------------------------------------- 6 0 E > RECV !0 1 RECV !1 7 2 INIT_FCALL 'strtolower' 3 SEND_VAR !0 4 DO_ICALL $2 5 INIT_FCALL 'strtolower' 6 SEND_VAR !1 7 DO_ICALL $3 8 IS_IDENTICAL ~4 $2, $3 9 > RETURN ~4 8 10* > RETURN null

We can land dmsnell#15 to switch to strcasecmp here if you agree.

please let me add another word of caution here, because you're comparing two very different functions. we don't know how long the comparison is that we want to make, so I suspect we'd need to perform multiple iterations through the short characters to find out. not a big deal, but with insignificant improvements, insignificant setbacks can add up too.

so the first question is how many characters of the source text are involved in a match for a short word? We don't know this up-front, which is why the existing algorithm moves character-by-character until it finds a result. that is, it proceeds until we have matches leading up to a null byte.

I've taken a moment to do some synthetic benchmarks, and I think we should change both that comparison in Token Map and this case to use strcasecmp. The difference is small, but may be relevant for highly optimized code. In ~3 million comparisons, strcasecmp ran 1.52 ± 0.02 times faster than strtolower.

thanks for running these, but I'm not interested in synthetic benchmarks, particularly because of how misleading they can be. as with the broader measurements I've made in this patch, we're 6x - 30x slower than html_entity_decode() and yet the impact on WordPress is insignificant because we aren't spending our time on this.

in all of my explorations, both with 300,000 actual web pages and randomly-generated files full of character entities, never were named character reference lookups in the hot path.

if we want to make this faster we should attack numeric character references, and I've failed to find any way to make that faster.

strcasecmp performing fewer operations, supporting the idea that strcasecmp may perform slightly better:

I'll believe measurements and I'll use VLD dumps to gain insight, but not for performance. those operations can take wildly different amounts of time, plus two parallel strtolower() calls on independent data might end up executing in parallel on the CPU, which can eliminate the sequential bottleneck. modern CPUs are complicated beasts.

I'm nervous about this because I find strcasecmp()'s interface much riskier because it's not doing what we want, but something different, and we have to adapt that different. so if you want to change this just make sure to include ample real-world metrics and try hard to break it, particularly with maps containing overlapping short tokens, e.g. [ 'a' => '1', 'aa' => 2 ], and also maps with different prefix lengths.

sirreal

I've been over this very carefully and compared it with the standard. I can't find any issues. Nicely done.

My confidence is greatly increased by the html5lib entities tests that were largely broken before (with PHP html entity implementations) and are all passing with this change.

sirreal · 2024-05-29T19:13:16Z

src/wp-includes/html-api/class-wp-html-decoder.php

+	 * @param string $context                    `attribute` for decoding attribute values, `data` otherwise.
+	 * @param string $text                       Text document containing span of text to decode.
+	 * @param int    $at                         Optional. Byte offset into text where span begins, defaults to the beginning (0).
+	 * @param ?int   &$matched_token_byte_length Optional. Holds byte-length of found lookup key if matched, otherwise not set. Default null.


I don't mind matched_token_byte_length, but an alternative that may be better here is matched_character_reference_byte_length because we are looking for character references and I don't think we mention tokens elsewhere here.

sirreal · 2024-05-29T19:40:03Z

src/wp-includes/html-api/class-wp-html-decoder.php

+	 * @param int $code_point Which code point to convert.
+	 * @return string Converted code point, or `�` if invalid.
+	 */
+	public static function code_point_to_utf8_bytes( $code_point ) {


This section makes a lot more sense after seeing the table here:

It might be nice to include a textual version here, although something is certainly lost without the coloring.

Code point to UTF-8 conversion follows this pattern: First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4 U+0000 U+007F 0xxxxxxx U+0080 U+07FF 110xxxxx 10xxxxxx U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx U+010000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

I'm torn on adding this, because I both want the code to be clearer but also I don't want to embed the UTF-8 standard inside the code. WordPress already has a couple of UTF-8 functions using this pattern, and a part of me simply wants to let the docblock link to and reference this more than explain it.

I've updated the docs a little anyway.

That's fine. This isn't a section of code I'd expect to change much.

dmsnell force-pushed the html-api/add-text-decoder branch 4 times, most recently from ff30946 to 796d11a Compare April 26, 2024 16:00

dmsnell force-pushed the html-api/add-text-decoder branch 2 times, most recently from 0351b78 to b80e784 Compare April 28, 2024 13:11

dmsnell marked this pull request as ready for review April 29, 2024 13:20

dmsnell force-pushed the html-api/add-text-decoder branch 3 times, most recently from d12f1a7 to a99fda8 Compare May 1, 2024 23:23

dmsnell mentioned this pull request May 13, 2024

Token Map: Introduce an efficient lookup and translation class for string mappings. #5373

Closed

dmsnell force-pushed the html-api/add-text-decoder branch 6 times, most recently from 388fc03 to b05d048 Compare May 14, 2024 01:58

dmsnell added 2 commits May 14, 2024 20:21

Correct typo in max digit computation.

401d30f

dmsnell force-pushed the html-api/add-text-decoder branch from b05d048 to 401d30f Compare May 15, 2024 03:22

dmsnell mentioned this pull request May 20, 2024

HTML API: Plans for WP 6.6 WordPress/gutenberg#60324

Open

2 tasks

Adjust docs, names.

4f5f21e

dmsnell mentioned this pull request May 24, 2024

Add/html character reference decoder WordPress/gutenberg#47040

Closed

Updates to attribute_starts_with

72aeccd

dmsnell mentioned this pull request May 27, 2024

safecss_filter_attr() should support query strings with & as used by Gutenberg #6645

Open

dmsnell and others added 4 commits May 27, 2024 18:19

Add tests for attribute_starts_with

d60e320

Fix alignment lint

db384b9

Annotate global variable type

be8b2a5

Change read_character_reference return type to string|false

8f9b076

sirreal and others added 8 commits May 28, 2024 18:46

Merge remote-tracking branch 'upstream/trunk' into html-api/add-text-…

2d3dee1

…decoder

Replace &colon entity test with &Colon;

b559320

Remove &Colon; test (this means "∷"!)

53dafce

Add WIP test for edge cases.

3bf0434

Fix type error in text decoder

edbc0b0

Replace false with null for token reading.

fd83c46

Update tests for Token Map returning null.

59671d6

Fill in test phpdoc details

ab672e7

sirreal reviewed May 29, 2024

View reviewed changes

sirreal added 5 commits May 29, 2024 14:08

Update optional $case_sensitivity param types

2e2fb8b

?string (nullable string) was used for these types. The type should be string, optionality can appear in the param description but is derived from a default argument.

Fix optional @param types

d38df36

?Type is a nullable Type, not an optional parameter. Identify optional parameters in the @param description.

Use default argument for read_character_reference at

9931e09

Instead of a nullable parameter that updates to `0` in the body of the function when null, use a default argument `0`, update the @param type and remove the null check and set from the function body.

Add additional attribute prefix tests

95820aa

Add mixed case and non-matching test cases

Change byte_length_of_matched_token to matched_token_byte_length

47dcdc1

Update the parameter name and description to align with the descriptions of analogous parameters used in WP_Token_Map methods.

sirreal reviewed May 29, 2024

View reviewed changes

sirreal approved these changes May 29, 2024

View reviewed changes

Docblock cleanup

0c0b36c

Directory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML API: Add custom text decoder #6387

HTML API: Add custom text decoder #6387

dmsnell commented Apr 12, 2024 •

edited

github-actions bot commented Apr 12, 2024

westonruter commented Apr 17, 2024

dmsnell commented Apr 18, 2024

dmsnell commented Apr 18, 2024

dmsnell commented Apr 26, 2024

github-actions bot commented Apr 29, 2024 •

edited

sirreal commented May 28, 2024

sirreal May 29, 2024

sirreal May 31, 2024

sirreal May 29, 2024

sirreal May 29, 2024

sirreal May 29, 2024

dmsnell May 29, 2024

sirreal left a comment

sirreal May 29, 2024

sirreal May 29, 2024

dmsnell May 31, 2024

sirreal May 31, 2024

Directory

HTML API: Add custom text decoder #6387

Are you sure you want to change the base?

HTML API: Add custom text decoder #6387

Conversation

dmsnell commented Apr 12, 2024 • edited

Status

Performance

Pre-decoding

Faster integer decoding.

Faster digit detection.

Quick table lookup of group/small token indexing.

Metrics on character references.

Attempted alternatives

Character-set detections.

html5lib tests

Differences from html_entity_decode()

github-actions bot commented Apr 12, 2024

Test using WordPress Playground

Some things to be aware of

westonruter commented Apr 17, 2024

dmsnell commented Apr 18, 2024

dmsnell commented Apr 18, 2024

dmsnell commented Apr 26, 2024

github-actions bot commented Apr 29, 2024 • edited

sirreal commented May 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sirreal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmsnell commented Apr 12, 2024 •

edited

`html5lib` tests

Differences from `html_entity_decode()`

github-actions bot commented Apr 29, 2024 •

edited