News:

Wondering if this will always be free?  See why free is better.

Main Menu

Unicode double quotation marks get replaced

Started by MegaBrutal, January 16, 2025, 07:45:00 PM

Previous topic - Next topic

MegaBrutal

Not sure if this is a bug per se, or a feature I could disable, but I recently upgraded my forum to SMF 2.1.4, and noticed that the Unicode double quotation marks I post get replaced with other, similarly looking characters.

Specifically, U+201E DOUBLE LOW-9 QUOTATION MARK gets replaced with the character sequence: ,, (2 ASCII comma characters, U+002C). And U+201D RIGHT DOUBLE QUOTATION MARK gets replaced with " (ASCII quotation mark, U+0022).

I'd prefer to preserve the original characters, not make arbitrary character replacements, as these replacements look ugly, I also don't know what other characters might get replaced by surprise.
Despite this.
I feel obligated to suggest.
Should you choose to create this world once more.
Another path would be better suited.


shawnb61

,,I'm not seeing this..."

If I look at the hex values in the DB, they're correct, e2809d & e2809e.

However...  If I copy & paste from elsewhere, you are correct, it happens...  It seems to be dependent where & how you're copying & pasting from...
A question worth asking is born in experience & driven by necessity. - Fripp

MegaBrutal

Yes, I also noticed that the source text remains intact in the DB, because when I quote or modify the post, I get the original characters back. The conversion happens when the post is rendered for viewing.

Here I try to post the original characters: ,,".
Despite this.
I feel obligated to suggest.
Should you choose to create this world once more.
Another path would be better suited.


live627

There's a function deep within Subs.php which purports to change some characters copied from MS Word. Maybe that's the cause.

MegaBrutal

Well, this is a total anti-feature. True that MS Word (and I think also LibreOffice) tend to replace " quotes with Unicode codes, which may be annoying when the user really wants to type the ASCII quote character. But in cases when people intentionally use the Unicode quote pairs because they look better in text, this is really annoying. I think not SMF should decide what characters to display and what not.

I don't know what other MS Word characters get replaced, other replacements might be reasonable, in case they would really screw up the rendering of the page.

I also think that the SMF drafts feature will hopefully tone down the habit of composing posts in Word and copy-pasting them into SMF when they are ready (I knew some users who were doing that).
Despite this.
I feel obligated to suggest.
Should you choose to create this world once more.
Another path would be better suited.


Kindred

SMF *HAS* to "decide" what to do with those characters, because stuff in Word does not follow the html standards and cna (and has) broken layouts
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

MegaBrutal

As I said, for characters those really break the layout, it's reasonable. But it surely doesn't apply to the double quotes in question.
Despite this.
I feel obligated to suggest.
Should you choose to create this world once more.
Another path would be better suited.


shawnb61

The problem here isn't in the capture - it is properly captured & stored in the DB as utf8.  The work done in SMF (nice work by @Sesquipedalian ) to clean up pasted content of invalid characters is working very well. 

Something funky is happening upon display though.  I don't know if the funkiness is a browser thing, or an SMF thing.  The characters displayed are not the proper characters; they've been dumbed down to a subset of utf8 somewhere along the line.
A question worth asking is born in experience & driven by necessity. - Fripp

Sesquipedalian

Quote from: shawnb61 on January 29, 2025, 03:54:54 PMThe problem here isn't in the capture - it is properly captured & stored in the DB as utf8.  The work done in SMF (nice work by @Sesquipedalian ) to clean up pasted content of invalid characters is working very well. 

Something funky is happening upon display though.  I don't know if the funkiness is a browser thing, or an SMF thing.  The characters displayed are not the proper characters; they've been dumbed down to a subset of utf8 somewhere along the line.

Correct. In SMF 2.1, those characters are replaced by the sanitizeMSCutPaste() function, which is called during parse_bbc() function. It is old and arguably useless code in the Unicode-aware world of 2025. In SMF 3.0, sanitizeMSCutPaste() has a comment attached to it suggesting that it be deprecated and no longer used, and I think that is what we will do.
I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

Sesquipedalian

As of #8423, sanitizeMSCutPaste() has been deprecated in SMF 3.0 and is no longer used.
I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

MegaBrutal

Quote from: Sesquipedalian on January 29, 2025, 06:36:42 PMIn SMF 2.1, those characters are replaced by the sanitizeMSCutPaste() function, which is called during parse_bbc() function. It is old and arguably useless code in the Unicode-aware world of 2025.

Strange because I don't remember this being an issue in SMF 2.0 – I guess it was there too if it's that old. Or maybe my font back then didn't make it visually so apparent.


Quote from: Sesquipedalian on January 29, 2025, 06:55:13 PMAs of #8423, sanitizeMSCutPaste() has been deprecated in SMF 3.0 and is no longer used.

Uhm, awesome. Is there any hope to get a fix for SMF 2.1?
Despite this.
I feel obligated to suggest.
Should you choose to create this world once more.
Another path would be better suited.


Arantor

If only it were truly that simple.

You see, the reason the function was introduced is not because things weren't UTF-8 aware back then, but because what Word does is... 😬

When Word puts in typographical quotes, *they are not Unicode by default*. They're Win-1251 or Win-1252 quotes (or others, depends on your system locale, but Win locale 1033 uses Win-1251 by default) rather than Unicode, and so much drama was caused by copy/pasting from Word and posts getting truncated by those quotes, precisely because they're not legal UTF-8.

I will admit the function was enthusiastic - but better to do that than lose half a post outright. Browsers have always been instructed to coerce content to UTF-8 but there's no guarantee they actually will. Microsoft browsers are the worst for this. I suspect if you look, this is possibly the source of some of the inconsistency when it does/does not behave as expected.

I wouldn't discount the people who write in other apps and then copy/paste it over - a lot of that originally came about because things got lost, e.g. session timeout or some other error and pressing back could empty the editor. People with long memories learned this a long time ago - it's one of the reasons drafts were introduced.
Holder of controversial views, all of which my own.


Sesquipedalian

I'm fairly sure that the full scale Unicode normalization and sanitization we now perform will be more than enough to deal with that.
I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

Arantor

I hope so, I'd really hate to go back to the bad old days. I remember fielding so many support issues about it.
Holder of controversial views, all of which my own.


Sesquipedalian

Well, if it turns out that I'm wrong and we still need that code, we can restore it easily enough. :)
I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

MegaBrutal

Quote from: Arantor on January 30, 2025, 05:15:51 AMYou see, the reason the function was introduced is not because things weren't UTF-8 aware back then, but because what Word does is... 😬

When Word puts in typographical quotes, *they are not Unicode by default*. They're Win-1251 or Win-1252 quotes (or others, depends on your system locale, but Win locale 1033 uses Win-1251 by default) rather than Unicode, and so much drama was caused by copy/pasting from Word and posts getting truncated by those quotes, precisely because they're not legal UTF-8.

But this function seems to replace valid UTF-8 characters, otherwise my quotes wouldn't be replaced. Truly invalid characters could be replaced with their legit UTF-8 counterparts.

Quote from: Sesquipedalian on January 30, 2025, 11:36:10 AMI'm fairly sure that the full scale Unicode normalization and sanitization we now perform will be more than enough to deal with that.

Inspired by Rust, where all strings (String type) are UTF-8 validated (at least, you need to go out of your way to construct an unvalidated String by explicitly calling an "unsafe" function), I'd suggest to replace all invalid characters with U+FFFD REPLACEMENT CHARACTER (�). Here is a function I once wrote that does that:

fn buffer_to_str_lossy(buffer: &mut Vec<u8>) -> &mut str {
    while let Err(e) = std::str::from_utf8(buffer) {
        if e.valid_up_to() + 2 < buffer.len() {
            debug!("buffer_to_str_lossy: correcting invalid UTF-8 sequence ({})", e);
            buffer[e.valid_up_to()] = 0xef;
            buffer[e.valid_up_to()+1] = 0xbf;
            buffer[e.valid_up_to()+2] = 0xbd;
        }
        else {
            debug!("buffer_to_str_lossy: truncating invalid UTF-8 at end of buffer ({})", e);
            buffer.truncate(e.valid_up_to());
        }
    }
    unsafe { std::str::from_utf8_unchecked_mut(buffer) }
}

Well... the hard part is actually recognizing the invalid sequences which I left to the standard library. (Normally, I wouldn't even need this function, but the ones provided by std tend to clone the input buffer which was a no go for my use case.) But I suppose PHP should have a function like this, UTF-8 validation is kind of granted in today's programming languages.

Also here's a great stress test file full of invalid UTF-8 sequences:
https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

If you can post its contents to SMF without breaking anything, i.e. the post stored in DB will be legit UTF-8, you know the validation works properly.

Speaking of the DB, it's also need to be ensured that the tables have UTF-8 collation, because the DB might scrap UTF-8 sequences. I knew someone who installed an SMF for someone with Japanese origins with a crappy ANSI collation, I'd guess ISO-8859-1 because it didn't even have proper support for Hungarian accented letters. Once members thought to open a Japanese learning topic – turned out pretty quickly that it doesn't work. You knew it's the DB because you saw your posts properly in the preview, but when you actually submitted your post, your characters were gone. Meanwhile I was happily posting Chinese hanzi on my UTF-8 backed SMF forum.

If I were you, I'd drop support for any other DB collation than UTF-8.
Despite this.
I feel obligated to suggest.
Should you choose to create this world once more.
Another path would be better suited.


Sesquipedalian

Thanks, MegaBrutal. We've already got that well covered, though. Feel free to look at the sanitizeChars() and normalize() methods in https://github.com/SimpleMachines/SMF/blob/release-3.0/Sources/Utils.php, as well as https://github.com/SimpleMachines/SMF/blob/release-3.0/Sources/Unicode/Utf8String.php and its attendant data files if you're curious. SMF 2.1 works similarly, except that 2.1 isn't OOP and thus the functions are scattered around Subs.php and Load.php.
I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

Advertisement: