News:

SMF 2.1.4 has been released! Take it for a spin! Read more.

Main Menu

Subjects truncated to max 20 chars when posting in Greek!

Started by spiros, November 08, 2004, 09:23:38 PM

Previous topic - Next topic

spiros

I have just installed the forum and when testing it, it limits the characters in the subject to 20 if the subject is in Greek .... if in English there is no such limitation!

Please help!
http://www.translatum.gr/forum/index.php

This is a sample Greek subject:
Αυτή είναι μία δοκιμαστική δημοσίευση...

which gets truncated to
Αυτή είναι μία δοκι&

see it at
http://www.translatum.gr/forum/index.php/board,11.0.html

spiros

I did find the solution finally... I changed the character set in the English language file and it now works OK. It is a very funny error though - I wish somebody could explain it to me!

[Unknown]

The facility that makes Greek work with the wrong character set requires much more space character wise) than regular english characters.  Using the correct character set is the "right" way to do it.

-[Unknown]

geonahta

Quote from: afaton on November 08, 2004, 10:42:29 PM
I did find the solution finally... I changed the character set in the English language file and it now works OK. It is a very funny error though - I wish somebody could explain it to me!

Care to share a step_by_step desription for the solution you found? This is a bug that I have faced as well.


[Unknown]

Again, it is not a bug.  You are exploiting functionality that has been carefully designed to let you do it wrong.  For example, just because <br> works in xhtml doesn't mean it's valid xhtml.  Nor does it mean it will work properly.

If you are posting Greek with the Western character set, you are doing it the wrong way.  It is only because of special code in SMF that this even *kinda* works.  The limitations on text length cannot be so easily resolved, when you are doing it this - again I note - WRONG way.

To do it the right way, in your index language file, you need to specify another character set.  For example, you might put "Big5" for Chinese.  This file is Themes/default/langauges/index.yourlanguagename.php.  The character set is located near the very top of the file.

-[Unknown]

spiros

Now the same thing happens to Greek text posted with UTF-8....  Only windows-1253 seems to work properly for Greek; however, in case of putting accented characters (like French) they come out as unparsed entities, i.e. the French character "ç" comes out as &#231; in Fran&#231;ais.

Surely if this forum uses western encoding and it copes with Greek somehow "Ελληνικά" (test Greek word) there should be some solution.

Any ideas?

[Unknown]

Indeed, there is a solution - but it means that limits are not imposed correctly.

Of course the correct solution is to use a french language package (using the French character set) and a Greek language package (using Greek.)

-[Unknown]

drf

oh yes.. changing the language as mentioned at the index of language file to windows-1253 encoding works well! :)

spiros

I guess you do not use neither of these packages in this board yet my previous posting which contained examples of both Greek and French characters displayed OK. In other words, how is it possible to have correct display of Greek AND French characters in this board? I.e. to correctly parse entities in this case?

Quote from: [Unknown] on November 12, 2004, 12:00:16 PM
Indeed, there is a solution - but it means that limits are not imposed correctly.

Of course the correct solution is to use a french language package (using the French character set) and a Greek language package (using Greek.)

-[Unknown]

[Unknown]

Like I said, it fixes things automatically.  This can only be done when the wrong character set is used, but makes the limits (such as the subject) incorrect.

-[Unknown]

spiros

So let me see if I get this right (excuse my simple-mindedness) what you are saying is that if I change the board into say ISO-8859-1 then BOTH French AND Greek characters will be displayed correctly?

My problem is not having French and Greek in different pages - it is a translation forum and people ask about translations in various languages and French, English, Greek are mixed in the same post.

I am sure there must be some other solution like parsing the "strange characters" in a way that they display properly no matter the character set. I think it is just a matter of what editor is used. For example if I have a Greek character set and I insert in the html "&agrave;" or "&#224;" it will be displayed properly as an "a" with grave accent. Now, is it difficult for SMF to do the same thing when one pastes these sort of characters when default charset is Windows-1253 given that this is already the case with this board?

Quote from: [Unknown] on November 13, 2004, 07:45:16 PM
Like I said, it fixes things automatically.  This can only be done when the wrong character set is used, but makes the limits (such as the subject) incorrect.

-[Unknown]

[Unknown]

Yes, because they are sent by the browser incorrectly.  And this is caused by character set problems, which affects how the browser works.

For example, look at the Russian board.  You'll notice two different types of text - that which is Russian, and that which cannot be easily read (gibberish of accented characters.)  The Russian is being sent in the ISO-8859-1 character set.  SMF detects this, and fixes it just as you say - using those entities.  However, the gibberish is sent in the Russian character set.  Because it is sent in a different character set, but SMF is expecting ISO-8859-1, the text is WRONGLY INTERPRETTED as it is sent WRONGLY.  This is caused by some users explicitly setting the character set (which is needed to view some sites properly, sadly.)  Luckily, in this case at least, the situation can be fixed for those using Russian - they can set it to interpret everything as Russian, and they will see their Russian properly.

This same type of problem is happening for your forum.  Since you are adamant, I hope you won't mind learning a bit about computers.  As I'm sure you know, every character is stored (by your computer and this forum) as a number.  They aren't usually expressed as numbers, but that's what they are.  The &#224; entity means "character #224".  The meaning of that, and here's the important part, differs between character sets.  In other words, in character set #1, 224 might be an a with a grave accent.  However, in a Japanese character set, it might be a specific kanji glyph.

Now, when everything is in the same character set, everything is fine.  Indeed, ISO-8859-1 is a good character set to use sometimes, because many characters (for example, that kanji glyph) are not in the character set at all... exactly.  Because of this, browsers send an extended entity character code - that is, &#12345; or similar.  Now, obviously, if you can only post 80 characters, and your subject is "&#12345;&#12345;&#12345;&#12345;&#12345;&#12345;&#12345;&#12345;&#12345;&#12345;"... well, you're going to have problems - that's already 80 characters!

I suppose the best solution to your problem would be to remove the limiting completely.  It's in Sources/Post.php, look for the following two comments:

// Make sure the subject isn't too long.
// At this point, we want to make sure the subject isn't too long.  Stripslashes first to avoid a trailing slash.

And remove the block of code immediately below them (two lines, for a total of three with the comment) in both places.  This will limit the length to 255, which should be (100 / 20 = 5, 255 / 5 = 51)... fifty one letters.

-[Unknown]

spiros

Thank you for your reply; I do appreciate your effort to explain things.

I understand what you say about numeric entities (i.e. &#38;) being interpreted differently in different character sets.  However, character entities (i.e. &amp;) tend to be interpreted the same no matter the character set. Please correct me if I am wrong.

Even in the case of numeric entities they should be at least parsed as a character (albeit wrong if one sees the page with the wrong character coding) rather than not being parsed at all (that is to say maintaining a "&#192;" form).

Checking the actual code produced, and I think here is the crux, the ampersand character of an entity is "expanded" ie instead of being parsed as  for example "&#192;" it is being parsed as  "&amp;#192;" - the bold being the expansion of the ampersand character which actually causes these results.

Now I checked this with my editor, Dreamweaver. When I paste the  character "&#192;" in design view it comes out as "&amp;#192;" in the code; whereas when I paste it in the actual code it comes out as it should be "&#192;"  and it is displayed correctly as an accented A.

To recapitulate, with my modest brains and lack of php knowledge I can envisage two solutions to this:

  • Perhaps a hack that would allow SMF to drop the amp; bit of the code when such a character is pasted, thus at least producing a character that once the page is changed to the correct character coding it would be displayed correctly
  • And taking it step further (the previous being a prerequisite for this one) another hack that would convert numeric entities to character entities

In fact there is proof that the case does not rest entirely on the browser. I tried doing the same thing with the same parameters (Windows-1253) with Mambo's standard forum component (simpleboard) and both French and Greek displayed correctly (even in the subject line) in all cases apart from the hierarchical listing of threads.

You can see the results yourself at
http://www.paxos.tk/m/component/option,com_simpleboard/Itemid,42/func,view/id,7/catid,3/

As you see from the message posted here  "I used Windows-1253 and both Greek (παιδί) and French (Français plutôt) is displayed correctly".

You can even post a new topic yourself using both Greek and French and see that it works. I guess this is a strong enough proof that it is not all a matter of how they are "being sent by the browser ".

Quote from: [Unknown] on November 14, 2004, 04:01:37 AM
  In other words, in character set #1, 224 might be an a with a grave accent.  However, in a Japanese character set, it might be a specific kanji glyph.

[Unknown]

Your browser does not send named entities (a grave, etc.) it only sends numeric ones.

It is expanded only for some entities.  For example, look at the source of this:

漢語 - 汉语 - 한국어 - คนไทย

Those are all entities sent by my browser.  Numeric ones.  And, they're all parsed properly.  Shorter entities need not be, and it can cause confusion if they are.  There were problems with parsing the shorter entities, so we had to roll it back to not parsing them.

-[Unknown]

spiros

Thank you again for your reply!

I do not want to abuse your time but as you understand, given that I run a languages site, this is a very important point for me and my users (and as I can see from other posts in this forum it appears to be important for others as well).

Therefore, would it be possible to explain, if this is not too much hassle for you, what these problems were and which bit of code does one need to manipulate to roll it back to parsing the shorter entities, or, if possible, to parse some of these shorter entities?

[Unknown]

All in Sources/Post.php:

Look for:
$form_subject = preg_replace('~&amp;#(\d{4,5}|[3-9]\d{2,4}|2[6-9]\d);~', '&#$1;', $form_subject);

Replace:
$form_subject = preg_replace('~&amp;#(\d{1,5});~', '&#$1;', $form_subject);

$_POST['subject'] = preg_replace('~&amp;#(\d{4,5}|[3-9]\d{2,4}|2[6-9]\d);~', '&#$1;', $_POST['subject']);

Replace:
$_POST['subject'] = preg_replace('~&amp;#(\d{1,5});~', '&#$1;', $_POST['subject']);

$quote_mozilla = strtr(preg_replace('~&amp;#(\d{4,5}|[3-9]\d{2,4}|2[6-9]\d);~', '&#$1;', htmlspecialchars($quote)), array('&quot;' => '"'));

$quote_mozilla = strtr(preg_replace('~&amp;#(\d{1,5});~', '&#$1;', htmlspecialchars($quote)), array('&quot;' => '"'));

And in Subs-Post.php:

$message = preg_replace('~&amp;#(\d{4,5}|[3-9]\d{2,4}|2[6-9]\d);~', '&#$1;', $message);

$message = preg_replace('~&amp;#(\d{1,5});~', '&#$1;', $message);

I've already attempted to fix the other problems... so, you can try this if you'd like and you may not run into them now.

-[Unknown]

spiros

Thank you!

I tried it on a couple of posts and it works like magic!

I am really grateful for your help. Perhaps this hack should be integrated in the final release or be made available to other people who need this sort of functionality - the language boards would be a good place to start.

Thanks again...

andrea

Moved this topic into language support area since it might be instructive for other languages as well.

Andrea Hubacher
Ex Lead Support Specialist
www.simplemachines.org

Personal Signature:
Most recent work:
10 Aqua Themes for SMF



vkot

Quote from: spiros on November 14, 2004, 12:36:44 PM.... both French and Greek displayed correctly (even in the subject line) in all cases apart from the hierarchical listing of threads.

This is another sad story and has to do with the locale setting in your MySQL installation (or your host's). If you are on a shared host outside Greece, there's nothing you can do.


Quote from: [Unknown] on November 14, 2004, 06:09:09 PM
漢語 - 汉语 - 한국어 - คนไทย

Those are all entities sent by my browser. Numeric ones. And, they're all parsed properly.

Nope. I get  ?? - ?? - ??? - (and some hebrew(?))

(I usually use Firefox 1.0. In I.E. it shows squares instead of question marks.)
For specialized SMF installation/customization, Web Development, Linux Server Administration, click here.
Για εξειδικευμένες υπηρεσίες στα παραπάνω, πατήστε εδώ.

[Unknown]

Quote from: vkot on December 29, 2004, 06:11:31 AM
Quote from: [Unknown] on November 14, 2004, 06:09:09 PM
漢語 - 汉语 - 한국어 - คนไทย

Those are all entities sent by my browser. Numeric ones. And, they're all parsed properly.

Nope. I get  ?? - ?? - ??? - (and some hebrew(?))

They are parsed properly.  You have not installed support for east asian and other languages in Regional Settings, which is why you see them not.

-[Unknown]

Advertisement: