Charset conflicts

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic
Charset conflicts
by on (#38)
Whenever I attempt to fetch a document from the forum, Apache is specifying the character set to be ISO-8859-1, causing Mozilla and Firefox to IGNORE the charset defined in the document's <meta http-equiv...> tag. As a result, ALL forum languages are being displayed as ISO-8859-1 rather than the character sets they specify.
If I recall correctly, the problem is within PHP's configuration - it has a way to specify a default character set, whicih must be DISABLED in order for this to work properly.


[EDIT]
An alternate fix for this problem is as follows:
1. Open includes/page_header.php
2. At the end, locate the line `$template->pparse('overall_header');`
3. Before that line, insert `header ('Content-Type: text/html; charset=' . $lang['ENCODING']);`
4. Open admin/page_header.php
5. At the end, locate `$template->pparse('header');`
6. Before, insert `header ('Content-Type: text/html; charset=' . $lang['ENCODING']);`

by on (#39)
It's early in the morning, so pardon me if I'm off my rocker. *still rubbing his eyes*

I don't think this kind-of change is possible; PHP is run as a CGI, not an Apache module.

Meaning, there's no way (I know of) to set default_charset to "" (to disable PHP sending the HTTP Charset: header, as mentioned here) outside of using "php_value" or "php_admin_value" inside of Apache via .htaccess or the httpd.conf.

I would think using ini_set() inside of PHP would come too late in the process (default HTTP headers already being sent at that point), but I really don't know just how much of a "pre"-parser PHP is, if you get my drift. Otherwise, I'd probably just use ini_set() inside of phpBB's extension.inc.
Re: Charset conflicts
by on (#40)
Quietust wrote:
An alternate fix for this problem is as follows:
1. Open includes/page_header.php
2. At the end, locate the line `$template->pparse('overall_header');`
3. Before that line, insert `header ('Content-Type: text/html; charset=' . $lang['ENCODING']);`
4. Open admin/page_header.php
5. At the end, locate `$template->pparse('header');`
6. Before, insert `header ('Content-Type: text/html; charset=' . $lang['ENCODING']);`


I've deployed this -- so far it looks legit.

There sure seems to be an awful lot of oversights in this forum software...
Re: Charset conflicts
by on (#41)
koitsu wrote:
I've deployed this -- so far it looks legit.

And it looks good here, too. Firefox is finally displaying the Japanese language pack in UTF-8 like it's supposed to.

Quote:
There sure seems to be an awful lot of oversights in this forum software...

The phpBB team actually knows about this problem already; they simply decided to defer fixing it until phpBB 2.2 (which is still pretty far from release).
Re: Charset conflicts
by on (#43)
Good deal. Also solved the Email issue, and tinkered with some of the style colours a bit... I'm really going to need tepples or someone else to go through those and figure out what looks good for the boards.

Of course, this is also just a "test run" -- if the back-end db gets horked or we break something in the process, it's not a big deal. Entirely different if and when the board gets moved or migrated... :)

Wiki stuff is coming along well (blargg dropped me an Email), I see!

by on (#44)
That reply should've been from me, koitsu. Damn login things... ;D

by on (#2011)
Er, it looks like this problem has reappeared...

by on (#2012)
Sounds like my upgrade of MySQL 4.0.8 to 4.0.11 somehow "broke" the default table settings. I'll fix it after I take care of some stuff first.

Pretty weird that it'd break though. Very odd...

-- koitsu

by on (#2013)
That should be 4.1.8 to 4.1.11, sorry. God I'm going brain-dead...

by on (#2014)
Looks like its not related to MySQL this time, but rather the server headers.

I see a lot of the PHP scripts for the board here updated on May 7th (today) about 5 hours ago, so my guess is that memblers upgraded it or something along those lines.

The problem looks like the Content-Type being sent in the HTML is iso-8859-1 regardless of what the actual content is, so I'll have to go fix all of that up... Heh. :-)

by on (#2015)
Yep, found that some of the language/ files had been changed (particularly lang_english's stuff) back to iso-8859-1. I set it to utf-8. The other languages were not touched (still utf-8).

In addition, the include/ and admin/ php files mentioned in this thread had also been updated and lacked the set-lang-encoding code.

According to my peers, things are working now. View Source shows the right thing, so I believe its working okay.

Let me know. :-)

-- koi

by on (#2016)
Yep, I updated phpBB today by uploading the changed files. Sorry if I broke something, heheh. Maybe it's safer to use the patch, but I don't know how.

by on (#2018)
Memblers wrote:
Yep, I updated phpBB today by uploading the changed files. Sorry if I broke something, heheh. Maybe it's safer to use the patch, but I don't know how.


Don't worry about it, man. :-) Q caught it, so it's all good.

Maybe I should write up some stupid little doc thing with some perl+sed crap which will recursively go through and fix all of it after an upgrade. *lazy zzzz*

-- koi
Thanks
by on (#4801)
Damn, i never thought of looking at it that way, thanks.

by on (#6691)
It's time to apply this fix again...

Dammit, Memblers, next time use the patch file!

by on (#7466)
The English language pack has been incorrectly reverted to use the character set ISO-8859-1; as a result, posts made by people using other language packs will appear garbled (where non-English characters are being used).

by on (#7781)
This one's for Quietust:

Good fucking lord MySQL has way too many places to change character set encoding / translation.

I think there's still some leftover crap dealing with UTF-8 migration. The database and tables were converted to UTF-8 long ago like I mentioned, but I believe the character set encoding used between the MySQL client (i.e. the webserver) and the MySQL server (on another box) is still latin1.

According to one of the user comments on this MySQL documentation page, the client/server encoding can screw up UTF-8 as well:

http://dev.mysql.com/doc/refman/4.1/en/charset-connection.html

While using the nesdev_phpbb database, SHOW VARIABLES returns the following (I'm using the standard mysql client on the webserver, which I don't have set to use utf8):

Code:
| character_set_client            | latin1                                                     |
| character_set_connection        | latin1                                                     |
| character_set_database          | utf8                                                       |
| character_set_results           | latin1                                                     |
| character_set_server            | latin1                                                     |
| character_set_system            | utf8                                                       |


I think part-of the solution is to add these queries to the MySQL connection code in phpbb. I think the CHARACTER_SET one is already being used (I think I added this myself), but the NAMES one I don't think I'm using:

Code:
SET NAMES utf8;
SET CHARACTER_SET utf8;

by on (#7782)
Strike that.

Looks like db/mysql4.php got overwritten to deal with some changes back in November (Memblers mentioned this). mysql4.php was one of the scripts I had to change to explicitly request a different character set when initiating a mysql connection.

I made a backup of the original, renaming it to mysql4.php__not_utf8_do_not_use:

Code:
-rw-r--r--  1 memblers  users  6066 Nov 16 18:56 mysql.php
-rw-r--r--  1 memblers  users  6486 Nov 16 18:56 mysql4.php
-rw-------  1 memblers  users  6482 Jul 17  2004 mysql4.php__not_utf8_do_not_use


However, a diff between the original non-utf8 and the currently used one (mysql4.php) shows absolutely no character set encoding requests or anything (meaning my changes got wiped out):

Code:
--- mysql4.php  Wed Nov 16 18:56:14 2005
+++ mysql4.php__not_utf8_do_not_use     Sat Jul 17 08:58:20 2004
@@ -6,7 +6,7 @@
  *   copyright            : (C) 2001 The phpBB Group
  *   email                : supportphpbb.com
  *
- *   $Id: mysql4.php,v 1.5.2.1 2005/09/18 16:17:20 acydburn Exp $
+ *   $Id: mysql4.php,v 1.5 2002/04/02 21:13:47 the_systech Exp $
  *
  ***************************************************************************/

@@ -271,7 +271,7 @@
                                {
                                        if( $this->rowset[$query_id] )
                                        {
-                                               $result = $this->rowset[$query_id][0][$field];
+                                               $result = $this->rowset[$query_id][$field];
                                        }
                                        else if( $this->row[$query_id] )
                                        {


So, my guess is that this is what's causing the problem.

I'll edit mysql4.php momentarily to put the UTF-8 stuff back in. Argh. :-)

by on (#7784)
Okay, so here's the diff (in the case that we go through this again in the future):

Code:
--- mysql4.php__not_utf8_do_not_use     Wed Nov 16 18:56:13 2005
+++ mysql4.php  Wed Dec 28 17:14:47 2005
@@ -61,6 +61,13 @@
                                }
                        }

+                       /**
+                        * Custom hack for utf8 support; we need to tell the MySQL server, before
+                        * any data is sent/received, to do everything using utf8.
+                        */
+                       @mysql_query("SET NAMES 'utf8'");
+                       @mysql_query("SET CHARACTER_SET utf8");
+
                        return $this->db_connect_id;
                }
                else


This has one major drawback in regards to the board right now though:

Many posts between November and now were done using UTF-8 here on the forums, but were essentially submit into the MySQL database using latin1.

The above change (made today) should fix this and revert things to how they should have been, but definitely breaks non-Latin character posts over the past 2 months.

If I remove the SET NAMES clause, the posts between November and present appear to work correctly. Sounds great, I know, but the problem is that if I remove that clause the client<->server model communicates everything in latin1 (even though the actual encoding type of the content is utf8). I believe there's translation that goes on (i.e. utf8 characters being coallated into latin1 for the MySQL connection then being coallated back into utf8 when the actual data is stored into the database).

SET NAMES 'utf8' basically does this:

Code:
mysql> SET character_set_client = utf8;
mysql> SET character_set_results = utf8;
mysql> SET character_set_connection = utf8;


I'll leave this one up to Quietust to decide.

by on (#7787)
Unfortunately, this has wiped out nearly every single post in the (now reasonably active) FCDev forum. Do you think you could turn this off temporarily so the existing posts can be saved, and then turn it back on so the posts can be fixed?


Personally, I think it's best to just leave the database with no encoding at all - just let it store the data as if it were binary. Let the software running on the site (phpBB, the wiki, etc.) and the users (namely, their browsers) decide how to interpret it.

by on (#7788)
Quietust wrote:
Unfortunately, this has wiped out nearly every single post in the (now reasonably active) FCDev forum. Do you think you could turn this off temporarily so the existing posts can be saved, and then turn it back on so the posts can be fixed?


Sure thing, I'll revert the SET NAMES entry in a moment.

I wish I knew of a way to "convert" the existing data into what would work with SET NAMES, you know what I'm saying? That way I could convert all of the older posts to what presently works, without breaking things. Sadly I don't know of a way to do that. Maybe ALTER can do it, but I'm still not sure how to accomplish that.

Quote:
Personally, I think it's best to just leave the database with no encoding at all - just let it store the data as if it were binary. Let the software running on the site (phpBB, the wiki, etc.) and the users (namely, their browsers) decide how to interpret it.


It doesn't seem to work that way. There's a lot of pieces to the puzzle:

* Database character encoding
* Table character encoding
* Column character encoding (not used in this case though)
* Client-server character encoding
* Coallation for all of the above (i.e. utf8 converted to latin1, etc.)
* Browser character encoding
* The character encoding type specified in the HTTP header

What a mess.

At this point, anything that relies on MySQL requires that you set the character encoding. All of this was introduced in 4.0, and now that 5.0 is the official stable release, I expect to see it even more prominently used.

My personal view is that people (i.e. forum software authors) need to stop mucking around with "support for multiple languages" (by having absurd "packages" for different languages/character sets, etc.). They need to just use utf8 and solve the problem in one swoop.

by on (#7789)
I've gone ahead and reverted the SET NAMES addition, but kept SET CHARACTER_SET.

The posts on the FCdev forum look correct to me. I don't have Japanese installed so someone will have to check to be sure

I can see the encoding difference between SET NAMES vs. without SET NAMES, and without SET NAMES the characters look to be the same as they were when we lacked SET CHARACTER_SET.

Might want to make some test posts in the Test forum to be 100% sure.

by on (#7790)
Ugh, and now I've just read we'll (possibly) be going through even more pain when I get around to upgrading to MySQL 5.0...

http://dev.mysql.com/doc/refman/4.1/en/charset-upgrading.html

But there's hope in some way. In regards to my idea of converting the presently-existing posts from the older format to what presently is correct/works, this might help:

http://dev.mysql.com/doc/refman/4.1/en/charset-conversion.html

by on (#7793)
koitsu wrote:
My personal view is that people (i.e. forum software authors) need to stop mucking around with "support for multiple languages" (by having absurd "packages" for different languages/character sets, etc.). They need to just use utf8 and solve the problem in one swoop.

Though there isn't really much of a reason for "packages" for languages in which text is written, there is still a reason for "packages" for languages in which to display the interface (e.g. "Post a reply").

by on (#7795)
I've grabbed the text for all of the Japanese posts in the FCdev forum (as well as one in nesemdev), so you can re-add the SET NAMES if you so desire.

Fixing the posts may be a bit troublesome, though, since they'll have to be done manually; fortunately, there are only a dozen topics that need to be fixed. Temporarily commenting out the "$edited_sql = ..." line in functions_post.php (should be line 267) will allow the posts to be edited (by a forum moderator) without inserting/updating the "Last edited [date], [N] edits total" at the bottom of each post.

by on (#7798)
Quietust wrote:
I've grabbed the text for all of the Japanese posts in the FCdev forum (as well as one in nesemdev), so you can re-add the SET NAMES if you so desire.


Re-enabled.

Quote:
Fixing the posts may be a bit troublesome, though, since they'll have to be done manually; fortunately, there are only a dozen topics that need to be fixed. Temporarily commenting out the "$edited_sql = ..." line in functions_post.php (should be line 267) will allow the posts to be edited (by a forum moderator) without inserting/updating the "Last edited [date], [N] edits total" at the bottom of each post.


Makes sense. I should add you as a forum moderator and let you take care of it (otherwise I can get to it later tonight); let me know. :-)

by on (#7799)
koitsu wrote:
I should add you as a forum moderator and let you take care of it


That'll probably work better - since you said you don't have Japanese fonts installed, you might not be able to tell if they were fixed properly.

by on (#7800)
tepples wrote:
Though there isn't really much of a reason for "packages" for languages in which text is written, there is still a reason for "packages" for languages in which to display the interface (e.g. "Post a reply").


True. I didn't think about that.

Though, admittedly, maintaining a multi-language interface is something everyone's already done (by this I mean the code/framework + all the necessary files are there). The problem is that all of the data is written in a non-Unicode character set, so it wouldn't end up displaying right under utf8 anyways.

Apache has modules for handling stuff like this: mod_negotiation (and the Multiviews directive). Based on browser preferences and some HTTP headers, you can determine what language someone prefers. Compare this to, say, a forum which has a dropdown for what "language" they want the interface in.

Not everyone uses Apache, but this would provide what people want for the most part.

mod_negotiation: http://httpd.apache.org/docs/2.0/mod/mod_negotiation.html

by on (#7801)
The biggest oxy-moron in the phpBB Language Pack system is that while each language specifies a different character set, the forum allows multiple language packs to be installed at once, which results in horrible conflicts when messages get posted.

The obvious solutions would be to either restrict the forum to only allow one language pack at once (which is rather stupid) OR force them all to use the same character set (i.e. Unicode, preferably UTF-8). The phpBB Team is aware of this problem, and it looks like they might fix it in the next major release, but I'm not holding my breath.

by on (#7802)
Quietust wrote:
That'll probably work better - since you said you don't have Japanese fonts installed, you might not be able to tell if they were fixed properly.


Done.

I'll corollate with you on IRC (Freenode) in regards to the post editing stuff and the like.

by on (#7803)
All of the posts in FCdev have been fixed (along with the one in NESemdev) - the only remaining 'broken' topic is the Russian one in the General Discussion forum (which I forgot to grab a copy of beforehand), though that may have just been a spam topic all along (considering the latest reply linked to a Russian forum post full of pictures of sex toys).

You may remove the "post-has-been-edited" message suppression (and restore me to a standard user) at your own convenience - I'm done here.

by on (#8817)
Quietust wrote:
You may remove the "post-has-been-edited" message suppression at your own convenience.


Just reiterating, since you don't appear to have changed this back yet.

by on (#8818)
Quietust wrote:
Quietust wrote:
You may remove the "post-has-been-edited" message suppression at your own convenience.


Just reiterating, since you don't appear to have changed this back yet.


I never changed it in the first place, which is why I was amazed your changes to all of the posts worked.