I tried this, but the problem is still the same. Within Climate Prediction the name is still correct, but within the Boincstats it is not correct.
It is still written in the wrong way.
You can see it here: http://www.boincstats.com/stats/team_graph.php?pr=cpdn&id=5624
But it should belong to the right team:
http://www.boincstats.com/stats/boinc_team_graph.php?pr=bo&id=27493
For us it does not matter in which way the team-name is written within Climate-Prediction. For us it is important that the results of Climate Prediction belong to the right team within the boincstats.
I'll see if I can get John and/or Richard to advise.
Hmm. Looks like a misinterpretation of the unicode characters between the Windows fonts and international universal fonts. There is often more than one way of getting the same symbol even in the same font set. They look the same to the viewer but the underlying string of bytes representing the characters is different and often, as in this case, there are two bytes for a single non-Latin character, the first byte indicating where to start counting in the table of more than 256 characters. All standard ASCII characters are numbered between 32 and 127 with extended ones up to 255. Some of these extended ones are accented letters but these could be repeated later in the table, often more than once if they are used in different language sets; this is probably where the problem arises.
I have had this problem, especially with German characters but I get round it by always using the same method for any string input output or matches and ensuring that Windows is always using the same international settings and fonts for a particular country-based database. Obviously you can't do this here as you don't know what international setting is used for the string matching or the string input/output. The real problem is that Microsoft is American English based at its core. For British users this only means spelling some words like colour and programme differently but for languages with accented or even different characters this can be much worse.
Sorry if this isn't much help but it may at least explain the problemThis might help or try googling unicode for more information.
You're sending UTF-8 (A unicode encoding) to the Boincstats site, but that site is expecting (probably) CP1252 (possibly ISO 8859-1). Solution, EITHER whatever is sending the data to Boincstats needs to send what that site expects (try CP1252), OR (better) Boincstats should support Unicode (UTF-8)
There's a Japanese term for this... mojibake :)
--Richard
<edit> Just verified. Boincstats is serving the page up encoded as ISO8859-1. And stuffing UTF-8 into it. A bit naughty!. They should change the content="text/html; charset=iso-8859-1"> at the top of their served pages to content="text/html; charset=utf-8">
Sorry about the delay in reply, had to get my daughter to bed.
I wouldn't think so. It's definitely UTF-8 that's appearing on the boincstats pages and it looks like the correct (2 byte) UTF-8 sequences are being used. Unfortunately the page is being served as an ISO8859-1 page and as a result the 2 byte sequence is not being interpreted as one character, but as two.
I notice that the climateprediction page for that team is also a 8859-1 encoded page, but in this case the correct code values are being used. 'ä' is encoded as the single byte 0xE4 in 8859-1 and this is being used on the cpdn pages.
I don't know how the team name is getting propagated to the boincstats servers, but something in the way has translated that to UTF-8. The encoding for 'ä' in UTF-8 is the 2 byte sequence 0xC3 0xA4. However if you read that as 8859-1 then instead of translating that sequence into the one character U+00E4 (ä) it gets viewed as the 2 8859-1 characters 0xC3 and 0xA4. 0xC3 is a Ã, 0xA4 is a ¤
To fix the problem you need to make sure that whatever is sending the team names to boincstats is doing so in an encoding that boincstats understands. There's nothing at all wrong with UTF-8, and my preferred solution is for boincstats to use UTF-8 in its webpages. Not only would this fix this problem, it'd also allow teams (and names) to use any character. Such as Japanese or Korean characters... Which is quite impossible in 8859-1, there's only 256 characters in that characterset, as opposed to about 1.1 million in Unicode... (although I think only about 150,000 are currently in use)
--Richard
ID: 27958 |
Les Bayliss Forum moderator Joined: Sep 5 04 Posts: 3623 ID: 12875 Credit: 3,467,707 RAC: 213
There are other stats sites, so I guess that a check on how they're handling this is also needed. It may just be BOINCstats.
What I don't understand is why the combinations that the member types for ä and ü, which must be two different combinations, both translate to the same string �. This looks like a list of 3 items.
Richard Rodway and I submitted this problem as boinc Trac ticket #57
We thought this was a boinc problem rather than a defect in the cpdn (and other project) software. I think I'd better ask Milo in Oxford to have a look at this thread.
Fossilised reply, but just for interest. That encoded sequence in the XML is the UTF-8 encoding of the Unicode U+FFFD, which is the 'replacement' character. It's used when you are trying to convert something to Unicode and that conversion failed. So in otherwords, whatever is generating that XML is trying to translate the a umlaut and u umlaut to UTF-8 and failing (maybe because it's assuming ASCII source or something?)
However this doesn't explain what actually ended up in BOINCstats. Somehow the 'real' data got through to it, otherwise we'd have seen � in the team name on the pages, not ä (for the a umlaut)
As a matter of interest I had a look through some Japanese team names. Most just use English names (probably because they worked out that Japanese names didn't work :)) I didn't find any with correctly displaying Japanese names, I did find some with names displaying the same symptoms as we see here (UTF-8 displayed as 8859-1)
All of this is too late to be of any interest I suspect, I've been way way too busy recently.
Well, it looks as if we wasted our time submitting the problem to the wrong people/place. And Milo didn't get an answer to the query he added either. Here's the fate of our ticket - wontfix.