PDA

View Full Version : character encoding frustration



aeiou
18 Sep 2009, 3:42 AM
Hi all
I know there are many (or, at least, quite a few) threads on this topic in this forum, and so I have searched for it but, as I have seen so many anwers, any of them works for me I dont know why (as I realize the answers are well reasoned).

Well, we are making a pan-european web-application (java based) which has to support swedish, german and spanish languages among others. The problem is I got bad characters in javascript snippets as I tell below.

My code in servlet class just before sending back to client is:


PrintWriter out = response.getWriter();
response.setHeader("Content-type", "application/json; charset=UTF-8");
response.setCharacterEncoding("UTF-8");
System.out.println(json);
out.print (json);
json is just a java String and I print it out to tomcat logs, and the message looks fine. The response firebug get in the response tab for the proper request is:


{"groups":
[{"id":550, "name":"Hospital Gregorio Marañón","cod":"10"},
{"id":401, "name":"Hospital Ramón y Cajal","cod":"09"}]
}
which looks fine as well but, when i try to do something like:


var myStore = grpCombo.store;
myStore.addListener ('load', function (thisStore, records, options){
Ext.each (records, function (rec, i, allRecs) {
console.info (rec.data);
console.info (escape(rec.data.name));
console.info(decodeURIComponent(rec.data.name));
// rec.data.name = decodeURIComponent (rec.data.name);
});
alert (records[1].data.name+":"+decodeURIComponent(records[1].data.name));
utils.raiseMsg(escape(records[1].data.name)+":"+unescape(records[1].data.name));

});
everything looks wrong like:

Hospital%20Gregorio%20Mara%uFFFD%uFFFDn:Hospital Gregorio Mara??n

I have tried almost every combination I have found here in forum, on server and client :((.
Ext.util.format.htmlEncode/htmlDecode didn't work and the templating like:


tpl: '<tpl for="."><div ext:qtip="{name:htmlEncode}" class="x-combo-list-item">{name:htmlEncode}</div></tpl>',
I guess I am forgetting something but I don't know what,
so may anyone provide some idea or hint to take me out of darkness, please?

Thanks in advance

w i l l y

Condor
18 Sep 2009, 4:17 AM
The HTML file that loads this javascript also needs to be saved as UTF-8.

And putting in:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
does NOT make a HTML file UTF-8!

Animal
18 Sep 2009, 4:18 AM
Should work.

What's all this console.blah?

A pan-european blah blah and you won't debug properly? You need to set a break at the point the data is read and examine how it gets processed.

Animal
18 Sep 2009, 4:19 AM
The HTML file that loads this javascript also needs to be saved as UTF-8.

And putting in:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
does NOT make a HTML file UTF-8!

No, but



response.setCharacterEncoding("UTF-8");


should. It must be putting it on the wire encoded into UTF8 bytes, so I suggest examining the client.

Animal
18 Sep 2009, 4:20 AM
Mmmmm...

The API docs for HttpServletResponse.setCharacterEncoding state:



This method has no effect if it is called after getWriter has been called or after the response has been committed.

Condor
18 Sep 2009, 4:23 AM
No, but



response.setCharacterEncoding("UTF-8");


should.

Yes, it's sending the JSON data in UTF-8 format, but is the HTML that is loading this JSON data UTF-8?

Animal
18 Sep 2009, 4:31 AM
It shouldn't matter how the original page was encoded.

An XHR's responseText will be decoded using whatever scheme that the server claims* it was encoded with.

* It's lying in this case because the method was called too late.

Condor
18 Sep 2009, 4:39 AM
1. From the ServletResponse JavaDoc:

The setContentType or setLocale method must be called before getWriter for the charset to affect the construction of the writer.
2. If you want to display received UTF-8 text in a browser then the HTML page itself also needs to be UTF-8.

Animal
18 Sep 2009, 4:44 AM
2. If you want to display received UTF-8 text in a browser then the HTML page itself also needs to be UTF-8.


Why would that be the case? Once inside the browser, there is no UTF-8.

All characters are just Unicode characters. The method in which they were originally encoded into a byte stream to be put on the wire is forgotten.

Condor
18 Sep 2009, 4:57 AM
Why would that be the case? Once inside the browser, there is no UTF-8.

All characters are just Unicode characters. The method in which they were originally encoded into a byte stream to be put on the wire is forgotten.

OK, let me put this another way:

What would a browser display if you tried to put an UTF-8 text in the innerHTML of a ISO-8859-1 HTML document body?
It would display question marks for all characters that exist in UTF-8, but not in ISO-8859-1!

Animal
18 Sep 2009, 4:57 AM
There's no such thing as UTF-8 text.

Animal
18 Sep 2009, 4:59 AM
If you read some bytes off the wire and decoded them using a scheme which was NOT the scheme used by the software that encoded them, then you might end up with the Unicode "unknown" character in the resulting Unicode string... the question mark.

aeiou
18 Sep 2009, 5:29 AM
Why would that be the case? Once inside the browser, there is no UTF-8.

All characters are just Unicode characters. The method in which they were originally encoded into a byte stream to be put on the wire is forgotten.
Hi
many thanks you two, but Animal was right (as many times :>). As I told in my post, i was sure I was forgotten some small detail but I am working alone in this issue and here nobody knew anything about this topic...

thanks again

w i l l y

ps: the console.blah showed the same as firebug debugger, but i think this is a faster method to debug when the problem is not algorithmic. but this is just my opinion.