Page 1 of 1

Charset Problem

PostPosted: Mon Jun 25, 2012 3:05 pm
by jhb50
I'm having a characterset problem with the titles in my Top100.groovy.

When I save in my browser the html page returned by youtube "http://www.youtube.com/playlist?list=MCUS" I get the line

<span class="title video-title " dir="ltr">Fun.: We Are Young ft. Janelle Monáe [OFFICIAL VIDEO]</span>

where "Monae" has the hex characters 4D 6F 6E E1 65


When I use html = new URL("http://www.youtube.com/playlist?list=MCUS").getText() in my groovy and save the text I get the line

<span class="title video-title " dir="ltr">Fun.: We Are Young ft. Janelle Monáe [OFFICIAL VIDEO]</span>

where "Monae" has the hex characters 4D 6F 6E C3 A1 65


My research says I need to specify a code page that correctly recognizes E1

So I look at http://www.fileformat.info/info/unicode ... upport.htm
and see that ISO-8859-1 contains the correct E1
and change the groovy to use html = new URL("http://www.youtube.com/playlist?list=MCUS").getText("ISO-8859-1")

but I still get

<span class="title video-title " dir="ltr">Fun.: We Are Young ft. Janelle Monáe [OFFICIAL VIDEO]</span>

where Monae still has the hex characters 4D 6F 6E C3 A1 65

Other charsets do the same. How do I get the correct characters returned in my groovy?

Re: Charset Problem

PostPosted: Mon Jun 25, 2012 4:23 pm
by zip
Looking at the page, it correctly displays in the browser when set to UTF-8, ISO-8859-1 shows it with the wrong character.

try .getText("utf-8")

Re: Charset Problem

PostPosted: Mon Jun 25, 2012 5:09 pm
by jhb50
I had tried that and the console output showed

title=[6] Fun.: We Are Young ft. Janelle Monße [OFFICIAL VIDEO]

so I assumed utf-8 did not work, but if I send the output to a file I get

title=[6] Fun.: We Are Young ft. Janelle Monáe [OFFICIAL VIDEO]

which is correct!

Go figure! Thanks!