Portuguese character set on Apache
October 9, 2008 9:00 PM   RSS feed for this thread Subscribe

I need to display Portuguese characters on a webpage, but when I upload the files to the Apache server, they stop displaying correctly. And, yes, I'm using UTF-8 encoding.

I've been provided with a number of HTML pages with Portuguese characters on them. If I look at the files on my (Windows) desktop machine, they look fine, but once I upload them to the (Ubuntu 7.04) web server, they're replaced with those annoying question mark characters.

I've done some digging around, and have found that if I copy/paste the chars into a new file, and then request it, they display just fine; it seems the act of SCPing the file up to the server screws them up (either that or pasting the text fixes them up).

If I strip it right back to a single word file (containing "Você" -- that last character is ASCII #234), and compare the ASCII codes of the characters contained there in, I get the following:

- this one displays correctly in one's browser (i.e. was the copy/paste effort):
V => 86
o => 111
c => 99
à => 195
ª => 170

- but this one doesn't (i.e. SCPed) despite the ASCII code being correct:
V => 86
o => 111
c => 99
ê => 234

If necessary I'll write a perl script to translate the character combos to the correct HTML entities but I'd rather not. Is there some sly way I can get this to work?
posted by John Shaft to computers & internet (11 comments total) 2 users marked this as a favorite
Could you post a link to a page that exhibits the behavior you describe?
posted by exphysicist345 at 9:26 PM on October 9, 2008


Here ye go:

- works
- doesn't
posted by John Shaft at 9:40 PM on October 9, 2008


Caveat -- the question marks appear in Firefox; otherwise blank in IE.
posted by John Shaft at 9:42 PM on October 9, 2008


It would help if your page was properly HTML formatted, with a proper HEAD and BODY section.

Then you could put the following into your HEAD section:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

And that would instruct your browser to interpret the encoding properly.
posted by Class Goat at 10:06 PM on October 9, 2008


Your web server is claiming that both files are UTF-8. However, the non-working one is actually encoded in ISO-8859-1, not UTF-8.

When you open the HTML file locally, there is no server to say "this is UTF-8", so your browser either has to sniff the bytes and guess the encoding, or it defaults to ISO-8859-1. Either way, it's making the characters appear to be correct. They just aren't really UTF-8.
posted by xil at 10:12 PM on October 9, 2008


There's a whole ton of complexity in character encoding issues.

One thing that occurs to me initially is that there's a file header called the "Byte Order Mark" that is used more frequently by Windows systems than Linux. Some Linux software (like PHP, for example) have issues when they see the BOM.

Here's an idea: How about you install BabelPad on your Windows system and experiment with saving a copy one of the files with various settings (it lets you strip the BOM, for one thing). That ought to either corroborate or disprove that the problem occurs upon SCPing.
posted by XMLicious at 10:14 PM on October 9, 2008


What are you using as your scp client? Is it being too "helpful"?
posted by i_am_joe's_spleen at 1:03 AM on October 10, 2008


On inspection with Firebug, Apache is sending the right headers for UTF-8 for both files:

Content-Type text/html; charset=UTF-8

but the "doesnt" file is not UTF8. Therefore I'm going to blame something other than Apache.
posted by i_am_joe's_spleen at 1:06 AM on October 10, 2008


Like xil said above, the "doesn't work" file is ISO-8895-1. Open it in Firefox, use "View > Character Encoding" to change it to "Western (ISO-8895-1)" and it displays correctly.

My hunch is that you'll have to take the original Portuguese HTML files given to you, open them in a text editor that can save in many encodings (like EditPad Pro), and save copies as UTF-8 before you start working on them.
posted by msittig at 7:32 AM on October 10, 2008


It was the original files' encoding -- ISO-8859-1 as some of you suspected. If I save the files (one at a time...) in Textpad from ANSI to UTF-8, then upload them, they're fine. Hooray!
posted by John Shaft at 9:12 PM on October 11, 2008


There's that many? If you know Perl and anticipate ever facing this situation again, the Encode module would be a good thing to check out.
posted by msittig at 1:19 AM on October 13, 2008


« Older [ethicsfilter] two of my frien...   |   Should I drink my tap water?... Newer »
This thread is closed to new comments.