How do I extract website content for translation?
March 24, 2006 9:30 AM   Subscribe

How do I extract content of a website (including images and alt -tags) for translation into different languages?

I have been tasked to prepare one of our websites for translation. Now they've sent me a word doc where I am to copy-n-paste all the content, graphic details (such as wording,) and alt tags. I really don't think that this would be the best way to go about this and I can't believe that I am the only person doing this. Is there a peice of software that will grab all the content, and it's relationship on the page, including graphics, and arrange it so that it can be translated?

Help me Obi-Wans... you're my only hope.
posted by Botunda to Computers & Internet (12 answers total)
What is the structure of the source (English) pages? Straight HTML? JSP pages with embedded text? Jsp pages with XML files? The answer is different depending.
If the site wasn't built with translation in mind - it will likely be a fairly manual process of pulling out text and recreating pages after the translation comes back.
If the text is straight HTML - most of the translation tools that a translation agency will have know how to mark all of the non-text bits as such and thus prevent translators from mucking about with them. You send them to the agency - they send back translated HTML files - nothing needs doing on your side.
Plenty can be done with other file formats - you just need to know what you have and what kind of translation tools your translator is using.
For the graphics - you're going to hope - and likely be disappointed - that the original source PSD files exist somewhere. Ideally you can send the translator an image with the text on it's own layer for super easy translation - again - agency translation tools will handle this. If not - you need to make an inventory of all the images and then a spreadsheet with the graphic name and the text in it so that when it comes back you or your translation agency will can recreate all the images by converting the gifs back to real images and "coloring" over the text and laying down the new translated text. I'm a localization manager and if I had a dollar for every time I'd done that - I'd be a very rich woman.
I do this for a living - user name at gmail if you need any more specific help with this.
posted by Wolfie at 10:21 AM on March 24, 2006

Gawd, that is definitely not the best way to go about this, since you're going to get a monolithic file back from the translator and will then need to break that out into the same directory structure as the English site, mark it all up, etc.

I've done website translations where the client simply said "Here's the URL. Translate everything." I didn't need to manipulate any graphics, though I did need to translate any text in them. The tool I used was a site ripper (there are numerous apps you can use for this, but that's what the class of apps is referred to) where you give it a URL and it downloads to disk everything under it, and BBEdit.

Working directly with HTML is kind of a drag, especially if it is tag-soup markup. Tables are especially problematic, as they make it impossible to read the HTML in a linear fashion. I would wind up printing out the pages just so I could read the pages as they were intended. This did slow me down, but it also meant that there was no additional work in marking up the translation for the web, so it might be worth it to find a translator willing to work this way and pay him a little extra. Obviously not all translators will be happy working in this mode.

I've also done website translations where the client faxed (!) me every page on the site, or put every page into slides in a powerpoint presentation, or whatever. Sheesh.

There are CMSs out there that are specifically designed to facilitate creating multilingual websites; basically they are their own localization tools. You don't say what form your content currently is stored in. If its static, the site ripper + text editor approach is probably best. If its in a database, you should explore the multilingual CMS option. That way, you'd just give the translator some level of admin permissions and a list of entries to run through.
posted by adamrice at 10:25 AM on March 24, 2006

Response by poster: It is all straight HTML with some javascript thrown in.
posted by Botunda at 10:30 AM on March 24, 2006

yes, you should be able to find a translator who can handle being given a URL and told to get on with it. I don't take on that kind of work but it looks like you need someone like adamrice (depending on the language, natch).
posted by altolinguistic at 10:33 AM on March 24, 2006

Are you working with a translator directly - or a translation/localization agency? You'll pay a bit more to work through an agency - but it may be worth it as they will be able to handle all of this for you. You send them the HTML pages and the images - they send you back a translated site. Done. You'll pay a slightly higher per word rate, hourly charges for graphic work and you'll pay 10-15% of the total in a project management fee - but if you're new at this and there are no experts in your office - it can be well worth the price. If you want to cut down on costs - create the file with all the text in the images and do the graphics production in-house. If you are working directly with a translator - they may not even be willing to do the graphics production work.
The other benefit of working with an agency is that you can ask that they start you with a translation memory. Most professional translators will also be able to do this for you - but "some guy in the French office" or the "brother of someone who happens to speak Chinese" will not. This will allow you to do updates to the site in the future without having to pay to have the entire site sent out and retranslated. Make sure you get in writing that the memory is owned by you. Ask for a delivery of this file at the end of the project - even if you have no tools to do anything with it directly. This also allows you to switch vendors if you find you're not happy with the one you are using and not lose the investment you've made so far in translation.
posted by Wolfie at 10:48 AM on March 24, 2006

Response by poster: Can you suggest a service or agency that you would recommend? This will have to be translated into about 10 languages.
posted by Botunda at 11:02 AM on March 24, 2006

Response by poster: I just sent this to Wolfie (I think) but I'll post it here as well.

I have a couple of questions more. I am wondering about links. As in Click here How or do translation services handle that? The question is if in English the link is click here and it is translated into Japanese where that link is 5 words or characters to say the same thing. How does that get back and linked correctly. Or how do they know where or what to link?

As in: Please click here for more info.
Is now: ここにかちりと鳴らしなさい より多くのインフォメーションのために

How the hell does that work?
posted by Botunda at 11:20 AM on March 24, 2006

Botunda, I'd second the earlier advice to hand the project to a localization company, they handle issues like this as routine. I've a list of recommended companies on another hard drive that I'll dig out but SDL, ITR and Lionbridge are all very good at website localization.
posted by ceri richard at 12:03 PM on March 24, 2006

Wow, ya gotta love machine translation. That one's a hoot.

The way you deal with linked text depends on how the translation is being handled at the bottom of the pile (that is, by the individual translators).

If they're working from something like a printout, they can be directed "please underline the text that should be the link text." If there are multiple links in one paragraph or sentence, this could get really problematic to reconstruct further along, so you might need to change the instructions to something like "please underline the text to use as link text and enclose the target URL in [brackets]". Or you could instruct them to use mediawiki markup, or put the target URL in a footnote, or something like that. In this scenario, you're going to need to do some reprocessing on the document you receive back, so make the format you choose something that's amenable to GREPping.

If they're working with raw HTML, it's not an issue. If they're working in some kind of WYSIWYG HTML editor and overtyping a copy of the source doc, they'll need to be careful not to accidentally delete the linked text before reconstructing it, but other than that, it shouldn't be an issue.

Feel free to contact me by e-mail for more info.
posted by adamrice at 12:05 PM on March 24, 2006

Depending on the size of the site - you should get a number of quotes from smaller to mid-size companies. You should ensure that the pricing for the translation should take into account words or sections of the site that repeat - these should be cheaper than unique sections.

Worldserver is a tool that combines a content management system with a translation tool and would be an ideal solution for manageing a website project. Idiom, the creator of WorldServer have a number of partners who use this product - any would be good to contact.

Feel free to contact me - I work for a translation service company that does a lot of these types of projects. Disclosure - we are also an Idiom partner, however we only specialize in a certain industry so we will not be bidding for this project. But free advice - no problem.... (oh and Hi Wolfie!!)
posted by clarkie666 at 1:22 PM on March 24, 2006

There are software packages that let you translate all relevant text (including alt texts [not “tags”]) but prevent you from touching HTML elements and other attributes. I read about it in that error-strewn book about Web internationalization a couple of years ago. I wish I could come up with some names for you.

Try asking Richard Ishida of the W3C.
posted by joeclark at 4:11 PM on March 24, 2006

Joe - the main software packages are: Trados/SDLX, Deja Vu and Worldserver. These tools are called CAT (Computer Aided Translation) Tools. There is an overview of prices and features on Proz
posted by clarkie666 at 11:20 AM on March 27, 2006

« Older Gap ad "Daddy-O" song?   |   Insurgent: use and meaning Newer »
This thread is closed to new comments.