What's wrong with my XML?
May 25, 2006 12:13 PM   Subscribe

XML Character Woes: I'm getting the error Reference to undefined entity 'ldquo' (and 'rdquo, etc) when I try to open my XML files with IE or Firefox. Can you help me fix it?

I know the parser is seeing it as an entity and looking for a definition, but I can't define them in the DTD because I don't know what entities might be coming in.

I've set the elements to CDATA hoping the parser would ignore it, but that doesn't change anything. I've also tried changing the entities to the various numerical entities.

My goal is just to have valid XHTML entities in the text. These files are certainly going to be converted to HTML at some point but who knows where else they'll go. They might go back into InDesign, etc.

In case it matters: I'm getting the content from InDesign and running it against some scripts to fix them up. InDesign is giving me Unicode, and I'm converting the Unicode special characters to the 'rdquo' style html entities
posted by miniape to Computers & Internet (9 answers total) 1 user marked this as a favorite
 
I think it's because those are HTML entities, not XML entities.

Would & quot ; work better? (spaces because without them it looks like ")
posted by utsutsu at 12:34 PM on May 25, 2006


You could just leave the Unicode (UTF8 I assume) alone and encode the bare minimum required by XML, as long as you store and process the content with Unicode-friendly software throughout.

If you want to handle it as ASCII or ISO-8859-1 then using numeric entities should work fine, I've done that myself to get around Unicode-hostile systems. Are you sure you tried it properly?
posted by malevolent at 12:43 PM on May 25, 2006


InDesign is giving me Unicode, and I'm converting the Unicode special characters to the 'rdquo' style html entities

Don't do that. Use the Unicode values to create numerical character entities, which work in both HTML and XML.
posted by scottreynen at 12:57 PM on May 25, 2006


The only XML entities are: amp, lt, gt, apos, quot. That's it.
XHTML has optional support for the character entities but you'll have to include the entities in xhtml-lat1.ent, xhtml-special.ent, xhtml-symbol.ent , and even then many applications won't read the entities correctly.

As malevolent said, Use UTF8. All modern application support it. Or use numeric entities if not.
posted by Sharcho at 12:59 PM on May 25, 2006


Response by poster: Wow. Thanks. I must have been trying the decimal numbers improperly because it looks like it might work.

I would like to use UTF-8, I'd I'll probably keep a copy in that form, but I'm getting a little resistance from outside forces. Someone we shipped some of this content to (A huge internet company no less) actually asked for ASCII csv's, so we're trying to keep it as simple as possible in case something like that comes up again.

So basically, scottreynen's solution, but

I was using a php function:
mb_convert_encoding($contents, 'HTML-ENTITIES', "UTF-8");

to convert them. Is there a Decimal Entity equivalent to HTML-ENTITIES? I couldn't find one in the PHP docs.

I prefer not to come up with a conversion table if possible.
posted by miniape at 1:12 PM on May 25, 2006


Best answer: The only XML entities are: amp, lt, gt, apos, quot. That's it.

Untrue. XML also handles numeric entities.

XHTML has optional support for the character entities but you'll have to include the entities in xhtml-lat1.ent, xhtml-special.ent, xhtml-symbol.ent

Also untrue. These entities are already included in XHTML. You don't need to do anything special to include them.

Is there a Decimal Entity equivalent to HTML-ENTITIES? I couldn't find one in the PHP docs.

Not that I know of. A while back I wrote some functions to do things like this. You could use those like so:

$contents_for_xml_and_html = unicode_to_entities_preserving_ascii( utf8_to_unicode( $utf8_contents ) );
posted by scottreynen at 1:22 PM on May 25, 2006


If you're using a stylesheet (xsl), you can also define the ones you want near the top. For example:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE xsl:stylesheet [ <!ENTITY nbsp " "> ]>
<xsl:stylesheet xmlns="http://www.w3.org/1999/XSL/Transform"; version="1.0">
XSLT brand noobie here, but I ran into that problem trying to get non-breaking spaces into table cells and googled the specific. Here's the link I pulled this from, in case the code above doesn't show up right. I'm not sure it's the same one I found from work, but the general approach is the same. (Isn't there a way to show pre/code stuff here? I guess I should check the FAQ.)
posted by phrits at 7:13 PM on May 25, 2006


On overnight review, it's not limited to stylesheets, I guess. You can declare any of your commonly used entities such that the shorthand (e.g., &nbsp;) references the numeric code.
posted by phrits at 3:30 AM on May 26, 2006


Response by poster: Thanks to everyone. All is well. I'm marking scottreynen's answer as best because his link has some great resources and his functions worked perfectly.
posted by miniape at 7:54 AM on May 26, 2006


« Older Birthday gift for mid-20s guy   |   Dead ants everywhere Newer »
This thread is closed to new comments.