Semantic markup and the world wide web: non-noob needs an explanation.
December 22, 2007 8:29 AM   Subscribe

Semantic markup and the world wide web: non-noob needs an explanation. What is all this semantic markup talk about, and where can I learn more?

I've been creating web tools/apps/etc. for years with well-written (as far as I can tell) and valid XHTML & CSS code. But over the last few years I've heard lots of people talk about "Semantic Markup."

What is this, and how can I tell if I'm already doing it? Why is it so good, and are there any problems with it? Finally, what are your favorite web and non-web resources for learning more?
posted by josh.ev9 to Technology (9 answers total) 9 users marked this as a favorite
This is a good place to start.
posted by Steven C. Den Beste at 8:32 AM on December 22, 2007

Semantic markup just means using the right tag for the job. So use h1 for the main heading, h2 for the subheadings, ul/ol/dl for lists, etc. There are some areas wide open to interpretation/experimentation (e.g. forms), but it's just about keeping the markup clean and logical, with only a sprinkling of extra divs/spans added to provide more hooks for CSS.
posted by malevolent at 9:35 AM on December 22, 2007

The basics of it is using elements for their intended purpose not their default display style.

The biggest one of these in the 'non semantic gripes' is tables. Tables are for tabular data, many people use them for other purposes - eg. forms and general layout. But it applies to any element.
Its also about using the correct elements when you should be using them, as well as not using them for things you shouldn't. eg. If you have a 'list' of links, you should use a list tag (usually ul)

'Valid XHTML' means very little, particularly given that certain browsers *IE. I'm looking at you* don't follow the standards. Saying your XHTML is valid simply means all your tags are closed correctly, in the correct order and you don't have any tags where they shouldn't be (and you've encoded all your damn ampersands grrr...) It doesn't mean your code is well written.

Why is it good?
It improves the machine readability of your site, which is good for search engine spiders and 'alternative' browsers - eg. screen readers for the blind. Using the correct elements for your content allows machines to make better assumptions about the content and how it should be interpretted logically.
posted by missmagenta at 9:36 AM on December 22, 2007

The Web Ontology Language might be of interest to you as well (but it's definitely meant for building ontologies).
posted by Nelsormensch at 9:53 AM on December 22, 2007

In addition to the good comments above, I recommend that you take a look at the implications of the upcoming HTML 5 and XHTML 2 standards and the difference between these two.
posted by Foci for Analysis at 10:47 AM on December 22, 2007

Semantic markup was actually the original design for HTML, with tags like em, h1/h2/h3, p, and so on, having specific meanings for the reader of the text. The idea was that each browser or each user would be free to render "emphasis" or "subsubheading" in an appropriate way, as well as making use of the semantic information provided by that (perhaps the browser could provide a TOC of the page in a sidebar, for example, extracted from the 'h' tags). There are also visual-markup tags, like i and b, which specify a particular rendering. But mostly you were supposed to stick to the markup vocabulary of HTML, which was pretty good at representing essays and articles, but didn't go much beyond that.

During the '90s, web page designers became very intent on visual design, trying to produce pages that would render pixel-identical for every user (and as a result you'd get pages that told you you have to be running at exactly 800x600 in a particular version of MSIE and so on or else the whole layout would fall apart). Presumably these were people coming from print or video design, fields which embed no semantic information at all, who didn't get that the browser was an active part of the system. This also meant that all browsers had to use exactly the same visual interpretations of the supposedly-semantic markup tags.

Stylesheets (e.g. CSS) are an attempt to let people have it both ways, ideally putting the semantic information in the main HTML document (this is a pull quote; this is emphasized; this is a subheading) and visual information in an associated stylesheet (pull quotes are italic and light gray; emphasis is represented by italics; subheadings are bold, larger type, and set off by some white space).
posted by hattifattener at 11:02 AM on December 22, 2007 [1 favorite]

Best answer: Steve's link is very good.

The web has traditionally worked like this: HTML code is rendered on a two-dimensional screen and the user works out what it means. That's a navigation bar! That's a headline! That's a postal address! Those are individual articles! The user infers meaning from layout and text content.

This sucks because machines - like the Googlebot - aren't as smart as human users, so they can't infer the same meaning (semantics) from the HTML. They can't spot a postal address in a page. They can't know that this text is the page headline and should be more important than this text over here when you are indexing the site. They can't identify the navigation bar so a blind user can skip right over it.

The semantic web is about putting this meaning explicitly into the code. HTML 4 lets you do that a little bit. Use H1 instead of FONT SIZE=ENORMOUS BOLD UNDERLINE and some machines - like screen readers - can then tell that is the page heading, and let you skip to it, or use the heading to index the page. But that's about it in HTML 4. Imagine you're trying to write a program - say, a web browser for blind people - that can take a web page and identify the main content, like being able to open the New York Times website and say "There are fourteen articles, here are their headlines, select which one and I'll follow the link to the article." How can you get that from the HTML? Only by laying it out on the screen and looking at it with a smart human. It's not explicitly in the code.*

Compare that to RSS, which has much more semantics: you have a TITLE element, and a DESCRIPTION element, and so on. You can build programs that process and amend and combine RSS feeds because their content is explicitly defined and therefore machine-readable.

Now look at HTML 5: lots more semantic markup, more like RSS than HTML4. There is a HEADER element, and an ARTICLE element, and a FOOTER element, and lots more. Think about that problem with the New York Times again, and imagine how easy the same problem comes when the HEADER and FOOTER elements let your web browser for blind people ignore the stuff at the top and bottom of the page and the ARTICLE element lets it pull out the fourteen articles. Suddenly a machine can read the web page just like a human and obtain the same meaning/semantics. Search engines and aggregation programs and accessibility tools and archivers and lots of things can suddenly do cool stuff with content from other people's sites, like they do now with RSS. Great!

That's the idea, anyway. What does all this mean for you as a web designer? Do nothing on the semantic web until HTML5 comes out, browsers support it somehow - even if it's just rendering it as well as HTML4 - and someone major starts using it on their website. If the New York Times or the BBC does it, then follow their lead. There'll be some Firefox extension that uses the semantics to do something very cool and you can then demo your site with it to customers.

For now, though, I'd suggest just using the semantic markup H1 to H6 properly for your blind users (and LABEL elements and attributes and always always alt attributes) and RSS autodiscovery but don't worry too much about anything else: using UL and OL for lists, and hCard, maybe? I'm an accessibility guy, rather than a web designer, so other people might have suggestions for other HTML4 elements.

(You should follow current fashion, of course, or your peers and customers will think less of you. Avoid TABLE elements for layout and use CSS instead. Make sure your pages validate. But they don't really matter in terms of the semantic web.)

* Microformats are an attempt to add more semantic meaning to HTML 4 - hey everyone, let's say that this particular HTML code means a postal address! It's saying that certain combinations of elements have a semantic meaning. Works so long as lots of people use them consistently.
posted by alasdair at 1:59 PM on December 22, 2007 [1 favorite]

I've been creating web tools/apps/etc. for years with well-written (as far as I can tell) and valid XHTML & CSS code.

Writing "valid" XHTML is sort-of a joke, since true valid XHTML has to have the application/xhtml+xml content header set (which maybe 1% of the valid XHTML folk do) and even then, IE6 will choke unless it's text/html--which completely defeats the purpose.

That said, "valid" HTML is sort-of a joke, too. Besides myself, I don't know anyone who actually codes valid HTML. You can tell if your markup is invalid if you close your <br /> or <input ... /> tags, for example. According to the spec, they shouldn't be closed up (go down the "end tag" column in this list--wherever you see an F, that means forbidden). Do you put your <tfoot> tags after the <tbody> tags (which you would think would make sense)? That's a no-no as well.

Pedantry aside, the theory is that the closer your conform to the standard, the more likely the final, rendered result will look the same across various browser implementations. But we all know that's a load of malarkey (at least, as of 2007). So what happens when a browser sees code that's invalid from the declared syntax? It turns it into "tag soup" and tries to render it the best it can.

What does that mean to you, the writer? Not a whole lot. Browsers are designed to be enormously fault-resistant to bad code. Unlike, say, XML parsers, which will simply give you an error when they encounter bad structure, the flexibility afforded to the web designer is such that you can follow just about whatever convention you prefer and still have better-than-average chances that your end-users will see what you intended...

...provided you use semantic markup.

Why is it so good, and are there any problems with it?

Thus we arrive at the true power of semantic markup. To illustrate this with an example, I was recently working on revamping an old corporate website. The one thing I insisted on was that the markup be semantic. Not Strict, not point-whatever-validation-100%--none of that bullshit. Just semantic. The code was a thing of beauty. A few DIV and SPAN sacrifices to make the browser gods happy, but 99% of it was tagged correctly.

What is "correct" markup? It's the difference between <b>this</b> and <strong>this</strong> What's that? "It looks the same," you say? Well, it may look the same, but <strong> means the block should be read with force, while <b> just means "display=boldface." To you the result is the same. To a blind person, one is read at a louder volume than the other. Same with <i> and <em>--one is purely for display. The other is semantic code for emphasis. How the browser (or screen reader) decides to handle that is up to them. The point is, it gets you closer to the actual intent of your document's content.

That's what semantic markup means. It means using the right tag for the right purpose. Why is it so good? Well, in the example I mentioned before about the corporate redesign, at one point I had to test the site from a remote ssh connection. All I had was a straight Linux shell--no browser. I could ping the site and it worked fine, but out of prurient curiosity, I decided to try viewing the site in Lynx. To my astonishment, it looked great. Not pixel-for-pixel perfect, mind you (lynx is text-based, so that would be impossible), but it was like the website had been converted into a document... sensibly. It looked right.

That's why you use semantic markup. Because you give whatever the client software is the best possible chance to render the content as correctly as possible.
posted by Civil_Disobedient at 5:46 PM on December 22, 2007 [1 favorite]

Presumably these were people coming from print or video design, fields which embed no semantic information at all, who didn't get that the browser was an active part of the system.

I think this is a common idea about people who designed this way. I can tell you the real reason: customers and clients. Nearly all of them coming into the web were familiar with advertising they'd done in the print world, where it could look "exactly so." (One lady kept bugging us about the yellow background of her site. Seriously. She didn't seem to grasp that we had a different monitor than hers, so it was going to look different). Many a time I had a wonderfully semantic website that needed to be ruined because a customer wanted the display to look a little differently.

Also, until recently CSS support was so all-over-the map that even though they were "standards" you would actually get more cross-platform consistency using things like the font tag, or using tables for positioning. Really frikking annoying.

Today, the nice thing is that customers and clients have grown up a little. Now, they care slightly less about how the website looks, and care more about it integrating into all of this crazy Web 2.0 they keep hearing about. It makes it a bit easier to make their websites Semantic-based rather than visual based.

Although they will still always want their logos bigger.
posted by Deathalicious at 4:42 AM on December 23, 2007

« Older Where To Take A Month Holiday For Two?   |   PhotoshopFilter: How to make a colourised B&W... Newer »
This thread is closed to new comments.