The Trouble with Semantic Markup: Response to schema.org

First thing this morning, checking in on the Twitter streams, I saw Jeff Evans (@joffaboy) announce the article, “Google, Bing & Yahoo’s New Schema.org Creates New Standards for Web Content Markup.”

Initial tweet

My heart began pounding as soon as I read the title. The arch-rivals of search, the biggest dogs in the yard, the great institutions of the web were collaborating to propose a solution to the problem of markup that has plagued me from the beginning: Markup doesn’t really address the substance of the web, just its most basic structure. My hopes were further raised by the mention of a “recipe” content type, which if you follow my writings, you’ll recognize as a regular example.

I retweeted in a flash: This is what I’ve been looking for!

My first retweet

Then, I visited schema.org, and all my hopes came crashing to Earth again. The Search Giant monsters have created a new monster.

My second retweet

Quick Overview

As I understand it, schema.org is proposing additions to HTML that the “Big Three” search engines are going to interpret, in order to improve the accuracy of search results. By augmenting the markup in web content, they are together settling on a standard vocabulary, so that they will all be recognizing the same language. Presumably, once they’ve built this standard language into their sorting algorithms, any content that has these augmentations will rise to the top of search results, above content that doesn’t.

In principle, that sounds good, doesn’t it?

I’d like to offer some reflections on a few practical implications of this effort.

Corporations try to head off the “free” Semantic Web

For-profit companies have been watching in dismay for twenty years the rise of the “free” WorldWide Web. Content is free. Software is free. Social Networking is free. And more and more of the web is being driven by “free” efforts, like the WorldWideWeb Consortium. Volunteerism is a huge threat to capitalism, and they know it.

Among the greatest of these free efforts is the quest for the Semantic Web, which in its simplest terms, seeks a set of standards for describing the meaning of content. Human language is always problematic—as are those who use it—because words are never just words. The meaning of words is rich, contextual, ambiguous, and worst of all, ever changing. There are a lot of really, really smart people, all over the world, almost exclusively volunteer (with some corporate support), working hard to figure this out. If you want to get a sense of the complexity of it all, talk to Rachel Lovinger (@rlovinger) at Razorfish. She’s one of the true semantic geeks, and I’ll just have to take her word on most of what she says. She’s fab.

But instead of supporting this “free” effort, the Search Giants have imposed a de facto standard for the Semantic Web, and they’re pushing it with the strength of their size and popularity. Like the Zen question of the tree in the forest:

If a search engine doesn’t support your semantic standard, will anyone find your content?

I am suspicious of their motives. I read it as an effort to bypass all the work that’s already gone into the Semantic Web.

Markup is more than basic structure and presentation

It has been a great struggle since the beginning of the web to strike the appropriate balance between the structure of content and its presentation. In other words, what content is should be distinct from how content looks. But HTML—even up to HTML5—still only addresses the most basic aspects of content, and even now, offers only tags that address the pieces of the “webpage”—like the “header” and “navigation.” There isn’t markup to describe the content’s substance.

CSS as semantic markers

Cascading Stylesheets, in a roundabout way is one approach to the problem, although it’s originally meant to control the presentation of the content. Let me give an example.

Lists are a primary content structure. We create lists for everything—ingredients, footnotes, archives, contacts, links, Q&A, references, etcetera ad nauseum—but HTML offers us only two choices: “Ordered lists” (numbered) and “Unordered lists” (bulleted).

If your website had a list of links in a sidebar and a list of staff names on a contact page, you use the same basic markup:

<ul>
    <li><a href= “http://url.for.link/1” title= “This is the first list item”>Link Text 1</a></li>
    <li><a href= “http://url.for.link/2” title= “This is the second list item”>Link Text 2</a></li>
</ul>

…and then…

<ul>
    <li>Contact Name 1</li>
    <li>Contact Name 2</li>
</ul>

Here’s the problem: The web browser has a default way of rendering these lists, and they will look exactly the same, except that the links will be underlined. If you want to distinguish them from each other, you can add CSS classes, which give you a way to style them differently.

Now, CSS gurus (the best of whom are really content strategists underneath it all) will tell you that you should NEVER use class names that describe how something looks, like “class= ‘blue_text’.” The class names should describe what they are, which is, in fact, a semantic indication:

<ul class="links”>
    […]
</ul>

…versus…

<ul class=“contacts”>
    […]
</ul>

Using these identifiers, the designer can define precisely how each component of a website should look. In a better world, however, they could also be used to identify what they are. Defining standard CSS classes and identifiers as part of XHTML would be one approach to encoding the meaning into markup.

But not Google, Bing, and Yahoo—Noooooooo.

The Search Giants, though, instead of building on CSS or any other existing approach, have introduced another “standard,” which superimposes another layer of markup on top of the feeble XHTML we already have. Here is the example from schema.org:

<div>
    <h1>Avatar</h1>
    <span>Director: James Cameron (born August 16, 1954)</span>
    <span>Science fiction</span>
    <a href="../movies/avatar-theatrical-trailer.html">Trailer</a>
</div>

Before I go any further, I have to say that this code doesn’t look like any real XHTML I’ve ever seen, and that’s a worry right from the start. Nevertheless…

Once they’ve applied their markup augmentations, again right from schema.org, it becomes:

<div itemscope itemtype="http://schema.org/Movie">
    <h1 itemprop="name">Avatar</h1>
    <div itemprop="director" itemscope itemtype="http://schema.org/Person">
    Director: <span itemprop="name">James Cameron</span> (born <span itemprop="birthDate">August 16, 1954)</span>
    </div>

    <span itemprop="genre">Science fiction</span>
    <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>

There are many, many, many things wrong with this picture.

All the complexity of XML without any of its simplicity

XML is the mother of all markup. In fact, XHTML is just one markup language based on the XML standard. Using XML as the basis of your web code is an elegant—but very complex—solution to defining your content. When it’s all worked out, however, it lets you replace that gobbledygook above with something more like this:

<movie>
    <title>Avatar</title>
    <director>
        <name>James Cameron</name>
        <birthdate> August 16, 1954</birthdate>
    </director>
    <genre>Science fiction</genre>
    <trailer url= “../movies/avatar-theatrical-trailer.html” />
</movie>

Putting it simply, by augmenting XHTML with another layer of markup, the Search Giants have complicated the code immensely, making it just as complex as if they had done it in XML, but without any of the benefits of XML’s simple elegance.

Content is rarely this simple

The examples above deceive us, in any case: Yes, we can add fields to CMS templates for isolated metadata like “title” and “director,” but what about the main content itself? What about the meaning embedded in the article? Let’s say we’re writing an article about motion picture history, and we include the following sentence:

<p>James Cameron, best known for directing the sci-fi thriller,
“Avatar,” was born on August 16, 1954.</p>

All of the information in the schema.org example is present in that sentence, and if we were searching for content about James Cameron, we would have to rely on full-text searching.

If we were to use the schema.org augmentation, in order to make it all accessible to the search engines, it would get very messy, something like:

<p>
    <span itemscope itemtype ="http://schema.org/Movie">
        <span itemprop="director" itemscope itemtype="http://schema.org/Person">
        James Cameron
        </span>
    </span>,
best known for directing the
    <span itemscope itemtype ="http://schema.org/Movie">
        <span itemprop="genre">sci-fi thriller</span>,
        <span itemprop="name”>Avatar</span>
    </span>
,” was born on
    <span itemscope itemtype ="http://schema.org/Movie">
        <span itemprop="director" itemscope itemtype="http://schema.org/Person">
        <span itemprop="birthDate">August 16, 1954</span>
    </span>
    </span>.
</p>

Not for mere mortal content authors

Now we come to the main practicality of content: Content authors.

I have marked up a lot of content in my career, and I am an obsessive, precise, exacting author. On the other hand, I’ve implemented CMS templates and tried to configure the best WYSIWYG editors to be able to apply the right CSS classes within content. And I’ve worked with a lot of content owners to teach them the importance of good markup.

Here’s the hard reality: No matter how powerful the technology, no matter how carefully designed and coded the CMS templates, no matter how sophisticated the WYSIWYG editor, and no matter how much training we offer, any markup will ultimately succeed or fail on the content authors’ ability to use it.

And that brings me to my main issue with the Semantic Web.

The Semantic Web cannot rely on encoding alone

If the main difficulty of searching the web is in understanding the meaning of the content (given all the languages, people, markup skill, and so many more factors), then we can really only solve it the hard way: Intelligent reading. We cannot rely on the human beings who create content to make it speak for itself, by making sure that everything is tagged correctly. They just can’t do it.

We cannot rely on markup because XHTML is insufficient, XML is too complicated for more than data structures, and the schema.org effort is unrealistic. In the end, each method may play a limited role in addressing the findability of content, but ultimately, it will require some other kind of intelligence—intelligence in the interpreting of meaning, rather than its encoding.

I don’t know what will happen with the schema.org markup augmentations. Personally, I hope that it just sags under its own weight and disappears into the marshes from whence it came. And I heartily encourage all the folks who are working on this problem to keep at it: There’s no path to success here but the long one. Eventually, perhaps new kinds of computers will be able to understand us weird, wonderful human beings, but for now, we remain inscrutable to the mechanical, algorithmic mind.

4 comments

  1. georgecee says:

    Sorry to be a NOOB. I have searched the web high and dry and I can I cannot find a real explanation on how to really use the new Schema.org markup.

    I have just programmed this OK looking website that you can view at http://www.parmaair.com/

    I just want to simply implement the code as they do in the Schema.Org example.

    Here is my code:

    Furnace and Air Conditioner repair in Parma Ohio

    Parma Air Heating and Air Conditioning is a family owned and operated company with very simple goals, which include offering exceptional services at the best possible pricing and providing products which will bring our customers nothing less than 100% satisfaction. Our specialists are qualified to identify heating and cooling issues in your home that may result in lowering your monthly energy bills! In addition we stock a full inventory of award-winning ENERGY STAR® rated furnaces and air conditioners that will add value to your home while keeping energy cost low.

    If I try to add

    then my backgrounds gets all screwed up and the sight doen’t act properly.

    I am using Dreamweaver 5.5.

    What the heck am I doing wrong?

    I appreciate your time.

    Thankyou.

    George from Ohio

  2. georgecee says:

    Sorry. The code did not show up in my reply.

  3. rsgracey says:

    Hey, George!

    I wouldn’t worry about it. The Search Monsters are trying to impose a standard, but I don’t believe that it’s in any shape to be implemented. I haven’t seen any examples on schema.org of anything like real xhtml: They’re mostly just fragments of basic database dumps, so in my view, it’s not going to take us very far. It’s rather like three Godzillas stomping through a model of Tokyo: Yes, they can smash it to bits, but it’s still only a model.

    Hope this helps some–

    Stephen

  4. rsgracey says:

    This is a comment from @snyderwriter (Bill Snyder). (We had a commenting glitch!)

    No luck. Just tried clearing my cache to make sure that wasn’t the problem, but no change. I get a comment field on pages but not on blog posts.

    What I was going to say, in addition to the comments below, is that I do see a useful scenario for this metadata.

    If you’re running an e-commerce site with a lot of SKUs, or a database like IMDB (Internet Movie Database, if you don’t use it), you are using the same content type repeatedly. So, looking at the example they gave, if their server spit out pages with all the same data types, only the specific values changing (e.g., movie title, actor’s name, etc.), it would make sense. You code it once, and you’re done. Well, that’s assuming it doesn’t break anything else, but that’s a separate concern. In editorial, however, trying to do anything like that would be a nightmare.

    So far, for all the talk of the Semantic Web, I’m yet to see it adopted. That was part of the purpose of XHTML and using CSS to create the look of the site. And there are tons of tags out there, say , that no one uses, because 98% of the time, sites are coded for the design, not the semantics. And it makes sense. Coders don’t know the content intimately enough. And though most writers, editors, and content strategists have something between a working knowledge and near proficiency of HTML/CSS, we can’t sit there and dictate every tag.

    Again, thanks for a great post. Since I can’t seem to post a response, if you’d like to use anything I’ve written you as a follow-up to the post, feel free. Just use my name. :-)

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>