The Trouble with Semantic Markup: Response to schema.org

First thing this morning, checking in on the Twitter streams, I saw Jeff Evans (@joffaboy) announce the article, “Google, Bing & Yahoo’s New Schema.org Creates New Standards for Web Content Markup.”

Initial tweet

My heart began pounding as soon as I read the title. The arch-rivals of search, the biggest dogs in the yard, the great institutions of the web were collaborating to propose a solution to the problem of markup that has plagued me from the beginning: Markup doesn’t really address the substance of the web, just its most basic structure. My hopes were further raised by the mention of a “recipe” content type, which if you follow my writings, you’ll recognize as a regular example.

I retweeted in a flash: This is what I’ve been looking for!

My first retweet

Then, I visited schema.org, and all my hopes came crashing to Earth again. The Search Giant monsters have created a new monster.

My second retweet

Quick Overview

As I understand it, schema.org is proposing additions to HTML that the “Big Three” search engines are going to interpret, in order to improve the accuracy of search results. By augmenting the markup in web content, they are together settling on a standard vocabulary, so that they will all be recognizing the same language. Presumably, once they’ve built this standard language into their sorting algorithms, any content that has these augmentations will rise to the top of search results, above content that doesn’t.

In principle, that sounds good, doesn’t it?

I’d like to offer some reflections on a few practical implications of this effort.

Corporations try to head off the “free” Semantic Web

For-profit companies have been watching in dismay for twenty years the rise of the “free” WorldWide Web. Content is free. Software is free. Social Networking is free. And more and more of the web is being driven by “free” efforts, like the WorldWideWeb Consortium. Volunteerism is a huge threat to capitalism, and they know it.

Among the greatest of these free efforts is the quest for the Semantic Web, which in its simplest terms, seeks a set of standards for describing the meaning of content. Human language is always problematic—as are those who use it—because words are never just words. The meaning of words is rich, contextual, ambiguous, and worst of all, ever changing. There are a lot of really, really smart people, all over the world, almost exclusively volunteer (with some corporate support), working hard to figure this out. If you want to get a sense of the complexity of it all, talk to Rachel Lovinger (@rlovinger) at Razorfish. She’s one of the true semantic geeks, and I’ll just have to take her word on most of what she says. She’s fab.

But instead of supporting this “free” effort, the Search Giants have imposed a de facto standard for the Semantic Web, and they’re pushing it with the strength of their size and popularity. Like the Zen question of the tree in the forest:

If a search engine doesn’t support your semantic standard, will anyone find your content?

I am suspicious of their motives. I read it as an effort to bypass all the work that’s already gone into the Semantic Web.

Markup is more than basic structure and presentation

It has been a great struggle since the beginning of the web to strike the appropriate balance between the structure of content and its presentation. In other words, what content is should be distinct from how content looks. But HTML—even up to HTML5—still only addresses the most basic aspects of content, and even now, offers only tags that address the pieces of the “webpage”—like the “header” and “navigation.” There isn’t markup to describe the content’s substance.

CSS as semantic markers

Cascading Stylesheets, in a roundabout way is one approach to the problem, although it’s originally meant to control the presentation of the content. Let me give an example.

Lists are a primary content structure. We create lists for everything—ingredients, footnotes, archives, contacts, links, Q&A, references, etcetera ad nauseum—but HTML offers us only two choices: “Ordered lists” (numbered) and “Unordered lists” (bulleted).

If your website had a list of links in a sidebar and a list of staff names on a contact page, you use the same basic markup:

<ul>
    <li><a href= “http://url.for.link/1” title= “This is the first list item”>Link Text 1</a></li>
    <li><a href= “http://url.for.link/2” title= “This is the second list item”>Link Text 2</a></li>
</ul>

…and then…

<ul>
    <li>Contact Name 1</li>
    <li>Contact Name 2</li>
</ul>

Here’s the problem: The web browser has a default way of rendering these lists, and they will look exactly the same, except that the links will be underlined. If you want to distinguish them from each other, you can add CSS classes, which give you a way to style them differently.

Now, CSS gurus (the best of whom are really content strategists underneath it all) will tell you that you should NEVER use class names that describe how something looks, like “class= ‘blue_text’.” The class names should describe what they are, which is, in fact, a semantic indication:

<ul class="links”>
    […]
</ul>

…versus…

<ul class=“contacts”>
    […]
</ul>

Using these identifiers, the designer can define precisely how each component of a website should look. In a better world, however, they could also be used to identify what they are. Defining standard CSS classes and identifiers as part of XHTML would be one approach to encoding the meaning into markup.

But not Google, Bing, and Yahoo—Noooooooo.

The Search Giants, though, instead of building on CSS or any other existing approach, have introduced another “standard,” which superimposes another layer of markup on top of the feeble XHTML we already have. Here is the example from schema.org:

<div>
    <h1>Avatar</h1>
    <span>Director: James Cameron (born August 16, 1954)</span>
    <span>Science fiction</span>
    <a href="../movies/avatar-theatrical-trailer.html">Trailer</a>
</div>

Before I go any further, I have to say that this code doesn’t look like any real XHTML I’ve ever seen, and that’s a worry right from the start. Nevertheless…

Once they’ve applied their markup augmentations, again right from schema.org, it becomes:

<div itemscope itemtype="http://schema.org/Movie">
    <h1 itemprop="name">Avatar</h1>
    <div itemprop="director" itemscope itemtype="http://schema.org/Person">
    Director: <span itemprop="name">James Cameron</span> (born <span itemprop="birthDate">August 16, 1954)</span>
    </div>

    <span itemprop="genre">Science fiction</span>
    <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>

There are many, many, many things wrong with this picture.

All the complexity of XML without any of its simplicity

XML is the mother of all markup. In fact, XHTML is just one markup language based on the XML standard. Using XML as the basis of your web code is an elegant—but very complex—solution to defining your content. When it’s all worked out, however, it lets you replace that gobbledygook above with something more like this:

<movie>
    <title>Avatar</title>
    <director>
        <name>James Cameron</name>
        <birthdate> August 16, 1954</birthdate>
    </director>
    <genre>Science fiction</genre>
    <trailer url= “../movies/avatar-theatrical-trailer.html” />
</movie>

Putting it simply, by augmenting XHTML with another layer of markup, the Search Giants have complicated the code immensely, making it just as complex as if they had done it in XML, but without any of the benefits of XML’s simple elegance.

Content is rarely this simple

The examples above deceive us, in any case: Yes, we can add fields to CMS templates for isolated metadata like “title” and “director,” but what about the main content itself? What about the meaning embedded in the article? Let’s say we’re writing an article about motion picture history, and we include the following sentence:

<p>James Cameron, best known for directing the sci-fi thriller,
“Avatar,” was born on August 16, 1954.</p>

All of the information in the schema.org example is present in that sentence, and if we were searching for content about James Cameron, we would have to rely on full-text searching.

If we were to use the schema.org augmentation, in order to make it all accessible to the search engines, it would get very messy, something like:

<p>
    <span itemscope itemtype ="http://schema.org/Movie">
        <span itemprop="director" itemscope itemtype="http://schema.org/Person">
        James Cameron
        </span>
    </span>,
best known for directing the
    <span itemscope itemtype ="http://schema.org/Movie">
        <span itemprop="genre">sci-fi thriller</span>,
        <span itemprop="name”>Avatar</span>
    </span>
,” was born on
    <span itemscope itemtype ="http://schema.org/Movie">
        <span itemprop="director" itemscope itemtype="http://schema.org/Person">
        <span itemprop="birthDate">August 16, 1954</span>
    </span>
    </span>.
</p>

Not for mere mortal content authors

Now we come to the main practicality of content: Content authors.

I have marked up a lot of content in my career, and I am an obsessive, precise, exacting author. On the other hand, I’ve implemented CMS templates and tried to configure the best WYSIWYG editors to be able to apply the right CSS classes within content. And I’ve worked with a lot of content owners to teach them the importance of good markup.

Here’s the hard reality: No matter how powerful the technology, no matter how carefully designed and coded the CMS templates, no matter how sophisticated the WYSIWYG editor, and no matter how much training we offer, any markup will ultimately succeed or fail on the content authors’ ability to use it.

And that brings me to my main issue with the Semantic Web.

The Semantic Web cannot rely on encoding alone

If the main difficulty of searching the web is in understanding the meaning of the content (given all the languages, people, markup skill, and so many more factors), then we can really only solve it the hard way: Intelligent reading. We cannot rely on the human beings who create content to make it speak for itself, by making sure that everything is tagged correctly. They just can’t do it.

We cannot rely on markup because XHTML is insufficient, XML is too complicated for more than data structures, and the schema.org effort is unrealistic. In the end, each method may play a limited role in addressing the findability of content, but ultimately, it will require some other kind of intelligence—intelligence in the interpreting of meaning, rather than its encoding.

I don’t know what will happen with the schema.org markup augmentations. Personally, I hope that it just sags under its own weight and disappears into the marshes from whence it came. And I heartily encourage all the folks who are working on this problem to keep at it: There’s no path to success here but the long one. Eventually, perhaps new kinds of computers will be able to understand us weird, wonderful human beings, but for now, we remain inscrutable to the mechanical, algorithmic mind.

Taxonomy: A “Disambiguation”

I was not able to attend the several workshops on “taxonomy” at the recent WebContent2010 conference (#wcconf) in Chicago: Tough choices were made. Yet I think I got a lot out of those workshops because of the seriously faithful tweeting coming out of them, and when I said so to some new friends, they almost all said, “How? I didn’t understand any of it…overwhelming.” I replied that when you follow a tweetstream, you only see what people understand, already interpreted for you. (Which is a recommendation, really, to follow conferences you can’t attend: Done well, the tweets will give you at least the essential points.)

Amid the summary tweets of the workshops’ content, however, I saw comments such as these:

“A workshop and a session on taxonomy and I’m still confused. Is it just me? #wcconf” – @EvanKittleton

“Ouch. My head hurts. Taxonomy not an easy beast to wrestle. #wcconf” –  @cc_holland

A lot of the confusion centered on how the idea of taxonomy relates to—and differs from—other elements of Information Architecture, such as sitemaps and navigation. Are they the same thing? Is it just your metadata?

With the guidance of my best-bud colleague Becky Bristol as technical reviewer (@paintingblue) I’m going to try to “disambiguate” it, that is, to explain and clarify.

Disclaimer: I’m an explainer, not a taxonomist, so if you’d like to help with the definition, please by all means chime in.

The Roots of Taxonomy

“Taxonomy” is an ancient scientific practice. It means to find names for things. In naming things, you try to figure out how sets of things are related to one another, so that each, unique item will not only have a unique name, but also a reference to the others to which it relates.

Taxonomy creates a hierarchy of inheritance, from general down to specific and back: A giant tree, on which there is a unique place for every item, like the leaves at the ends of twigs at the ends of branches connected to a trunk and running deep into the earth.

In order to build a taxonomy in the scientific sense, you have to create a framework that tells you how to name a thing. This is the “schema.” The most famous schema was created by Carl Linnaeus, an 18th Century Swedish botanist, to categorize and name life on Earth. It has eight, major taxonomic ranks:

Domain -> Kingdom -> Phylum (botany)/Division (zoology) -> Class -> Order -> Family -> Genus -> Species

If you’re REALLY geeky, you can lay it out in Latin:

Regio -> Regnum -> Phylum/Divisio -> Classis -> Ordo -> Familia -> Genus -> Species

There are only certain terms you can put into those fields. Imagine drop-down boxes from which you MUST choose. Let’s try it on ourselves, humans:

Domain Kingdom Division Class Order Family Genus Species
Eukarya Animalia Chordata Mammalia Primates Hominidae Homo H. Sapiens

When the terms don’t apply at a certain point, then you get to pick a new term, which at that point, creates a new branch. If you find a new item in nature, something that hasn’t been named before, you get to name it yourself, but you will use the same set of terms down the tree as far as you can to demonstrate your new species’s relationship to all other life.

Taken altogether, this classification system becomes the official way of understanding the whole world of animals, plants, and bacteria. Taxonomy is powerful because it is universally adopted: You could try to work out a new system, but then you’d have to explain it to everyone and get buy-in for it to mean anything to anyone else but you. It is at this point that we make the transition to the Web…

Taxonomy on the Web

Now at some point, the word “taxonomy” was appropriated by information architects to talk about web content. When one discipline borrows from another’s, the meaning and use of the term can change significantly, and so “taxonomy” doesn’t mean to the web professional quite what it means to the biologist.

A website’s taxonomy describes how all the content relates to each other. Through its rigidly controlled network of meaning, there is a way to say with confidence:

“Item X and Item Y are in the same group. When you look at Item X, you may also be interested in Item Y.”

We take this kind of connection for granted these days because Amazon and other e-commerce giants have made such ubiquitous and successful use of taxonomy to sell related things, but it’s really quite difficult to establish those kinds of relationships in your content without taxonomy.

In summary to this point, then, “taxonomy” on a website is a classification system that maps all your content to other content. Taxonomy on a website creates a scaffold that holds your content together.

Not one taxonomy, but many

It gets a little more complicated from here. Whereas in a biological taxonomy, we’re dealing with only one dimension of relationship, the ultimate relationship of one species to another through its name, on a website, there can be many classification systems to govern the relationship of content along many dimensions.

Let’s take with a clothing retailer. The most basic taxonomy would divide the products into groups of “kind” to answer the question, “What article of clothing is this?”

Clothing for the upper body

  • Shirts
    • Blouses
    • T-shirts
    • Polos
    • Turtlenecks
  • Jackets
    • Blazer
    • Windbreaker
  • Sweaters
    • Cardigan
    • Pull-over
    • Vest

Clothing for the legs

  • Pants
    • Dress pants
    • Jeans
    • Shorts
  • Skirts
    • Full-length
    • Wraps
    • Culottes (really a hybrid)

Accessories

  • Jewelry
    • Rings
    • Earrings
    • Watches
    • Necklaces
  • Belts
  • Hats
  • Bags

So far, so good. We have a system for identifying items by basic type. But that’s not so good for sales.

There will be, then, additional taxonomies to build up a multidimensional system that organizes products into classes: For women or men, girls or boys; for casual, work or formal contexts; for outdoor or indoor; by color; by season; by ethnic origin; and so on, and so on…

But that’s just the products. There will be other content that accompanies these products, and all that content must also be organized into categories.

  • “How to” content might include tieing neckties, caring for leather, assembling an ensemble for an evening out in Paris.
  • “About us” content might go through all the ways that this company works for environmental activism.
  • Product information might include stories about where the materials came from, or who made them.

The taxonomy must account for all these dimensions of content description and classification, so that when you pull up the product page for that pair of shoes you’re considering, you also can see:

  • What other colors are available?
  • What other shoes are in its class?
  • How do you care for them?
  • What accessories would complete your outfit?
  • How have other customers worn this item? (From their photos)
  • How long it would take to get them if you clicked the button right now…?

Taxonomy implemented through metadata

All this work of understanding the interrelationship of content has a specific and practical end: Metadata.

It is beyond the scope of this article to explain the process of developing taxonomic systems and how they are then translated into metdata for your web content. It is crucial, however, to recognize that having a clear, controlled system of metadata, which is then meticulously and consistently connected to your content, is the only way to ensure that your search and coordinated applications serve up the content the user expects, in the language the user expects, in combinations that make sense to the user.

Rich, interactive experiences require taxonomy

Creating rich internet applications (RIAs) is partly about the technology to evaluate and serve up all these connections, but it is impossible without care, design, and maintenance of your content’s taxonomy.

Again, unlike our scientific counterparts, there can be no, single, universal taxonomy for web content because each content domain has its own context of purpose, vocabulary, and peculiarity.  There are commercially available taxonomic systems to get you started, but they all have to be evaluated for your specific purpose, and there will always be adaptation of the metadata.

Taxonomy, Navigation, and Sitemaps

A lot of the confusion in the workshops dealt with how a website’s taxonomy relates to the other aspects of its information architecture. As we explore these concepts, keep in mind that when done well, the taxonomy is completely invisible to the user. It just makes everything run smoothly.

Sitemaps

The sitemap reveals the website’s overall organization. Every bit of content on a website needs a primary “home.” Ultimately, when you reach a content item, you are (virtually, of course) in a particular location on the site. The information architect’s job is to choose from the infinite range of organizational possibilities to anchor the user experience, which then is the foundation for the richness that the taxonomy creates.

The sitemap probably will reflect some basic aspects of the taxonomy underlying the content, but when you consider the richness and complexity described above, any relation between the sitemap and the taxonomy will be loose.

Navigation

Navigation is more closely related to the sitemap than to the taxonomy. The main navigation provides the user an organized path around the website, intended for browsing. Like the sitemap, it may reflect some aspects of the taxonomy, but it doesn’t have to.

The taxonomy will enable, however, the local navigation options through access points to content elsewhere on the site, reached through the relatedness of content.

IAs help you put it together!

It’s the job of information architects to work all these intricacies out. The skills for designing the taxonomy and associated metadata are extensive and precise. The content strategist helps to define the content domain and the language that will best represent it, but the IA will be able to build an organizational framework that links the content domain with the technical wizardry that serves up the user experience.

In conclusion, as my best-bud Becky says, “There is no right or wrong way of [creating taxonomy]. The trick is to come up with a taxonomy that works for your users.”

I hope that this article has helped to clarify the definition of taxonomy and its application. Please offer corrections, amplifications, and clarification. It’s a matter to wide importance, and we need to get it right!

Content Typology: Getting a Handle on Your Content Types

Content types” are among the least understood, and yet most potent, aspects of user experience and web design. Most people encounter them for the first time when implementing a grand-scale content management system (CMS) because you have to define content types before building templates for each kind of content you’re going to publish. (Everything I know about content types began with Bob Boiko’s Content Management Bible, and I recommend it to anyone facing a new CMS.)

Because they associate content types so closely with CMS, some make the mistake of equating content strategy with content management. They’re not the same thing, though they are certainly related. Your content strategy specifies the content types that will then be modeled for your CMS.

I want to take some time, then, to tell you what I understand about content typology, so that you’ll be able to address content types in your strategy.

Read More

Find the Distinctions That Make a Difference

Rachel Lovinger (@rlovinger) just published a great piece on categorizing, called “Splitting Tigers, Lumping Rabbits,” on Scatter/Gather. I love her simple, elegant advice: “You just need to find the right balance between lumping and splitting.”

Since I read it, I’ve been wondering: How do you find that balance? Is it just some feeling that comes upon you when you have all the pieces in the proper order? Is it like sorting male and female chicks?—something that is learned unconsciously through experience? Is there some way to work it out systematically?

I believe that finding the balance lies in discovering which distinctions make the most difference for the users of your content. If you can articulate what makes this thing different from that one, and why that difference matters to your users, then you will have identified the dimensions of difference. You will also have created a test for your categories, your labels, your navigation, and perhaps even the whole content strategy for your website.

Read More