Schema.org and Structured Data

Introduction

Metadata plays a central role in most computer applications. There are different data models to structure metadata. The data model we will use to add metadata to our pages is called schema.org.

Data Models: An Overview

Some data models provide a basic structure, but others are more involved. Regardless, they all provide a sort of controlled vocabulary, which means they:

  • reduce ambiguity by using approved terms
  • promote interoperability between systems, platforms, or institutions
  • allow for machine processing by enforcing structures and values.

For example, the Dublin Core data model provides a flat list of controlled terms. The main terms include:

  • contributor
  • coverage
  • creator
  • date
  • description
  • format
  • identifier
  • language
  • publisher
  • relation
  • rights
  • source
  • subject
  • title
  • type

We can use this data model for all sorts of items or works. Here I use it to describe Toni Morrison's Pulitzer Prize winning novel, Beloved:

DC TermProperty
CreatorToni Morrison
TitleBeloved
PublisherAlfred A. Knopf Inc.
LanguageEnglish
Date1987
Identifier1-58060-120-0

Unlike Dublin Core, other controlled vocabulary data models are designed to be more interconnected. For example, MeSH (Medical Subject Headings) and LCSH (Library of Congress Subject Headings) are types of thesauri. Both are organized in a tree-like hierarchical structure; for example, MeSH terms are arranged from broad categories (e.g., neoplasms) to specific categories (e.g., Neoplasm Metastasis). Likewise, LCSH employs broader terms (BT, like C (Computer Programming Language)), related terms (RT, like Objective-C (Computer Programming Language)), and narrower terms (NT, like Small-C (Computer Programming Language)). MeSH is used in biomedical and health-related indexing, such as in PubMed. LCSH is often used by the Library of Congress and in academic libraries catalogs (e.g., UK's InfoKat).

Schema.org

Then there's schema.org, which was created by Google, Microsoft, Yahoo, and Yandex, for the purpose of describing web content for search engines. Unlike the prior data models, schema.org functions more like a taxonomy and ontology. As a taxonomy, schema.org is a kind of hierarchical, relational classification system, and as an ontology, schema.org stresses foundational components, such as concepts or classes, properties or attributes, relationships, and instances.

Like other data models, the schema.org vocabulary provides a method for adding structured, linked data. The data is linked because the vocabulary is interconnected via a hierarchical data model. This means that each type or property can point to or be reused across datasets, and it's this characteristic that creates a web of meaning that is readable by machines.

For example, the root data type in schema.org is Thing. The Thing type includes child data types such as Action, Person, Place, Organization, and more. These are all types of Things (or classes). The child data types include additional descendants; for example, the following are all examples of specific classes of an Organization thing:

  • Airline,
  • EducationalOrganization
  • PoliticalParty,
  • LocalBusiness, and more.

Digging deeper, if we focus on the EducationalOrganization type, we find that it may include other Things:

  • CollegeOrUniversity
  • ElementarySchool
  • HighSchool, and so on.

Schema.org data types are transitive (if a > b and b > c, then a > c). For example, the University of Kentucky is an instance of a CollegeOrUniversity type. This itself is a subclass of an EducationalOrganization. We could go on: an EducationalOrganization type is a subclass of an Organization type. And finally, an Organization thing is a subclass of Thing. We might represent this as follows:

- Thing
    - Organization
        - EducationalOrganization
            - CollegeOrUniversity
                - University of Kentucky (instance)

All classes eventually descend back to the Thing type, just as in biology, all life on Earth is classified in a taxonomy with Domain holding the broadest rank.

Furthermore, all types have properties. A Thing type can have the following properties:

  • image
  • name
  • description

And an EducationalOrganization Thing can have alumni as a property.

And each of those properties may have additional properties or take on values. For example, for image, we can provide a URL to an actual image. Or we may provide a caption for it.

However, just like University of Kentucky can be counted as an instance of CollegeOrUniversity, each type has its own set of instances in schema.org. To illustrate: since a CollegeOrUniversity thing is also an EducationalOrganization thing, a CollegeOrUniversity thing may also have the properties specific to EducationalOrganization, such as alumni. For instance, because CollegeOrUniversity inherits from Thing, it can use general properties like name, description, and url, but also more specific ones like alumni that are directly inherited from EducationalOrganization.

That is, a CollegeOrUniversity thing may inherit properties of other types not in its direct lineage. Another example: a CollegeOrUniversity thing may also be a CivicStructure thing and a Place thing, even though neither of those are specific descendants of EducationalOrganization. In this way, specific things and properties can interconnect or link to each other, forming linked data. That is, it's this ability to belong to multiple classes, and to inherit properties from these classes, that enables schema.org to describe real-world complexity more naturally than rigid, single-hierarchy systems.

Visual representation of the connections between schema.org types using University of Kentucky as an example
Fig. 2. Schema.org Example Map of University of Kentucky. Types are represented in square shapes. Properties are represented in oval shapes. Instances are represented in hexagonal shapes. Diagram created using Dia.

As you can see, a particular instance of some Thing may be a member of many classes, or be many types of Things. This is the same as you and me. For example, I am a professor, a parent, an offspring, etc. You might be a student and offspring. Thus, we both share at least one class, and inherit the properties of that class and its broader classes, like Person. By using this organizational model to describe the content of a web page, search engines can begin to understand that content and its context and the relationship among Things on the web.

To employ schema.org on your web pages requires some familiarity with the data model and what it offers. Therefore, begin reviewing the Full schema hierarchy for a complete listing of what is available.

Conclusion

In this section, we were introduced to the schema.org data model. We learned that schema.org is a hierarchical and extensible data model, with Thing as the root class and many descendant classes that inherit and extend its properties. By understanding this structure, we will be able to select the right types and properties that describe our web pages.

In the next section, we will use the schema.org data model to model the content of our web pages. Then we will serialize the models we create as JSON-LD.

To sum it up: Metadata serialized in JSON-LD is used by search engines, AI, and other services to understand the content expressed in HTML, the latter of which is used for human consumption. It accomplishes this through the schema.org vocabulary (although other data models exist for different contexts).

In the next section, we focus on the practical aspects of serializing schema.org as JSON-LD.