Schema.org and Structured Data
Introduction
Metadata plays a central role in most computer applications. There are different data models to structure metadata. The data model we will use to add metadata to our pages is called schema.org.
Data Models: An Overview
Some data models provide a basic structure, but others are more involved. Regardless, they all provide a sort of controlled vocabulary, which means they:
- reduce ambiguity by using approved terms
- promote interoperability between systems, platforms, or institutions
- allow for machine processing by enforcing structures and values.
For example, the Dublin Core data model provides a flat list of controlled terms. The main terms include:
contributor
coverage
creator
date
description
format
identifier
language
publisher
relation
rights
source
subject
title
type
We can use this data model for all sorts of items or works. Here I use it to describe Toni Morrison's Pulitzer Prize winning novel, Beloved:
DC Term | Property |
---|---|
Creator | Toni Morrison |
Title | Beloved |
Publisher | Alfred A. Knopf Inc. |
Language | English |
Date | 1987 |
Identifier | 1-58060-120-0 |
Unlike Dublin Core, other controlled vocabulary data models are designed to be more interconnected. For example, MeSH (Medical Subject Headings) and LCSH (Library of Congress Subject Headings) are types of thesauri. Both are organized in a tree-like hierarchical structure; for example, MeSH terms are arranged from broad categories (e.g., neoplasms) to specific categories (e.g., Neoplasm Metastasis). Likewise, LCSH employs broader terms (BT, like C (Computer Programming Language)), related terms (RT, like Objective-C (Computer Programming Language)), and narrower terms (NT, like Small-C (Computer Programming Language)). MeSH is used in biomedical and health-related indexing, such as in PubMed. LCSH is often used by the Library of Congress and in academic libraries catalogs (e.g., UK's InfoKat).
Schema.org
Then there's schema.org, which was created by Google, Microsoft, Yahoo, and Yandex, for the purpose of describing web content for search engines. Unlike the prior data models, schema.org functions more like a taxonomy and ontology. As a taxonomy, schema.org is a kind of hierarchical, relational classification system, and as an ontology, schema.org stresses foundational components, such as concepts or classes, properties or attributes, relationships, and instances.
Like other data models, the schema.org vocabulary provides a method for adding structured, linked data. The data is linked because the vocabulary is interconnected via a hierarchical data model. This means that each type or property can point to or be reused across datasets, and it's this characteristic that creates a web of meaning that is readable by machines.
For example, the root data type in schema.org is Thing
.
The Thing
type includes child data types such as Action
, Person
, Place
, Organization
, and more.
These are all types of Things (or classes).
The child data types include additional descendants; for example, the following are all examples of specific classes of
an Organization
thing:
Airline
,EducationalOrganization
PoliticalParty
,LocalBusiness
, and more.
Digging deeper, if we focus on the EducationalOrganization
type, we find that it may include other Things:
CollegeOrUniversity
ElementarySchool
HighSchool
, and so on.
Schema.org data types are transitive (if a > b
and b > c
, then a > c
).
For example, the University of Kentucky is an instance
of a CollegeOrUniversity
type.
This itself is a subclass of an EducationalOrganization
.
We could go on: an EducationalOrganization
type is a subclass of an Organization
type.
And finally, an Organization
thing is a subclass of Thing
.
We might represent this as follows:
- Thing
- Organization
- EducationalOrganization
- CollegeOrUniversity
- University of Kentucky (instance)
All classes eventually descend back to the Thing
type, just as in biology,
all life on Earth is classified in a taxonomy with Domain
holding the broadest rank.
Furthermore, all types have properties.
A Thing
type can have the following properties:
image
name
description
And an EducationalOrganization
Thing can have alumni
as a property.
And each of those properties may have additional properties or take on values.
For example, for image
, we can provide a URL to an actual image.
Or we may provide a caption
for it.
However, just like University of Kentucky
can be counted as an instance of CollegeOrUniversity
,
each type
has its own set of instances in schema.org.
To illustrate: since a CollegeOrUniversity
thing is also an EducationalOrganization
thing,
a CollegeOrUniversity
thing may also have the properties specific to EducationalOrganization
, such as alumni
.
For instance, because CollegeOrUniversity
inherits from Thing
,
it can use general properties like name
, description
, and url
, but
also more specific ones like alumni
that are directly inherited from EducationalOrganization
.
That is, a CollegeOrUniversity
thing may inherit properties of other types not in its direct lineage.
Another example: a CollegeOrUniversity
thing may also be a CivicStructure
thing and a Place
thing,
even though neither of those are specific descendants of EducationalOrganization
.
In this way, specific things
and properties
can interconnect or link
to each other, forming linked data
.
That is, it's this ability to belong to multiple classes, and to inherit properties from these classes,
that enables schema.org to describe real-world complexity more naturally than rigid, single-hierarchy systems.
As you can see, a particular instance
of some Thing
may be a member of many classes, or be many types of Things
.
This is the same as you and me.
For example, I am a professor, a parent, an offspring, etc.
You might be a student and offspring.
Thus, we both share at least one class, and inherit the properties of that class and its broader classes, like Person
.
By using this organizational model to describe the content of a web page, search engines can begin to understand
that content and its context and the relationship among Things on the web.
To employ schema.org on your web pages requires some familiarity with the data model and what it offers. Therefore, begin reviewing the Full schema hierarchy for a complete listing of what is available.
Conclusion
In this section, we were introduced to the schema.org data model.
We learned that schema.org is a hierarchical and extensible data model, with Thing
as the root class and
many descendant classes that inherit and extend its properties.
By understanding this structure,
we will be able to select the right types and properties that describe our web pages.
In the next section, we will use the schema.org data model to model the content of our web pages. Then we will serialize the models we create as JSON-LD.
To sum it up: Metadata serialized in JSON-LD is used by search engines, AI, and other services to understand the content expressed in HTML, the latter of which is used for human consumption. It accomplishes this through the schema.org vocabulary (although other data models exist for different contexts).
In the next section, we focus on the practical aspects of serializing schema.org
as JSON-LD.