Dr. Tom’s Meta-Data Guide

Thomas D. Wason, Ph.D. (aka Dr. Tom)
http://www.tomwason.com [Home]
wason@mindspring.com

One of the Dr. Tom Guides

More about meta-data than anyone would ever want to know.

I shall try to use a minimum of specialized terms and acronyms. I will be unsuccessful, but I'll try. You will note that I hyphenate "meta-data". There is a company, the Metadata Corporation, that is vigorously defending a copyright on the terms "Metadata" and "METADATA". It is on pretty shaky ground in my opinion, but no one wants to commit the resources to fight them in court, so IMS has adopted the practice of hyphenating the term. As I read the copyright information that its lawyers sent me, it is not copyrighting the lower case "metadata", but it is cumbersome to switch back and forth.

Technically the term "meta-data" is plural: "These are the meta-data", just as the term "data" is the plural of "datum". As with "data", "meta-data" is frequently used in the singular. Such is the evolution of language.

As a starting point, we can say that meta-data is/are "descriptive information". About what? How does it describe it? Ay, there's the rub.

There is Dwight Waldo’s parable of the five blind men encountering an elephant. Each describes what the animal seems to be, based on the part each has encountered (i.e., the trunk makes the beast seem like a big snake, a leg like a tree, the side like a wall, the tail like a rope, the ear like a big leaf). There seem to be five different elephants. Meta-data similarly can be looked at from several viewpoints. Discussions of meta-data absolutely require that all parties understand the particular viewpoint of the moment. All are valid. None alone is adequate.

Some of the essential questions about meta-data seem pretty confusing:

What is meta-data, anyway?
How is the meta-data structured?
What is it called?
What does it mean?
What is it about?
How is it represented?
Why bother?

The diamonds are in the details of the answers, one hopes. At the end I'll give a few examples of meta-data in IMS and Dublin Core systems.

What is meta-data, anyway?

Meta-data can be considered as "data about data". This is the historical definition of meta-data. This is largely of interest to programmers. What are the data fields in a database? What are permissible values? How are the values expressed? This is really "pure" meta-data: it describes a data set and the data formats of the values.

An example of a data value is a calendar date. There are several different standards and specifications for calendar dates. A programmer must know which one is being used in order to interpret a block of data. Currently the term "meta-data" applies to more than just descriptions of data in a data base. The information in the database itself is considered meta-data about something else: books, images, digital resources and so forth. This derived sense of meta-data is frequently a source of miss-communications between librarians and programmers.

Let us consider a block of data about a vertical column in the ocean as a resource. I'm over my head in ignorance about oceanographic data, so float with me here. Let us consider data along a vertical line going down from the surface at a particular geospatial coordinate on a particular date starting at a particular time. At intervals there are measurements of depth, time, temperature, density, current speed and flow direction. Let us imagine this as a table of data with columns for each of the parameters, and that each row contains column values gathered at the same depth at the same time. This is a simple two-dimensional matrix.

Let us take that first column, depth. How is the data encoded: integers? decimal values? What is the permissible range of values? For example, does the cell allow a negative value? How is a missing value encoded? To this basic information we can give some interpretation: What are the units? Meters? Feet? Fathoms? Given waves and tides, what does "depth" mean? Instantaneous depth? Depth relative to the mean? Can there be a value greater than the diameter of the earth? To this might be added information about the value: What is the range of error? Error might also be an additional table column. What does "zero" depth mean? Is it information just above the water surface?

All of this information about the data in the cell is meta-data. It describes something about the data. There are two kinds of information here, though: technical and interpretative. The technical information pertains to what the allowable data in the cell can be. This is the stuff that a programmer must deal with. This will usually be only data type and range for a single cell of data. If the depth is integer, then decimal values must be either rounded, truncated or rejected when someone tries to enter it. Exception handling is a topic of policy of that organization. The rest of the information is important for interpretation of the number in the cell. Frequently that information is carried in the data set someplace. It is all meta-data, but from a programmer's standpoint, traditionally only the technical information is considered "meta-data". Some technical people feel that the term "meta-data" has been usurped. It's all information that must be available for every cell in the matrix.

At the next level we can look at the entire table: How is it built? How many columns are there, how many rows are there, and what are the titles for each column? This is the structure of the dataset. That information is meta-data, and is probably the most original meaning of the concept. From this information a programmer knows where to "put" and "get" information. Again, it is technical information.

The entire dataset can be considered as a resource. It can be described with bibliographic information. The dataset is a particular kind of information about a particular patch of ocean at a particular time. The data has an author--whatever that may mean--, a publisher, and be disseminated in a particular format, e.g., an Excel spread sheet. This is the sort of information that one might use when searching for information. It is sometimes referred to as cataloging information or indexing. This information may have predefined descriptions of which information is to be included and how the information is to be represented. This is meta-data about the meta-data. Sometimes this is called "meta-meta-data". I shall use this term only when it is meta-data about a meta-data record included within that record. Clearly one can have meta-data about a collection of meta-data. It is important that clear, well-defined terminology be used. It is suggested that a glossary of these terms be created.

It is important that the type of meta-data discussed at any particular point be very clearly articulated. Errors in this sort of reference can create significant confusion.

Well, that was pretty confusing for a while, but we can see that for our purposes, the meta-data that we are considering here is meta-data about the resources. It is "bibliographic meta-data". We will need to create information describing what information is captured in that meta-data, and how it is represented. Each meta-data record should be able to refer to the specification that describes its content and format. Programmers typically are interested in the technical aspects of the meta-data: its structure and format. They are not the appropriate people to decide what information is important enough to be captured. With luck, however, they will be interested in what information is to be captured so that useful meta-data creation and search tools can be created. Programmers who know about the intended use of the meta-data can also intelligently construct storage systems for effective storage and retrieval. Information that does not need to be searched can be treated in a different manner from that which is to be searched.

How is the meta-data structured?

Structured? And you thought meta-data was just a list of descriptive terms. What complexity rears its ugly head here? The information structure is sometimes called the "data model". There are two basic types of meta-data structures that I will call the "clothesline" and the "mobile" models.

Remember clotheslines? Those were cords that were strung up outside from which one hung wet clothes to dry with clothespins. The clothesline model can be thought of as a clothesline with a bunch of wooden springed clothespins. Each clothespin has a label written on it: Title, Creator, Subject, etc. Each clothespin can hold a single strip of paper with a data value on it. This is a set of names with values. These are called name-value pairs. Wow. Clever, huh? For example, the meta-data field called "Author" may contain the name of the author:

Author = Jane Doe.

The equals sign expresses the one-to-one nature of the name and the value. As Dublin Core meta-data using an HTML "meta" tag this would be expressed:

<META NAME="dc.Creator" CONTENT="Jane Doe">

Meta-data may also be structured. Consider a mobile such as those created by the late Calder. They hang from a single point and have a set of pieces that each might have yet other pieces suspended from them. An author's name can be expressed in a structure:

Name
|--FirstName = Jane
|--LastName = Doe

Or to get closer to reality, Author may be a person:

Author
|--Person
|--|--Name
|--|--|--FirstName =
|--|--|--LastName =
|--|--Address
|--|--|--Organization =
|--|--|--Street1 =
|--|--|--Street2 =
|--|--|--City =
|--|--|--State =
|--|--|--Zipcode =
|--|--|--Country =
|--|--Telephone
|--|--|--Business =
|--|--|--Home =

Familiar, isn't it?

We could also have multiple Authors, repeating Person:

Author
|--Person
|--Person

Structured meta-data can become quite elaborate. The resulting richness requires the existence and maintenance of someplace describing the structure so that others can access it. This is often called a "registry". An outline, or hierarchy, is one of the simplest structures. One could also have a single person filling several roles: author and publisher, for instance. A "polyhierarchical" structure may result. So we have meta-data about the meta-data. This is particularly true in the geosciences world in which there are complex datasets.

A structured meta-data system may use the structure to create the definition of a field. Within the IMS and IEEE LOM meta-data is a structure called "Classification". This relates to a system for classifying a resource in some manner. Part of the structure of Classification is:

Classification
|--Purpose
|--Description
|--Keywords

"Purpose" is definition of the type of classification. For example, a Purpose of "Discipline" is equivalent to "Subject". The term is the result of an international negotiation (as noted below), but the definition would be understood by Americans under the rubric of "Subject". Therefore, the Description is a description of the subject area of the resource. The Keywords relate solely to the subject area. In other words, the Description and Keywords must be interpreted with respect to the Purpose of the classification. The structural relationships have meaning.

"Author" in the IMS meta-data system in an XML binding would be expressed as follows:

<LIFECYCLE> 
<CONTRIBUTE> 
<ROLE> 
<LANGSTRING lang="en">Author</LANGSTRING> 
</ROLE> 
<CENTITY> 
<VCARD> 
 BEGIN:vCard FN:Jane Doe N:Doe;Jane 
 END:vCard 
</VCARD> 
</CENTITY> 
</CONTRIBUTE> 
<LIFECYCLE>

LIFECYCLE refers to the lifecycle of the resource. This system recognizes that there may be contributors to other facets of the resource or its meta-data such as contributions to the meta-data record itself, and annotations. CONTRIBUTE means that this sections contains information about contributions. ROLE refers to the kind of contribution (i.e., Author, Instructional Designer, Editor, Publisher). LANGSTRING is a method of containing a string and defining the language in which its data is expressed. Clearly one could repeat LANGSTRING and have ROLE repeated in different languages. Author is the kind of contribution. CENTITY (C - ENTITY) is a class of person or organization. The term CENTITY is used as ENTITY is used within the DTD language, hence is considered a reserved word. CENTITY may be repeated for multiple contributors. VCARD is a specific, standard, widely adopted expression of information about a person or organization. The block that follows BEGIN:vCard ... END:vCard is a standard format for this information, a fragment of which is demonstrated here. The entire CONTRIBUTE block may be repeated for different kinds (ROLES) of contributions.

The IMS (and IEEE LOM) structures can be expanded to include the date of a particular kind of Role. One kind of role is "Terminator". It develops that knowledge that a resource is no longer available is a useful piece of information.

What is it called?

Each field, however it is created (name-value or structured), has some sort of name. Often, this is how it is located (directly or indirectly) in a data base. In theory, the name should not mean anything. Numbers would work. It is simply a mechanism for getting to the field value. In fact, an implementation may substitute numbers for names without the knowledge of the user. However, users do not normally encounter numbers as labels. People care passionately about the names of fields. There is a tendency to confuse the name of a field with its definition, therefore people crave meaningful names. The problem is that words have different meanings for different people and communities. Within the IEEE LTSC LOM (Institute of Electrical and Electronic Engineers Learning Technology Standards Committee Learning Object Metadata) work group, we found that our European counterparts had very different views of what the term "concept" meant. We finally settled on the term "idea". This scenario was repeated many times. Add to this internationalization domain specialization. In K-12 education, a brief description of a work is a "summary". In higher ed, it is an "abstract". Functionally they are the same.

There are methods for creating names. The MARC system (Machine Readable Cataloging: http://lcweb.loc.gov/marc/marc.html) of the Library of Congress uses a naming system of concatenation. MARC has over 3,000 named fields. There are sub areas working groups in MARC that maintain the dictionaries. Dublin Core (http://purl.org/metadata/dublin_core_elements), a radical simplification of MARC, started with 15 fields, but uses a "dotted" notation for adding qualifiers: Date.Publish. IMS meta-data (http://www.imsglobal.org/specifications.html), which uses IEEE LTSC LOM as it's base, is a structured system, so naming has to do with the names of the nodes in the structure. The ability to reuse nodes in IMS is important in name selection. For example, "Description" is used in many places. It is a node that actually contains multiple language representation capabilities.

What does it mean?

Semantics. This is the definition of a field that the people within a domain agree to. Implementers (e.g., programmers) basically say: Tell me where to find it in my data base and I'll get it for you. You figure out what it means. Field definitions are not the purview of programmers, although one would hope that programmers would be interested so that they can create effective interfaces and search strategies. The creation of the definitions of fields is an important area of work that must involve the domain experts and the entire user community.

Gene Alloway of the University of Michigan told me early in my work in meta-data to ignore the field names and concentrate on the field definitions first. This was good advice that I have found to be most effective. In international discussions about meta-data fields, we have often put the name of the field aside for a while and concentrated on creating a written definition of the field. This is actually an easier task than one might think. It is critical that the definition be written down, as this will form the basis for interoperability among systems. Once the definition has been resolved, naming is much more tractable, as all understand what the name needs to accomplish, and what the limitations of the name alone will be. The intensity associated with the name as a device for conveying the definition has been defused. Any organization that is embarking on the creation and/or adoption of a meta-data system should focus considerable energies on the definitions. Definitions, not field names or structures, also provide the basis for evaluating existing systems for adoption. Often one finds that an existing systems definitions are better, and more refined, than what one has started with. This is not surprising, as the existing definitions have been carefully worked out. Thus, one should follow the programmer's model and adopt other work whenever possible. This also enhances the possibility of interoperability.

The product of a workgroup creating or extending a meta-data system is a dictionary of fields with complete definitions. This reduces definition "drift", a frequently encountered problem.

What is it about?

In the oceanographic "core" meta-data described in What is meta-data, anyway? a variety of descriptive information was presented: bibliographic meta-data, meta-data about the structure of the resource, meta-data about the format and range of data values in the data set. Meta-meta-data, the description of how the meta-data within a record was constructed versus the actual meta-data within the record were described. We must be clear about the sort of meta-data we are referring to.

Additionally, we need to know if the meta-data is about human useable resources or if the resources need to be processed to be used. Is the resource a processing tool? Will one want the same sort of meta-data on all three? Does one want types of meta-data? There might be a common core with different sub-classes of meta-data. How are these related?

How is it represented?

Meta-data has some sort of conceptual model, as described above, the "data model". This model must somehow be realized in some concrete fashion. In English, human language words are represented by alphabetic arrangements that create phonemes (most of the time). The actual technical representation of meta-data is called its "binding". In HTML, meta-data has been encapsulated in META tags with very few controls:

<META name="description" content="The IMS meta-data system."> 
<META name="keywords" CONTENT="IMS, Metadata, Meta Data, meta-data, 
  fields, online, on-line, on line, knowledge, distributed, instruction, 
  education, learning">

The W3C consortium (http://www.w3.org/) has developed a standard for binding structured data called eXtenisble Markup Language (XML) which has been called "HTML on steroids". It is a subset of SGML, a somewhat hairy system used by librarians. Many organizations are using XML as the binding for data, particularly meta-data. XML allows the creation of new elements. The contents of each element, which may be more elements or data, can be defined. This is called the Content Model. There are specializations of XML that use XML itself to create schemas. The two most important ones are XML-RDF and XML-Schema. It is beyond the scope of this primer to go into what these are and how they differ. At this time, it looks as if XML-Schema will be more widely adopted. In any case, XML is particularly well-suited for expressing structured meta-data. An example of an XML representation of meta-data is given above.

Why bother?

This all seems so complicated when you just want to find stuff. Wouldn't it be easier to just search the text? This is called free text searching, and there is a significant school of thought that meta-data should be absolutely minimal; free text searching should be used whenever possible. Intelligent agents should search the resources' text to determine what the resource is about relative to the user's interests. It may even be possible to search images. There are three principal reasons for using a meta-data system:

Sufficiency,
Scalability and
Interoperability

1. Sufficiency

Can the resource be adequately described by the resource itself. For example, an image may contain a picture of a particular geologic structure, but it would be hard to search this. Words are needed. Although some resources may contain text, they may need further information to describe or use them. For example, an online children's book letters of the alphabet a pictures to match could benefit from a description that it was a children's book for ages 4-6 that had a subject of the alphabet and is illustrated. In other words, not all materials contain inherently adequate self-descriptions. A data set such as the oceanographic data set described above well may benefit from descriptive cataloging. Additionally, it may be useful to know something about the sort of software needed to use the resource.

2. Scalability

It may be possible to do full text analysis on a few thousand resources, but it is probably impractical for large repositories of resources. Additionally, a few users may be serviced with text analysis, but a large number of users may cause the system to perform at unusably slow rates. Meta-data provides highly targeted, rapid search and recovery at the cost of lower flexibility. If the information is not available in the meta-data, then it can't be accessed through the meta-data.

3. Interoperability

The ability for different systems to interchange information, processes and resources is called "interoperability". If different systems can agree to create a mapping between their meta-data, then it is possible for each to search the other's meta-data. It is also possible for systems to accomplish wide area searches among many systems if they all have created common mappings. Meta-data, as a descriptive system, should allow descriptive mappings amongst systems, hence, interoperability. Interoperability is important for systems that expect to access resources from a variety of sources.

An intermediate approach is to use agents to create the meta-data using free-text access by the agents. This is also called automatic creation of meta-data. In reality, it is a meta-data system, not a free-text system, as a predefined set of fields are populated.

Meta-data systems

What is "meta-data system"? It is the combination of fields, definitions, data formats, structure(s), binding, rules and controls. It may also include a method for making information about these components public. The purpose of the Dublin Core system is to support wide area searching with a small set of commonly used fields. The concept is to limit the number of fields so that the probability of finding resources can be supported across many systems. The Dublin Core system defines 15 fields:

TITLE
CREATOR
SUBJECT
DESCRIPTION
PUBLISHER
CONTRIBUTOR
DATE
TYPE
FORMAT
IDENTIFIER
SOURCE
LANGUAGE
RELATION
COVERAGE
RIGHTS

Data formats are not defined. There is a standard set of definitions for Dublin Core elements. These 15 fields have been found by many organizations to lack specificity, so qualifiers are sometimes added to the basic fields. For example a "classification" qualifier can be added to DC Subject:

DC.Subject.Classification = 301.12
DC.Subject.Classification = SfB Uhj
DC.Subject.Classification = 301:624(England)

Not every query system will understand a particular set of qualifiers. A basic rule of Dublin Core is called "dumbing down". This means that any field with a qualifier must be useful if the qualifier is removed. This makes the system robust, but limits its expressiveness. There are currently groups working on various qualifier sets.

The IMS and IEEE LTSC meta-data systems (they are very closely related) use structured data, well supported by XML (see above). Someplace there must be a registry that contains the description of the meta-data structure. Several structures, that are related, may be managed within a single registries. Multiple registries may be interrelated. A registry may contain access, and services, to one or more specific taxonomies or vocabularies to be used at specific nodes within a structure. The registry manages this information and makes it available. This is part of a meta-data system.

A "repository" is where the actual meta-data (and sometimes the resources) are stored.

Many of the terms in this guide are defined in the Glossary.

Author:

Thomas D. Wason, Ph.D. (aka Dr. Tom)
wason@mindspring.com
http://www.tomwason.com
+1 919.602.6370