Metadata, as a form of data affixed to other data, is indispensable to data science and the interconnected domain of data analytics. Metadata describes data, provides context, and is vital for accurate data interpretation and use by both humans and machines. Given this dependency, it is logical to conclude that metadata innovation ought to have progressed in tandem with advances in big data and data science. To this end, leading data science journals and conferences have been increasing coverage of metadata research and development (R&D). Examples include
Research literature and notable reports reveal the metadata research lag in data science. For example, Smith et al. (2014) call for metadata research, emphasizing that “current big data ecosystems lack a principled approach to metadata management.” Another clear example is the US report entitled
This article considers these questions as steps toward advancing the synergy between metadata and data science. The following section describes metadata and data science, followed by discussion of two intersecting factors that challenge metadata research in the digital ecosystem. Next, the paper introduces the concepts
Exploring the interconnection between metadata and data science requires a review of these two concepts.
Metadata has been loosely defined and popularized as data about data, information about information. More comprehensive definitions address metadata as structured data supporting functions associated with an
To understand the full extent of metadata, it is important to recognize that the adjectival label “metadata” is not always used when, in fact, the data of interest has a
Beyond labeling and categorization, metadata can more universally be thought of as value-added language that serves as an integrated layer in an information system. When appropriately placed and accessible, by human or machine, metadata language eloquently enables the interplay between an object, such as data, and the desired activity, such as discovery, access, provenance tracking, calculation, or other directives. To understand metadata research opportunities in data science, it is useful to also review the meaning of data science, as follows.
Data science is an interdisciplinary field that targets studying and leveraging data to gain insights. A data science undertaking may enable one to predict a phenomenon or automate decision-making. The Data Science Association defines data science as the “scientific study of the creation, validation and transformation of data to create meaning” (Data Science Association, 2017). Data science draws upon the full range of data (small, big, static, structured, unstructured, or streaming), and applies scientific and statistical methodologies to learn from data (van der Aalst, 2016).
Data science has many aspects, and the collection of definitions reveals different emphases. For example, Dhar (2013) focuses on the predictive capabilities of data, emphasizing application of statistical methods. Stanton (2012) offers a broader definition, explaining that data science encompasses a full range of activities, including the “collection, preparation, analysis, visualization, management, and preservation of large collections of information.” The unifying factor across various definitions is the “science” that comprises defining appropriate questions, selecting and obtaining suitable data, and applying the correct, at times often innovative, modeling, and statistical methods.
The “science” of data science indicates a methodological and systematic approach to leveraging data as part of studying a problem or a phenomenon. Data science endeavors rely not only on data, but accurate description of the data—hence metadata. Given the reliance on metadata, one would anticipate appropriate support for, and recognition of, the value of research addressing metadata processes, applications, and societal impacts. Unfortunately, there are a number of key impediments to understanding the scientific merit of metadata research. These impediments are reviewed below in the context of challenges to metadata R&D.
Information and library science, computer science, and a number of disciplinary domains (e.g. biology, medicine, materials science, and geography to name a few) support a generally tightly-knit, robust metadata community through interest groups within a larger association, several targeted conferences, and focused publications, such as the
Metadata is generally viewed as a practical application relating to cataloging, indexing, database development, and the recording of digital transactions. This point is underscored in “Metadata in Everyday Life,” the first section in NISO National Information Standards Organization (
National Information Standards Organization (
Another utilitarian aspect affecting perceptions of metadata stems from the pressing need for metadata to accommodate the exponential growth of data and the larger digital ecosystem, which limits resources (time, personnel, and finances) that could otherwise be allocated toward deeper metadata research analyses and theoretical development (Greenberg, 2009). As noted above, there is a robust metadata research community; however, the pragmatic strength and necessities of metadata have very likely impeded development of a more rigorous metadata agenda in data science.
Metadata carries baggage similar to that of cataloging (Coleman, 2005; Tennant, 2002). Specific criticisms address the Semantic Web, with claims that ontologies cannot support automatic reasoning (Shirky, 2005), the mark-up is excessive (Manian, 2011), and that the goals underlying linked data are unrealistic. There is also concept of “metacrap,” coined by Doctorow (2001), referring to the impossibility of “exhaustive, reliable metadata” due to “insurmountable obstacles,” and proclamations that automated methods will take over, obviating the need to investigate metadata (Dimitrova, 2004). The metadata community has internal critics as well, as demonstrated by Beall’s “
Traditional perceptions are further reflected in differing opinions about metadata and what constitutes a science. An illustrative example is found in “There Is No Science of Data,” a discussion on
Few (the blog author) has over 30 years of experience in business intelligence and information design, and his viewpoint clearly illustrates that many simply equate metadata with the nuts-and-bolts of an information system, rather than a research-worthy topic. Few continues his blog discussion by observing that information science and data science are also misnomers; despite the fact that a discussion contributor, named Konrad, shares, “Actually there is a whole academic discipline dedicated to the study of information…,” (Konrad. January 24, 2017 at 1:11 am). Further, this participant references the Wikipedia entry for information science, which is substantive with credible references confirming existence of the discipline. Although continued discussion of what is a science extends beyond the scope of this
Overall, the discussion above provides insight into why metadata research faces impediments in data science, and other disciplines. Nevertheless, the value of metadata cannot be denied. In fact, the significance of metadata became a mainstream media topic with Edward Snowden’s whistleblowing on the US government’s surveillance of personal phone record metadata, without individual consent or knowledge of this activity (Greenwald, 2013). In advocating for greater attention to metadata research, the following section presents three concepts to foster dialogue about metadata, and help provide a framework for metadata-focused research in data science.
Every domain has its
The data science enterprise has been motivated by the availability of massive amounts of digital data and new capacities for data-driven solutions. These ideas are central to the “fourth paradigm,” a dimension coined by Microsoft Research visionary Jim Gray, and captured by Hey, Tansley, and Tolle (2009), to explain the growing, unprecedented opportunity for data-driven science. Metadata is a vital component of the fourth paradigm, although the significance of metadata is often overlooked or only noted in a limited way. Metadata can garner new research attention if it is understood as
Big metadata is both a first-class object and an auxiliary associated with the wide, seemingly countless variety of data formats, types, and genres. Simon’s piece, “
Beyond an association with big data diversity and size,
Big metadata is defined below in Table 1 by the
The five Vs of big metadata. Singh (2013) identified data as the new black gold on Wired.com.
Five Vs Definition Volume The quantity and usefulness of metadata generated daily confirms the existence of big metadata. At times metadata is less than or equal to the extent of the data it describes in size (bytes). During other times the metadata exceeds the data being described or tracked, due to the complexity of the data lifecycle activity. Linked data offers an example, with metadata renderings that can be larger than the volume of data object(s) being represented. Like big data, not all big metadata is useful, and a challenge is to identify the big metadata that is useful for data science and analytic endeavors. Velocity Metadata is generated via automatic processes at immense speed correlating with rate of digital transactions. For example, searching Google, answering an email, purchasing an item online, and day-to-day office activities such as word processing of all log data, as well as associated metadata. Variety Metadata reflects the wide variety of data formats, types, and genres along with the extensive range of data and metadata lifecycles. In addition, the different types of metadata (e.g. discovery, technical, preservation, etc.) as well as unique domain specific metadata requirements intensify the variety. Variability There is an unmistakable unevenness of metadata across the digital ecosystem. Lack of uniformity is extensive for data descriptions across different domains, systems, and processes. This unevenness can even be profound within domains, given economic factors supporting metadata generation, competing standards, or, simply, differing adoption policies. For example, two organizations may use the same metadata standard, but have different implementation practices. Even when standardization is imposed, an organization, process, and human activity can contribute to inconsistencies. Value Metadata, as the
The five Vs of big metadata.
Singh (2013) identified data as the new black gold on Wired.com.
Table 1 draws from the commonly applied 5Vs (Marr, 2014), although other big data frameworks with nuanced or even different criteria likely apply to big metadata. Clearly, data science is not limited to big data; however, exploring the framework above is warranted inasmuch as it helps define big metadata and identify research pathways. Smart metadata, discussed in the following section, offers another fresh insight into metadata in the area of data science.
Metadata is inherently smart data because it provides context and meaning for data. One of the earliest uses of “smart metadata” was for a special session entitled “Smart Metadata” at the 2003 Dublin Core Conference, Seattle, Washington (DCMI, 2003). Themes in this special session included interoperable metadata, Semantic Web support, accessibility, and ontologies. Around the same time, van Hemel et al. (2003) promoted the idea of smart metadata in reference to the Semantic Web and the use of the Resource Description Framework (RDF) for topic maps. In 2007, Kogen, Miller, and Schobbe (2007) of the Microsoft Corporation used the term smart metadata as part of a patent description for a technique supporting metadata field management in a taxonomy system. Since that time, there does not appear to be a clear path for using the term “smart metadata” although research and discussions acknowledge metadata as a value-added factor supporting smart search, and as an enabler or characteristic of the Semantic Web and linked data (e.g. Fatima, Luca, & Wilson, 2014; Oh, Yi, & Jang, 2015). Zeng underscores this point in her work on smart data in the humanities, specifically in a recent discussion segment entitled “
A related aspect of smart metadata is the alignment with smart technology, including smart, mobile devices, and appliances. Examples include
Smart metadata has received attention within smart technology research. For example, Abbasi, Vassilopoulou, and Stergioulas (2017) used the phrase “smart metadata” to identify research directions and new tools supporting better use of digital media and the larger IoT. Contractor et al. (2015) refer to smart metadata in their analysis of the Learning Content Hub, a content management system supporting automatic metadata assignment, and the use of analytics to build customized educational applications. Similarly, researchers identify smart metadata as part of their design for a personalized, recommendation engine for TV programs (Thyagaraju & Kulkarni, 2011). In all these cases, metadata is smart in that it enables an action that draws on the data being represented or tracked. The action depends on good quality metadata that is accessible, preserved over time, and trusted. These ideas translate into the principles presented in Figure 2, forming a smart metadata matrix.
The smart metadata principles defined here qualify metadata as value-added data. The next section of this paper explores value relating to metadata more thoroughly through the concept metadata capital.
“Metadata capital” is a concept that emerged through research on data and metadata reuse in the Dryad data repository (Greenberg, Swauger, & Feinstein, 2013). Capital, broadly speaking, is understood as an asset with value; and the value may be financial, intellectual, social, or defined in other ways. Capital is most commonly associated with finance and wealth, and draws from work such as Adam Smith’s
It is important to point out that cost and value are not always aligned; this is because a product can cost more than it is worth, or be assigned a price that is below its worth. Even so, financial cost can be calculated. The metadata capital work postulates that when a purchased item is reused, over time, it is worth more than its original cost. Analogies to consider include a top-end stainless steel pot that is used over and over, without any change, and always supporting cooking to perfection; or an antique chest that has been passed down generation after generation, and is used to store sweaters in the summer, while also serving as a piece of furniture, becoming more valuable with age.
As stated above, capital, wealth, and value do not solely apply to financial matters, despite the fact that much of the big data and data science coverage is associated with economic incentive and opportunity. The broader interpretation of capital extends to knowledge (intellectual capital), and friendships—personal and professional relationships (social capital), as well as other areas, including some still likely to be discovered. Drawing on this broader context of value, a formalized definition of “metadata capital” is as follows, which was originally published in the
An asset that contains contextual knowledge about content. Content is the data or information contained in any information object (any “entity, form, or mode”). Context is who, what, where, when, how, why, etc., which can be captured via metadata attributes (Kunze, 2001). A product or service generated by human labor and/or machine-driven processes with value that increases over time or that enables the value increase of other assets. A good (a service facilitator) supporting a range of functions such as discovery, provenance tracking, rights management, authentication, preservation and other functions associated with lifecycle management and access. A public good if the product (metadata) is open, following which the services can be open.
An asset that contains contextual knowledge about content.
Content is the data or information contained in any information object (any “entity, form, or mode”).
Context is who, what, where, when, how, why, etc., which can be captured via metadata attributes (Kunze, 2001).
A product or service generated by human labor and/or machine-driven processes with value that increases over time or that enables the value increase of other assets.
A good (a service facilitator) supporting a range of functions such as discovery, provenance tracking, rights management, authentication, preservation and other functions associated with lifecycle management and access.
A public good if the product (metadata) is open, following which the services can be open.
Metadata capital is defined as an asset, a product, a good, including a public good, which enables gain through knowledge, access, and services. Metadata capital connecting to this broader interpretation associates with the promise of big data when considering the unprecedented opportunity to address real world problems in energy, health, and the environment (Greenberg & Garoufallou, 2013). Metadata is essential for using data to compare new energy approaches; track the progression and decline of a health crisis, such as the Ebola virus; or study climate change.
The biggest challenge with metadata valuation in this broader spectrum, and even with financial aspects, is the formidable task of substantiating value. In pursuing metadata capital as a financial topic, costs can be identified, or at least estimated, by adding system expenses, staff salaries distributed by hours dedicated to metadata tasks, and other incurred costs. However, determining exactly where to begin measuring cost is not an easy task. Does cost start with the metadata system design, the salary of the person who had the idea to build the metadata system, the person or team that implemented workflow design, or the cost of the code library that allowed the system to be built? Assessing social and intellectual value is even more daunting. How can we determine long-term consequences for metadata created today that allows for a major health discovery five or ten years from now?
There are more questions than answers in pinpointing or even approximating the value of metadata; it is predicament that underscores a significant challenge and invites research. Metadata capital requires further study, including drawing upon valuation and appraisal frameworks from other disciplines. What frameworks exist for measuring value across the domains of energy, health, and the environment? How do people assess the value of knowledge, personal friendships, and professional contacts? Although there is no single answer, drawing upon valuation research from other domains can help chart metadata research directions, and, future, demonstrate the value of metadata entrenched in data science.
Metadata, while applauded by many, has not been vigorously pursued as a research topic in data science, compared to statistical modeling, algorithm testing, data mining, and visualization. To be clear, there is metadata research; however, metadata focused scientific and scholarly output in data science venues has not kept pace with these other topics. Articulating a problem is one of the first steps to addressing a challenge. This paper pursued initial steps to addressing this challenge in the following ways:
The immediate contribution of this work is, simply, that it may elicit response, critique, or revision. A more impactful contribution is that this work may motivate more researchers to consider the significance of metadata as a topic worthy of research within data science and the larger digital ecosystem. In a recent discussion, my colleague at Drexel University, Dr. Rosina Weber, asked me, “Can you imagine data science without metadata?” I cannot think of a statement more profound than this to motivate next steps. This question needs to be considered by anyone who applauds or dismisses the value of metadata.
Data science cannot progress without metadata research; and while an extensive range of metadata topics are important, researchers need to ask:
The five Vs of big metadata.
|Volume||The quantity and usefulness of metadata generated daily confirms the existence of big metadata. At times metadata is less than or equal to the extent of the data it describes in size (bytes). During other times the metadata exceeds the data being described or tracked, due to the complexity of the data lifecycle activity. Linked data offers an example, with metadata renderings that can be larger than the volume of data object(s) being represented. Like big data, not all big metadata is useful, and a challenge is to identify the big metadata that is useful for data science and analytic endeavors.|
|Velocity||Metadata is generated via automatic processes at immense speed correlating with rate of digital transactions. For example, searching Google, answering an email, purchasing an item online, and day-to-day office activities such as word processing of all log data, as well as associated metadata.|
|Variety||Metadata reflects the wide variety of data formats, types, and genres along with the extensive range of data and metadata lifecycles. In addition, the different types of metadata (e.g. discovery, technical, preservation, etc.) as well as unique domain specific metadata requirements intensify the variety.|
|Variability||There is an unmistakable unevenness of metadata across the digital ecosystem. Lack of uniformity is extensive for data descriptions across different domains, systems, and processes. This unevenness can even be profound within domains, given economic factors supporting metadata generation, competing standards, or, simply, differing adoption policies. For example, two organizations may use the same metadata standard, but have different implementation practices. Even when standardization is imposed, an organization, process, and human activity can contribute to inconsistencies.|
|Metadata, as the |