Cite

Introduction

Vocabulary systems in the form of controlled vocabularies, thesauri, and ontologies are used to index resources stored in information systems and aid in information retrieval, discovery, and access. These types of vocabulary systems generally structure and encode the relationship among terms. Structuring processes frequently adhere to standards, such as the ANSI/NISO Z39.19 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies (NISO, 2010). Structured vocabularies following this standard encode hierarchical, associative, and equivalent relationships among terms. Over nearly the last two decades, the World Wide Web Consortium (W3C) has introduced encoding standards, such as Simple Knowledge Organization Systems (SKOS) (Miles et al., 2005) and the Web Ontology Language (OWL) (Antoniou & van Harmelen, 2004), providing new opportunities for structured vocabularies (NISO, 2010) across the Web. These standards build on eXtensible Mark-up Language (XML) and integrate the Resource Description Framework (RDF) model, making vocabularies machine processable and enabling support for computational approaches. These innovations support linked data (Heath & Bizer, 2011) and the goals of the Semantic Web (Shadbolt, Berners-Lee, & Hall, 2006) by providing the necessary infrastructure for connecting information resources across the Web. Linked data capabilities have excelled in selected areas of medicine and biology; however, progress specific to consumer health seems limited, which is surprising, as millions of individuals turn to digital media for health information on a daily basis looking for genuine answers and assistance.

Consumers seeking health information face the full range of consumer-oriented to scientific information, across the Web, as well as commercial information that has not necessarily been vetted. The realization of a consumer health vocabulary has historical roots in public libraries and early digital libraries (Rubenstein, 2012); and, in 2007, researchers at the Brigham and Women's Hospital and National Library of Medicine joined together developing the first-generation Consumer Health Vocabulary (CHV) (Zeng, 2007). Sponsored by the National Institutes of Health-National Library of Medicine (NIH-NLM), the initial CHV vocabulary was developed as a flat file (lacking hierarchies) and maintained over close to a four-year period. In 2004, drawing on Semantic Web (Heath & Bizer, 2011) principles and infrastructure, an interconnected group of researchers developed the Open Consumer Health Vocabulary (OCHV), which was recently (2016) registered in National Center for Biological Ontologies’ Bioportal in (Salvadores et al., 2013). These two initiatives stand as important contributions to the consumer health ontology space, although, unfortunately, they have discontinued, chiefly due to funding limitations and change in research priorities.

Today, given the promise of the Big Data rush and AI, it seems more pressing than ever to advance the original CHV ontology. The research presented in this paper takes a step toward addressing this need and developing a best practice for extending CHV. The research is motivated by the importance of consumer health to society and value of ontologies for both users and computational activities. The following sections cover background and data sources and a review of relevant prior research in this area as well as our goals of objectives of the research discussed in this study. The paper also reviews the research methods and procedures, including the incorporation of the Helping Interdisciplinary Vocabulary Engineering (HIVE) (Greenberg et al., 2011) technology; and presents the results and contextual discussion. Finally, the conclusion restates key outcomes and considers next steps.

Background and data sources
Consumer health vocabulary

The consumer health vocabulary (CHV) is a controlled vocabulary containing a mapping between consumer terms and UMLS medical terms. CHV was initially proposed in an expert panel at MEDINFO in 2001 (IOS Press Ebooks, 2001). In the early 1990s, with the advancement of the Web, the imbalance of domain knowledge between professionals and laypersons was gradually tilted as consumers across many societal sectors increasingly gained access to information online. A 2008 study based on the popular OHC site, PatientsLikeMe, found that 43% of the patient-described symptoms have a match in the Unified Medical Language System Metathesaurus (Smith & Wicks, 2008). To help consumers with identifying all symptoms by matching them with UMLS terms, there was a need to bridge the gap between consumer and medical terminology. CHV was proposed as a solution to bridge this gap.

Over the years since the initial development of CHV, there have been a few efforts to expand the vocabulary. One such effort by Zeng et al. (2007) used human review as well as logistic regression and other formulas to expand the vocabulary. Another study by Doing-Harris and Zeng-Treitler (2011) proposes an expansion of CHV using social network mining.

Consumer terms come from a variety of sources including social media, discussion boards and patient diaries. Given this heterogeneity of consumer terms, one can anticipate that multiple consumer terms map to the same preferred UMLS term. As part of background exploratory work for the research reported on below, we found there are about 58,000 UMLS terms that are mapped to over 150,000 consumer terms. This means for every UMLS term, we have an average of three consumer terms associated with the UMLS term. The Medical Subject Heading (MeSH) thesaurus is another ontology that is a part of UMLS and offers possibilities for enhancing CHV. The next section discusses MeSH.

MeSH

The Medical Subject Heading (MeSH) thesaurus is another ontology system, often referred to as a controlled vocabulary, which published by the National Library of Medicine. The main goal of MeSH is to organize all articles and journals published in the life sciences. Particularly, all MEDLINE/PubMed articles are indexed using the MeSH controlled vocabulary. MeSH organizes all medical terminology into a tree structure. There are 16 initial categories which are then divided further into subcategories. The thesaurus has up to 13 levels of subcategories. The most compelling case for the research presented here is that, in addition to organizing terms in categories, both MeSH and CHV are a part of the collection of UMLS vocabularies. As such, they share a unique identifier called CUI (Concept Unique Identifiers).

The shared CUI provides an important system features that could be leveraged pursuing ontological fusion. Specific to the goal of enhancing CHV, the fact that the two ontologies share a unique identifier provides a pathway for automatically enhancing the consumer health vocabulary using MeSH. Pursuing this step will allow researchers to capture health consumers’ ever-evolving preferences in a machine-readable format, which could further be utilized by algorithms to improve the quality of information delivered to consumers’ health-related queries.

Related work
Interoperability of ontologies

As we aim to extend and enhance CHV using another ontology, we look to other research on achieving and improving the interoperability among ontologies. There are a number of projects aimed at linking multiple ontologies to improve our ability to index and retrieve data. Zeng (2019) presents an insightful account detailing key approaches for merging two or more ontologies; and this work serves as a guide for achieving interoperability. Research by Isaac et al. (2009) also details merging of ontologies using different techniques and provides insight to various metrics that help determine whether terms can be matched focusing on library data. Francesconi et al. (2008) also discuss a method for merging ontologies; however, their research focuses on combining five ontologies that are of importance in the EU. Chan and Zheng (2002) discuss matching ontologies using different algorithms and evaluation criteria. Background research also points to a study by Slater, Gkoutos, and Hoehndorf (2020), and their method for combining biomedical ontologies into a single meta-ontology. This piece addresses a range of challenges drawing, given the different semantic levels and relationships found in ontologies. The research reviewed here helped us to gain insight into various approaches and challenges, and provided grounding for how we might enhance the CHV.

Expanding CHV

As part of our review, we also examined research projects aimed at expanding CHV. It seems that previous projects mainly focused on expanding the ontology using unstructured text. One example is the study by Jiang and Yang (2015) that enhanced CHV with new terms by mining online health communities and determined whether terms should be added to the ontology using co-occurrence metrics. Another example is a study by He et al. (2017) exploring mining a social Q&A site to find candidate terms for expanding CHV. This stands as important work, although the results of this have not been incorporated into the publically accessible CHV. Overall, reporting on expanding the CHV and evaluating the outcomes are limited. Our aim is to expand the CHV, and, as part of our approach we seek to improve the interoperability of CHV using structured data from a well maintained ontology.

Research goals and objectives

The overriding goal of this research is to extend and enhance the original CHV. Many new medical conditions have emerged since the last CHV update. Examples such as illnesses associated with opioid addiction and vaping, have consumer terms associated with them; however, neither the conditions nor their associated consumer terms are currently captured in CHV. Another chief limitation to address is that the current CHV ontology is a flat terminology. Relationships among terms is central to ontology construction, and their ability to support computational activity and machine learning. To address this limitation, we took steps to enhance CHV with parent terms and additional associations and capture relationships useful for to consumers in learning about an illness as well as future data driven activity supporting machine learning. These enhancements will help us overcome the challenges associated with CHV, particularly having multiple term variants at the same level. Additionally, adding parent terms, as part of a hierarchy, will help us identify where to insert new terms more easily. Our key objectives in this research are to convert CHV to a hierarchical ontology using terms from MeSH. This research will outline steps to perform this process automatically. By adding a hierarchical structure to CHV, we will enable consumers to produce more relevant results when searching for terms using this ontology. Our plan is to eliminate all redundant and irrelevant results as well as improve the discovery of terms using their relationships.

Methodology
Methods

We present our approach for automatically extending and enhancing CHV using identifier mapping and ontological fusion. To achieve these goals, we performed the following steps:

Implementation of a hierarchical structure. We enhanced and improved CHV by converting the controlled vocabulary from a flat list to a hierarchical structure. This goal required converting the data to a hierarchical structure. The CHV ontology is a mapping from consumer terms to MeSH terms integrated within the UMLS. We selected MeSH because MeSH and CHV share an identifier, the language is understandable by the general consumer, and because it offers a hierarchical structure and performed a join using the UMLS terms that both ontologies share.

HIVE (Helping Interdisciplinary Vocabulary Engineering) integration/development. We worked with the HIVE application to view and study the enhanced CHV, with the longer-term goal of enabling automatic assignment of subject metadata with CHV using HIVE. HIVE is both a model and a system that supports automatic metadata generation by drawing descriptors from multiple Simple Knowledge Organization System (SKOS)-encoded controlled vocabularies (Greenberg et al., 2011). This goal was accomplished by creating new unique identifiers that mapped to multiple alternate terms. In developing identifiers, our convention was to use the URI associated with the institution of the authors and assign a sequential integer identifier to each member of the vocabulary. The result is a significant reduction of the number of terms returned in HIVE's search results by removing all alternate terms from the search results.

Our methodology is described in Figure 1. We enhance and improve CHV and rename it the Combined Consumer Health Vocabulary (CCHV).

Figure 1

Diagram of methodology for producing the Combined Consumer Health Vocabulary.

Procedures

The original CHV ontology contains 158,519 terms and consists of two types of terms – UMLS terms that have a CUI associated with them and consumer terms that do not have a CUI directly associated with them. These terms are mapped to a UMLS term and are associated with a CUI indirectly. There are 57,819 unique UMLS terms in the CHV ontology. As Figure 1 shows, we start by joining our CHV and UMLS data to the MeSH terms using the common CUI field. While the consumer terms are mapped to the UMLS terms, they do not have a unique identifier. Therefore, the next step in processing this vocabulary was to assign a unique identifier to each consumer term in the ontology. This will differentiate the consumer terms from the UMLS terms as they will be represented as independent terms with their own identifier. We connect the consumer terms and the UMLS terms in the ontology with the alternate relationship. This is because in CHV, the terms are connected with both an alternate and a related relationship. Additionally, terms are connected with an alternate relationship in MeSH (they are defined as synonyms) so using an alternate relationship contributes to maintaining consistency in the new ontology. The unique identifiers were generated by assigning a URI to the terms that ended in an eight-digit unique identifier. Using the UMLS terms in CHV, we identified all terms in MeSH that were parents of the UMLS terms in CHV and added these terms and their entire hierarchy. These parent terms were connected as parents to both UMLS terms and consumer terms in CHV as parents. Additionally, relationships between existing terms were retrieved from MeSH. This step is described in Figure 1 as the extraction of the hierarchical structure and led to additional alternate relationships in the ontology. We add these terms and relationships to CHV and transform the mapping from a flat mapping to a hierarchy with multiple connections. The new ontology is named CCHV. The final step in this procedure is to enhance HIVE. HIVE was enhanced to store all alternate terms as concepts. This means that we could use these relationships to discover more information in our data. Furthermore, HIVE has been modified to return only preferred terms when performing a search. This means a more streamlined result when performing a search in HIVE. Rather than returning all terms that contain the search query, we now return only the preferred terms for any term returned by the search query.

Results

Transitioning to CCHV means that we can now query the vocabulary and retrieve results that contain more information about each term. Particularly, we can learn more about the parent, child, and related terms for each search result. Additionally, fewer irrelevant results are retrieved from a query since the search result only contains preferred terms. For example, while bunion and bunions both exist in the vocabulary as consumer terms, searching either one will return bunion, since bunion is the preferred term. Before enhancing CHV, we had no broader or narrower relationships. After transitioning to CCHV, a total of 8,269 MeSH parent terms were added, and some parent-child relationships were uncovered between terms that already existed in the hierarchy. Furthermore, the number of broader-narrower relationships between existing terms increased 0 to 55,571 relationships among terms. At the same time, there were 30,060 related relationships and 51,995 alternate relationships. The alternate relationships were retained while the related relationships were enhanced. By adding the related relationships from MeSH, we now have 324,545 related term relationships. These new relationships allow users to discover more information by clicking through the concepts and discovering more information. We also use the relationships in querying and indexing. The summary of the change in semantic relationships counts from the orginal CHV, to the enhance CHV, called CCHV, is presented in Table 1.

Relationships between the original CHV and the new CCHV.

VOCABULARYSEMANTIC RELATIONSHIP
ALTERNATE TERMRELATED TERMBROADER TERMNARROWER TERM
Original CHV51,99530,06000
CCHV51,995325,54555,57155,571

The change in the data is demonstrated in the examples in Figures 2 and 3. Figure 2 is a screenshot of the term bunion from CHV in HIVE before adding the additional information from MeSH. This term is assigned the CUI from UMLS. Figure 3 shows CCHV which contains the additional MeSH terms. In Figure 3, we see the same bunion term from the original CHV but enhanced with the broader and narrower terms. We combined all the relationships to bunion from all alternates of bunion to create one combined bunion term. Since we introduced new relationships, we also assigned a new ID to this term.

Figure 2

Search results for the term bunion in the previous version of HIVE using CHV.

Figure 3

Search results for the term bunion in the improved version of HIVE using CCHV.

Adding terms and relationships from MeSH helps us discover relationships in our data and adds more context to the consumer terms. Since HIVE is a tool for indexing documents, uncovering more connections in the vocabulary used to index will mean more context on the terms in the document. In the example, we see that Tailor's bunion is a term that existed in CHV before the enhancement but now we know that it is a narrower term of bunion. However, the term foot deformities did not exist in the data and has been added from MeSH. Additionally, we have used the preferred label for each term so that searching for an alternate will return the preferred term. For example, in Figure 4, hallux valgus is an alternate term for bunion. Searching for hallux valgus returns the term bunion as a first result.

Figure 4

Search results for the term hallux valgus in the improved version of HIVE using CCHV.

Examining other search queries and comparing only the number of search terms shows we typically find a reduction of results. In Table 2, we see an improvement in results for the query liver failure. In the previous version of HIVE, we get a result of the same term appearing twice while in the new HIVE it only appears once.

A comparison of search results for the term “liver failure” between the previous and the improved versions of HIVE.

HIVE Results - Before ImprovementsHIVE Results - After Improvements
Liver failureLiver failure
Liver failure

In Table 3, we present the result of the query ovarian cysts. The new version of HIVE returns fewer terms. Some of the terms are not directly related to the query but are returned in our search because of an alternate. In this case, the term brca1 genes is returned even though it isn’t an exact match with ovarian cysts.

A comparison of search results for the term ovarian cyst between the previous and the improved versions of HIVE and CHV.

HIVE Results - Before ImprovementsHIVE Results - After Improvements
cyst ovarianbrca1 genes
cysts ovarianovarian cysts
ovarian cystparovarian cyst
ovarian cystectomy
ovarian cysts
ovarian cysts
parovarian cyst
parovarian cyst
parovarian cysts
Discussion

The results presented above demonstrate the improvements made in both CHV and HIVE. The new CCHV ontology is a transformation from a flat mapping of consumer terms to UMLS terms to a hierarchical dataset that has a tree structure and contains both new terms from MeSH as well as new relationships between existing terms uncovered from MeSH. In many areas of research, a hierarchical ontology is considered an improvement on a flat mapping since it can convey more information through the relationships between the terms. One example of an area of research where such an improvement is observed is in protein function prediction using gene ontologies (Eisner et al., 2005). Another area of research that illustrates the benefits of a hierarchical ontology over a flat one is in the classification of web serach results (Singh & Nakata, 2005). In addition to enhancing the structure of the ontology, our study has also added imrpovments to HIVE. HIVE has been enhanced to produce more relevant search results. HIVE now returns one search result per preferred term rather than returning all alternate terms containing the search query. To demonstrate this, we tested a sample of 30 search queries and saw a reduction in the number of results with the new version of HIVE. The mean number of results returned in the previous version of HIVE was 140.4 and went down to 22.87 in the new version of HIVE. The bunion search query highlights the improvements in our results. Other research like the study by Kaisser, Hearst, and Lowe (2008) has shown that shorter search results are considered preferable. Therefore, a reduction in the number of terms returned in a query is a desirable result. Using the example of the search term bunion, in previous versions of HIVE and CHV, there were no broader, narrower, or alternate terms connected to the bunion entry in CHV. However, after enhancing both HIVE and CHV, we added the relationships to the alternates hallux valgus, acquired hallux valgus, congenital hallux valgus, and bunions. We also added the relationships to the narrower terms Tailor's bunion and bunionette. This has been illustrated in Figures 2–4. The related terms previously existed in our dataset but the relationships between them were not retrieved in the HIVE search. We also added the terms foot deformities and acquired foot deformities from MeSH. Having these terms in our dataset allows us to connect the term bunion with other types of foot deformities. The additional relationships between the terms in the updated ontology allow for additional data discovery for consumers.

While we have reduced the number of results per search, occasionally, we will get a result that is not directly related to the query. For example, when searching for ovarian cysts, our search query produces less duplication and streamlines the result by combining all alternates into one preferred term. However, our query also returned brca1 genes. While both terms are related to women's health, the result is not synonymous with the query. This is a definite area for improvement. The algorithm should be improved to display only the most relevant results for each query. One possible method is by using an algorithm for synonym detection like the one developed by Turney (2001).

Conclusion

This research has been performed with the aim of enhancing CHV and improving HIVE to produce improved search results. The new CHV contains more relationships between existing concepts. It contains more types of relationships since the previous version of CHV contained no broader or narrower terms and no related terms. CHV has transformed into a hierarchical structure by adding broader terms from MeSH. HIVE has been enhanced by improving query results when searching for medical terms. After enhancing CHV and HIVE, search queries now return fewer terms. The terms returned are only preferred terms rather than returning multiple variations of the same term with slightly different spelling. We have consolidated all duplicate terms and very similar terms and produce a more streamlined result that is easier to sift through. We can still uncover the alternate terms by looking at the alternates of the query result. This research is a first step in enhancing CHV to include new terms that have been added to other medical ontologies since the last release of CHV.

Allowing users to search for consumer terms and obtain results of the preferred medical term can have great benefits to both consumers and medical practitioners (McCray et al., 1993). Using such a resource can help identify medical conditions or symptoms of a condition from discussions in messaging boards or improve results in search engines. By increasing the number of connections between terms, we allow for improved discovery of medical conditions and symptoms using consumer terms.

While this research significantly improves both CHV and HIVE, in performing the research we identify two main areas for further work. The first area is to devise an algorithm to identify alternates that are not synonymous with each other. We identified such an example with the relationship between ovarian cysts and brca1 genes. The two are not synonymous but related. However, when searching for one, the other was returned in our results. To further improve our results, we will devise a machine learning algorithm that will learn to identify the most appropriate relationship between terms and recommend whether the terms have a broader-narrower relationship, an alternate relationship, or a related relationship. This future research will further improve our search results. A second area of improvement is to perform the second stage of this analysis and enhance CHV with new terms that did not exist when CHV was last updated. There are many conditions that do not exist in the current version of CHV. One example is conditions related to vaping. There is currently no mention of these medical conditions in CHV and consumers and medical practitioners would both benefit from such enhancements to the vocabulary. The plan for CHV enhancement with new terms will also use a machine learning algorithm. Our plan is to scrape multiple message boards that contain questions about health and use natural language processing to identify relationships between consumer terms discussed in the message boards and UMLS terms.

eISSN:
2543-683X
Idioma:
Inglés
Calendario de la edición:
4 veces al año
Temas de la revista:
Computer Sciences, Information Technology, Project Management, Databases and Data Mining