Vocabulary systems in the form of controlled vocabularies, thesauri, and ontologies are used to index resources stored in information systems and aid in information retrieval, discovery, and access. These types of vocabulary systems generally structure and encode the relationship among terms. Structuring processes frequently adhere to standards, such as the ANSI/NISO Z39.19
Consumers seeking health information face the full range of consumer-oriented to scientific information, across the Web, as well as commercial information that has not necessarily been vetted. The realization of a consumer health vocabulary has historical roots in public libraries and early digital libraries (Rubenstein, 2012); and, in 2007, researchers at the Brigham and Women's Hospital and National Library of Medicine joined together developing the first-generation
Today, given the promise of the Big Data rush and AI, it seems more pressing than ever to advance the original CHV ontology. The research presented in this paper takes a step toward addressing this need and developing a best practice for extending CHV. The research is motivated by the importance of consumer health to society and value of ontologies for both users and computational activities. The following sections cover background and data sources and a review of relevant prior research in this area as well as our goals of objectives of the research discussed in this study. The paper also reviews the research methods and procedures, including the incorporation of the Helping Interdisciplinary Vocabulary Engineering (HIVE) (Greenberg et al., 2011) technology; and presents the results and contextual discussion. Finally, the conclusion restates key outcomes and considers next steps.
The consumer health vocabulary (CHV) is a controlled vocabulary containing a mapping between consumer terms and UMLS medical terms. CHV was initially proposed in an expert panel at MEDINFO in 2001 (IOS Press Ebooks, 2001). In the early 1990s, with the advancement of the Web, the imbalance of domain knowledge between professionals and laypersons was gradually tilted as consumers across many societal sectors increasingly gained access to information online. A 2008 study based on the popular OHC site, PatientsLikeMe, found that 43% of the patient-described symptoms have a match in the Unified Medical Language System Metathesaurus (Smith & Wicks, 2008). To help consumers with identifying all symptoms by matching them with UMLS terms, there was a need to bridge the gap between consumer and medical terminology. CHV was proposed as a solution to bridge this gap.
Over the years since the initial development of CHV, there have been a few efforts to expand the vocabulary. One such effort by Zeng et al. (2007) used human review as well as logistic regression and other formulas to expand the vocabulary. Another study by Doing-Harris and Zeng-Treitler (2011) proposes an expansion of CHV using social network mining.
Consumer terms come from a variety of sources including social media, discussion boards and patient diaries. Given this heterogeneity of consumer terms, one can anticipate that multiple consumer terms map to the same preferred UMLS term. As part of background exploratory work for the research reported on below, we found there are about 58,000 UMLS terms that are mapped to over 150,000 consumer terms. This means for every UMLS term, we have an average of three consumer terms associated with the UMLS term. The Medical Subject Heading (MeSH) thesaurus is another ontology that is a part of UMLS and offers possibilities for enhancing CHV. The next section discusses MeSH.
The Medical Subject Heading (MeSH) thesaurus is another ontology system, often referred to as a controlled vocabulary, which published by the National Library of Medicine. The main goal of MeSH is to organize all articles and journals published in the life sciences. Particularly, all MEDLINE/PubMed articles are indexed using the MeSH controlled vocabulary. MeSH organizes all medical terminology into a tree structure. There are 16 initial categories which are then divided further into subcategories. The thesaurus has up to 13 levels of subcategories. The most compelling case for the research presented here is that, in addition to organizing terms in categories, both MeSH and CHV are a part of the collection of UMLS vocabularies. As such, they share a unique identifier called CUI (Concept Unique Identifiers).
The shared CUI provides an important system features that could be leveraged pursuing ontological fusion. Specific to the goal of enhancing CHV, the fact that the two ontologies share a unique identifier provides a pathway for automatically enhancing the consumer health vocabulary using MeSH. Pursuing this step will allow researchers to capture health consumers’ ever-evolving preferences in a machine-readable format, which could further be utilized by algorithms to improve the quality of information delivered to consumers’ health-related queries.
As we aim to extend and enhance CHV using another ontology, we look to other research on achieving and improving the interoperability among ontologies. There are a number of projects aimed at linking multiple ontologies to improve our ability to index and retrieve data. Zeng (2019) presents an insightful account detailing key approaches for merging two or more ontologies; and this work serves as a guide for achieving interoperability. Research by Isaac et al. (2009) also details merging of ontologies using different techniques and provides insight to various metrics that help determine whether terms can be matched focusing on library data. Francesconi et al. (2008) also discuss a method for merging ontologies; however, their research focuses on combining five ontologies that are of importance in the EU. Chan and Zheng (2002) discuss matching ontologies using different algorithms and evaluation criteria. Background research also points to a study by Slater, Gkoutos, and Hoehndorf (2020), and their method for combining biomedical ontologies into a single meta-ontology. This piece addresses a range of challenges drawing, given the different semantic levels and relationships found in ontologies. The research reviewed here helped us to gain insight into various approaches and challenges, and provided grounding for how we might enhance the CHV.
As part of our review, we also examined research projects aimed at expanding CHV. It seems that previous projects mainly focused on expanding the ontology using unstructured text. One example is the study by Jiang and Yang (2015) that enhanced CHV with new terms by mining online health communities and determined whether terms should be added to the ontology using co-occurrence metrics. Another example is a study by He et al. (2017) exploring mining a social Q&A site to find candidate terms for expanding CHV. This stands as important work, although the results of this have not been incorporated into the publically accessible CHV. Overall, reporting on expanding the CHV and evaluating the outcomes are limited. Our aim is to expand the CHV, and, as part of our approach we seek to improve the interoperability of CHV using structured data from a well maintained ontology.
The overriding goal of this research is to extend and enhance the original CHV. Many new medical conditions have emerged since the last CHV update. Examples such as illnesses associated with opioid addiction and vaping, have consumer terms associated with them; however, neither the conditions nor their associated consumer terms are currently captured in CHV. Another chief limitation to address is that the current CHV ontology is a flat terminology. Relationships among terms is central to ontology construction, and their ability to support computational activity and machine learning. To address this limitation, we took steps to enhance CHV with parent terms and additional associations and capture relationships useful for to consumers in learning about an illness as well as future data driven activity supporting machine learning. These enhancements will help us overcome the challenges associated with CHV, particularly having multiple term variants at the same level. Additionally, adding parent terms, as part of a hierarchy, will help us identify where to insert new terms more easily. Our key objectives in this research are to convert CHV to a hierarchical ontology using terms from MeSH. This research will outline steps to perform this process automatically. By adding a hierarchical structure to CHV, we will enable consumers to produce more relevant results when searching for terms using this ontology. Our plan is to eliminate all redundant and irrelevant results as well as improve the discovery of terms using their relationships.
We present our approach for automatically extending and enhancing CHV using identifier mapping and ontological fusion. To achieve these goals, we performed the following steps:
Our methodology is described in Figure 1. We enhance and improve CHV and rename it the Combined Consumer Health Vocabulary (CCHV).
Diagram of methodology for producing the Combined Consumer Health Vocabulary.
The original CHV ontology contains 158,519 terms and consists of two types of terms – UMLS terms that have a CUI associated with them and consumer terms that do not have a CUI directly associated with them. These terms are mapped to a UMLS term and are associated with a CUI indirectly. There are 57,819 unique UMLS terms in the CHV ontology. As Figure 1 shows, we start by joining our CHV and UMLS data to the MeSH terms using the common CUI field. While the consumer terms are mapped to the UMLS terms, they do not have a unique identifier. Therefore, the next step in processing this vocabulary was to assign a unique identifier to each consumer term in the ontology. This will differentiate the consumer terms from the UMLS terms as they will be represented as independent terms with their own identifier. We connect the consumer terms and the UMLS terms in the ontology with the alternate relationship. This is because in CHV, the terms are connected with both an alternate and a related relationship. Additionally, terms are connected with an alternate relationship in MeSH (they are defined as synonyms) so using an alternate relationship contributes to maintaining consistency in the new ontology. The unique identifiers were generated by assigning a URI to the terms that ended in an eight-digit unique identifier. Using the UMLS terms in CHV, we identified all terms in MeSH that were parents of the UMLS terms in CHV and added these terms and their entire hierarchy. These parent terms were connected as parents to both UMLS terms and consumer terms in CHV as parents. Additionally, relationships between existing terms were retrieved from MeSH. This step is described in Figure 1 as the extraction of the hierarchical structure and led to additional alternate relationships in the ontology. We add these terms and relationships to CHV and transform the mapping from a flat mapping to a hierarchy with multiple connections. The new ontology is named CCHV. The final step in this procedure is to enhance HIVE. HIVE was enhanced to store all alternate terms as concepts. This means that we could use these relationships to discover more information in our data. Furthermore, HIVE has been modified to return only preferred terms when performing a search. This means a more streamlined result when performing a search in HIVE. Rather than returning all terms that contain the search query, we now return only the preferred terms for any term returned by the search query.
Transitioning to CCHV means that we can now query the vocabulary and retrieve results that contain more information about each term. Particularly, we can learn more about the parent, child, and related terms for each search result. Additionally, fewer irrelevant results are retrieved from a query since the search result only contains preferred terms. For example, while
Relationships between the original CHV and the new CCHV.
VOCABULARY | SEMANTIC RELATIONSHIP | |||
---|---|---|---|---|
ALTERNATE TERM | RELATED TERM | BROADER TERM | NARROWER TERM | |
Original CHV | 51,995 | 30,060 | 0 | 0 |
CCHV | 51,995 | 325,545 | 55,571 | 55,571 |
The change in the data is demonstrated in the examples in Figures 2 and 3. Figure 2 is a screenshot of the term
Search results for the term
Search results for the term
Adding terms and relationships from MeSH helps us discover relationships in our data and adds more context to the consumer terms. Since HIVE is a tool for indexing documents, uncovering more connections in the vocabulary used to index will mean more context on the terms in the document. In the example, we see that
Search results for the term
Examining other search queries and comparing only the number of search terms shows we typically find a reduction of results. In Table 2, we see an improvement in results for the query
A comparison of search results for the term “liver failure” between the previous and the improved versions of HIVE.
HIVE Results - Before Improvements | HIVE Results - After Improvements |
---|---|
Liver failure | Liver failure |
Liver failure |
In Table 3, we present the result of the query
A comparison of search results for the term
HIVE Results - Before Improvements | HIVE Results - After Improvements |
---|---|
cyst ovarian | brca1 genes |
cysts ovarian | ovarian cysts |
ovarian cyst | parovarian cyst |
ovarian cystectomy | |
ovarian cysts | |
ovarian cysts | |
parovarian cyst | |
parovarian cyst | |
parovarian cysts |
The results presented above demonstrate the improvements made in both CHV and HIVE. The new CCHV ontology is a transformation from a flat mapping of consumer terms to UMLS terms to a hierarchical dataset that has a tree structure and contains both new terms from MeSH as well as new relationships between existing terms uncovered from MeSH. In many areas of research, a hierarchical ontology is considered an improvement on a flat mapping since it can convey more information through the relationships between the terms. One example of an area of research where such an improvement is observed is in protein function prediction using gene ontologies (Eisner et al., 2005). Another area of research that illustrates the benefits of a hierarchical ontology over a flat one is in the classification of web serach results (Singh & Nakata, 2005). In addition to enhancing the structure of the ontology, our study has also added imrpovments to HIVE. HIVE has been enhanced to produce more relevant search results. HIVE now returns one search result per preferred term rather than returning all alternate terms containing the search query. To demonstrate this, we tested a sample of 30 search queries and saw a reduction in the number of results with the new version of HIVE. The mean number of results returned in the previous version of HIVE was 140.4 and went down to 22.87 in the new version of HIVE. The
While we have reduced the number of results per search, occasionally, we will get a result that is not directly related to the query. For example, when searching for
This research has been performed with the aim of enhancing CHV and improving HIVE to produce improved search results. The new CHV contains more relationships between existing concepts. It contains more types of relationships since the previous version of CHV contained no broader or narrower terms and no related terms. CHV has transformed into a hierarchical structure by adding broader terms from MeSH. HIVE has been enhanced by improving query results when searching for medical terms. After enhancing CHV and HIVE, search queries now return fewer terms. The terms returned are only preferred terms rather than returning multiple variations of the same term with slightly different spelling. We have consolidated all duplicate terms and very similar terms and produce a more streamlined result that is easier to sift through. We can still uncover the alternate terms by looking at the alternates of the query result. This research is a first step in enhancing CHV to include new terms that have been added to other medical ontologies since the last release of CHV.
Allowing users to search for consumer terms and obtain results of the preferred medical term can have great benefits to both consumers and medical practitioners (McCray et al., 1993). Using such a resource can help identify medical conditions or symptoms of a condition from discussions in messaging boards or improve results in search engines. By increasing the number of connections between terms, we allow for improved discovery of medical conditions and symptoms using consumer terms.
While this research significantly improves both CHV and HIVE, in performing the research we identify two main areas for further work. The first area is to devise an algorithm to identify alternates that are not synonymous with each other. We identified such an example with the relationship between