1. bookVolume 2 (2020): Issue 1 (December 2020)
Journal Details
First Published
20 Oct 2019
Publication timeframe
1 time per year
access type Open Access

Supporting secondary research in early drug discovery process through a Natural Language Processing based system

Published Online: 31 May 2021
Page range: 209 - 222
Journal Details
First Published
20 Oct 2019
Publication timeframe
1 time per year

Last decades were characterised by a constant decline in the productivity of research and development activities of pharmaceutical companies. This is due to the fact that the drug discovery process contains an intrinsic risk that should be managed efficiently. Within this process, the early phase projects could be streamlined by doing more secondary research. These activities would involve the integration of chemical and biological knowledge from scientific literature in order to extract an overview and the evolution of a certain research area. This would then help refine the research and development operations.

Considering the vast amount of pharmaceutical studies publications, it is not easy to identify the important information. For this task, a series of projects leveraged the advantages of the open pharmacological space through state-of-the-art technologies. The most popular are Knowledge Graphs methods. Although extremely useful, this technology requires increased investments of time and human resources. An alternative would be to develop a system that uses Natural Language Processing blocks. Still, there is no defined framework and reusable code template for the use-case of compounds development.

In this study, it is presented the design and development of a system that uses Dynamic Topic Modelling and Named Entity Recognition modules in order to extract meaningful information from a large volume of unstructured texts. Moreover, the dynamic character of the topic modelling technique allows to analyse the evolution of different subject areas over time. In order to validate the system, a collection of articles from the Pharmaceutical Research Journal was used.

Our results show that the system is able to identify the main research areas in the last 20 years, namely crystalline and amorphous systems, insulin resistance, paracellular permeability. Additionally, the evolution of the subjects is a highly valuable resource and should be used to get an in-depth understanding about the shifts that happened in a specific domain.

However, a limitation of this system is that it cannot detect association between two concepts or entities if they are not involved in the same document.


Aizawa, A. (2003). An information-theoretic perspective of tf–idf measures. Information Processing & Management, 39(1), 45-65.Search in Google Scholar

Alcantara, L. L., Mahichi, F., & Park, Y. (2012). An Analysis of the Antibiotic Industry: An Innovator’s Dilemma?. Journal of International Business Research, 11(2), 1.Search in Google Scholar

Asmussen, C. B., & Møller, C. (2019). Smart literature review: a practical topic modelling approach to exploratory literature review. Journal of Big Data, 6(1), 93.Search in Google Scholar

Balakrishnan, V., & Lloyd-Yemoh, E. (2014). Stemming and lemmatization: a comparison of retrieval performances.Search in Google Scholar

Belleau, F., Nolin, M. A., Tourigny, N., Rigault, P., & Morissette, J. (2008). Bio2RDF: towards a mashup to build bioinformatics knowledge systems. Journal of biomedical informatics, 41(5), 706-716.Search in Google Scholar

Bhadury, A., Chen, J., Zhu, J., & Liu, S. (2016, April). Scaling up dynamic topic models. In Proceedings of the 25th International Conference on World Wide Web (pp. 381-390).Search in Google Scholar

Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.Search in Google Scholar

Blei, D. M., & Lafferty, J. D. (2006, June). Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning (pp. 113-120).Search in Google Scholar

Blei, D. M., Franks, K., Jordan, M. I., & Mian, I. S. (2006). Statistical modeling of biomedical corpora: mining the caenorhabditis genetic center bibliography for genes related to life span. Bmc Bioinformatics, 7(1), 250.Search in Google Scholar

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.Search in Google Scholar

Chen, B., Dong, X., Jiao, D., Wang, H., Zhu, Q., Ding, Y., & Wild, D. J. (2010). Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC bioinformatics, 11(1), 255.Search in Google Scholar

Chen, B., Wang, H., Ding, Y., & Wild, D. (2014). Semantic breakthrough in drug discovery. Synthesis Lectures on the Semantic Web: Theory and Technology, 4(2), 1-142.Search in Google Scholar

Daelemans, W., & Hoste, V. (2002). Evaluation of machine learning methods for natural language processing tasks. In 3rd International conference on Language Resources and Evaluation (LREC 2002). European Language Resources Association (ELRA).Search in Google Scholar

Deerwester, S., Dumais, S., Landauer, T., Furnas, G., & Beck, L. (1988, January). Improving information-retrieval with latent semantic indexing. In Proceedings of the ASIS annual meeting (Vol. 25, pp. 36-40). 143 OLD MARLTON PIKE, MEDFORD, NJ 08055-8750: INFORMATION TODAY INC.Search in Google Scholar

Fleming, S. (2018). Pharma’s Innovation Crisis, Part 1: Why The Experts Can’t Fix It. Forbes Mag.Search in Google Scholar

Frick, J., Guha, R., Peryea, T., & Southall, N. T. (2015). Evaluating disease similarity using latent Dirichlet allocation. BioRxiv, 030593.Search in Google Scholar

Gilbert, J., Henske, P., & Singh, A. (2003). Rebuilding big pharma’s business model. IN VIVONEW YORK THEN NORWALK-, 21(10), 73-80.Search in Google Scholar

Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., & Blei, D. M. (2004). Hierarchical topic models and the nested chinese restaurant process. In Advances in neural information processing systems (pp. 17-24).Search in Google Scholar

Groth, P., Loizou, A., Gray, A. J., Goble, C., Harland, L., & Pettifer, S. (2014). API-centric linked data integration: The open PHACTS discovery platform case study. Journal of web semantics, 29, 12-18.Search in Google Scholar

He, B., Tang, J., Ding, Y., Wang, H., Sun, Y., Shin, J. H., ... & Wild, D. J. (2011). Mining relational paths in integrated biomedical data. PLoS One, 6(12), e27506.Search in Google Scholar

Hofmann, T. (1999, August). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 50-57).Search in Google Scholar

King, G., & Lowe, W. (2003). An automated information extraction tool for international conflict data with performance as good as human coders: A rare events evaluation design. International Organization, 617-642.Search in Google Scholar

KPMG International Cooperative (2017). Pharma outlook 2030: From evolution to revolutionSearch in Google Scholar

Mcauliffe, J. D., & Blei, D. M. (2008). Supervised topic models. In Advances in neural information processing systems (pp. 121-128).Search in Google Scholar

Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272).Search in Google Scholar

Mo, Y., Kontonatsios, G., & Ananiadou, S. (2015). Supporting systematic reviews using LDA-based document representations. Systematic reviews, 4(1), 172.Search in Google Scholar

Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100-108).Search in Google Scholar

O’Reilly III, C. A., & Tushman, M. L. (2016). Lead and disrupt: How to solve the innovator’s dilemma. Stanford University Press.Search in Google Scholar

Pammolli, F., Magazzini, L., & Riccaboni, M. (2011). The productivity crisis in pharmaceutical R&D. Nature reviews Drug discovery, 10(6), 428-438.Search in Google Scholar

Philippidis, A. (2015). Despite Big Pharma Retreat, R&D Spending Advances: As Biotechs Fill the Research Gap, Developers of All Sizes Scramble to Reduce Risk. Genetic Engineering & Biotechnology News, 35(06), 6-7.Search in Google Scholar

PricewaterhouseCoopers (PwC) (2009). Pharma 2020: Challenging business models. Which path will you take.Search in Google Scholar

Rizzo, S. J. S., Edgerton, J. R., Hughes, Z. A., & Brandon, N. J. (2013). Future viable models of psychiatry drug discovery in pharma. Journal of biomolecular screening, 18(5), 509-521.Search in Google Scholar

Siebert, M. (2020). How AI and knowledge graphs can make your research easier. Elsevier Connect. See at the URL: https://www.elsevier.com/connect/how-ai-and-knowledge-graphs-can-make-your-research-easierSearch in Google Scholar

Stott, K. (2017). Pharma’s broken business model: An industry on the brink of terminal decline, Endpoint News, 28 November 2017. See at the URL: https://endpts.com/pharmas-broken-business-model-anindustry-on-the-brink-of-terminal-decline.Search in Google Scholar

Van Vlijmen, H. (2016, March). Open PHACTS: Semantic interoperability for drug discovery. In ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY (Vol. 251). 1155 16TH ST, NW, WASHINGTON, DC 20036 USA: AMER CHEMICAL SOC.Search in Google Scholar

Wang, H., Ding, Y., Tang, J., Dong, X., He, B., Qiu, J., & Wild, D. J. (2011). Finding complex biological relationships in recent PubMed articles using Bio-LDA. PloS one, 6(3), e17243.Search in Google Scholar

Wood, J., Tan, P., Wang, W., & Arnold, C. (2017, April). Source-LDA: Enhancing probabilistic topic models using prior knowledge sources. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE) (pp. 411-422). IEEE.Search in Google Scholar

Xiao, C., Zhang, P., Chaowalitwongse, W. A., Hu, J., & Wang, F. (2017, February). Adverse drug reaction prediction with symbolic latent dirichlet allocation. In Proceedings of the thirty-first AAAI conference on artificial intelligence.Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo