1. bookVolume 3 (2019): Issue 4 (October 2019)
Journal Details
First Published
30 Jan 2017
Publication timeframe
4 times per year
access type Open Access

A comparison of machine learning algorithms for the prediction of Hepatitis C NS3 protease cleavage sites

Published Online: 23 Oct 2019
Page range: 167 - 174
Journal Details
First Published
30 Jan 2017
Publication timeframe
4 times per year

Hepatitis is a global disease that is on the rise and is currently the cause of more deaths than the human immunodeficiency virus each year. As a result, there is an increasing need for antivirals. Previously, effective antivirals have been found in the form of substrate-mimetic antiviral protease inhibitors. The application of machine learning has been used to predict cleavage patterns of viral proteases to provide information for future drug design. This study has successfully applied and compared several machine learning algorithms to hepatitis C viral NS3 serine protease cleavage data. Results have found that differences in sequence-extraction methods can outweigh differences in algorithm choice. Models produced from pseudo-coded datasets all performed with high accuracy and outperformed models created with orthogonal-coded datasets. However, no single pseudo-model performed significantly better than any other. Evaluation of performance measures also show that the correct choice of model scoring system is essential for unbiased model assessment.


1. Tong L. Viral Proteases. Chem Rev. 2002;102(12):4609–26.Search in Google Scholar

2. WHO. Global hepatitis report. 2017.Search in Google Scholar

3. Zopf S, Kremer AE, Neurath MF, Siebler J. Advances in hepatitis C therapy: What is the current state - what come’s next? World J Hepatol. 2016 Jan;8(3):139–47.Search in Google Scholar

4. Lin C. HCV NS3-4A Serine Protease. In: Hepatitis C Viruses: Genomes and Molecular Biology. 1st ed. Norfolk: Horizon Bioscience; 2006. p. 163–206.Search in Google Scholar

5. Chambers TJ, Weir RC, Grakoui A, McCourt DW, Bazan JF, Fletterick RJ, et al. Evidence that the N-terminal domain of nonstructural protein NS3 from yellow fever virus is a serine protease responsible for site-specific cleavages in the viral polyprotein. Proc Natl Acad Sci U S A. 1990 Nov;87(22):8898–902.Search in Google Scholar

6. Colarusso S, Gerlach B, Koch U, Muraglia E, Conte I, Stansfield I, et al. Evolution, synthesis and SAR of tripeptide α-ketoacid Inhibitors of the hepatitis C virus NS3/NS4A serine protease. Bioorg Med Chem Lett. 2002;12(4):705–8.Search in Google Scholar

7. Sheng XC, Pyun H-J, Chaudhary K, Wang J, Doerffler E, Fleury M, et al. Discovery of novel phosphonate derivatives as hepatitis C virus NS3 protease inhibitors. Bioorg Med Chem Lett. 2009;19(13):3453–7.Search in Google Scholar

8. Venkatraman S, Wu W, Prongay A, Girijavallabhan V, George Njoroge F. Potent inhibitors of HCV-NS3 protease derived from boronic acids. Bioorg Med Chem Lett. 2009;19(1):180–3.Search in Google Scholar

9. Lamarre D, Anderson PC, Bailey M, Beaulieu P, Bolger G, Bonneau P, et al. An NS3 protease inhibitor with antiviral effects in humans infected with hepatitis C virus. Nature. 2003 Oct 26;426:186.Search in Google Scholar

10. Kwo PY, Lawitz EJ, McCone J, Schiff ER, Vierling JM, Pound D, et al. Efficacy of boceprevir, an NS3 protease inhibitor, in combination with peginterferon alfa-2b and ribavirin in treatment-naive patients with genotype 1 hepatitis C infection (SPRINT-1): an open-label, randomised, multicentre phase 2 trial. Lancet. 2010;376(9742):705–16.Search in Google Scholar

11. Sing WT, Lee CL, Yeo SL, Lim SP, Sim MM. Arylalkylidene rhodanine with bulky and hydrophobic functional group as selective HCV NS3 protease inhibitor. Bioorg Med Chem Lett. 2001;11(2):91–4.10.1016/S0960-894X(00)00610-7Open DOISearch in Google Scholar

12. Venkatraman S, Bogen SL, Arasappan A, Bennett F, Chen K, Jao E, et al. Discovery of (1R,5S)-N-[3-Amino-1-(cyclobutylmethyl)-2,3-dioxopropyl]-3-[2(S)-[[[(1,1-dimethylethyl)amino]carbonyl]amino]-3,3-dimethyl-1-oxobutyl]-6,6-dimethyl-3-azabicyclo[3.1.0] hexan-2(S)-carboxamide (SCH 503034), a Selective, Potent, Orally Bioavailable Hepatitis C Virus NS3 Protease Inhibitor: A Potential Therapeutic Agent for the Treatment of Hepatitis C Infection. J Med Chem. 2006;49(20):6074–86.Search in Google Scholar

13. Li X, Zhang Y-K, Liu Y, Ding CZ, Li Q, Zhou Y, et al. Synthesis and evaluation of novel α-amino cyclic boronates as inhibitors of HCV NS3 protease. Bioorg Med Chem Lett. 2010;20(12):3550–6.Search in Google Scholar

14. Prongay AJ, Guo Z, Yao N, Pichardo J, Fischmann T, Strickland C, et al. Discovery of the HCV NS3/4A Protease Inhibitor (1R,5S)-N-[3-Amino-1-(cyclobutylmethyl)-2,3-dioxopropyl]-3-[2(S)-[[[(1,1-dimethylethyl)amino]carbonyl]amino]-3,3-dimethyl-1-oxobutyl]-6,6-dimethyl-3-azabicyclo[3.1.0]hexan-2(S)-carboxamide (Sch 503034) II. Key Steps in Structure-Based Optimization. J Med Chem. 2007 May 1;50(10):2310–8.Search in Google Scholar

15. Chen KX, Njoroge FG, Prongay A, Pichardo J, Madison V, Girijavallabhan V. Synthesis and biological activity of macrocyclic inhibitors of hepatitis C virus (HCV) NS3 protease. Bioorg Med Chem Lett. 2005;15(20):4475–8.Search in Google Scholar

16. Venkatraman S, Njoroge FG, Wu W, Girijavallabhan V, Prongay AJ, Butkiewicz N, et al. Novel inhibitors of hepatitis C NS3–NS4A serine protease derived from 2-aza-bicyclo[2.2.1]heptane-3-carboxylic acid. Bioorg Med Chem Lett. 2006;16(6):1628–32.10.1016/j.bmcl.2005.12.046Open DOISearch in Google Scholar

17. Bai X, McMullan G, Scheres SHW. How cryo-EM is revolutionizing structural biology. Trends Biochem Sci. 2015;40(1):49–57.Search in Google Scholar

18. Bishop CM. Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin, Heidelberg: Springer-Verlag; 2006.Search in Google Scholar

19. Lu X, Wang L, Jiang Z. The Application of Deep Learning in the Prediction of HIV-1 Protease Cleavage Site. In: 2018 5th International Conference on Systems and Informatics (ICSAI). 2018. p. 1299–304.Search in Google Scholar

20. Singh O, Su EC-Y. Prediction of HIV-1 protease cleavage site using a combination of sequence, structural, and physicochemical features. BMC Bioinformatics. 2016 Dec;17(17):478.Search in Google Scholar

21. Narayanan A, Wu X, Yang ZR. Mining viral protease data to extract cleavage knowledge. Bioinformatics. 2002;18:5–13.10.1093/bioinformatics/18.suppl_1.S5Open DOISearch in Google Scholar

22. Rögnvaldsson T, You L. Why neural networks should not be used for HIV-1 protease cleavage site prediction. Bioinformatics. 2004;20(11):1702–9.Search in Google Scholar

23. Lv Z, Chu Y, Wang Y. HIV protease inhibitors: a review of molecular selectivity and toxicity. HIV AIDS (Auckl). 2015;7:95–104.Search in Google Scholar

24. Schechter I, Berger A. On the size of active sites in proteases. I. Papain. Biochem Biophys Res Commun. 1967;27:157–62.10.1016/S0006-291X(67)80055-XOpen DOISearch in Google Scholar

25. Ripley B. Pattern Recognition and Neural Networks. 1stedn ed. Cambridge: Cambridge University Press; 1996.Search in Google Scholar

26. Breiman L. Random Forests. Mach Learn. 2001;45:5–32.10.1023/A:1010933404324Open DOISearch in Google Scholar

27. J. Dobson A. An Introduction to Generalized Linear Models. 2nd ed. London: Chapman and Hall; 2002.Search in Google Scholar

28. Mika S, Ratsch G, Weston J, Scholkopft B, Mullert K. Fisher Discriminant Analysis with Kernels. In: Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society. 1999. p. 41–8.Search in Google Scholar

29. Cortes C, Vapnik V. Support-Vector Networks. Mach Learn. 1995;20:273–97.10.1007/BF00994018Open DOISearch in Google Scholar

30. Kotsiantis SB, Zaharakis I, Pintelas P. Supervised machine learning: A review of classification techniques. Emerg Artif Intell Appl Comput Eng. 2007;160:3–24.Search in Google Scholar

31. Kutkina O, Feuerriegel S. Deep Learning in R. University of Freiburg; 2016.Search in Google Scholar

32. Goel E, Abhilasha E. Random Forest : A Review. Int J Adv Res Comput Sci Softw Eng. 2017;7(1):251–7.10.23956/ijarcsse/V7I1/01113Open DOISearch in Google Scholar

33. Dey D, Ghosh S, Mallick B. Generalized Linear Models. 1st ed. Boca Raton: CRC Press; 2000.Search in Google Scholar

34. Ben-Hur A, Ong C., Sonnenburg S, Schölkopf B, Rätsch G. Support Vector Machines and Kernels for Computational Biology. PLoS Comput Biol. 2008;4(10).Search in Google Scholar

35. Panchal F, Panchal M. Optimizing Number of Hidden Nodes for Artificial Neural Network using Competitive Learning Approach. Int J Comput Sci Mob Comput. 2015;4(5):358–64.Search in Google Scholar

36. McLachlan Geoffrey J., Do K-A, Ambroise C. Analyzing microarray gene expression data / Geoffrey J. McLachlan, Kim-Anh Do, Christopher Ambroise. Wiley-Interscience Hoboken, N.J; 2004. 213–214 p.Search in Google Scholar

37. Metz CE. Basic principles of ROC analysis. Semin Nucl Med. 1978;8(4):283–98.10.1016/S0001-2998(78)80014-2Open DOISearch in Google Scholar

38. Raghavan V, Bollmann P, S. Jung G. A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans Inf Syst. 1989;7(3):205–29.Search in Google Scholar

39. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta-Protein Struct. 1975;405(2):442–51.Search in Google Scholar

40. Chicco D. Ten quick tips for machine learning in computational biology. BioData Min. 2017;10:1–17.Search in Google Scholar

41. Boughorbel S, Jarray F, El-Anbari M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS One. 2017;12(6):1–17.Search in Google Scholar

42. Royston JP. Algorithm AS 181: The W Test for Normality. J R Stat Soc Ser C (Applied Stat. 1982;31(2):176–80.Search in Google Scholar

43. Joanes DN, Gill CA. Comparing Measures of Sample Skewness and Kurtosis. J R Stat Soc Ser D (The Stat. 1998;47(1):183–9.Search in Google Scholar

44. Kim TK. T test as a parametric statistic. Korean J Anesthesiol. 2015/11/25. 2015 Dec;68(6):540–6.Search in Google Scholar

45. Kim H-Y. Analysis of variance (ANOVA) comparing means of more than two groups. Restor Dent Endod. 2014/01/20. 2014 Feb;39(1):74–7.Search in Google Scholar

46. Spearman C. The proof and measurement of association between two things. Am J Psychol. 1904;15(1):72–101.10.2307/1412159Open DOISearch in Google Scholar

47. Chakrabarti K, Keogh E, Mehrotra S, Pazzani M. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. ACM Trans Database Syst. 2002;27(2):188–228.10.1145/568518.568520Open DOISearch in Google Scholar

48. Li B, Cai Y, Feng K, Zhao G. Prediction of Protein Cleavage Site with Feature Selection by Random Forest. PLoS One. 2012;7(9):1–9.Search in Google Scholar

49. Davis J, Goadrich M. The Relationship Between Precision-Recall and ROC Curves. In: Proceedings of the 23rd International Conference on Machine Learning. New York, NY, USA: ACM; 2006. p. 233–40.Search in Google Scholar

50. Crooks GE, Hon G, Chandonia J-M, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004 Jun;14(6):1188–90.1517312010.1101/gr.849004Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo