1. bookVolume 3 (2019): Issue 4 (October 2019)
Journal Details
First Published
30 Jan 2017
Publication timeframe
4 times per year
access type Open Access

A comparison of machine learning algorithms for the prediction of Hepatitis C NS3 protease cleavage sites

Published Online: 23 Oct 2019
Volume & Issue: Volume 3 (2019) - Issue 4 (October 2019)
Page range: 167 - 174
Journal Details
First Published
30 Jan 2017
Publication timeframe
4 times per year

Hepatitis is a global disease that is on the rise and is currently the cause of more deaths than the human immunodeficiency virus each year. As a result, there is an increasing need for antivirals. Previously, effective antivirals have been found in the form of substrate-mimetic antiviral protease inhibitors. The application of machine learning has been used to predict cleavage patterns of viral proteases to provide information for future drug design. This study has successfully applied and compared several machine learning algorithms to hepatitis C viral NS3 serine protease cleavage data. Results have found that differences in sequence-extraction methods can outweigh differences in algorithm choice. Models produced from pseudo-coded datasets all performed with high accuracy and outperformed models created with orthogonal-coded datasets. However, no single pseudo-model performed significantly better than any other. Evaluation of performance measures also show that the correct choice of model scoring system is essential for unbiased model assessment.


1. Tong L. Viral Proteases. Chem Rev. 2002;102(12):4609–26.10.1021/cr010184fSearch in Google Scholar

2. WHO. Global hepatitis report. 2017.Search in Google Scholar

3. Zopf S, Kremer AE, Neurath MF, Siebler J. Advances in hepatitis C therapy: What is the current state - what come’s next? World J Hepatol. 2016 Jan;8(3):139–47.10.4254/wjh.v8.i3.139Search in Google Scholar

4. Lin C. HCV NS3-4A Serine Protease. In: Hepatitis C Viruses: Genomes and Molecular Biology. 1st ed. Norfolk: Horizon Bioscience; 2006. p. 163–206.Search in Google Scholar

5. Chambers TJ, Weir RC, Grakoui A, McCourt DW, Bazan JF, Fletterick RJ, et al. Evidence that the N-terminal domain of nonstructural protein NS3 from yellow fever virus is a serine protease responsible for site-specific cleavages in the viral polyprotein. Proc Natl Acad Sci U S A. 1990 Nov;87(22):8898–902.10.1073/pnas.87.22.8898Search in Google Scholar

6. Colarusso S, Gerlach B, Koch U, Muraglia E, Conte I, Stansfield I, et al. Evolution, synthesis and SAR of tripeptide α-ketoacid Inhibitors of the hepatitis C virus NS3/NS4A serine protease. Bioorg Med Chem Lett. 2002;12(4):705–8.10.1016/S0960-894X(01)00843-5Search in Google Scholar

7. Sheng XC, Pyun H-J, Chaudhary K, Wang J, Doerffler E, Fleury M, et al. Discovery of novel phosphonate derivatives as hepatitis C virus NS3 protease inhibitors. Bioorg Med Chem Lett. 2009;19(13):3453–7.10.1016/j.bmcl.2009.05.023Search in Google Scholar

8. Venkatraman S, Wu W, Prongay A, Girijavallabhan V, George Njoroge F. Potent inhibitors of HCV-NS3 protease derived from boronic acids. Bioorg Med Chem Lett. 2009;19(1):180–3.10.1016/j.bmcl.2008.10.124Search in Google Scholar

9. Lamarre D, Anderson PC, Bailey M, Beaulieu P, Bolger G, Bonneau P, et al. An NS3 protease inhibitor with antiviral effects in humans infected with hepatitis C virus. Nature. 2003 Oct 26;426:186.10.1038/nature02099Search in Google Scholar

10. Kwo PY, Lawitz EJ, McCone J, Schiff ER, Vierling JM, Pound D, et al. Efficacy of boceprevir, an NS3 protease inhibitor, in combination with peginterferon alfa-2b and ribavirin in treatment-naive patients with genotype 1 hepatitis C infection (SPRINT-1): an open-label, randomised, multicentre phase 2 trial. Lancet. 2010;376(9742):705–16.10.1016/S0140-6736(10)60934-8Search in Google Scholar

11. Sing WT, Lee CL, Yeo SL, Lim SP, Sim MM. Arylalkylidene rhodanine with bulky and hydrophobic functional group as selective HCV NS3 protease inhibitor. Bioorg Med Chem Lett. 2001;11(2):91–4.10.1016/S0960-894X(00)00610-7Open DOISearch in Google Scholar

12. Venkatraman S, Bogen SL, Arasappan A, Bennett F, Chen K, Jao E, et al. Discovery of (1R,5S)-N-[3-Amino-1-(cyclobutylmethyl)-2,3-dioxopropyl]-3-[2(S)-[[[(1,1-dimethylethyl)amino]carbonyl]amino]-3,3-dimethyl-1-oxobutyl]-6,6-dimethyl-3-azabicyclo[3.1.0] hexan-2(S)-carboxamide (SCH 503034), a Selective, Potent, Orally Bioavailable Hepatitis C Virus NS3 Protease Inhibitor: A Potential Therapeutic Agent for the Treatment of Hepatitis C Infection. J Med Chem. 2006;49(20):6074–86.10.1021/jm060325b17004721Search in Google Scholar

13. Li X, Zhang Y-K, Liu Y, Ding CZ, Li Q, Zhou Y, et al. Synthesis and evaluation of novel α-amino cyclic boronates as inhibitors of HCV NS3 protease. Bioorg Med Chem Lett. 2010;20(12):3550–6.10.1016/j.bmcl.2010.04.129Search in Google Scholar

14. Prongay AJ, Guo Z, Yao N, Pichardo J, Fischmann T, Strickland C, et al. Discovery of the HCV NS3/4A Protease Inhibitor (1R,5S)-N-[3-Amino-1-(cyclobutylmethyl)-2,3-dioxopropyl]-3-[2(S)-[[[(1,1-dimethylethyl)amino]carbonyl]amino]-3,3-dimethyl-1-oxobutyl]-6,6-dimethyl-3-azabicyclo[3.1.0]hexan-2(S)-carboxamide (Sch 503034) II. Key Steps in Structure-Based Optimization. J Med Chem. 2007 May 1;50(10):2310–8.10.1021/jm060173kSearch in Google Scholar

15. Chen KX, Njoroge FG, Prongay A, Pichardo J, Madison V, Girijavallabhan V. Synthesis and biological activity of macrocyclic inhibitors of hepatitis C virus (HCV) NS3 protease. Bioorg Med Chem Lett. 2005;15(20):4475–8.10.1016/j.bmcl.2005.07.033Search in Google Scholar

16. Venkatraman S, Njoroge FG, Wu W, Girijavallabhan V, Prongay AJ, Butkiewicz N, et al. Novel inhibitors of hepatitis C NS3–NS4A serine protease derived from 2-aza-bicyclo[2.2.1]heptane-3-carboxylic acid. Bioorg Med Chem Lett. 2006;16(6):1628–32.10.1016/j.bmcl.2005.12.046Open DOISearch in Google Scholar

17. Bai X, McMullan G, Scheres SHW. How cryo-EM is revolutionizing structural biology. Trends Biochem Sci. 2015;40(1):49–57.10.1016/j.tibs.2014.10.005Search in Google Scholar

18. Bishop CM. Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin, Heidelberg: Springer-Verlag; 2006.Search in Google Scholar

19. Lu X, Wang L, Jiang Z. The Application of Deep Learning in the Prediction of HIV-1 Protease Cleavage Site. In: 2018 5th International Conference on Systems and Informatics (ICSAI). 2018. p. 1299–304.10.1109/ICSAI.2018.8599496Search in Google Scholar

20. Singh O, Su EC-Y. Prediction of HIV-1 protease cleavage site using a combination of sequence, structural, and physicochemical features. BMC Bioinformatics. 2016 Dec;17(17):478.10.1186/s12859-016-1337-6Search in Google Scholar

21. Narayanan A, Wu X, Yang ZR. Mining viral protease data to extract cleavage knowledge. Bioinformatics. 2002;18:5–13.10.1093/bioinformatics/18.suppl_1.S5Open DOISearch in Google Scholar

22. Rögnvaldsson T, You L. Why neural networks should not be used for HIV-1 protease cleavage site prediction. Bioinformatics. 2004;20(11):1702–9.10.1093/bioinformatics/bth144Search in Google Scholar

23. Lv Z, Chu Y, Wang Y. HIV protease inhibitors: a review of molecular selectivity and toxicity. HIV AIDS (Auckl). 2015;7:95–104.Search in Google Scholar

24. Schechter I, Berger A. On the size of active sites in proteases. I. Papain. Biochem Biophys Res Commun. 1967;27:157–62.10.1016/S0006-291X(67)80055-XOpen DOISearch in Google Scholar

25. Ripley B. Pattern Recognition and Neural Networks. 1stedn ed. Cambridge: Cambridge University Press; 1996.10.1017/CBO9780511812651Search in Google Scholar

26. Breiman L. Random Forests. Mach Learn. 2001;45:5–32.10.1023/A:1010933404324Open DOISearch in Google Scholar

27. J. Dobson A. An Introduction to Generalized Linear Models. 2nd ed. London: Chapman and Hall; 2002.10.1201/9781420057683Search in Google Scholar

28. Mika S, Ratsch G, Weston J, Scholkopft B, Mullert K. Fisher Discriminant Analysis with Kernels. In: Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society. 1999. p. 41–8.Search in Google Scholar

29. Cortes C, Vapnik V. Support-Vector Networks. Mach Learn. 1995;20:273–97.10.1007/BF00994018Open DOISearch in Google Scholar

30. Kotsiantis SB, Zaharakis I, Pintelas P. Supervised machine learning: A review of classification techniques. Emerg Artif Intell Appl Comput Eng. 2007;160:3–24.Search in Google Scholar

31. Kutkina O, Feuerriegel S. Deep Learning in R. University of Freiburg; 2016.Search in Google Scholar

32. Goel E, Abhilasha E. Random Forest : A Review. Int J Adv Res Comput Sci Softw Eng. 2017;7(1):251–7.10.23956/ijarcsse/V7I1/01113Open DOISearch in Google Scholar

33. Dey D, Ghosh S, Mallick B. Generalized Linear Models. 1st ed. Boca Raton: CRC Press; 2000.10.1201/9781482293456Search in Google Scholar

34. Ben-Hur A, Ong C., Sonnenburg S, Schölkopf B, Rätsch G. Support Vector Machines and Kernels for Computational Biology. PLoS Comput Biol. 2008;4(10).10.1371/journal.pcbi.1000173Search in Google Scholar

35. Panchal F, Panchal M. Optimizing Number of Hidden Nodes for Artificial Neural Network using Competitive Learning Approach. Int J Comput Sci Mob Comput. 2015;4(5):358–64.Search in Google Scholar

36. McLachlan Geoffrey J., Do K-A, Ambroise C. Analyzing microarray gene expression data / Geoffrey J. McLachlan, Kim-Anh Do, Christopher Ambroise. Wiley-Interscience Hoboken, N.J; 2004. 213–214 p.10.1002/047172842XSearch in Google Scholar

37. Metz CE. Basic principles of ROC analysis. Semin Nucl Med. 1978;8(4):283–98.10.1016/S0001-2998(78)80014-2Open DOISearch in Google Scholar

38. Raghavan V, Bollmann P, S. Jung G. A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans Inf Syst. 1989;7(3):205–29.10.1145/65943.65945Search in Google Scholar

39. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta-Protein Struct. 1975;405(2):442–51.10.1016/0005-2795(75)90109-9Search in Google Scholar

40. Chicco D. Ten quick tips for machine learning in computational biology. BioData Min. 2017;10:1–17.10.1186/s13040-017-0155-3572166029234465Search in Google Scholar

41. Boughorbel S, Jarray F, El-Anbari M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS One. 2017;12(6):1–17.10.1371/journal.pone.0177678545604628574989Search in Google Scholar

42. Royston JP. Algorithm AS 181: The W Test for Normality. J R Stat Soc Ser C (Applied Stat. 1982;31(2):176–80.10.2307/2347986Search in Google Scholar

43. Joanes DN, Gill CA. Comparing Measures of Sample Skewness and Kurtosis. J R Stat Soc Ser D (The Stat. 1998;47(1):183–9.10.1111/1467-9884.00122Search in Google Scholar

44. Kim TK. T test as a parametric statistic. Korean J Anesthesiol. 2015/11/25. 2015 Dec;68(6):540–6.10.4097/kjae.2015.68.6.540466713826634076Search in Google Scholar

45. Kim H-Y. Analysis of variance (ANOVA) comparing means of more than two groups. Restor Dent Endod. 2014/01/20. 2014 Feb;39(1):74–7.10.5395/rde.2014.39.1.74391651124516834Search in Google Scholar

46. Spearman C. The proof and measurement of association between two things. Am J Psychol. 1904;15(1):72–101.10.2307/1412159Open DOISearch in Google Scholar

47. Chakrabarti K, Keogh E, Mehrotra S, Pazzani M. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. ACM Trans Database Syst. 2002;27(2):188–228.10.1145/568518.568520Open DOISearch in Google Scholar

48. Li B, Cai Y, Feng K, Zhao G. Prediction of Protein Cleavage Site with Feature Selection by Random Forest. PLoS One. 2012;7(9):1–9.10.1371/journal.pone.0045854344548823029276Search in Google Scholar

49. Davis J, Goadrich M. The Relationship Between Precision-Recall and ROC Curves. In: Proceedings of the 23rd International Conference on Machine Learning. New York, NY, USA: ACM; 2006. p. 233–40.10.1145/1143844.1143874Search in Google Scholar

50. Crooks GE, Hon G, Chandonia J-M, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004 Jun;14(6):1188–90.1517312010.1101/gr.84900441979715173120Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo