1. bookVolume 45 (2020): Edition 4 (December 2020)
Détails du magazine
Première parution
24 Oct 2012
4 fois par an
access type Accès libre

Return on Investment in Machine Learning: Crossing the Chasm between Academia and Business

Publié en ligne: 16 Dec 2020
Volume & Edition: Volume 45 (2020) - Edition 4 (December 2020)
Pages: 281 - 304
Reçu: 16 Mar 2020
Accepté: 30 Nov 2020
Détails du magazine
Première parution
24 Oct 2012
4 fois par an

Academia remains the central place of machine learning education. While academic culture is the predominant factor influencing the way we teach machine learning to students, many practitioners question this culture, claiming the lack of alignment between academic and business environments. Drawing on professional experiences from both sides of the chasm, we describe the main points of contention, in the hope that it will help better align academic syllabi with the expectations towards future machine learning practitioners. We also provide recommendations for teaching of the applied aspects of machine learning.


[1] Algorithmia. algorithmia.com/product. Accessed: 2020-11-30.Search in Google Scholar

[2] Amazon ground truth. https://amzn.to/3g0AGqf. Accessed: 2020-11-30.Search in Google Scholar

[3] Amazon mechanical turk. www.mturk.com. Accessed: 2020-11-30.Search in Google Scholar

[4] Amazon open data. registry.opendata.aws. Accessed: 2020-11-30.Search in Google Scholar

[5] Breaking the 80/20 rule: How data catalogs transform data scientists’ productivity. ibm.co/2QhuGxK. Accessed: 2020-11-30.Search in Google Scholar

[6] Cleaning big data: Most time-consuming, least enjoyable data science task, survey says. bit.ly/3d3M0zQ. Accessed: 2020-11-30.Search in Google Scholar

[7] Cookiecutter data science. https://bit.ly/3msmp8j. Accessed: 2020-11-30.Search in Google Scholar

[8] Data gov. www.data.gov. Accessed: 2020-11-30.Search in Google Scholar

[9] Data usa. datausa.io. Accessed: 2020-11-30.Search in Google Scholar

[10] Data version control. dvc.org. Accessed: 2020-11-30.Search in Google Scholar

[11] Figure eight. www.figure-eight.com. Accessed: 2020-11-30.Search in Google Scholar

[12] Flair. github.com/flairNLP/flair. Accessed: 2020-11-30.Search in Google Scholar

[13] Ganlab. poloclub.github.io/ganlab. Accessed: 2020-11-30.Search in Google Scholar

[14] Google dataset search. https://bit.ly/2JucsbU. Accessed: 2020-11-30.Search in Google Scholar

[15] Huggingface. https://bit.ly/39vUDE4. Accessed: 2020-11-30.Search in Google Scholar

[16] Jupyanno. github.com/chestrays/jupyanno. Accessed: 2020-11-30.Search in Google Scholar

[17] Kaggle datasets. www.kaggle.com/datasets. Accessed: 2020-11-30.Search in Google Scholar

[18] Kedro. github.com/quantumblacklabs/kedro. Accessed: 2020-11-30.Search in Google Scholar

[19] Metaflow. metaflow.org. Accessed: 2020-11-30.Search in Google Scholar

[20] Mlflow. mlflow.org. Accessed: 2020-11-30.Search in Google Scholar

[21] Neptune. neptune.ml. Accessed: 2020-11-30.Search in Google Scholar

[22] Network repository. networkrepository.com. Accessed: 2020-11-30.Search in Google Scholar

[23] Notion: All-in-one workplace. ww.notion.so. Accessed: 2020-11-30.Search in Google Scholar

[24] Pigeon. github.com/agermanidis/pigeon. Accessed: 2020-11-30.Search in Google Scholar

[25] Sagemaker. aws.amazon.com/sagemaker. Accessed: 2020-11-30.Search in Google Scholar

[26] The staggering cost of training sota ai models. bit.ly/39O20nL. Accessed: 2020-11-30.Search in Google Scholar

[27] Tensorboard. www.tensorflow.org/tensorboard. Accessed: 2020-11-30.Search in Google Scholar

[28] Tf serving. github.com/tensorflow/serving. Accessed: 2020-11-30.10.1002/ntlf.30266Search in Google Scholar

[29] Tldrlegal - software licenses explained in plain english. tldrlegal.com/. Accessed: 2020-11-30.Search in Google Scholar

[30] Uci ml repository. archive.ics.uci.edu/ml. Accessed: 2020-11-30.Search in Google Scholar

[31] Visdom. github.com/facebookresearch/visdom. Accessed: 2020-11-30.Search in Google Scholar

[32] Weights & biases. www.wandb.com. Accessed: 2020-11-30.Search in Google Scholar

[33] Yellowbrick. www.scikit-yb.org. Accessed: 2020-11-30.Search in Google Scholar

[34] Fair crowd work: Shedding light on the real work of crowd-, platform-, and app-based work. http://faircrowd.work/platform-reviews, 2018. Accessed: 2020-11-30.Search in Google Scholar

[35] Amershi S., Begel A., Bird C., DeLine R., Gall H., Kamar E., Nagappan N., Nushi B., and Zimmermann T. Software engineering for machine learning: A case study. In 41st IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice, pages 291–300. IEEE, 2019.Search in Google Scholar

[36] Badene S., Thompson K., Lorré J.-P., and Asher N. Weak supervision for learning discourse structure. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2296–2305, 2019.10.18653/v1/D19-1234Search in Google Scholar

[37] Bernardi L., Mavridis T., and Estevez P. 150 successful machine learning models: 6 lessons learned at booking.com. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1743–1751, 2019.10.1145/3292500.3330744Search in Google Scholar

[38] Bloice M. D., Roth P. M., and Holzinger A. Biomedical image augmentation using augmentor. Bioinformatics, 35(21):4522–4524, 2019.10.1093/bioinformatics/btz25930989173Search in Google Scholar

[39] Bolukbasi T., Chang K.-W., Zou J., Saligrama V., and Kalai A. Man is to computer programmer as woman is to homemaker? debiasing word embeddings, 2016.Search in Google Scholar

[40] Bosch J., Crnkovic I., and Olsson H. H. Engineering ai systems: A research agenda, 2020.10.4018/978-1-7998-5101-1.ch001Search in Google Scholar

[41] Breck E., Cai S., Nielsen E., Salib M., and Sculley D. What's your ml test score? a rubric for ml production systems. In Reliable Machine Learning in the Wild - NIPS 2016 Workshop, 2016.Search in Google Scholar

[42] Breiman L. Random forests. Machine Learning, 45(1):5–32, 2001.10.1023/A:1010933404324Search in Google Scholar

[43] Brown T. B., Mann B., Ryder N., Subbiah M., Kaplan J., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A., Agarwal S., Herbert-Voss A., Krueger G., Henighan T., Child R., Ramesh A., Ziegler D. M., Wu J., Winter C., Hesse C., Chen M., Sigler E., Litwin M., Gray S., Chess B., Clark J., Berner C., McCandlish S., Radford A., Sutskever I., and Amodei D. Language models are few-shot learners, 2020.Search in Google Scholar

[44] Buslaev A., Iglovikov V. I., Khvedchenya E., Parinov A., Druzhinin M., and Kalinin A. A. Albumentations: fast and flexible image augmentations. Information, 11(2):125, 2020.Search in Google Scholar

[45] Byun T. and Rayadurgam S. Manifold for machine learning assurance. Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 06 2020.10.1145/3377816.3381734Search in Google Scholar

[46] Dehghani M., Severyn A., Rothe S., and Kamps J. Learning to learn from weak supervision by full supervision, 2017.Search in Google Scholar

[47] Demir S., Eniser H. F., and Sen A. Deepsmartfuzzer: Reward guided test generation for deep learning, 2019.Search in Google Scholar

[48] Devlin J., Chang M.-W., Lee K., and Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.Search in Google Scholar

[49] Dignum V. Responsible artificial intelligence: designing ai for human values. The ITU Journal on Future and Evolving Technologies, 2017.Search in Google Scholar

[50] Dingwall N. and Potts C. Mittens: An extension of glove for learning domain-specialized representations, 2018.10.18653/v1/N18-2034Search in Google Scholar

[51] Doersch C. and Zisserman A. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2051–2060, 2017.10.1109/ICCV.2017.226Search in Google Scholar

[52] Drozdowski M., Kowalski D., Mizgajski J., Mokwa D., and Pawlak G. Mind the gap: a heuristic study of subway tours. Journal of Heuristics, 20(5):561–587, 2014.Search in Google Scholar

[53] Fayyad U., Piatetsky-Shapiro G., and Smyth P. From data mining to knowledge discovery in databases. AI magazine, 17(3):37–37, 1996.Search in Google Scholar

[54] Gagolewski M. et al. Benchmark suite for clustering algorithms – version 1, 2020.Search in Google Scholar

[55] Garcia M. Racist in the machine: The disturbing implications of algorithmic bias. World Policy Journal, 33(4):111–117, 2016.Search in Google Scholar

[56] Gerasimou S., Eniser H. F., Sen A., and Cakan A. Importance-driven deep learning system testing, 2020.10.1145/3377811.3380391Search in Google Scholar

[57] Gilyazev R. and Turdakov D. Y. Active learning and crowdsourcing: A survey of optimization methods for data labeling. Programming and Computer Software, 44(6):476–491, 2018.Search in Google Scholar

[58] Gofman M. and Jin Z. Artificial intelligence, human capital, and innovation. Human Capital, and Innovation (August 20, 2019), 2019.10.2139/ssrn.3448116Search in Google Scholar

[59] Gong C., Zhang H., Yang J., and Tao D. Learning with inadequate and incorrect supervision. In 2017 IEEE International Conference on Data Mining (ICDM), pages 889–894. IEEE, 2017.10.1109/ICDM.2017.110Search in Google Scholar

[60] Gottgtroy P. Ontology driven knowledge discovery process: a proposal to integrate ontology engineering and kdd. PACIS 2007 Proceedings, page 88, 2007.Search in Google Scholar

[61] Hadiwinoto C. and Ng H. T. Upping the ante: Towards a better benchmark for chinese-to-english machine translation, 2018.Search in Google Scholar

[62] Hager G. D., Drobnis A., Fang F., Ghani R., Greenwald A., Lyons T., Parkes D. C., Schultz J., Saria S., Smith S. F., and Tambe M. Artificial intelligence for social good, 2019.Search in Google Scholar

[63] Hand D. J. and Khan S. Validating and verifying ai systems. Patterns, 1(3):100037, 2020.10.1016/j.patter.2020.100037766044933205105Search in Google Scholar

[64] Hao D., Zhang L., Sumkin J., Mohamed A., and Wu S. Inaccurate labels in weakly supervised deep learning: Automatic identification and correction and their impact on classification performance. IEEE Journal of Biomedical and Health Informatics, 2020.10.1109/JBHI.2020.2974425742934532078570Search in Google Scholar

[65] Hendrycks D., Mazeika M., Kadavath S., and Song D. Using self-supervised learning can improve model robustness and uncertainty. In Advances in Neural Information Processing Systems, pages 15663–15674, 2019.Search in Google Scholar

[66] Ho D., Liang E., Chen X., Stoica I., and Abbeel P. Population based augmentation: Efficient learning of augmentation policy schedules. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2731–2741, Long Beach, California, USA, 06 2019. PMLR.Search in Google Scholar

[67] Honnibal M. and Montani I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. 2017.Search in Google Scholar

[68] Howard A. and Borenstein J. The ugly truth about ourselves and our robot creations: The problem of bias and social inequity. Science and engineering ethics, 24(5):1521–1536, 2018.Search in Google Scholar

[69] Howard A. G., Zhu M., Chen B., Kalenichenko D., Wang W., Weyand T., Andreetto M., and Adam H. Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.Search in Google Scholar

[70] Howard J. and Ruder S. Universal language model fine-tuning for text classification, 2018.10.18653/v1/P18-1031Search in Google Scholar

[71] Inmon B. Data Lake Architecture: Designing the Data Lake and avoiding the garbage dump. Technics Publications, 2016.Search in Google Scholar

[72] Inmon W. H. Building the data warehouse. John Wiley & Sons, 2005.Search in Google Scholar

[73] Kessler J. S. Scattertext: a browser-based tool for visualizing how corpora differ, 2017.10.18653/v1/P17-4015Search in Google Scholar

[74] Kim M., Zimmermann T., DeLine R., and Begel A. Data scientists in software teams: State of the art and challenges. IEEE Transactions on Software Engineering, 44(11):1024–1038, 2017.Search in Google Scholar

[75] Krizhevsky A., Sutskever I., and Hinton G. E. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.10.1145/3065386Search in Google Scholar

[76] Lacoste A., Luccioni A., Schmidt V., and Dandres T. Quantifying the carbon emissions of machine learning, 2019.Search in Google Scholar

[77] Lazer D., Kennedy R., King G., and Vespignani A. The parable of google flu: traps in big data analysis. Science, 343(6176):1203–1205, 2014.Search in Google Scholar

[78] Ledwich M. and Zaitsev A. Algorithmic extremism: Examining youtube's rabbit hole of radicalization, 2019.10.5210/fm.v25i3.10419Search in Google Scholar

[79] Lee S. W. and Kerschberg L. A methodology and life cycle model for data mining and knowledge discovery in precision agriculture. In SMC’98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics, volume 3, pages 2882–2887, 1998.Search in Google Scholar

[80] Ma E. NLP augmentation. github.com/makcedward/nlpaug, 2019.Search in Google Scholar

[81] Matignon R. Data mining using SAS enterprise miner, volume 638. John Wiley & Sons, 2007.10.1002/9780470171431Search in Google Scholar

[82] Mikolov T., Chen K., Corrado G., and Dean J. Efficient estimation of word representations in vector space, 2013.Search in Google Scholar

[83] Mizgajski J., Szymczak A., Żelasko P., Morzy M., Augustyniak Ł., and Szymański P. Return of investment in machine learning: Crossing the chasm between academia and business. In Proceedings of the PP-RAI’2019 Conference: Polskie Porozumienie na Rzecz Rozwoju Sztucznej Inteligencji, pages 285–291. Wroclaw University of Science and Technology, 2019.Search in Google Scholar

[84] Montani I. and Honnibal M. Prodigy: A new annotation tool for radically efficient machine teaching. prodi.gy/, 2018. Accessed: 2020-11-30.Search in Google Scholar

[85] Nakayama H., Kubo T., Kamura J., Taniguchi Y., and Liang X. Doccano: Text annotation tool for human. github.com/doccano/doccano, 2018. Accessed: 2020-11-30.Search in Google Scholar

[86] Obermeyer Z. and Mullainathan S. Dissecting racial bias in an algorithm that guides health decisions for 70 million people. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 89–89, 2019.10.1145/3287560.3287593Search in Google Scholar

[87] Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32:8026–8037, 2019.Search in Google Scholar

[88] Pennington J., Socher R., and Manning C. D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.10.3115/v1/D14-1162Search in Google Scholar

[89] Purdy M. and Daugherty P. Why artificial intelligence is the future of growth, 2016.Search in Google Scholar

[90] Pustejovsky J. and Stubbs A. Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications. O’Reilly Media, 2012.Search in Google Scholar

[91] Ratner A., Bach S. H., Ehrenberg H., Fries J., Wu S., and Ré C. Snorkel: rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, 11(3):269–282, 11 2017.10.14778/3157794.3157797595119129770249Search in Google Scholar

[92] Ribeiro M. T., Wu T., Guestrin C., and Singh S. Beyond accuracy: Behavioral testing of nlp models with checklist, 2020.10.24963/ijcai.2021/659Search in Google Scholar

[93] Roh Y., Heo G., and Whang S. E. A survey on data collection for machine learning: a big data – ai integration perspective, 2019.Search in Google Scholar

[94] Ruder S., Peters M. E., Swayamdipta S., and Wolf T. Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18, 2019.10.18653/v1/N19-5004Search in Google Scholar

[95] Schwartz R., Dodge J., Smith N. A., and Etzioni O. Green ai, 2019.Search in Google Scholar

[96] Sculley D., Holt G., Golovin D., Davydov E., Phillips T., Ebner D., Chaudhary V., Young M., Crespo J.-F., and Dennison D. Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems, pages 2503–2511, 2015.Search in Google Scholar

[97] Settles B. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009.Search in Google Scholar

[98] Shane J. You Look Like a Thing and I Love You: How Artificial Intelligence Works and why It's Making the World a Weirder Place. Voracious, 2019.Search in Google Scholar

[99] Shi Z. R., Wang C., and Fang F. Artificial intelligence for social good: A survey, 2020.Search in Google Scholar

[100] Simonyan K. and Zisserman A. Very deep convolutional networks for large-scale image recognition, 2015.Search in Google Scholar

[101] Strobelt H., Gehrmann S., Behrisch M., Perer A., Pfister H., and Rush A. M. Seq2seq-vis: A visual debugging tool for sequence-to-sequence models. IEEE Transactions on Visualization and Computer Graphics, 25(1):353–363, 2018.Search in Google Scholar

[102] Sun Z., Zhang J. M., Harman M., Papadakis M., and Zhang L. Automatic testing and improvement of machine translation, 2019.10.1145/3377811.3380420Search in Google Scholar

[103] Tatman R. Gender and dialect bias in youtube's automatic captions. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pages 53–59, 2017.10.18653/v1/W17-1606Search in Google Scholar

[104] Tufekci Z. Youtube's recommendation algorithm has a dark side. bit.ly/2m09tvZ. Accessed: 2020-11-30.Search in Google Scholar

[105] Van Engelen J. E. and Hoos H. H. A survey on semi-supervised learning. Machine Learning, 109(2):373–440, 2020.10.1007/s10994-019-05855-6Search in Google Scholar

[106] VanderPlas J. The big data brain drain: Why science is in trouble. Accessed: 2020-11-30.Search in Google Scholar

[107] VanderPlas J. Hacking academia: Data science and the university. Accessed: 2020-11-30.Search in Google Scholar

[108] Vassiliadis P. A survey of extract–transform–load technology. International Journal of Data Warehousing and Mining, 5(3):1–27, 2009.10.4018/jdwm.2009070101Search in Google Scholar

[109] Wagstaff K. Machine learning that matters, 2012.Search in Google Scholar

[110] Wan Z., Xia X., Lo D., and Murphy G. C. How does machine learning change software development practices? IEEE Transactions on Software Engineering, pages 1–14, 2019.10.1109/TSE.2019.2937083Search in Google Scholar

[111] Wang J., Gou L., Shen H.-W., and Yang H. Dqnviz: A visual analytics approach to understand deep q-networks. IEEE Transactions on Visualization and Computer Graphics, 25(1):288–298, 2018.Search in Google Scholar

[112] Weiss K., Khoshgoftaar T. M., and Wang D. A survey of transfer learning. Journal of Big data, 3(1):9, 2016.10.1186/s40537-016-0043-6Search in Google Scholar

[113] Wirth R. and Hipp J. Crisp-dm: Towards a standard process model for data mining. In Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, pages 29–39. Springer-Verlag, 2000.Search in Google Scholar

[114] Yan M., Wang L., and Fei A. Artdl: Adaptive random testing for deep learning systems. IEEE Access, 2019.10.1109/ACCESS.2019.2962695Search in Google Scholar

[115] Yosinski J., Clune J., Nguyen A., Fuchs T., and Lipson H. Understanding neural networks through deep visualization, 2015.Search in Google Scholar

[116] Zafrir O., Boudoukh G., Izsak P., and Wasserblat M. Q8bert: Quantized 8bit bert, 2019.10.1109/EMC2-NIPS53020.2019.00016Search in Google Scholar

[117] Zeiler M. D. and Fergus R. Visualizing and understanding convolutional networks. In 13th European Conference on Computer Vision, ECCV 2014, pages 818–833. Springer Verlag, 2014.10.1007/978-3-319-10590-1_53Search in Google Scholar

[118] Zhang J. M., Harman M., Ma L., and Liu Y. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering, 2020.Search in Google Scholar

[119] Zinkevich M. Rules of machine learning: Best practices for ml engineering. developers.google.com/machine-learning/guides/rules-of-ml. Accessed: 2020-11-30.Search in Google Scholar

Articles recommandés par Trend MD

Planifiez votre conférence à distance avec Sciendo