1. bookVolume 2022 (2022): Issue 2 (April 2022)
Journal Details
First Published
16 Apr 2015
Publication timeframe
4 times per year
access type Open Access

Who Knows I Like Jelly Beans? An Investigation Into Search Privacy

Published Online: 03 Mar 2022
Page range: 426 - 446
Received: 31 Aug 2021
Accepted: 16 Dec 2021
Journal Details
First Published
16 Apr 2015
Publication timeframe
4 times per year

Internal site search is an integral part of how users navigate modern sites, from restaurant reservations to house hunting to searching for medical solutions. Search terms on these sites may contain sensitive information such as location, medical information, or sexual preferences; when further coupled with a user’s IP address or a browser’s user agent string, this information can become very specific, and in some cases possibly identifying.

In this paper, we measure the various ways by which search terms are sent to third parties when a user submits a search query. We developed a methodology for identifying and interacting with search components, which we implemented on top of an instrumented headless browser. We used this crawler to visit the Tranco top one million websites and analyzed search term leakage across three vectors: URL query parameters, payloads, and the Referer HTTP header. Our crawler found that 512,701 of the top 1 million sites had internal site search. We found that 81.3% of websites containing internal site search sent (or leaked from a user’s perspective) our search terms to third parties in some form. We then compared our results to the expected results based on a natural language analysis of the privacy policies of those leaking websites (where available) and found that about 87% of those privacy policies do not mention search terms explicitly. However, about 75% of these privacy policies seem to mention the sharing of some information with third-parties in a generic manner. We then present a few countermeasures, including a browser extension to warn users about imminent search term leakage to third parties. We conclude this paper by making recommendations on clarifying the privacy implications of internal site search to end users.


[1] Syed Suleman Ahmad, Muhammad Daniyal Dar, Muhammad Fareed Zaffar, Narseo Vallina-Rodriguez, and Rishab Nithyanand. 2020. Apophanies or Epiphanies? How Crawlers Impact Our Understanding of the Web. In Proceedings of The Web Conference 2020 (WWW ’20). Association for Computing Machinery, New York, NY, USA, 271–280. https://doi.org/10.1145/3366423.338011310.1145/3366423.3380113 Search in Google Scholar

[2] Alexa Internet Inc. 2020. Alexa Top Sites. https://www.alexa.com/topsites. Search in Google Scholar

[3] Amirhossein Aleyasen, Oleksii Starov, Alyssa Phung Au, Allan Schiffman, and Jeff Shrager. 2015. On the Privacy Practices of Just Plain Sites. In Proceedings of the 14th ACM Workshop on Privacy in the Electronic Society (WPES ’15). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/2808138.280814010.1145/2808138.2808140 Search in Google Scholar

[4] Ryan Amos, Gunes Acar, Elena Lucherini, Mihir Kshirsagar, Arvind Narayanan, and Jonathan Mayer. 2020. Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset. arXiv:cs.CY/2008.0915910.1145/3442381.3450048 Search in Google Scholar

[5] Michael Barbaro and Tom Zeller Jr. 2006. A Face Is Exposed for AOL Searcher No. 4417749. https://www.nytimes.com/2006/08/09/technology/09aol.html. Search in Google Scholar

[6] Bee and Ciphey collaborators. 2008. Ciphey. https://github.com/Ciphey/Ciphey. Search in Google Scholar

[7] CCPA. 2018. California Consumer Privacy Act of 2018. https://leginfo.legislature.ca.gov/faces/codes_displayText.xhtml?division=3.&part=4.&lawCode=CIV&title=1.81.5. Search in Google Scholar

[8] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (2011), 27:1–27:27. Issue 3. Search in Google Scholar

[9] Manolis Chatzimpyrros, Konstantinos Solomos, and Sotiris Ioannidis. 2020. You Shall Not Register! Detecting Privacy Leaks Across Registration Forms. In Computer Security. Springer International Publishing, Cham, 91–104. Search in Google Scholar

[10] Chrome DevTools Team. 2020. Chrome DevTools Protocol. https://chromedevtools.github.io/devtools-protocol/. Search in Google Scholar

[11] Chrome DevTools Team. 2020. Puppeteer. https://github.com/GoogleChrome/puppeteer. Search in Google Scholar

[12] Chrome DevTools Team. 2020. Puppeteer Chrome DevTools Protocol Session. https://devdocs.io/puppeteer/index#class-cdpsession/. Search in Google Scholar

[13] Federal Trade Commission. 1998. Privacy online: A report to Congress. https://www.ftc.gov/sites/default/files/documents/reports/privacy-online-report-congress/priv-23a.pdf. Search in Google Scholar

[14] Lorrie Faith Cranor, Pedro Giovanni Leon, and Blase Ur. 2016. A Large-Scale Evaluation of U.S. Financial Institutions’ Standardized Privacy Notices. ACM Trans. Web 10, 3, Article 17 (Aug. 2016), 33 pages. https://doi.org/10.1145/291198810.1145/2911988 Search in Google Scholar

[15] Mark Davies and Joseph L. Fleiss. 1982. Measuring Agreement for Multinomial Data. Biometrics 38, 4 (1982), 1047–1051. Search in Google Scholar

[16] Martin Degeling, Christine Utz, Christopher Lentzsch, Henry Hosseini, Florian Schaub, and Thorsten Holz. 2019. We Value Your Privacy ... Now Take Some Cookies - Measuring the GDPR’s Impact on Web Privacy. Inform. Spektrum 42, 5 (2019), 345–346. https://doi.org/10.1007/s00287-019-01201-110.1007/s00287-019-01201-1 Search in Google Scholar

[17] Disconnect. 2020. The Tracker Protection lists. https://github.com/disconnectme/disconnect-tracking-protection. Search in Google Scholar

[18] Ivan Dolnák. 2017. Implementation of referrer policy in order to control HTTP Referer header privacy. In 2017 15th International Conference on Emerging eLearning Technologies and Applications (ICETA). IEEE, USA, 1–4. Search in Google Scholar

[19] Xinshu Dong, Minh Tran, Zhenkai Liang, and Xuxian Jiang. 2011. AdSentry: Comprehensive and Flexible Confinement of JavaScript-Based Advertisements. In Proceedings of the 27th Annual Computer Security Applications Conference (ACSAC ’11). Association for Computing Machinery, New York, NY, USA, 297–306. https://doi.org/10.1145/2076732.207677410.1145/2076732.2076774 Search in Google Scholar

[20] Steven Englehardt. 2017. No boundaries: Exfiltration of personal data by session-replay scripts. https://freedom-to-tinker.com/2017/11/15/no-boundaries-exfiltration-of-personal-data-by-session-replay-scripts/. Search in Google Scholar

[21] Steven Englehardt, Jeffrey Han, and Arvind Narayanan. 01 Jan. 2018. I never signed up for this! Privacy implications of email tracking. Proceedings on Privacy Enhancing Technologies 2018, 1 (01 Jan. 2018), 109 – 126. https://doi.org/10.1515/popets-2018-000610.1515/popets-2018-0006 Search in Google Scholar

[22] Steven Englehardt and Arvind Narayanan. 2016. Online tracking: A 1-million-site measurement and analysis. In Proceedings of ACM CCS 2016. Association for Computing Machinery, New York, NY, USA, 1388–1401. Search in Google Scholar

[23] Kiran Garimella, Orestis Kostakis, and Michael Mathioudakis. 2017. Ad-blocking: A study on performance, privacy and counter-measures. In Proceedings of the 2017 ACM on Web Science Conference. Association for Computing Machinery, Troy, New York, USA, 259–262. Search in Google Scholar

[24] Richard Gomer, Eduarda Mendes Rodrigues, Natasa Milic-Frayling, and M.C. Schraefel. 2013. Network Analysis of Third Party Tracking: User Exposure to Tracking Cookies through Search. In 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), Vol. 1. IEEE, USA, 549–556. https://doi.org/10.1109/WI-IAT.2013.7710.1109/WI-IAT.2013.77 Search in Google Scholar

[25] Google. 2020. Google’s Programmable Search Engine. https://developers.google.com/custom-search/docs/overview. Search in Google Scholar

[26] Google. 2021. Sitelinks search box. https://developers.google.com/search/docs/data-types/sitelinks-searchbox. Search in Google Scholar

[27] Saikat Guha, Bin Cheng, and Paul Francis. 2010. Challenges in measuring online advertising systems. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement. Association for Computing Machinery, New York, NY, USA, 81–87. Search in Google Scholar

[28] Hamza Harkous, Kassem Fawaz, Rémi Lebret, Florian Schaub, Kang G. Shin, and Karl Aberer. 2018. Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning. arXiv:cs.CL/1802.02561 Search in Google Scholar

[29] Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python. https://doi.org/10.5281/zenodo.1212303 Search in Google Scholar

[30] Rosie Jones, Ravi Kumar, Bo Pang, and Andrew Tomkins. 2007. “ I know what you did last summer” query logs and user privacy. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. Association for Computing Machinery, New York, NY, USA, 909–914. Search in Google Scholar

[31] Beliz Kaleli, Manuel Egele, and Gianluca Stringhini. 2019. On the Perils of Leaking Referrers in Online Collaboration Services. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, Germany, 67–85. Search in Google Scholar

[32] Ruogu Kang, Laura Dabbish, Nathaniel Fruchter, and Sara Kiesler. 2015. “My Data Just Goes Everywhere”: User Mental Models of the Internet and Implications for Privacy and Security. In Proceedings of the Eleventh USENIX Conference on Usable Privacy and Security (SOUPS ’15). USENIX Association, USA, 39–52. Search in Google Scholar

[33] Patrick Gage Kelley, Lucian Cesca, Joanna Bresee, and Lorrie Faith Cranor. 2010. Standardizing Privacy Notices: An Online Study of the Nutrition Label Approach. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’10). Association for Computing Machinery, New York, NY, USA, 1573–1582. https://doi.org/10.1145/1753326.175356110.1145/1753326.1753561 Search in Google Scholar

[34] Balachander Krishnamurthy, Konstantin Naryshkin, and Craig Wills. 2011. Privacy leakage vs. protection measures: the growing disconnect. In In Web 2.0 Workshop on Security and Privacy, Vol. 2. IEEE, USA, 1–10. Search in Google Scholar

[35] Arturs Lavrenovs and F Jesús Rubio Melón. 2018. Http security headers analysis of top one million websites. In 2018 10th International Conference on Cyber Conflict (CyCon). IEEE, USA, 345–370. Search in Google Scholar

[36] Victor Le Pochat, Tom Van Goethem, Samaneh Tajalizadehkhoob, Maciej Korczyński, and Wouter Joosen. 2019. TRANCO: A Research-Oriented Top Sites Ranking Hardened Against Manipulation. https://doi.org/10.14722/ndss.2019.2338610.14722/ndss.2019.23386 Search in Google Scholar

[37] Tim Libert. 2014. Privacy implications of health information seeking on the web.10.2139/ssrn.2423006 Search in Google Scholar

[38] Timothy Libert. 2015. Exposing the hidden web: An analysis of third-party HTTP requests on 1 million websites. Search in Google Scholar

[39] Timothy Libert. 2018. An automated approach to auditing disclosure of third-party data collection in website privacy policies. In Proceedings of the 2018 World Wide Web Conference. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 207–216. Search in Google Scholar

[40] Thomas Linden, Rishabh Khandelwal, Hamza Harkous, and Kassem Fawaz. 01 Jan. 2020. The Privacy Policy Landscape After the GDPR. Proceedings on Privacy Enhancing Technologies 2020, 1 (01 Jan. 2020), 47 – 64. https://doi.org/10.2478/popets-2020-000410.2478/popets-2020-0004 Search in Google Scholar

[41] Chang Liu and Kirk P. Arnett. 2002. Raising a Red Flag on Global WWW Privacy Policies. Journal of Computer Information Systems 43, 1 (2002), 117–127. https://doi.org/10.1080/08874417.2002.11647076 Search in Google Scholar

[42] Mike Ter Louw, Karthik Thotta Ganesh, and V.N. Venkatakrishnan. 2010. AdJail: Practical Enforcement of Confidentiality and Integrity Policies on Web Advertisements. https://www.usenix.org/conference/usenixsecurity10/adjail-practical-enforcement-confidentiality-and-integrity-policies-web Search in Google Scholar

[43] Delfina Malandrino, Andrea Petta, Vittorio Scarano, Luigi Serra, Raffaele Spinelli, and Balachander Krishnamurthy. 2013. Privacy awareness about information leakage: Who knows what about me?. In Proceedings of the 12th ACM workshop on Workshop on privacy in the electronic society. Association for Computing Machinery, New York, USA, 279–284. Search in Google Scholar

[44] Elena Maris, Timothy Libert, and Jennifer R Henrichsen. 2020. Tracking sex: The implications of widespread sexual data leakage and tracking on porn websites. New Media & Society 22, 11 (2020), 2018–2038. Search in Google Scholar

[45] Florencia Marotta-Wurgler. 2016. Understanding Privacy Policies: Content, Self-Regulation, and Markets., 43 pages. https://doi.org/10.2139/ssrn.273651310.2139/ssrn.2736513 Search in Google Scholar

[46] Aleecia M. McDonald, Robert W. Reeder, Patrick Gage Kelley, and Lorrie Faith Cranor. 2009. A Comparative Study of Online Privacy Policies and Formats. In Privacy Enhancing Technologies, Ian Goldberg and Mikhail J. Atallah (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 37–55. Search in Google Scholar

[47] Leo A. Meyerovich and Benjamin Livshits. 2010. ConScript: Specifying and Enforcing Fine-Grained Security Policies for JavaScript in the Browser. In 2010 IEEE Symposium on Security and Privacy. IEEE, USA, 481–496. https://doi.org/10.1109/SP.2010.3610.1109/SP.2010.36 Search in Google Scholar

[48] Mozilla. 2020. Query selector. https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector/. Search in Google Scholar

[49] Mozilla. 2021. The fetch API. https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API. Search in Google Scholar

[50] Mozilla. 2021. Firefox Translation Project on Pontoon. https://pontoon.mozilla.org/projects/firefox. Search in Google Scholar

[51] Mozilla. 2021. The language global attribute. https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/lang. Search in Google Scholar

[52] Mozilla. 2021. OpenSearch description format. https://developer.mozilla.org/en-US/docs/Web/OpenSearch. Search in Google Scholar

[53] Mozilla Foundation. 2021. The Public Suffix List. https://publicsuffix.org/. Search in Google Scholar

[54] Maud Nalpas. 2020. A new default Referrer-Policy for Chrome: strict-origin-when-cross-origin. https://developers.google.com/web/updates/2020/07/referrer-policy-new-chrome-default. Search in Google Scholar

[55] Maud Nalpas. 2020. Referer and Referrer-Policy best practices. https://web.dev/referrer-best-practices/. Search in Google Scholar

[56] Henrik Nielsen, Roy T. Fielding, and Tim Berners-Lee. 1996. Hypertext Transfer Protocol – HTTP/1.0. RFC 1945. https://doi.org/10.17487/RFC194510.17487/rfc1945 Search in Google Scholar

[57] Henrik Nielsen, Jeffrey Mogul, Larry M Masinter, Roy T. Fielding, Jim Gettys, Paul J. Leach, and Tim Berners-Lee. 1999. Hypertext Transfer Protocol – HTTP/1.1. RFC 2616. https://doi.org/10.17487/RFC261610.17487/rfc2616 Search in Google Scholar

[58] Node.js core collaborators. 2021. Node Worker Threads. https://nodejs.org/docs/latest-v12.x/api/worker_threads.html. Search in Google Scholar

[59] OECD. 2013. Guidelines on the Protection of Privacy and Transborder Flows of Personal Data. https://www.oecd.org/sti/ieconomy/oecd_privacy_framework.pdf. Search in Google Scholar

[60] European Parliament. 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons (General Data Protection Regulation). https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=EN. Search in Google Scholar

[61] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. Search in Google Scholar

[62] PhantomJS contributors. 2018. PhantomJS. https://github.com/ariya/phantomjs. Search in Google Scholar

[63] Privacy Alliance. 2020. Guidelines for Online Privacy Policies. http://www.privacyalliance.org/resources/ppguidelines. Search in Google Scholar

[64] Privacy Community Group. 2020. The First-Party Sets. https://github.com/privacycg/first-party-sets. Search in Google Scholar

[65] Agus Purwanto and Andi Wahju Rahardjo Emanuel. 2020. The State of Website Security Response Headers in Indonesia Banking.10.1063/5.0030359 Search in Google Scholar

[66] Iskander Sanchez-Rola, Davide Balzarotti, Christopher Kruegel, Giovanni Vigna, and Igor Santos. 2020. Dirty Clicks: A Study of the Usability and Security Implications of Click-related Behaviors on the Web. In Proceedings of The Web Conference 2020. Association for Computing Machinery, New York, NY, USA, 395–406. Search in Google Scholar

[67] Iskander Sanchez-Rola, Matteo Dell’Amico, Platon Kotzias, Davide Balzarotti, Leyla Bilge, Pierre-Antoine Vervier, and Igor Santos. 2019. Can I Opt Out Yet? GDPR and the Global Illusion of Cookie Control. In Proceedings of the 2019 ACM Asia Conference on Computer and Communications Security. Association for Computing Machinery, New York, NY, USA, 340–351. Search in Google Scholar

[68] Kanthashree Mysore Sathyendra, Florian Schaub, Shomir Wilson, and Norman Sadeh. 2016. Automatic extraction of opt-out choices from privacy policies. In FS-16-01 (AAAI Fall Symposium - Technical Report). AI Access Foundation, United States, 270–275. Search in Google Scholar

[69] Sebastian Schelter and Jérôme Kunegis. 2016. On the ubiquity of web tracking: Insights from a billion-page web crawl. Search in Google Scholar

[70] Selenium contributors. 2020. Selenium. https://github.com/SeleniumHQ/selenium. Search in Google Scholar

[71] Rebecca Sentance. 2020. 24 best practice tips for ecommerce site search. https://econsultancy.com/24-best-practice-tips-for-ecommerce-site-search. Search in Google Scholar

[72] Oleksii Starov, Phillipa Gill, and Nick Nikiforakis. 01 Jan. 2016. Are You Sure You Want to Contact Us? Quantifying the Leakage of PII via Website Contact Forms. Proceedings on Privacy Enhancing Technologies 2016, 1 (01 Jan. 2016), 20 – 33. https://doi.org/10.1515/popets-2015-002810.1515/popets-2015-0028 Search in Google Scholar

[73] Ian Storm Taylor. 2019. Add a timeout option to prevent hanging. https://github.com/whatwg/fetch/issues/951. Search in Google Scholar

[74] Symantec. 2020. Webpulse. https://sitereview.bluecoat.com/#/category-descriptions. Search in Google Scholar

[75] Bill Tancer. 2008. Click: What millions of people are doing online and why it matters. Hachette Books, London, UK. Search in Google Scholar

[76] Pelayo Vallina, Álvaro Feal, Julien Gamba, Narseo Vallina-Rodriguez, and Antonio Fernández Anta. 2019. Tales from the porn: A comprehensive privacy analysis of the web porn ecosystem. In Proceedings of the Internet Measurement Conference. Association for Computing Machinery, New York, NY, USA, 245–258. Search in Google Scholar

[77] Steven Van Acker, Philippe De Ryck, Lieven Desmet, Frank Piessens, and Wouter Joosen. 2011. WebJail: Least-Privilege Integration of Third-Party Components in Web Mashups. In Proceedings of the 27th Annual Computer Security Applications Conference (ACSAC ’11). Association for Computing Machinery, New York, NY, USA, 307–316. https://doi.org/10.1145/2076732.207677510.1145/2076732.2076775 Search in Google Scholar

[78] W3C. 2019. Referrer Policy Editor’s Draft, 4 December 2019. https://w3c.github.io/webappsec-referrer-policy/#referrer-policies. Search in Google Scholar

[79] W3C Schema.org Community Group. 2015. Search Action. https://schema.org/SearchAction. Search in Google Scholar

[80] Ben Weinshel, Miranda Wei, Mainack Mondal, Euirim Choi, Shawn Shan, Claire Dolin, Michelle L. Mazurek, and Blase Ur. 2019. Oh, the Places You’ve Been! User Reactions to Longitudinal Transparency About Third-Party Web Tracking and Inferencing. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (CCS ’19). Association for Computing Machinery, New York, NY, USA, 149–166. https://doi.org/10.1145/3319535.336320010.1145/3319535.3363200 Search in Google Scholar

[81] Ryen W White, P Murali Doraiswamy, and Eric Horvitz. 2018. Detecting neurodegenerative disorders from web search signals. NPJ digital medicine 1, 1 (2018), 1–4. Search in Google Scholar

[82] Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N. Cameron Russell, Thomas B. Norton, Eduard Hovy, Joel Reidenberg, and Norman Sadeh. 2016. The Creation and Analysis of a Website Privacy Policy Corpus. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1330–1340. https://doi.org/10.18653/v1/P16-112610.18653/v1/P16-1126 Search in Google Scholar

[83] Razieh Nokhbeh Zaeem and K. Suzanne Barber. 2017. A study of web privacy policies across industries. Journal of Information Privacy and Security 13, 4 (2017), 169–185. https://doi.org/10.1080/15536548.2017.139406410.1080/15536548.2017.1394064 Search in Google Scholar

[84] Sebastian Zimmeck, Peter Story, Daniel Smullen, Abhilasha Ravichander, Ziqi Wang, Joel Reidenberg, N. Cameron Russell, and Norman Sadeh. 01 Jul. 2019. MAPS: Scaling Privacy Compliance Analysis to a Million Apps. Proceedings on Privacy Enhancing Technologies 2019, 3 (01 Jul. 2019), 66 – 86. https://doi.org/10.2478/popets-2019-003710.2478/popets-2019-0037 Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo