Algorithmic Copyright Enforcement and AI: Issues and Potential Solutions through the Lens of Text and Data Mining



Although digitalization and the emergence of the Internet has caused a long-term crisis for copyright law, technology itself also seems to offer a seemingly ideal solution to the challenges of digital age: copyright has been a major use case for algorithmic enforcement from the early days of digital rights management technologies to the more advanced content recognition algorithms. These technologies identify and filter possibly infringing content automatically, effectively and often in a preventive fashion. These methods have been criticized for their shortcomings, such as the lack of transparency, bias and the possible impairment of fundamental rights. Self-learning machines and semi-autonomous AI have the potential to offer even more sophisticated and expeditious enforcement by code, however, they could also aggravate the aforementioned issues. As the EU legislator envisions to make the use of such technologies essentially obligatory for certain online content sharing service providers (via the infamous Article 17 of the directive on copyright in the digital single market), the assessment of the situation in light of future technological development has become a current topic.

This paper aims to identify the main issues and potential long-term consequences of creating legislation that practically requires the employment of such filtering algorithms as well as their solutions. This paper focuses on the potential role a broad copyright exception for text and data mining could play in counterbalancing the issues associated with algorithmic enforcement.

AI, Copyright Law, EU Law, Machine Learning, Technology, Text and Data Mining
Author biography

Andrea Katalin Tóth

Hungarian Intellectual Property Office Eötvös Loránd University Faculty of Law and Political Science

Legal officer at the International Copyright Law at the Hungarian Intellectual Property Office

Ph.D. candidate at the Eötvös Loránd University Faculty of Law and Political Science


[1] Akester, P. (2009) Technological Accommodation of Conflicts between Freedom of Expression and DRM: The First Empirical Assessment. Rochester, New York: Social Science Research Network.

[2] Allgrove, B. (2004) Legal Personality for Artificial Intellects: Pragmatic Solution or Science Fiction? [online] Available from: [Accessed 15 January 2019].

[3] Bamberger, K. A. (2010) Technologies of Compliance: Risk and Regulation in a Digital Age. Texas Law Review, 88 (4).

[4] Banko, M. and Brill, E. (2001) Scaling to Very Very Large Corpora for Natural Language Disambiguation. In: Bonnie Lynn Webber (ed.). Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, Toulouse, 6–11 July. USA: Association for Computational Linguistics.

[5] Bartholomew, T. B. (2015) The Death of Fair Use in Cyberspace: YouTube and the Problem with Content ID. Duke Law & Technology Review, 13 (1).

[6] Big Data Made Simple. (2014) Top 14 useful applications of data mining. [online] 20 August. Available from: [Accessed 4 February 2019].

[7] Calders, T. and Custers, B. (2013) What Is Data Mining and How Does It Work? In: Bart Custers et al (eds.) Discrimination and Privacy in the Information Society. 1st ed. Berlin: Springer.

[8] Čerka, P., Grigienė, J. and Sirbikytė, G. (2017) Is it possible to grant legal personality to artificial intelligence systems? Computer Law & Security Review, 33 (5).

[9] Channel Awesome. (2016) Where‘s The Fair Use – Nostalgia Critic. [online video] Available from: [Accessed 10 January 2019].

[10] Citron, D. K. (2008) Technological Due Process. Washington University Law Review, 85 (6).

[11] Counter Extremism Project. (2018) The eGlyph Web Crawler: ISIS Content on YouTube. [online] Available from: [Accessed 14 June 2019].

[12] Craig, C. J. (2017) Technological Neutrality: Recalibrating Copyright in the Information Age. Theoretical Issues in Law, 17 (2).

[13] Cummings, M. L. (2006) Automation and Accountability in Decision Support System Interface Design. The Journal of Technology Studies, 32.

[14] DeGroat, T. J. (2018) 19 Free Public Data Sets for Your Data Science Project. [online] Available from: [Accessed 15 June 2019].

[15] Directive 2001/29/EC of the European Parliament and of the Council of 22 May 2001 on the harmonisation of certain aspects of copyright and related rights in the information society. Official Journal of the European Union (2001/L-167/10) 22 June. Available from: [Accessed 10 January 2019].

[16] Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market and amending Directives 96/9/EC and 2001/29/EC. Official Journal of the European Union (2019/L-130/92) 17 May. Available from: [Accessed 19 May 2019]

[17] DMR. (2019) 160 YouTube Statistics and Facts. [online] Available from: [Accessed 11 January 2019].

[18] Elkin-Koren, N. (2017) Fair Use by Design. UCLA Law Review, 64 (5).

[19] Facebook. (2019) Rights Manager. [online] Available from: [Accessed 10 January 2019].

[20] Facebook. (2019) Terms of Service, Section 3.3: The permissions you give us. [online] Available from: [Accessed 7 February 2019].

[21] Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. (1996) The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM, 39 (11).

[22] Filippov, S. (2014) Mapping Text and Data Mining in Academic and Research Communities in Europe. The Lisbon Council Special Briefing Issue, (16).

[23] Fisher, W. W. (1988) Reconstructing the Fair Use Doctrine. Harvard Law Review, 101 (8).

[24] Fisher, W. W. (2001) Theories of Intellectual Property. In: Stephen Munzer (ed.). New Essays in the Legal and Political Theory of Property. 1st ed. Cambridge: Cambridge University Press.

[25] Friedman, B. and Nissenbaum, H. (1996) Bias in Computer Systems. ACM Transactions on Information Systems, 14 (3).

[26] Geller, P. E. (2008) Beyond the Copyright Crisis: Principles for Change. Journal of the Copyright Society of the USA, 55.

[27] Google. (2019) Featured policies. [online] Available from: [Accessed 14 June 2019].

[28] Google. (2019) Frequently asked questions about fair use. [online] Available from: [Accessed 15 June 2019].

[29] Greenberg, B. A. (2016) Rethinking Technology Neutrality. Minnesota Law Review, 100 (4).

[30] Halevy, A., Norvig, P. and Pereira, F. (2009) The Unreasonable Effectiveness of Data. Intelligent Systems, IEEE, 24 (2).

[31] I Hate Everything. (2015) Cool Cat Saves The Kids – The Search For The Worst. [online video] Available from: [Accessed 10 January 2019].

[32] Joyce, C. (ed.). (2013) Copyright Law. 9th ed. New Providence: LexisNexis.

[33] Kerr, I. (2010) Digital Locks and the Automation Virtue. In Geist, Michael (ed.). From „Radical Extremism” to „Balanced Copyright”: Canadian Copyright and the Digital Agenda. 1st ed. Toronto: Irwin Law.

[34] Knight, W. (2017) The Dark Secret at the Heart of AI. MIT Technology Review, 11 April. [online] Available from: [Accessed 13 January 2019].

[35] Latman, A. and Patry, W. F. (1986) Latman’s the Copyright Law. 6th ed. Washington, D.C.: Bureau of National Affairs.

[36] Litman, J. (2002) Revising Copyright Law for the Information Age. In: Adam Thierer and Wayne Crews (eds.) Copy Fights: The Future of Intellectual Property in the Information Age. 1st ed. Washington, D.C: Cato Institute.

[37] Leval, P. N. (1990) Toward a Fair Use Standard. Harvard Law Review, 103.

[38] Lessig, L. (2006) Code v. 2.0, 2006, New York: Basic Books. Available from: [Accessed 10 January 2019].

[39] Lester, T. and Pachamanova, D. (2017) The Dilemma of False Positives: Making Content ID Algorithms more Conducive to Fostering Innovative Fair Use in Music Creation. UCLA Entertainment Law Review, 24 (1).

[40] Ryszard S. Michalski, Jaime G. Carbonell and Tom M. Mitchell (eds.). (1983) Machine Learning: An Artificial Intelligence Approach. 1st ed. Berlin: Springer-Verlag.

[41] Mills, M. L. (1989) New Technology and the Limitations of Copyright Law: An Argument for Finding Alternatives to Copyright Legislation in an Era of Rapid Technological Change. Chicago-Kent Law Review, 65 (1).

[42] Murphy, K. P. (2012) Machine Learning: A Probabilistic Perspective. 1st ed. Cambridge: Massachusetts Institute of Technology.

[43] Myška, M. (2009) The True Story of DRM. Masaryk University Journal of Law and Technology, 3 (2).

[44] Nordemann, J. B. (2017) Liability of Online Service Providers for Copyrighted Content – Regulatory Action Needed? In-Depth Analysis for the IMCO Committee. Directorate-General for Internal Policies, Policy Department A (Economic and Scientific Policy), European Parliament.

[45] Newton v. Diamond (2004) 388 F.3d 1189, 7 April.

[46] Omnicore. (2019) Instagram by the Numbers: Stats, Demographics & Fun Facts. [online] Available from: [Accessed 11 January 2019].

[47] Perel, M. and Elkin-Koren, N. (2016) Accountability in Algorithmic Copyright Enforcement. Stanford Technology Law Review, 19 (3).

[48] Perel, M and Elkin-Koren, N. (2017) Black Box Tinkering: Beyond Disclosure in Algorihtmic Enforcement. Florida Law Review, 69 (1).

[49] PewDiePie. (2017) Life is cringe – life is strange – S2E01. [online video] Available from: [Accessed 10 January 2019].

[50] European Commission. (2016) Proposal for a Directive of the European Parliament and of the Council on copyright in the Digital Single Market. (COM(2016) 593 final) 14 September. Available from: [Accessed 15 January 2019].

[51] European Parliament. (2018) Report on the proposal for a directive of the European Parliament and of the Council on copyright in the Digital Single Market. (COM(2016)0593 – C8-0383 (2016) – 2016/0280(COD)). Available from: [Accessed 7 February 2019].

[52] Richard, K. (2018) Fair Use in the Information Age. Richmond Journal of Law & Technology, 25 (1).

[53] Ringgold v. Black Entertainment Television, Inc. (1997) 126 F.3d 70, 16 September.

[54] Sandoval v. New Line Cinema Corp. (1998) 147 F.3d 215, 24 June.

[55] Sarikaya, R., Geoffrey E. and Deoras, A. (2014) Application of Deep Belief Networks for Natural Language Understanding. IEEE Transactions on Audio, Speech and Language Processing, 22 (4).

[56] Solum, L. B. (1991) Legal Personhood for Artificial Intelligences. North Carolina Law Review, 70 (4).

[57] Stamatoudi, I. and Torremans, P. (2014) EU Copyright Law, a Commentary. 1st ed. Cheltenham: Edward Elgar Publishing Limited.

[58] Stanford, S. (2018) The Best Public Datasets for Machine Learning and Data Science. [online] Available from: [Accessed 15 June 2019].

[59] Thatcher, S. G. (2006) Fair Use in Theory and Practice: Reflections on its History and the Google Case. Journal of Scholarly Publishing, 37 (3).

[60] Triaille, J.P., de Meeus d’Argenteuil, J. and de Francquen, A. (2014) Study on the legal framework of text and data mining (TDM). 1st ed. Luxembourg: European Union. Available from: [Accessed 4 February 2019].

[61] United States Copyright Office. (2019) More information on fair use. [online] Washington, D. C.: USCO. Available from: [Accessed 23 May 2019].

[62] Vorhies, W. (2016) CRISP-DM – a Standard Methodology to Ensure a Good Outcome. [online] Data Science Central. Available from: [Accessed 4 February 2019].

[63] Witten, I. H. and Frank, E. (2005) Data Mining, Practical Machine Learning Tools and Techniques. 2nd ed. San Francisco: Morgan Kaufmann Publishers.

[64] YouTube. (2017) An update on our commitment to fight violent extremist content online. [online] Available from: [Accessed 13 January 2019].

[65] YouTube. (2019) Content Verification Program. [online] Available from: [Accessed 14 June 2019].

[66] YouTube. (2019) Copyright Management Tools. [online] Available from: [Accessed 14 June 2019].

[67] YouTube. (2019) Copyright Management Tools – Content ID. [online] Available from: [Accessed 10 January 2019].

[68] YouTube. (2019) Copyright Match Tool. [online] Available from: [Accessed 14 June 2019].

[69] YouTube. (2019) How Content ID works. [online] Available from: [Accessed 10 January 2019].

[70] YouTube. (2019) Terms of Service, Section 8: Rights you license. [online] Available from: [Accessed 7 February 2019].

[71] YouTube. (2019) YouTube in Numbers. [online] Available from: [Accessed 14 June 2019].

[72] Zhang, M., Zhang, Y. and Fu, G. (2016) Tweet Sarcasm Detection Using Deep Neural Network. In: Eiichiro Sumita, Takenobu Tokunaga and Sadao Kurohashi (eds.). Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December. Japan: Japanese Association of Natural Language Processing.