Sonia Cooper, Assistant General Counsel, Open Innovation, Microsoft, "What Open Data Means in the world of Generative AI" from State of Open: The UK in 2023 Phase Two, Part Two

Sonia Cooper, Assistant General Counsel, Open Innovation, Microsoft

State of Open: The UK in 2023

Phase Two “Show us the Money”

Part 2: “AI Openness”

Sonia emphasises the pivotal role of open data in advancing AI, particularly in the context of Generative AI. Generative AI relies on extensive, diverse datasets to improve model performance, enabling it to learn patterns and make accurate predictions. The widening data divide is identified as a barrier, limiting access to data and hindering the development of AI across sectors. The opportunities presented by AI, such as automation and optimisation, can only be fully realised if there is broad access to usable data. The text also highlights concerns regarding copyright issues and the need for policymakers to clarify regulations to avoid hindering text and data mining. Overall, the call is for fostering a culture of openness to unlock the full potential of AI for the benefit of all.

Thought Leadership: What Open Data Means in the world of Generative AI

Sonia Cooper, Assistant General Counsel, Open Innovation, Microsoft

The importance of open data is back in focus. AI may well be the most consequential technology of our time, delivering advancements across all industries and sectors. The advent of foundation models for Generative AI, which require vast amounts of broad and diverse data, has renewed the importance of open data. Varied datasets enable AI models to learn patterns, recognise objects, make accurate predictions, and help people create. The larger and more representative the dataset, the better the AI model performance. Data, which can be broadly accessed, and which is made as open as possible, plays a pivotal role in ensuring that there is sufficient data available to train AI models, to facilitate innovation and ensure unbiased, performant AI.

Data accessbility and the widening divide

AI holds immense potential to revolutionise industries, enhance decision making, and address complex challenges. The opportunities for automation, optimisation, and prediction, empowering organisations and individuals to achieve greater efficiency and innovation are unprecedented. From healthcare and finance to transport and education, AI-driven solutions have the capability to improve outcomes, save costs, and enhance the overall quality of our lives. These opportunities will be unlocked only if organisations developing AI have sufficient access to broad and diverse data.

That data must be as accessible and usable as possible. There is, however, a widening divide between those who are able to put data to work and those who cannot. This data divide is due both to barriers to accessing data and to using data caused by a lack of data science and analytics expertise. The use of data tags to enable effective search functions is limited as found in a recent study.38 Outside of the tech sector, fewer than 1% of companies have data scientists that can work with their data.

For AI to benefit everyone they should be able to access and use the data they need to develop and use AI, so that research and development can occur across all sectors and for the benefit of all parts of society. Conversely, restricting access to data for the purpose of training AI will see research and development in AI restricted to furthering the priorities of a limited number of organisations that have access to data.

The opportunities presented by AI must be available to all if we are going to develop AI that benefits everyone. In this context, Generative AI presents particularly interesting opportunities and challenges. It presents us with the opportunity to democratise AI and the availability of data – by using large language models to develop conversational interfaces to AI functionality, AI becomes easier for everyone to use. By using Generative AI to interact with data, the insights in data become more widely available to more people. Generative AI interfaces will make it easier to request feedback from open data users, thereby increasing the incentives to publish open data. As more data tools incorporate Generative AI, data users will be able to inform data publishers about how they are using the data, while they interact with it. As we find data gaps, Generative AI may even help plug those gaps by generating synthetic data or make it easier to publish data. Data publishing tools incorporating AI will also assist with the creation of metadata when publishing. This virtuous cycle underscores the exponential benefits when we are more open with data. Similarly, challenges presented from Generative AI arise from the greater dependency on vast amounts of data. Without access to enough data, AI will not perform well and this virtuous cycle cannot be created.

As we found in the open data steward peer learning network, with Microsoft and the Open Data Institute39, feedback loops that show data publishers and other data users how data is being used encourage greater community engagement and more effective data publishing. This can include providing mechanisms for users to contribute back to the data sources, report errors, suggest improvements, or share derived insights. Such collaboration promotes data quality, ensures ongoing updates and maintenance, and strengthens the open data ecosystem. This is one of the principles explored in the Microsoft and the Open Data Institute’s toolkit for new data publishers40.

Open data practices and copyright

Issues surrounding data access are compounded by concerns among those who are not confident in being able to use data and in part due to questions as to whether copyright is infringed when copyrighted works are used to train AI. In 2022, the UK Government withdrew its decision to implement an explicit copyright exception for text and data mining, creating confusion regarding the ability to text and data mine in the UK. The UK Intellectual Property Office is now working on a code of practice for text and data mining to avoid the need for legislative change. This seems to be a good opportunity for the UK Government to clarify that exceptions already exist in UK legislation that permit text and data mining.

Performing text and data mining on works protected by copyright should not be considered a copyright infringement. As required under TRIPS, 9(2), copyright should not extend to ideas, procedures, or mathematical concepts. Everyone should have the right to extract knowledge from copyrighted works – to read, to learn, to understand, to develop ways to create new works and use technology as a tool to enable this.

Policy makers should take care not to attempt to reverse fundamental principles of copyright law which prevent copyright from extending to ideas and concepts. Some are suggesting that data made publicly-available on the internet cannot be used for text and data mining without a copyright licence, without regard to existing exceptions. This is a worrying development. Placing conditions on the use of data on the internet that extends beyond the limits of copyright risks further exacerbating the data divide, putting a cap on future innovations. It further attacks the notion of a free and open internet. We are only beginning to see the opportunities that AI presents. We must be careful not to limit our ability to share facts and ideas widely. Instead, governments should help foster a culture of openness and incentivise data sharing and usage, so we can unlock the full potential of AI and ensure that its benefits are accessible to all.

View all Thought Leadership