Data ready from day one

News & Press: Data, Analytics & AI

Data ready from day one

22 October 2024 (0 Comments)
Posted by: Rob Green

Data ready from day one.

Emma Thwaites, Director of Global Policy and Corporate Affairs at the Open Data Institute (ODI) will be appearing at this year’s Rewired conference. Here she looks at the how a National Data Library could work and what is needed for it to be a success.

The idea of a National Data Library (NDL) emerged in the Labour Party Manifesto ahead of this year’s General Election. Since then, there’s been ongoing discussion about its shape and scope.

My organisation, the Open Data Institute, has contributed thoughts on the NDL to the AI Action Plan, reviewed research and written about the concept more broadly. The original vision from Labour for the NDL was to “bring together existing research programmes and help deliver data-driven public services, whilst maintaining strong safeguards and ensuring all of the public benefit”. This will only be possible if we create a well-thought-through piece of infrastructure. This involves ensuring access to high-value datasets, building infrastructure that the public can trust and ensuring that people have the right skills to use and maintain it. Many of the things that information professionals think about daily, no doubt.

As we’ve argued at the ODI, without data, there is no AI. As a country, we need well-structured and well-governed data to support AI stacks. To illustrate just how critical this point is, one study found that analysts typically spend 80 per cent of their time preparing data for AI use. To ensure the same challenge doesn’t blight the NDL, we must design it to be AI-ready from the outset.

The NDL will require high-quality datasets curated from existing public sector bodies, research organisations, and beyond. However, data often sits in legacy IT systems that don’t “talk” to each other and vary enormously in format and quality. For example, a great deal of data still only exists in PDF documents stored in rudimentary databases.

With these kinds of challenges, it’s tempting to try to design a comprehensive data architecture from the get-go, with accompanying detailed rules and reporting requirements. But it’s important not to stifle good practices where they exist or, indeed, to make the task ridiculously daunting for those who don’t already have high-quality data. It’s also critical to understand the cost of digital (and data) transformation when public finances are under pressure. While the government’s newly published Green (consultation) Paper, Invest 2035: The UK’s Modern Industrial Strategy, has a goal of “using public sector data as a driver of growth”, we must ensure that the approach to implementing the library takes account of the varying data maturity of public sector and research organisations whose data might come within its remit. In the ODI’s recommendations for the AI Action Plan, we suggest that the library initially focuses on three key areas: high-quality public data, federated Trusted Research Environments (TREs), and cultural heritage data. These datasets offer significant benefits and are well-structured for reliable public-sector deployment. Making them available to digital innovators could enhance public service delivery and stimulate the ecosystem necessary for sustainable impact. This would also give the NDL robust foundations on which to build. Regardless of the roadmap to building the library, resources and investment will be needed, just as they are required for building physical infrastructure.

In some ways, as our Senior Policy Advisor, Gavin Freegard, noted back in July the NDL can be likened to a traditional library. He cited a report by Onward proposing that “the Government should establish a British Library for data – a centralised, secure platform to collate high-quality data for scientists and start-ups.” Just as libraries store, organise, and provide access to knowledge, the NDL can curate and make data accessible to researchers, businesses, and public services. However, just as libraries have different access levels depending on the sensitivity of their materials, the NDL will need to offer tiered access to protect personal data. Data stewards will be critical in guiding users through this vast resource, ensuring the NDL becomes a trusted and effective driver of innovation and societal benefit.

Indeed, building and maintaining public trust will be a critical success factor in the project. The Lloyd’s Register Foundation’s World Risk Poll 2024 reveals that only 11 per cent of global citizens trust their governments to protect personal data, and the problem is particularly acute when it comes to personal or sensitive data, like benefits information or health records. The 2021 GPDPR campaign (https://tinyurl.com/odiNHSdata), where patient opt-outs spiked, shows the risks of ignoring public concerns. In building the National Data Library, we can avoid this by exploring the potential of new Privacy Enhancing Technologies (PETs), which include personal data stores, federated learning models and multi-party computation. These technologies, which the ODI is researching, offer ways to ensure data security and empower individuals to control data about them.

For example, in a Federated learning model, the model travels to the data rather than the data being ‘pooled’ by the user for the algorithm to be deployed. The model is trained locally on each device; the data never leaves its original location. This technology is already being successfully used in projects such as the EU’s MELLODDY and the secure healthcare research collaboration between Moorfields Eye Hospital and Bitfount, which aims to improve the early detection of eye diseases. Enabling algorithms to be trained across multiple local datasets without exchanging the underlying data presents great potential to unlock value from data that might traditionally have been kept closed.

With personal data stores, individuals control their personal data and decide how it can be used across the Web. There are several examples of this, including Solid. Solid is an open-source project, community and standard (or protocol), originated by the ODI’s co-founder Sir Tim Berners-Lee. Stewarded by the ODI since October 2024, Solid’s model is already proving effective. For example, in Flanders, a company called Inrupt (www.inrupt.com), has provided enterprise-grade Solid software for projects, including with the government of Flanders – through the Flanders’ Data Utility Company. This has given 6.8 million citizens their own Personal Online Data Stores (Pods) through which they can share data with government services, demonstrating the potential for trusted, transparent public sector data exchanges.

Multi-party computation (MPC) is a cryptographic protocol that enables individual stakeholders who hold sensitive data to pool it with others for joint computations without revealing the underlying data itself. Again, this offers a potential solution to some of the challenges around sharing sensitive public sector information via the NDL. Earlier this year, the ODI looked at how the Boston Women’s Workforce Council (BWWC) has partnered with Boston University to use an MPC to enable companies from the Greater Boston area to collectively compute the sum of their payroll data without revealing their individual contributions. The MPC protocol was produced to benchmark employers and produce a pay equity report covering the Greater Boston area. Introducing MPC technologies to the NDL could enable researchers to conduct similar analyses using UK public sector data.

A further significant consideration for the NDL is the data skills gap. AI tools can help the public sector and researchers work with complex datasets, but there’s an urgent need for digital and data skills training across all sectors. It would be a poor outcome for the NDL if we built it but did not provide people with the skills they need to use it. Multiverse recently found that new technology often goes under-utilised in organisations, and data is frequently mistrusted. Their Mapping the Data Skills Gap Intelligence Report 2024 revealed that while businesses invest substantial amounts in software, they aren’t matching that investment with skills training, leaving employees at a disadvantage.

Keeping up with the pace of technological change is hard, but it’s essential if the NDL is going to be able to reach its potential. Forrester’s 2023 Your data culture is in crisis (https://tinyurl.com/odiCulture) report indicated that 41 per cent of employees often mistrust the data available to them for decision-making, and according to Multiverse, as much as a third of the time spent by employees working with data is unproductive, with just under two-thirds reporting that they don’t even have basic Excel skills. In its design and inception, the NDL must encourage and develop data literacy, ensuring that those building it and using it can get the maximum possible value.

There is potential for the National Data Library – as an essential part of our national data infrastructure – to act as a showcase for responsible data practices while being stewarded and governed in the national interest. It can become a valuable resource for the public and private sectors and the UK’s research community. Using the country’s considerable expertise in cutting-edge technologies and well-designed data governance models, including those that allow for safe and ethical data sharing, we can make the UK’s NDL a global leader. As our Executive Chair, Sir Nigel Shadbolt has said, “With open minds, we can build platforms, services, and ways of working to realise the social and economic value in health, education, and other public data.”

Published: 21 October 2024

More from Information Professional

This reporting is funded by CILIP members. Find out more about the

Benefits of CILIP membership