This open source project is using Python, SQL and Docker to understand coronavirus health data

About

Django and Python developers working alongside clinicians and researchers have built a new analytics platform that looks at electronic health records for 24 million people.

By: Jo Best | ZDnet.

As the largest health provider in the world, the NHS holds an unparalleled amount of health data, which scientists and researchers should be able to draw on to help find them find ways to treat or prevent diseases.

In practice, NHS patient data hasn’t always been as accessible to researchers as they would have wanted.

But the urgent threat of coronavirus created an impetus to put the huge repository of data at researchers’ disposal as soon as possible, in order to help them find answers to questions such as why some people are more likely to die from the disease, and whether the medications a patient takes can affect whether they develop severe symptoms or not.

OpenSafely was created in just five weeks by the University of Oxford, the London School of Hygiene and Tropical Medicine, and health records companies including TPP; NHS England is acting as the data controller. While the idea of creating an analytics platform like OpenSafely predated COVID, the threat of the disease and an understanding of the value of the data the NHS holds, spurred the organisations to kickstart the project; at the same time, the COPI notice from NHS X, the health service’s tech and digital unit, made information governance around patient data during coronavirus more straightforward.

“There was a need to access an unprecedented scale of data, but to do that, we had to come up with a model that was much more secure than anything that had gone before,” says Dr Ben Goldacre, director of the University of Oxford’s EBM Data Lab.

Issues around security and privacy have cast a shadow over projects looking to use NHS data for research in the past and, given the extreme sensitivity of health data, making sure that ‘anonymised’ or ‘pseudonymised’ records couldn’t be reverse engineered into giving up sensitive data on an individual was key for OpenSafely.

To do this, OpenSafely uses a series of tiered tables, each giving up less and less information on individuals, and researchers don’t have the access to run a database query on the raw event-level patient data.

“They provide a description of what their analytic cohort should look like, in code, and then that runs remotely. They can’t do a simple database query, which is where all of the security risks would reside,” Goldacre says.

To keep NHS patients’ data as secure as possible, OpenSafely has shifted from a model based on trust (where trusted researchers are approved to work on raw data) to one more based on proof.

“That’s partly a concept that you inherit from working with software developers. You put tests in your code, you want proof that something works, you don’t want to rely on trust,” Goldacre says.

“I think it would have been unambiguously completely impossible and incredibly dangerous to analyse the primary care records of 40% of the population using the traditional model of large data extracts. That would have been unimaginably dangerous and I think even a general purpose trusted research environment would have been very, very risky.”

Researchers will only be able to analyse the OpenSafely data inside the electronic health record company’s datacentre. Rather than the usual model of exporting datasets that researchers work on locally (and so expose it to all the local security risks), all the analysis takes place where the records reside and only summary tables can be extracted by researchers.

OpenSafely is also available under open-source licence, with all code published on GitHub alongside the study definition for the first study run on the data.

Projects like OpenSafely could ultimately help push the research community to a more open, less proprietorial stance with their data and analysis. “In some respects, we have built OpenSafely to help and encourage epidemiologists to become better at sharing their work, not by hectoring them, but just by making it a completely normal part of the workflow,” he says.

The system makes a feature of sharing your working out — more openness than clinicians and researchers might traditionally have felt comfortable with.

The way the group have built OpenSafely aims to encourage researchers to share everything they do as they go. When users make a code list — a list of people with a particular condition, for example — or an analytic script, it’s all shared on GitHub.

“Everything that you do is shared by design,” Goldacre adds.

It hasn’t taken long for OpenSafely to bear its first fruit: a study of 17 million records published last month found that people from Black and Asian backgrounds were more at risk of dying of COVID-19, even when their additional medical risk factors and any social deprivation had been accounted for. It also identified key risk factors for death from COVID including being male, older, or with severe asthma and poorly controlled diabetes.

Read the full article here.