Navigation auf uzh.ch

Suche

UZH News

Data Management

Navigating a Sea of Data

Processing, storing and ensuring access to large amounts of data is becoming increasingly important for many researchers. The Data Stewards Network at UZH is there to help them find their way through the data jungle.
Text: Theo von Däniken; Translation: Gena Olson

Kategorien

Three people are standing in a server room between server racks. The person on the far right is pointing at a server with an outstretched arm.
Guanghao You, Andrea Malits and Andrea Farnham (left to right) in the server room at UZH's Irchel Campus – these members of the UZH community are working to make research data widely accessible. (Image: Diana Ulrich)

Scientists have always generated, collected and analyzed data. With digitalization, however, the amount of available data has increased dramatically. Data analysis capabilities are also growing rapidly. “Data is a valuable raw material nowadays,” says Andrea Malits, the open science coordinator at University Library Zurich. It can be collected, analyzed and combined in a variety of ways and forms the foundation for self-learning systems like AI. For this reason, data is no longer only interesting for the researchers who collect and analyze it, but it also be valuable for other research questions. Data collected by linguists for studies on dialect development could also be of interest to human geographers researching migration patterns, for instance.

FAIR Data

However, for data to be useful to other researchers, it must be available and properly prepared. Here the magic word is FAIR: an acronym that stands for four requirements that need to be met for data to be usable in research. Data needs to be findable, accessible, interoperable and reusable.

In practice, though, each of these four requirements poses challenges to researchers. This is why UZH established the Data Stewards Network last year. The network connects scientists who encounter FAIR issues when dealing with how to process data and make it accessible. They hope to use their knowledge to help other researchers navigate this new jungle of data.

Portrait image of Andrea Malits

Data is a valuable raw material nowadays.

Andrea Malits
Head of Open Science Services at the University Library Zurich

The first challenge often comes when making data findable. “The research data that we have at UZH is highly valuable,” explains Malits. “But we can’t make full use of it because we don’t have an overview of what and where all the data is.” Making data from different fields easily discoverable is particularly important for interdisciplinary research projects.

Unlike scientific publications, where UZH has ZORA (Zurich Open Repository and Archive), an open directory with information on all publications by UZH researchers, there is no central repository for research data. It's not about creating a standardized storage location for the data itself, Malits emphasizes, but rather about having a university-wide directory that lists which data is available in which formats and where it is stored.

“Good incentives should be created for researchers to register their data collections in a directory like this,” says Malits. After all, it would be additional work that researchers would have to do, often without yielding any direct benefits. A data directory could also help communicate the research achievements of UZH to the outside world.

The principle of openness

Open data and the FAIR principles are a matter close to the heart of epidemiologist Andrea Farnham, who works at the Population Research Center at UZH. That’s why she dedicates part of her time to preparing her data for other researchers. “It is important for our data to be accessible, since it was ultimately collected thanks to public funds,” she says. In practice, however, implementing these principles involves significant hurdles in her particular case.

Farnham is the scientific director of the SwissPrEPared project, which aims to reduce HIV infections and other sexually transmitted diseases. The program involves a total of 10,000 participants who receive PrEP, a medication that prevents HIV infection. The participants are regularly surveyed about their health, sexual behavior and drug use habits as part of an accompanying longitudinal study. The aim of the study is to improve our understanding of the needs and behaviors of risk groups and to design prevention measures and healthcare services based on these insights.

This involves collecting and safeguarding highly sensitive health and medical data. "Because the data is very detailed, it is very difficult to fully anonymize," explains Farnham. In addition, there are data protection regulations and the requirement that study participants must consent to any data used outside of the study.

Protecting privacy

This means that the possibilities for sharing the data and making it available to others are very limited. "In theory, the FAIR principles are ideal," says Farnham. "In practice, though, they cannot always be implemented since our obligation to protect the privacy of participants is more important.” Anyone who wants access to the study data must submit an application, which is reviewed by external experts and a scientific committee. And even then, the data is often only made available in parts. “We’ve never shared the entire data set,” she says.

Portrait image of Andrea Farnham

It’s important to make our data accessible to other researchers.

Andrea Farnham
Epidemiologist

Even if direct access to the data is not possible, Farnham is still concerned with making the data findable. “We could at least publish the metadata that describes the type of data we have collected,” she says. But this also entails additional effort, since the scope of the study is constantly being adapted to current developments. For example, when monkeypox (mpox) broke out last year, the questionnaire was expanded accordingly. “We need to update our metadata at least once per year to keep it up to date,” says Farnham.

Farnham only came up with the idea of making the data findable by publishing metadata thanks to the Data Stewards Network: “There I learned that publishing your metadata is considered FAIR best practice.” She also received tips from her data steward colleagues on how the metadata should be structured to ensure that it is actually findable and interoperable – meaning that it can be used for other studies as well.

Networking and mutual learning

Additionally, she learned how to share the programming code for analyzing her data on the code repository gitLab. This has significantly improved collaborative programming and the quality of the code. “We’ve also become much more efficient at troubleshooting as a result,” she says. The knowledge she has acquired in her own project is now being passed on to others in her role as a data steward. “Many people don't know where to get help and support with these questions,” she says.

Networking and mutual learning is one of the goals that UZH is pursuing with the data stewards initiative; it's also about motivating research communities to act and raising awareness about the issue. Many data-related problems and solutions are specific to individual research areas. How to write metadata, store large amounts of data, or which repositories to publish data on – this can vary greatly depending on the discipline.

“Ultimately, it's up to the researchers to develop standards for their field,” says Malits. However, the data stewards can help connect the specific needs of different fields with the resources and support services available at UZH.

Crucial role for funding

Much of this still depends on the personal commitment of data stewards like Farnham who value the principle of open science. The network now comprises 30 people and is coordinated by Susanna Weber from the University Library. “In the long run, these responsibilities would need to be formalized and receive appropriate funding,” says Malits. Only in this way could data be stored and made available over the long term.

Funding is also a point of difficulty: Research projects always have a limited timeframe and a limited budget. “For individual researchers, there is little incentive to commit to ensuring that data is preserved and remains accessible,” says Malits.

Hard-to-achieve harmonization

One specialist who specifically works on data management is Guanghao You. He works at the university’s Linguistic Research Infrastructure (LiRI) for the National Center of Competence in Research, advising researchers on how to aggregate, evaluate and store data. “My main job is to merge data from a wide variety of sources so that we can analyze it within the scope of the project,” You explains.

In his own research, You primarily focuses on the earliest language acquisition in infants/toddlers. For this purpose, his research group makes recordings of everyday situations with toddlers and their parents. These recordings are transcribed for analysis and then stored together with structured annotations and information about the speakers and the situation that was recorded.

Portrait image of Guanghao You

It helped me a lot to talk to other people about strategies and shared challenges.

Guanghao You
Linguist

In parallel, You also works with data sources that are freely available online. “But with those, the metadata is often missing or incomplete,” he says. This makes it difficult to use the data when, for example, information about the age of the speakers is missing because of anonymization measures.

You’s group also follows its own protocol for how annotations, descriptions and glossaries are written, and they try to stay as close as possible to established standards from linguistics. However, data from other sources is sometimes prepped and described in a completely different way, making it difficult to integrate directly into their own database.

This example shows that the interoperability requirement – meaning that data can be used by different research groups – is a high hurdle even when the data meets the findability and accessibility requirements. This is because – depending on the discipline – there are no uniform standards for how data should be described, and each group can follow its own protocol. “Since data often comes from projects that have already been completed, we cannot influence the protocols,” says You.

You and his staff have had to make a considerable additional effort to be able to use outside data for their project. In a few cases where You knows the researchers, he can discuss data processing with them so that he can more easily make use of the data.

Decentralized data storage

Data storage is another challenge for You, because the National Centres of Competence in Research encompass many different research areas. The types and amounts of data that accumulate are enormous and extremely wide ranging. “How a data repository can meet all the different requirements was a big challenge for me,” he explains.

You came up with his solution partially thanks to exchanges with other data stewards. “Through talking to them, I came up with the idea that a unified repository isn’t necessarily needed,” says You. Instead, he has now set up a central index listing all data repositories and the data they contain. This allows researchers to store their data in repositories that meet their needs. Meeting with other data stewards also showed You that he wasn't alone with the issues he faced: “It helped me a lot to talk to other people about strategies and shared challenges.”

Differences between the disciplines

According to Malits, researchers depositing data in field-specific repositories is also in line with UZH’s approach. “When working with MRI data from medicine, you need a different skillset than when researching with text data,” she says.

In linguistics, UZH has taken a leading role domestically and operates the Language Repository of Switzerland (LaRS), which provides researchers from Swiss universities with a place to deposit their linguistic research data. In doing so, they receive advice and support from specialists from the University Library and LiRI. Additionally, the repository is embedded in the European Network for Linguistic Research (CLARIN).

“LaRS is a successful pilot project,” says Malits.  “But it doesn't mean that these repositories now have to be built within every discipline.” International data repositories that are widely used have existed for some time in many fields. There it wouldn’t make sense to establish new infrastructure. “The approach needs to be adapted to the needs of the researchers,” says Malits.

Motivating the research community

Using a bottom-up approach, the Data Stewards Network is helping to make these issues and their solutions more widely known. Malits believes that they have found a way to “achieve a significant leverage effect.”

She says that during Data Protection Week in January, they managed to bring together expertise from different research communities. “The data stewards ‘activated’ their colleagues, who not only participated in workshops but in some cases also immediately organized their own,” she adds.

The Open Science working group, led by VP Christian Schwarzenegger and VP Elisabeth Stark, deals with issues related to data management for the university as a whole. The group includes representatives from all disciplines, who can then address the issues within their own faculties. The goal, according to Malits, is for as many researchers as possible to process and make their data accessible according to FAIR principles. Only in this way can these new treasure troves of data be used to benefit as many people as possible.