The anonymisation of personal data is an essential aspect of good scientific practise. According to the BDSG (German data protection law) § 3, paragraph 6 anonymisation refers to the practise of changing personal data in such a way that "individual characteristics cannot, or only by use of a disproportionate amount of time, cost and effort, be attributed to a particular natural person." Another way of protecting personal or sensitive data is pseudonymisation.
An archive is a system which makes the organised storage and retrieval of historical data, documents and objects possible. The way its contents are organised depends on the underlying policy. Archives can be provided as a service or set up and operated independently. For long-term preservation of 10 years and more special archiving systems are required. A particular form of archive is a repository.
Artificial Intelligence (AI)
Artificial intelligence (AI) is a research field of computer science. Its’ goal is to teach computers to complete tasks that are considered uniquely human, tasks requiring intelligence. The term AI doesn’t include how this goal can be reached.
As a subdiscipline of AI, machine learning (ML) enables systems to recognize patterns in existing databases and algorithms and develop solutions based on those patterns. Basically, artificial knowledge is generated through experience. Artificial neuronal networks with multiple layers between input and output are used to extract deep and complex structures. This method is called deep learning. More about AI.
To unambiguously identify persons, institutions, sponsors, etc. authority data were developed. Authority data are collected when texts or artifacts are digitized, cataloged, and archived. Additionally to a person’s “name”, their “number” also has to be added to avoid false associations. Independently from spelling, the information for the search term is findable. The “Integrated Authority File” (GND) by the German National Library is the central authority file in Germany.
The securing of data is called backup and can be used to restore original data when lost. There are different methods for securing data
- A full backup is usually done automatically in regular intervals and stored separately from the original data, so that physical harm (e.g. by fire) doesn’t destroy the data completely.
- A differential backup only safes data that was changed or added after the last full backup. It is only a selective change of the full backup and therefore faster and less storage intense.
- In contrast, incremental backups only safe files or parts of a file that were added or changed since the last incremental backup. This form of backup requires the least amount of storage space. However, a restoration has to be done step by step and relies on recombining partial backups.
- An image backup is used to secure entire medias (hard drive, network drive, etc.). Additionally to all the data, user settings and programs as well as the operating system are backed up too. The image backup can be used to restore a computer after total breakdown.
The term best practice refers to an already tested and proven method for running a work process. Such a method is "a technique or methodology that has proven through experience and research to be reliable in achieving a desired result". An obligatory use of best practice in all areas is an obligation to use all available knowledge and technology to guarantee a successful execution. For research data management, this term characterizes the standards needed to create high quality records. Metadata standards are predominant in this context.
The CARE Principles for Indigenous Data Governance were developed as a supplement to the FAIR Principles by the Research Data Alliance International Indigenous Data Sovereignty Interest Group and published by the Global Indigenous Data Alliance (GIDA).
The acronym stands for Collective Benefit, Authority to Control, Responsibility, and Ethics. Based on the CARE Principles, researchers are sensitized to ensure that the rights and interests of indigenous peoples are respected in Open Data and Open Science efforts. The CARE Principles are intended to prevent the right to self-determination of indigenous people and groups of people from being disregarded due to different power relations or historical inequalities.
In the area of research data management certification generally relates to repositories. By abiding by specific standards repositories can receive a certificate. It confirms the quality and the trustworthiness of the repository.
The citation of data publications varies depending on research field and discipline. To date there are no consistent standards for the citation of research data.
German copyright law regulates the use of literary, artistic and scientific works which fall under its specifications. If authors of such works do not grant users further usage rights by using a license such as Creative Commons, re-use is only possible within the restrictive limits of German copyright law.
Whether research data are subject to copyright or not depends on whether the threshold of originality is reached or whether the data fall under database protection law ("sui generis"). If in doubt as to whether either of these laws apply, it is best to consult a specialist attorney.
In order to ensure maximum reusability of scientific research data (which may be protected by copyright law) authors should consider granting more usage rights by choosing a less restrictive license. Licensed data are usually reused and cited more which can lead to better visibility and reputational gains for the data author even beyond their own research community.
Creative Commons Licences
In order to ensure maximum reusability of scientific reserach data, which might be subject to copyright law, the additional allocation of usage rights by using a suitable license should be considered. One possibility of determining reuse conditions of published research data is the use of liberal licensing models such as the widely accepted Creative Commons (CC) model.
Database Rights (sui generis database rights)
The sui generis database rightsprotect databases for a duration of 15 years from unapproved use or duplication if an “essential investment” of money, time, energy etc. was necessary (reaching the so called “intellectual threshold of originality”). German sui generis database rightsare based on the European General Data Protection Regulation (GDPR) (in effect since May 25th 2018). They aren’t based on a database’s content which can be subject to copyright law, instead they focus on the systematic and methodological creation itself.
Data Curation describes the management activities necessary to maintain research data long-term so that they are available for conservation and reuse. In the broadest sense curation is a compilation of processes and actions performed to create, manage, maintain and validate a component. Therefore, it is the active and ongoing management of data during its’ life-cycle. Data curation facilitates the search, discovery, and availability of data as well as quality control, value, and reuse over time.
Data curation profile
A data-curation profile describes the 'history' of a data set or data collection, i.e. the origin and life cycle of a data set within a research project. Developed by Purdue University Libraries, the profile and its associated toolkit are both a tool and a collection of data sets. The tool consists of an interview tool, which is presented to the user for a very thorough 'data exploration', which becomes a 'profile' as it is filled in. The data collection can be searched for completed data-curation profiles, e.g. to obtain information services in research data management for the data curation of a specific discipline or research method.
Data journals are publications with the main goal of providing access to data sets. In general, they aim to establish research data as an academic achievement in their own right and to facilitate their reuse. Moreover, they attempt to improve the transparency of academic methods and processes and the associated research results, support good data management practises and provide long-term access to data.
Data Management Plan
A data management plan systematically describes how research data are managed within research projects. It documents the storage, indexing, maintenance and processing of data. A data management plan is essential in order to make data interpretable and re-usable for third parties. It is therefore recommended to assign data management responsibilities before the start of a project.The following questions can serve as an orientation:
- Which data will be generated and used within the project?
- Which data have to be archived at the end of a project?
- Who is responsible for the indexing of metadata?
- For what period of time will the data be archived?
- Who will be able to use the data after the end of the project and under which licensing conditions?
Data Protection Law
The term data protection refers to technical and organisational measures to prevent the misuse of personal data. Misuse is defined as gathering, processing or using such data in an unauthorised way. Data protection is regulated by the EU General Data Protection Regulation (GDPR), by the German Federal Data Protection Act as well as the corresponding laws on state level, for example the Data Protection Act of Baden-Württemberg.
Personal data are gathered and used especially in medical and social science studies. It is mandatory to encode/encrypt data of this kind and store them in an especially secure location. Subsequent pseudonymisation and anonymisation can ensure that individuals cannot be identified which can make a publication of these kinds of data possible.
Data stewards are experts in Research Data Management. They work at research institutions to support researchers in the sustainable handling of their data. Embedded data stewards help researchers with discipline-specific inquiries on the faculty, department or project level. The tasks of data stewards include support, training, needs-assessment as well as requirements engineering.
A digital artefact is the end result of the proces of digitalisation during which an analog object (a book, manuscript, picture, sculpture etc.) is transformed into digital values in order to store it electronically. As opposed to an analog object, a digital artefact can be distributed in the form of digital research data and machine-processed. Another advantage of working with digital artefacts is that further alteration or damage to sensitive analog objects can be avoided.
The DINI certificate is a widely recognised quality seal for repositories. It guarantees a high quality service for authors, users and funders. It indicates that open access standards, guidelines and best practises have been implemented. The 2013 version can also be used to certify that all services maintained by a hosting provider comply with certain miminum requirements from the criteria catalogue. These criteria are marked as Dini-ready for the hosting provider and don’t have to be certified separately during the certification process.
Digital object identifier (DOI)
A Digital Object Identifier (DOI) is one of the most common systems for persistent identification of digital documents. A DOI remains the same over the entire lifetime of a designated object. The DOI system is managed by the International DOI Foundation. Another well-known system for persistent identification is the Uniform Resource Name (URN).
A (temporary) embargo is a timespan in which only descriptions of the research data, meaning descriptive metadata, is accessible, for example in repositories. The corresponding data publication is not available. An embargo can be used if the publication of research data is supposed to be delayed (e.g. during a peer-review-process).
In science, an „enhanced publication” is an electronic publication with the digital research data attached and publicly accessible.
The term FAIR (Findable, Accessible, Interoperable und Reusable) Datawas coined by the FORCE11-Community for sustainable research data management in 2016. It is the main goal of the FAIR data principles to promote professional management of research data in order to make them more findable, accessible, interoperable and reusable. The FAIR principles were adopted by the European Commission and integrated into the Horizon 2020 funding guidelines.
File Format (File Type)
The file format (sometimes also called file type) is created when a file is saved and contains information about the structure of the data present in the file, its purpose and affiliation. With the help of the information available in the file format, application programmes can interpret the data and make the contents available. To indicate the format of a file, a specific extension consisting of a dot and two to four letters is added to the actual file name.
In the case of the so-called proprietary formats, the files can only be processed with the associated application, support or system programmes (for example .doc/.docx, .xls/.xlsx). Open formats (for example .html, .jpg, .mp3, .gif), on the other hand, make it possible to open and edit the file with software from different providers.
File formats can be changed by the user through conversion when saving, but this can lead to data loss. In research, one should pay attention above all to compatibility, suitability for long-term preservation and loss-free conversion into alternative formats.
Good Scientific Practice
The guidelines on safeguarding good research practise serve as an orientation for scientific research and academic workflows. In Germany such a set of rules can be found in recommendation 15 to 17 by the German Research Foundation (DFG). It stipulates that "Researchers back up research data and results made publicly available, as well as the central materials on which they are based and the research software used, by adequate means according to the standards of the relevant subject area, and retain them for an appropriate period of time. Where justifiable reasons exist for not archiving particular data, researchers explain these reasons. HEIs (Higher Education Institutions) and non-HEI research institutions ensure that the infrastructure necessary to enable archiving is in place". This is meant to ensure the reproducibility of research results. Publishing data also facilitates the reuse of research data.
Harvesting protocols are used to automatically extract data. One of the most commonly used protocols is the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) which is based on XML. Because there are many different metadata standards, OAI-PMH chose Dublin Core as the smallest common denominator in metadata representation.
High-performance computing (HPC) is a specific area in computational calculations where big amounts of computational power and storage capacity are needed to solve complex and extensive calculations. To achieve this goal parallel processing plays an important role. Optimized computer clusters depend on fast connections and extremely short response times between the different computational nodes.
Baden-Württemberg has started the project bwHPC-S5 (Scientific Simulation and Storage Support Services). The primary goal of this project is the establishment of a state-wide computing and storage infrastructure. The inter-university coordination efficiently ensures a consistent state-wide user support. The project is funded by the ministry of science, research and arts (MWK) of the state of Baden-Württemberg. This support is in line with the High-performance computing and Data Intensive Computing strategy of Baden-Württemberg.
Informed consent is the process by which the researcher provides appropriate information about what happens to the participant's (personal) data during and after the duration of the study/reesearch/project, in order for the potential participant/subject to make an informed choice about whether or not to be part of the research. It is usually sought before the research begins and needs to be freely given, informed, unambiguous, specific and confirmed by a clear affirmative action such as, for instance, signing an information/consent form.It is crucial to gain informed consent from potential participants, in order to fulfill one's legal and ethical obligations as a researcher.
Ingest is a part of the data life-cycle where research data is transferred into an archive or repository. After receiving notice of the arrival of the data a choice must be made about the form in which the data will be taken on.
Depending on the specific content the workflow varies. Typically, quality checks are performed (e.g. checking metadata) and preparations are made (e.g. more metadata added).
JSON is a compact, easily readable, and software-independent data format for data exchange between applications. Especially web applications use it to convey structured data between systems. For the same amount of information JSON needs less storage space than XML but its applications are limited.
The aim of long-term preservation ist to ensure access to archived data over a long period of time. The limited durability of storage media, technological change and safety risks complicate this task which is why extensive and forward-thinking planning is necessary. In order to avoid data loss and ensure long-term data recall, a suitable archiving system (metadata, structure) has to be employed. During the planning stage different aspects like IT infrastructure, hardware and software have to be considered. Additionally, societal developments should also be taken into account.
Machine actionable data
Machine actionable data can be found and used automatically by computer systems with none or minimal human assistance. The prerequisite for machine usability is a uniform data structure. The machines or computers that are supposed to read and use this data are programmed based on this structure
Mapping is the process of transforming data from one model into another. It’s the first step for the integration of foreign information into one’s own information system. It includes data transformation during an electronic data exchange that typically uses XML as the markup language and JSON as the data format.
Metadata are independent data which contain structured information about other data and/or ressources and their characteristics. Metadata are stored either independently of or together with the data they describe. An exact defintion of metadata is difficult since the term is being used in different contexts and distinctions can vary according to perspective.
Usually there is a distinction between discipline-specific and technical/administrative metadata. Whereas the latter are definitely considered to be metadata, the former might also be viewed as research data.
In order to raise the effectiveness of metadata, a standardisation of descriptions is necessary. By using metadata standards, metadata from different sources can be linked and processed together.
For interoperability, i.e. the linking and common processing of metadata, metadata standards for specific purposes were set up. Metadata standards aim at a uniform description of similar data, both in terms of content and structure. A metadata standard as such can often provide a so-called mapping to another metadata standard.
National Research Data Infrastructure (NFDI)
The NFDI has the objective to become a widespread and interconnected infrastructure that offers portfolios and support for creating and using research data. This is achieved through research field or method-specific consortia.
The NFDI is supposed to “systematically manage scientific and research data, provide long-term data storage, backup and accessibility, and network the data both nationally and internationally. The NFDI will bring multiple stakeholders together in a coordinated network of consortia tasked with providing science-driven data services to research communities.” DFG
The creation of the NFDI was initiated by the Joint Science Conference (Gemeinsame Wissenschaftskonferenz GWK) and is financed by the states and the federal government. The DFG (German Research Foundation) is responsible for the assessment and appraisal of consortia proposals. The selection process has three rounds. Nine first-round NRDI consortia started up in October 2020. There will be two more rounds in 2020 and 2021.
The term open access refers to free and unimpeded access to digital scientific content. Users are usually given a wide range of usage rights and provided with easy modes of access. The copyright, however, generally remains in the hands of the author. Through open access scientific information can be widely disseminated, used and re-processed. As such it represents an important achievement of the open science movement.
When publishing scientific content, there are two open access options:
- Publishing the content in a genuine open access medium is referred to as the „golden path“ of open access.
- Publishing the content in a traditional, subscription-based medium with an open access version paid for by the author is called the “green path”.
Open data refers to data that may be used, disseminated and reused by third parties for any purpose (e.g. for information, analysis or even commercial reuse). Restrictions to use are only permitted in order to preserve the provenance and openness of the knowledge; for example, the CC-BY license requires the author be named. The goal of open data is that free reuse allows for greater transparency and more collaboration.
The Open Research and Contributor (ORCID) -iD is an internationally recognized persistent identifier that helps to clearly identify researchers. It is independent from journals and institutions and can be used by researcher long-term. The iD consists of 16 digits in groups of four (e.g. 000-0002-2792-2625). ORCID has been established by many publishers, universities, and research institutes and is included in the workflow, e.g. the assessment of journal articles.
The project “ORCID DE – Förderung der Open Researcher and Contributor ID in Deutschland” was created to promote ORCID in Germany and was supported by the German Research Foundation (DFG) in 2016 for three years.
Aim of the DFG- project ORCID DE is to support the implementation of ORCID at universities and other research institutes. In Baden-Württemberg, the AK FDM has recommended the use of the ORCID-iD.
Persistent identification is the process of assigning a permanent, digital identifier consisting of numbers and/or alphanumerical characters to a data set (or any other digital object).
Frequently used identification systems are DOI (Digital Object Identifier) and URN (Uniform Resource Name). As opposed to other serial identifiers (such as URL addresses) a persistent identifier refers to the object itself rather than to its location on the internet. Even if the location of a persistently identified object changes, the identifier remains the same. All that needs to be changed is the URL location in the identification database. In this way it can be ensured that data sets are permanently findable, retrievable and citable.
German data protection law (BDSG) defines personal data as „information on personal characteristics or circumstances of a particular natural person (affected party)." Data are considered personal if they can be attributed to a particular natural person. Typical examples are name, profession, height or nationality of a person. German data protection law also stipulates that information on ethnicity, political opinion, religious or philosophical affiliation, union membership, health and sexuality are especially sensitive and therefore subject to even stricter protection.
Policies and Guidelines
Policies establish certain rules for the handling and managing of reasarch data for all employees of a research institution. They usually also determine which methods of research data management should be applied. In Germany most research data policies do not contain detailed regulations, but instead usually consist of a basic self-commitment to the principles of open access.
Primary research data are unprocessed and uncommented raw data which have not yet been associated with any metadata. They form the foundation of all scientific activity. The distinction between research data and primary research data usually only has theoretical merit because raw data are hardly ever published without any associated metadata. Digital artefacts are generally not published by their proprietors (such as scientific libraries) without background information such as provenance and other information.
As opposed to anonymisation, the technique of pseudonymisation simply substitutes letter and/or number codes for identifiyng charcateristics such as names in order to impede or ideally prevent any individuals from being identified (BDSG § 3, paragraph 6a). During the course of a scientific study the reference list of personal data and its associated code should be kept separate from the actual study data. An anonymisation of data can be achieved by deleting this reference list after the completion of the project so that no individual person can be connected to the study results.
ReadMe files contain information about research data, research datasets or research data collections in a compact and structured form and are often available as simple text files or in TEI-xml (.txt; .md; .xml). In this sense, ReadMe files can be published to accompany research data or can be used for structured storage of research data at the end of a project (e.g., on an institute server or repository for archiving). ReadMe files collect central metadata about the project in which the data originated (e.g., project name, persons involved, funding), provide information about naming standards used, folder structures, abbreviations, and norm data, and record changes to and versioning of research data.
Replication studies are scientific studies that check if the results of a previous study is reproducible.
A repository can be viewed as a particular kind of archive. In the digital age it refers to an administrated storage space for digital objects. Since repositories are generally publically accessible or at least accessible to a specific group of users it is closely connected to the issue of open access.
Data that are a) created through scientific processes/research (for example through measurements, surveys, source work), b) the basis for scientific research (for example digital artefacts), or c) documenting the results of research, can be called research data.
This means that research data vary according to projects and academic disciplines and therefore, require different methods of processing and management, subsumed under the term research data management. There is also a distinction between primary data and metadata, however, the latter do not strictly count as research data in many disciplines.
Research Data Lifecycle
The model of the research data life cycle illustrates the stages research data can go through, from the collection to its reuse. The stages of the data lifecycle can vary, but in general the data lifecycle comprises the following phases:
- Planning research projects (including handling of the data in the research project, see data management plan)
- Creation and collection
- Processing and analysis
- Sharing and publication
Research Data Management
The term research data management refers to the process of transforming, selecting and storing research data with the aim of making them accessible, re-usable and reproducible independently from the data author for a long period of time. To achieve that aim systematic actions can be taken at all points in the data life cycle in order to maintain the scientific value of research data, ensure their accessibility for analysis by third parties and to secure the chain of evidence.
Research data policy
A research data policy is a document that dictates how research data should be handled at a given institution.
It is supposed to contribute to the efficient management of the valuable resource research data. By now, there are research data policies for individual universities (institutional policies) as well as interdisciplinary and disciplinary policies in Germany. Even some scientific journals have their own research data policies.
Rights to data
Rights to/for data can be defined from two different perspectives. From the researcher’s perspective the rights to make decisions about the data comes from their creation. From the user’s point of view they are rights that need to be respected when reusing the data. Rights can be established legally and communicated through licenses and contractual agreements.
For the reuse of data, at least the rules of good scientific practice have to be followed. This means that the Author has to be cited correctly (Copyright). Through the use of the Creative-commons-license CC-BY, this can be arranged legally. Data protection laws, patent laws and personality rights can hinder the reuse of data.
The semantic web is an attempt to systemize the World Wide Web for the purpose of facilitating automated exchange and processing. By contextualizing pivotal terms, which appear on websites unstructured, through additional information (metadata) it clears up, for example, if “Berlin” is the capitol of Germany, another city, or a name. To teach a term’s context to a machine machine-readable metadata standards are used. Because of its complexity and workload, the desired connectivity of information on the Web through contextualization is still in its early stages. However, it will most likely improve the searchability of the Web long-term.
Specialised Information Services (SIS)
The „Specialised Information Services“ (SIS) are a funding program for scientific libraries of the “Deutsche Forschungsgemeinschaft” (German research union). SIS are supposed to improve the informational infrastructure for research. They are the follow-up project to the Special Subject Collections at scientific libraries in Germany which were funded since 1949. Many SIS offer specific information about research data management for their disciplines.
Threshold of Originality
When an object or work is created the threshold of originality is a measurement of the degree to which it incorporates personal characteristics of its author. Whether a work reaches this threshold of originality is a decisive criterion for its protection by German copyright law . An important aspect of the threshold of originality is that the work is a result of its author's creativity and personality rather than an outcome of external circumstances (objective, functionality, objectivity etc.). This is why research data very rarely fall under German copyright law.
URN (Uniform Resource Name)
URN is the acronym of an identification and addressing system used in a similar way to a DOI (Digital Object Identifier) for the persistent identification of digital objects (network publications, data sets, etc.). URNs are particularly widespread in German-speaking countries, as the German National Library uses, administers and resolves URNs for persistent identification and addressing.
When working with data, they inevitably change. It is recommended to mark the respective work statuses with the help of versioning and thus make them traceable. A predefined, easy-to-understand versioning scheme (e.g. version 1.3 or version 2.1.4) should be used for this purpose. Data can be versioned either manually or using versioning software such as git. Versioning should be done during the research process itself, for example, to identify different working versions of data, and also in the case of subsequent changes to research datasets that have already been published, to enable subsequent users to cite the correct version of a research dataset.
Virtual research environment' (VRE)
Virtual research environments are software or platforms that facilitate location-independent collaborations between researchers. A VRE is a service provided by infrastructure facilities (e.g. Libraries) for specific research alliances and communities. The software solution combines discipline-specific tools, tool-collections, and work environments. Cross-disciplined applications are currently a far-off dream.
XML (Extensible Markup Language)
XML is a markup language for saving hierarchically structured information as text files. It is most commonly used for platform-independent data exchange between applications or computers. The coding is both human- and machine-readable. It is possible to check the validity of the content of an SML file if additional content-related rules have been defined in an external file. Therefore, the form and content of the coded information can be described. Thanks to SDL (SML Stylesheet Language), it is possible to interpret information and visualize it with other file formats.