The anonymisation of personal data is an essential aspect of good scientific practise. According to the BDSG (German data protection law) § 3, paragraph anonymisation refers to the practise of changing personal data in such a way that "individual characteristics cannot, or only by use of a disproportionate amount of time, cost and effort, be attributed to a particular natural person." Another way of protecting personal or sensitive data is pseudonymisation.
An archive is a system which makes the organised storage and retrieval of historical data, documents and objects possible. The way its contents are organised depends on the underlying policy. Archives can be provided as a service or set up and operated independently. For long-term preservation of 10 years and more special archiving systems are required. A particular form of archive is a repository.
The term backup refers to the practise of making redundant copies of data and storing them on separate storage media. In case of data loss it can be used to recover the original data. There are several backup methods.
A full backup is done automatically in regular time intervals and the data are stored in a separate location from the original data. In order to avoid total data loss in case of physical damage through fire etc.
A differential backup saves only those data that have been accumulated or changed since the last full backup. It’s less time and resource intensive than doing a full backup each time.
With an incremental backup only files or parts of files are saved which have been changed since the last incremental backup. This form of backup requires the least storage space, however, in case of data loss data can only be restored in a resource-intensive multi-step process of retrieving chains of partial backups.
An image backup saves an entire storage medium (hard drive, network drive etc.) including all data, user setups and programms and in some cases even the entire operating system. Retrieving an image backup can protect from cases of total data loss.
In Germany the German copyright law regulates the use of literary, artistic and scientific works which fall under its specifications. If authors of such works do not grant users further usage rights by using a license such as Creative Commons, re-use is only possible within the restrictive limits of German copyright law.
Whether research data are subject to copyright or not depends on whether the threshold of originality is reached or whether the data fall under database protection law ("sui generis"). If in doubt as to whether either of these laws apply, it is best to consult a specialist attorney.
In order to ensure maximum reusability of scientific research data (which may be protected by copyright law) authors should consider granting more usage rights by choosing a less restrictive license. Licensed data are usually reused and cited more which can lead to betetr visibility and reputational gains for the data author even beyond their own research community.
Creative Commons licences
In order to ensure maximum reusability of scientific reserach data, which might be subject to copyright law, the additional allocation of usage rights by using a suitable license should be considered. One possibility of determining reuse conditions of published research data is the use of liberal licensing models such as the widely accepted Creative Commons (CC) model.
Data base protection law
Data base protection law protects databases from unauthorised use and reproduction for up to 15 years if a "significant investment" of money, time and/or effort was required for its creation (reaching the 'threshold of originality'). German database protection law is based on the European Parliament/Council regulation 96/9/EG from March 11th 1996 on the legal protection of databases. This regulation does not apply to the contents of a database, which in themselves may be subject to copyright law, but to the act of systematically and methodically assembling the database itself.
The term data curation refers to the kinds of management activities necessary in order to maintain the reusability of research data in the long term. In the broadest sense curation comprises a range of activities and processes which aim at creating, managing, maintaining and validating a component. It can also be described as the active and on-going administration of data during the data life cycle. Data curation improves the searchability and retrieval of data as well as their quality, added value and re-usability over time.
Data curation profile
A data curation profile describes the ‘history’ of a data set or a data collection in the form of the provenance and the life cycle of the data within a research project. The profile and the associated toolkit were developed by Purdue University Libraries and consists of a tool as well as a data collection in its own right. The tool contains an interviewing instrument on the basis of which a thorough data analysis can be conducted which then serves as the foundation for filling in the data curation profile. The data collection can be searched for finished data curation profiles in order to obtain information, for example, on information services available for curating data in a specific research discipline or for a specific research method.
The data format indicates the syntax and semantic within a data file. In order for a computer or a computer application to be able to interpret data within a data file, information about the data format, which is encoded in the file type extension, is necessary. Most data formats are designed for a particular type of use and can be grouped according to certain criteria:
- executable files
- system files
- library files
- user files: image files (vector graphics [SVG, ...], raster graphics [JPG, PNG, ...]), text files, video files, etc.
Data formats can be proprietray or open:
- proprietary formats are usually provided by software manufacturers or platforms and are subject to licensing and/or patenting or respectively require manufacturer-specific knowledge for implementation
- open formats allow unrestricted access to their source code and can be adapted by users
Data journals are publications with the main goal of providing access to data sets. In general, they aim to establish research data as an academic achievement in their own right and to facilitate their reuse. Moreover, they attempt to improve the transparency of academic methods and processes and the associated research results, support good data management practises and provide long-term access to data.
Data protection law
The term data protection refers to technical and organisational measures to prevent the misuse of personal data. Misuse is defined as gathering, processing or using such data in an unauthorised way. Data protection is regulated by the European data protection directive 95/46/E, by the German federal Data Proection Act as well as the corresponding laws on state level, for example the Data Protection Act of Baden-Württemberg.
Personal data are gathered and used especially in medical and social science studies. It is mandatory to encode/encrypt data of this kind and store them in an especially secure location. Subsequent pseudonymisation and anonymisation can ensure that individuals cannot be identified which can make a publication of thes ekinds of data possible.
Data management plan
A data management plan systematically describes how research data are managed within rsearch projects. It documents the storage, indexing, maintenance and processing of data. A data management plan is essential in order to make data interpretable and re-usable for third parties. It is therefore recommended to assign data management responsibilities before the start of a project.The following questions can serve as an orientation:
Which data will be generated and used within the project?
Which data have to be archived at the end of a project?
Who is responsible for the indexing of meta data?
For what period of time will the data be archived?
Who will be able to use the data after the end of the project and under which licensing conditions?
Data stewards are experts in Research Data Management. They work at research institutions to support researchers in the sustainable handling of their data. Embedded data stewards help researchers with discipline-specific inquiries on the faculty, department or project level. The tasks of data stewards include support, training, needs-assessment as well as requirements engineering.
A digital object is the end result of the proces of digitalisation during which an analog object (a book, manuscript, picture, sculpture etc.) is transformed into digital values in order to store it electronically. As opposed to an analog object, a digital object can be distributed in the form of digital research data and machine-processed. Another advantage of working with digital objects is that further alteration or damage to sensitive analog objects can be avoided.
The DINI certificate is a widely recognised quality seal for repositories. It guarantees a high quality service for authors, users and funders. It indicates that open access standards, guidelines and best practises have been implemented. The 2013 version can also be used to certify that all services maintained by a hosting provider comply with certain miminum requirements from the criteria catalogue. These criteria are marked as Dini-ready for the hosting provider and don’t have to be certified separately during the certification process.
File Format (file type)
The file format (sometimes also called file type) is created when a file is saved and contains information about the structure of the data present in the file, its purpose and affiliation. With the help of the information available in the file format, application programmes can interpret the data and make the contents available. To indicate the format of a file, a specific extension consisting of a dot and two to four letters is added to the actual file name.
In the case of the so-called proprietary formats, the files can only be processed with the associated application, support or system programmes (for example .doc/.docx, .xls/.xlsx). Open formats (for example .html, .jpg, .mp3, .gif), on the other hand, make it possible to open and edit the file with software from different providers.
File formats can be changed by the user through conversion when saving, but this can lead to data loss. In research, one should pay attention above all to compatibility, suitability for long-term preservation and loss-free conversion into alternative formats.
Good scientific practise
The rules of good scientific practise serve as an orientation for scientific research and academic workflows.In Germany such a set of rules can be found in recommendation 7 by the German Research Foundation (DFG). It stipulates that "primary data as the foundation for academic publications should be stored for a minimum of 10 years on a secure and stable medium at the institution where they were created". This is meant to ensure the reproducibility of research results. Publishing data also facilitates the reuse of research data.
The term harvesting describes the automated gathering of data or meta data from archives and repositories via so-called data providers such as Base OAIster, OpenAIRE or Scientific Commons.
For this process harvesting protocols which gather data automatically are used. One of the most frequently used harving protocols is the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) which is based on XML. Since there is a large number of meta data standards the OAI-PMH protocol employs the Dublin Core model as the lowest common denominator for meta data representation.
The term ingest is used to refer to the stage in the research data life cycle at which the data are deposited in an archive or repository. At first there is confirmation that a data package was received and then it is decided in what form to ingest the data.
Depending on the content to be ingested associated workflows can differ. In general after confirmation of reception the data package goes through quality control (checking of meta data and sensitivity) and several enrichment/addition processes (for example meta data enrichment).
Java Script Object Notation (JSON)
JSON is a compact, easy to read, software independent data format used for data exchange between applications. It is mainly utilised by web applications for transmitting structured data in order to integrate them into other systems or applications. To achieve this JSON needs significantly less storage space than XML. However, it is not as versatile.
The aim of long-term preservation ist to ensure access to archived data over a long period of time. The limited durability of storage media, technological change and safety risks complicate this task which is why extensive and forward-thinking planning is necessary. In order to avoid data loss and ensure long-term data re-call, a suitable archiving system (meta data, structure) has to be employed. During the planning stage different aspects like IT infrastructure, hardware and software have to be considered. Additionally, societal developments should also be taken into account.
Machine actionable data
Machine actionable data can be found and used automatically by computer systems with none or minimal human assistance. The prerequisite for machine usability is a uniform data structure. The machines or computers that are supposed to read and use this data are programmed based on this structure.
Mapping (data mapping)
Data mapping refers to the transfer of data elements from one data model to the next. This is the first step towards integrating external data into an information system. Data mapping includes data transformation during an electronic data exchange and is usually done using the mark-up language XML as well as the data format JSON.
Meta data are independent data which contain structured information about other data and/or ressources and their characteristics. Meta data are stored either independently of or together with the data they describe. An exact defintion of meta data is difficult since the term is being used in different contexts and distinctions can vary according to perspective.
Usually there is a distinction between discipline-specific and technical/administrative meta data. Whereas the latter are definitely considered to be metadata, the former might also be viewed as research data.
In order to raise the effectiveness of meta data, a standardisation of descriptions is necessary. By using meta data standards, meta data from different sources can be linked and processed together.
Meta data standards
In order to ensure the interoperability, that is the linking and shared processing of meta data, different kinds of meta data standards exist. They facilitate a homogenous structural and textual description of similar data. Meta data standards can often be transformed into one another by using the process of mapping.
The term open access refers to free and unimpeded access to digital scientific content. Users ae usually given a wide range of usage rights and provided with easy modes of access. The copyright, however, generally remains in the hands of the author. Through open access scientific information can be widely disseminated, used and re-processed. As such it represents an important achievement of the open science movement.
When publishing scientific content, there are two open access options:
Publishing the content in a genuine open access medium is referred to as the „golden path“ of open access.
Publishing the content in a traditional, subscription-based medium with an open access version paid for by the author is called the “green path”.
Open data refers to data that may be used, disseminated and reused by third parties for any purpose (e. g. for information, analysis or even commercial reuse). Restrictions to use are only permitted in order to preserve the provenance and openness of the knowledge; for example, the CC-BY license requires the author be named. The goal of open data is that free reuse allows for greater transparency and more collaboration.
Data ownership can be viewed from different perspectives. From the perspective of the researcher it is about control of the data they have gathered/generated. From the user perspective it is an issue of usage rights. Rights can be granted and communicated in the form of licences and their associated license agreements.
When re-using data in any form the most basic standard to conform to are the rules of good scientific practise which means that data user are required to attribute the data correctly. By using the Creative Commons license CC-BY this requirement can also be made explicit. Data protection laws, patent laws and personal rights regulations can also impede re-use.
Persistent identification is the process of assigning a permanent, digital identifier consisting of numbers and/or alphanumerical characters to a data set (or any other digital object).
Frequently used identification systems are DOI (Digital Object Identifier) and URN (Uniform Resource Name). As opposed to other serial identifiers (such as URL addresses) a persistent identifier refers to the object itself rather than to its location on the internet. Even if the location of a persistently identified object changes, the identifier remains the same. All that needs to be changed is the URL location in the identification database. In this way it can be ensured that data sets are permanently findable, retrievable and citable.
German data protection law (BDSG) defines personal data as „information on personal characteristics or circumstances of a particular natural person (affected party)." Data are considered personal if they can be attributed to a particular natural person. Typical examples are name, profession, height or nationality of a person. German protection law moreover stipulates that information on ehtnicity, political opinion, religious or philosophical affiliation, union membership, health and sexuality are especially sensitive and therefore subject to even stricter protection.
Policies establish certain rules for the handling and managing of reasarch data for all employees of a research institution. They usually also determine which methods of research data management should be applied. In Germany most research data policies do not contain detailed regulations, but instead usually consist of a basic self-commitment to the principles of open access.
Primary research data
Primary research data are unprocessed and uncommented raw data which have not yet been associated with any meta data. They form the foundation of all scientific activity. The distinction between reserach data and primary research data usually only has theoretical merit because raw data are hardly ever published without any associated metadata. Digital objects are generally not published by their proprietors (such as scientific libraries) without background information such as provenance and other information.
As opposed to anonymisation, the technique of pseudonymisation simply substitutes letter and/or number codes for identifiyng charcateristics such as names in order to impede or ideally prevent any individuals from being identified (BDSG § 3, paragraph 6a). During the course of a scientific study the reference list of personal data and its associated code should be kept separate from the actual study data. An anonymisation of data can be achieved by deleting this reference list after the completion of the project so that no individual person can be connected to the study results.
A repository can be viewed as a particular kind of archive. In the digital age it refers to an administrated storage space for digital objects. Since repositories are generally publically accessible or at least accessible to a specific group of users it is closely connected to the issue of open access.
Research Data Lifecycle
The model of the research data life cycle illustrates the stages research data can go through, from the collection to its reuse. The stages of the data lifecycle can vary, but in general the data lifecycle comprises the following phases:
- Planning research projects (including handling of the data in the research project, see data management plan)
- Creation and collection
- Processing and analysis
- Sharing and publication
Research data management
The term research data management refers to the process of transforming, selecting and storing research data with the aim of making them accesible, re-usable and reproducible independently from the data author for a long period of time. To achieve that aim systematic actions can be taken at all points in the data life cycle in order to maintain the scientific value of research data, ensure their accessibility for analysis by third parties and to secure the chain of evidence.
Efforts to systemise the world wide web in order to facilitate automated information exchange between computers are summarized unser the term semantic web. Central unstructured terms on a website are contextualised with additonal information (metadata) so that it becomes clear whether the mentioning of "Berlin" refers to the capital of Germany, an entirely different city or a name. In order to be able to convey the context of a term to a computer, machine-readable meta data standards are utilised. Interconnecting web information through contextualising is still a project in its infancy due to its complexity and sheer scope. However, it will certainly contribute to improved searchability of the web.
Threshold of Originality
When an object or work is created the threshold of originality is a measurement of the degree to which it incorporates personal characteristics of its author. Whether a work reaches this threshold of originality is a decisive criterion for its protection by German copyright law . An important aspect of the threshold of originality is that the work is a result of its author's creativity and personality rather than an outcome of external circumstances (objective, functionality, objectivity etc.). This is why research data very rarely fall under German copyright law.
URN (Uniform Resource Name)
URN is an identification and addressing system and, like DOI, is used for the persistent identification of digital objects (online publications, data sets etc.). It is prevalent especially in the German-speaking realm, since the German National Library serves as the administrating, hosting and resolving institution for URNs.
Virtual Reserach Environments (VRE)
Virtual Research Environments are software solutions or platforms designed to enable location-independent collaboration between researchers. A VRE is above all a user-oriented service usually offered by infrastructure institutions like libraries or computing centres for reserach associations and research communities. VREs usually incorporate discipline-specific tools, tool collections and work environments. Realising generic, discipline- independent applications has so far remained a long-term objective.
XML (Extensible Markup Language)
XML is a mark-up languaged used for storing hierarchically structured information as a simple text file. It is mainly utilised for platform-independent data exchange between applications or computers. The coding is machine-readable as well as human-readable. It is also possible to check the contents of an XML document for validity if on top of general formal rules content-related rules have been defined in an external file. As a result, the form and content of the coded information can be described very precisely. Using XSL (XML Stylesheet Language) it is possible to interpret the stored information and convert it into other data formats for visualization.