Policy-tool-kit

Here you find the current version of the ODINE Policy Tool Kit.

1. Introduction

This guide gives a first outline what to consider when data is published as open data. It sets out what has already been researched in the first months of the project, and provides an outline of what is planned for the the legal and privacy toolkit v2.

The objective for version 1 is “What are the critical things to consider when opening up data?” with the focus on privacy of the data.

1.1 How to use this guide

The Legal and privacy toolkit’s aim is to provide practical advice for all consortium partners of the ODINE project, funded projects and generally interested persons on how to release data, legal aspects, which are crucial for scenarios in which open and enterprise data is used in combination, and what has to be considered for this process and to further push forward open data in Europe.

For that we provide in chapter 2 “Top three things to consider when publishing open data”, a quick overview for the basic things you should consider in the beginning of the process.

After that we provide more detailed input in the further sections.

2. Checklist – Top things to consider when publishing open data

Licence

Guide to licensing: At its simplest, open data requires just two things: data and openness. There are lots of aspects to openness, but at its most fundamental, the key is how the data is licensed. Data that doesn’t explicitly have an open licence is not open data.
List of licences: This section of the Open Knowledge website lists licenses that are conformant with the principles laid out in the Open Definition.
Choose a license for your code: http://choosealicense.com/. As the default licence for data we recommend Creative Commons 4.0 or Creative Commons Zero. See details at Legal code understandable for humans.
Contribute to the discussion on licensing here.
Encourage re-use of data.

Managing the risk of publishing data relating to individuals

Understanding whether you have data relating to individuals
- A quick definition of personal data by the EU.
- A quick intro to ‘what is an identifier‘
Reducing the risk of publishing any personal data by anonymising the data
- A quick checklist
- A deeper Anonymisation Decision-making Framework and free online course by the UK Anonymisation Network
- Free anonymisation workshops coming up in Europe: see events

Be aware of the legal framework in the EU and your country

EU legislation
Implementation on country level
Voluntary recommended approaches, e.g.
- https://responsibledata.io/
- http://wiki.okfn.org/Personal_Data_and_Privacy

After publishing great open data

Make sure the public knows, see this steps

3. Open data and privacy – managing the risk of publishing data relating to individuals

Privacy and their legal issues and openness are not opposing forces, in fact they are different sides of the same coin and equally important. What looks like that simple might also be important to move things forward around this crucial issue.

Open data advocates often suggest that openness should be the default for all human knowledge and that we could and should share, re-use and compare data freely and in doing so reap the benefits of innovation, cost savings and increased citizen participation, just to name a few positive effects.

The other side includes the concerns about privacy, especially personal information, meaning people worry that the path of openness could lead to a world where all our information is shared with everyone, permanently.

A way would be to see openness and privacy as complementary forces, with privacy as a governing framework to control access to, collection and usage of information basically privacy laws enabling knowledge and control of data about citizens and their surroundings.

On the one side in Europe Swedes have access to tax records, whereas on the other Germans are fighting against such bulk data collection, due to cultural and historical differences. How does all this add up into a set of more or less coherent single European digital market norms so the European citizens know what they can expect on the legal side? It’s understandable why advocates of open data and privacy rally to different points of view.

But one might be also consider the fact that you want privacy for exactly the same reason you want openness. Because you want to know whether the information held by the government or any business on a given problem, or indeed on you, is true and verifiable. Like openness and ‘open by default’, privacy is a principle that cuts across all forms of data release. It is fundamentally the same thing. Privacy permits us to share selectively, and grant people access but with limitations.

Therefore we need to include the arguments by privacy groups in those open data conversations. If we shift our thinking on open data and privacy from one of competing interests to one of a single inextricably linked, albeit complex, issue then we can find a path that enables us to cut a way through the jungle.

We need to agree whether and how open data can include personal information. And we need to stop making a dichotomous distinction between freedom of information laws and data protection; between open data policies and privacy policies. We need one single policy framework that controls as well as encourages the use of ‘open’ data.

This delivery and its next version will not remove the tensions between these two opposite points of view, but it might allow us to work toward common objectives and explanations that further each perspective’s work rather than leave us in this lockdown.

3.1 Making data open while considering for privacy

In making data open it must also maintain high standards of privacy in the data it releases.

The definition of open data means non-personal data. To be crystal clear personal information of private citizens should not be released through open data.

Sometimes there may not be a clear cut distinction between non-personal and personal information and may include de-identified personal information. De-identification of personal information is the removal of obscure personal identifiers and personal information so that identification of individuals, that are the subject of the information, is no longer possible.

The privacy risks of open data

While there are significant economic, democratic and social benefits to the release of government data, it can pose risks to the privacy of personal information. The primary risk to privacy during the release of data is the identification of individuals. That is releasing personal information or data that can be made into personal information through easily linking with other information.

The violation of an individual privacy leading to identification of an individual person can be significant including humiliation, financial or employment-status impact, depending on the type of data released and the extent of any identification of individuals. This can happen either as spontaneous recognition, which is made without any special effort due to rare characteristics or as deliberate attempt of combining various characteristics and datasets.

3.2 Reducing the risks of personal identification in open data

Assessing the risks

Assessing the risks of identification of individuals in the release of open data is one of the necessary steps to mitigate those risks to acceptable levels. Following points should be considered:

Determining of any specific unique identifier like name, date of birth
Cross-referencing to determine unique combinations like age, gender, ZIPcode, …
Acquiring knowledge of other publicly available datasets and information that

could be used for list matching.

The level of privacy risk will be dependent on the likelihood that identification could occur from the release of the data and the consequences of such a release. The level of risk will determine what steps the agency takes to mitigate the privacy risks.

The likelihood for identification breach and potential impact of it

The likelihood of identification depends on the provided data like name, date of birth or unique identifiers like customer numbers.

Even with such variables missing other factors should to be considered:

Motivation to attempt identification
Level of details (the more detail the more likely identification becomes)
Presence of rare characteristics
Presence of other information (the dataset itself does not include any data that

can identify an individual, but it may include enough variables that can be matched with other information)

Always keep in mind what the potential breach of the privacy could mean for the individual.

Reducing privacy risks through de-identification

Several techniques can be applied to properly de-identify the dataset and reduce any risks of identification of an individual.

Removing identifiers

First step of de-identification is to remove clear identifying variables from the data (name, date of birth or address).

Removing the identifiers from

Customer	Customerid	Address	Items	Postcode	Annual kilometers	Age
John Doe	23	Street 1	Bike	12345	7500	24

results in

Customerid	Items	Postcode	Annual kilometer	Age
23	Bike	12345	7500	24

While some identifiers are stripped, it retains a relatively high potential for re-identification: the data still exists on an individual level and other, potentially identifying, information has been retained. For example, some ZIPcodes have very small populations and combining this data with other publicly available information, can make re-identification a relatively easy task.

While it may be tempting for agencies to strip out all potentially identifying information,

doing so could render the data meaningless. The fact that somewhere in Germany there is a railroad customer with the age 24 traveling 7500 km by railroad a year may have limited potential use.

Pseudonymisation

Another method of de-identification is ‘pseudonymisation’ which involves consistently replacing recognisable identifiers with artificially generated identifiers, such as a coded reference or pseudonym. In our example John Doe would be assigned a randomly selected number.

Pseudo#	Address	Items	Postcode	Annual kilometers	Age
pseudo123	Street 1	Bike	12345	7500	24

This pseudonymisation allows for different information about an individual, often in different datasets to be correlated without the consequence of direct identification of the individual. For example, the information above could be correlated with:

Pseudo#	Year	Month	Restaurant-Waggon	Food	Age
pseudo123	2015	July	yes	yes	24

Be aware, pseudonymisation also has a relatively high potential for re-identification, as the data exists on an individual level with other potentially identifying information being retained. Also, because pseudonymisation is generally used when an individual is tracked over more than one dataset, if re-identification does occur more personal information will be revealed concerning the individual.

Reducing the precision of the data

Rendering personally identifiable information less precise can reduce the possibility of reidentification. Dates of birth or ages can be replaced by age groups.

Pseudo#	Year	Month	Restaurant-Waggon	Food	Age
pseudo123	2015	July	yes	yes	20-30

Related techniques include suppression of cells with low values or conducting statistical analysis to determine whether particular values can be correlated to individuals. In such cases it may be necessary to apply the frequency rule by setting a threshold for the minimum number of units contributing to any cell. Common threshold values are 3, 5 and 10. For example, applying a threshold value of 3 to the following table the cell indicating the number of driving instructors at ages 35-40 has a value less than 3 may be suppressed or aggregated into a bigger range.

Age	ZIPcode	train-riders	annual kilometer
20-30	12345	21	<1000
31-40	23456	12	1001-4999
41-50	34567	3	>5000

Introducing random values or ‘adding noise’ is more advanced and may also include altering the underlying data in a small way so that original values cannot be known

with certainty but the aggregate results are unaffected.

Aggregation

Individual data can be combined to provide information about groups or populations. The

larger the group and the less specific the data is about them, the less potential there will be

for identifying an individual within the group. In our example for aggregating the ZIPcodes on state level.

3.3 Privacy tools list for de-identification

The following list of tools and software packages are an example for helping de-identifying

datasets. These tools provide an automated method of applying a particular de-identification method and may assist an agency to determine with more precision the success of the de-identification method applied and the privacy risk of public release of the dataset.

μ-ARGUS – Statistics Netherlands

http://neon.vb.cbs.nl/casc/mu.htm

Privacy Analytics Risk Assessment Tool

http://www.privacy-analytics.com/

Cornell Anonymization Toolbox

http://sourceforge.net/projects/anony-toolkit/

University of Texas Anonymisation Toolbox

http://cs.utdallas.edu/dspl/cgi-bin/toolbox/

4. Open data, privacy and European Law

This section of the delivery aims to assist understanding and addressing the risks to privacy when considering the public release of datasets and has been developed to ensure compliance with the European law. A more detailed version will follow with version 2.

Privacy and data protection are fundamental rights in the EU.

Data protection is a fundamental right, protected by European law and enshrined in Article 8 of the Charter of Fundamental Rights of the European Union.

Under EU law as well as under CoE law, ‘personal data’ are defined as information

relating to an identified or identifiable natural person, that is, information about

a person whose identity is either manifestly clear or can at least be established by

obtaining additional information.

(Data Protection Directive, Art. 2 (a); Convention 108, Art. 2 (a).)

The first EU Data Protection Directive was written in 1995. Under this directive, any data “by which an individual can be identified” was the sole responsibility of the data controller, i.e. the owner of this data.

But a newer and stronger regulation is currently on the way, being developed to take into account the vast technology changes since then.

The current plan by the EU is to finalise the regulation in 2015 and implement it by 2017 (similar to the end ODINE project and the version 2 of this delivery). As with any regulation, the current draft could change.

Under the new proposed regulations any business or individual that processes this data will also be held responsible for its protection, including third parties such as cloud providers. To put it simply, anyone who access your data, wherever they are based, is responsible in the case of a data breach. The implications of this are pretty wide, for example third parties will need to be very attentive when it comes to securing the data of others, and data owners will want to thoroughly vet their partners. More specifically, the rules for data protection in the EU institutions – as well as the duties of the European Data Protection Supervisor (EDPS) – are set out in Regulation (EC) No 45/2001. The EDPS is a relatively new but increasingly influential independent supervisory authority with responsibility for monitoring the processing of personal data by the EU institutions and bodies, advising on policies and legislation that affect privacy and cooperating with similar authorities to ensure consistent data protection.