Here you find the current version of the ODINE Policy Tool Kit.
This guide gives a first outline what to consider when data is published as open data. It sets out what has already been researched in the first months of the project, and provides an outline of what is planned for the the legal and privacy toolkit v2.
The objective for version 1 is “What are the critical things to consider when opening up data?” with the focus on privacy of the data.
The Legal and privacy toolkit’s aim is to provide practical advice for all consortium partners of the ODINE project, funded projects and generally interested persons on how to release data, legal aspects, which are crucial for scenarios in which open and enterprise data is used in combination, and what has to be considered for this process and to further push forward open data in Europe.
For that we provide in chapter 2 “Top three things to consider when publishing open data”, a quick overview for the basic things you should consider in the beginning of the process.
After that we provide more detailed input in the further sections.
Privacy and their legal issues and openness are not opposing forces, in fact they are different sides of the same coin and equally important. What looks like that simple might also be important to move things forward around this crucial issue.
Open data advocates often suggest that openness should be the default for all human knowledge and that we could and should share, re-use and compare data freely and in doing so reap the benefits of innovation, cost savings and increased citizen participation, just to name a few positive effects.
The other side includes the concerns about privacy, especially personal information, meaning people worry that the path of openness could lead to a world where all our information is shared with everyone, permanently.
A way would be to see openness and privacy as complementary forces, with privacy as a governing framework to control access to, collection and usage of information basically privacy laws enabling knowledge and control of data about citizens and their surroundings.
On the one side in Europe Swedes have access to tax records, whereas on the other Germans are fighting against such bulk data collection, due to cultural and historical differences. How does all this add up into a set of more or less coherent single European digital market norms so the European citizens know what they can expect on the legal side? It’s understandable why advocates of open data and privacy rally to different points of view.
But one might be also consider the fact that you want privacy for exactly the same reason you want openness. Because you want to know whether the information held by the government or any business on a given problem, or indeed on you, is true and verifiable. Like openness and ‘open by default’, privacy is a principle that cuts across all forms of data release. It is fundamentally the same thing. Privacy permits us to share selectively, and grant people access but with limitations.
Therefore we need to include the arguments by privacy groups in those open data conversations. If we shift our thinking on open data and privacy from one of competing interests to one of a single inextricably linked, albeit complex, issue then we can find a path that enables us to cut a way through the jungle.
We need to agree whether and how open data can include personal information. And we need to stop making a dichotomous distinction between freedom of information laws and data protection; between open data policies and privacy policies. We need one single policy framework that controls as well as encourages the use of ‘open’ data.
This delivery and its next version will not remove the tensions between these two opposite points of view, but it might allow us to work toward common objectives and explanations that further each perspective’s work rather than leave us in this lockdown.
In making data open it must also maintain high standards of privacy in the data it releases.
The definition of open data means non-personal data. To be crystal clear personal information of private citizens should not be released through open data.
Sometimes there may not be a clear cut distinction between non-personal and personal information and may include de-identified personal information. De-identification of personal information is the removal of obscure personal identifiers and personal information so that identification of individuals, that are the subject of the information, is no longer possible.
While there are significant economic, democratic and social benefits to the release of government data, it can pose risks to the privacy of personal information. The primary risk to privacy during the release of data is the identification of individuals. That is releasing personal information or data that can be made into personal information through easily linking with other information.
The violation of an individual privacy leading to identification of an individual person can be significant including humiliation, financial or employment-status impact, depending on the type of data released and the extent of any identification of individuals. This can happen either as spontaneous recognition, which is made without any special effort due to rare characteristics or as deliberate attempt of combining various characteristics and datasets.
Assessing the risks of identification of individuals in the release of open data is one of the necessary steps to mitigate those risks to acceptable levels. Following points should be considered:
could be used for list matching.
The level of privacy risk will be dependent on the likelihood that identification could occur from the release of the data and the consequences of such a release. The level of risk will determine what steps the agency takes to mitigate the privacy risks.
The likelihood of identification depends on the provided data like name, date of birth or unique identifiers like customer numbers.
Even with such variables missing other factors should to be considered:
can identify an individual, but it may include enough variables that can be matched with other information)
Always keep in mind what the potential breach of the privacy could mean for the individual.
Several techniques can be applied to properly de-identify the dataset and reduce any risks of identification of an individual.
First step of de-identification is to remove clear identifying variables from the data (name, date of birth or address).
Removing the identifiers from
Customer | Customerid | Address | Items | Postcode | Annual kilometers | Age |
John Doe | 23 | Street 1 | Bike | 12345 | 7500 | 24 |
results in
Customerid | Items | Postcode | Annual kilometer | Age |
23 | Bike | 12345 | 7500 | 24 |
While some identifiers are stripped, it retains a relatively high potential for re-identification: the data still exists on an individual level and other, potentially identifying, information has been retained. For example, some ZIPcodes have very small populations and combining this data with other publicly available information, can make re-identification a relatively easy task.
While it may be tempting for agencies to strip out all potentially identifying information,
doing so could render the data meaningless. The fact that somewhere in Germany there is a railroad customer with the age 24 traveling 7500 km by railroad a year may have limited potential use.
Another method of de-identification is ‘pseudonymisation’ which involves consistently replacing recognisable identifiers with artificially generated identifiers, such as a coded reference or pseudonym. In our example John Doe would be assigned a randomly selected number.
Pseudo# | Address | Items | Postcode | Annual kilometers | Age |
pseudo123 | Street 1 | Bike | 12345 | 7500 | 24 |
This pseudonymisation allows for different information about an individual, often in different datasets to be correlated without the consequence of direct identification of the individual. For example, the information above could be correlated with:
Pseudo# | Year | Month | Restaurant-Waggon | Food | Age |
pseudo123 | 2015 | July | yes | yes | 24 |
Be aware, pseudonymisation also has a relatively high potential for re-identification, as the data exists on an individual level with other potentially identifying information being retained. Also, because pseudonymisation is generally used when an individual is tracked over more than one dataset, if re-identification does occur more personal information will be revealed concerning the individual.
Rendering personally identifiable information less precise can reduce the possibility of reidentification. Dates of birth or ages can be replaced by age groups.
Pseudo# | Year | Month | Restaurant-Waggon | Food | Age |
pseudo123 | 2015 | July | yes | yes | 20-30 |
Related techniques include suppression of cells with low values or conducting statistical analysis to determine whether particular values can be correlated to individuals. In such cases it may be necessary to apply the frequency rule by setting a threshold for the minimum number of units contributing to any cell. Common threshold values are 3, 5 and 10. For example, applying a threshold value of 3 to the following table the cell indicating the number of driving instructors at ages 35-40 has a value less than 3 may be suppressed or aggregated into a bigger range.
Age | ZIPcode | train-riders | annual kilometer |
20-30 | 12345 | 21 | <1000 |
31-40 | 23456 | 12 | 1001-4999 |
41-50 | 34567 | 3 | >5000 |
Introducing random values or ‘adding noise’ is more advanced and may also include altering the underlying data in a small way so that original values cannot be known
with certainty but the aggregate results are unaffected.
Individual data can be combined to provide information about groups or populations. The
larger the group and the less specific the data is about them, the less potential there will be
for identifying an individual within the group. In our example for aggregating the ZIPcodes on state level.
The following list of tools and software packages are an example for helping de-identifying
datasets. These tools provide an automated method of applying a particular de-identification method and may assist an agency to determine with more precision the success of the de-identification method applied and the privacy risk of public release of the dataset.
μ-ARGUS – Statistics Netherlands
http://neon.vb.cbs.nl/casc/mu.htm
Privacy Analytics Risk Assessment Tool
http://www.privacy-analytics.com/
Cornell Anonymization Toolbox
http://sourceforge.net/projects/anony-toolkit/
University of Texas Anonymisation Toolbox
http://cs.utdallas.edu/dspl/cgi-bin/toolbox/
This section of the delivery aims to assist understanding and addressing the risks to privacy when considering the public release of datasets and has been developed to ensure compliance with the European law. A more detailed version will follow with version 2.
Privacy and data protection are fundamental rights in the EU.
Data protection is a fundamental right, protected by European law and enshrined in Article 8 of the Charter of Fundamental Rights of the European Union.
Under EU law as well as under CoE law, ‘personal data’ are defined as information
relating to an identified or identifiable natural person, that is, information about
a person whose identity is either manifestly clear or can at least be established by
obtaining additional information.
(Data Protection Directive, Art. 2 (a); Convention 108, Art. 2 (a).)
The first EU Data Protection Directive was written in 1995. Under this directive, any data “by which an individual can be identified” was the sole responsibility of the data controller, i.e. the owner of this data.
But a newer and stronger regulation is currently on the way, being developed to take into account the vast technology changes since then.
The current plan by the EU is to finalise the regulation in 2015 and implement it by 2017 (similar to the end ODINE project and the version 2 of this delivery). As with any regulation, the current draft could change.
Under the new proposed regulations any business or individual that processes this data will also be held responsible for its protection, including third parties such as cloud providers. To put it simply, anyone who access your data, wherever they are based, is responsible in the case of a data breach. The implications of this are pretty wide, for example third parties will need to be very attentive when it comes to securing the data of others, and data owners will want to thoroughly vet their partners. More specifically, the rules for data protection in the EU institutions – as well as the duties of the European Data Protection Supervisor (EDPS) – are set out in Regulation (EC) No 45/2001. The EDPS is a relatively new but increasingly influential independent supervisory authority with responsibility for monitoring the processing of personal data by the EU institutions and bodies, advising on policies and legislation that affect privacy and cooperating with similar authorities to ensure consistent data protection.
Certify your open data: Show that it’s easy to find, use and share and describe the
Further reading:
http://theodi.org/guides/what-open-data
The Open Data Maturity Model is a way to assess how well an organisation publishes and consumes open data, and identifies actions for improvement.
Handbook on European data protection law
http://fra.europa.eu/sites/default/files/fra-2014-handbook-data-protection-law-2nd-ed_en.pdf
Open Data Handbook
http://opendatahandbook.org/guide/en/how-to-open-up-data/
Open Data World Bank Blog
http://blogs.worldbank.org/opendata/how-can-the-open-government-data-toolkit-help-you
Open Definiton
OECD Guidelines on the Protection of Privacy and Transborder Flows of Personal Data
Opening a new Chapter for Data Protection
Protection of personal data
http://ec.europa.eu/justice/data-protection/
UK Anonymisation Network
http://privacytools.seas.harvard.edu/
http://privacytools.seas.harvard.edu/publications/automating-open-science-big-data
http://opendefinition.org/guide/data/
Guidelines on recommended standard licences, datasets and charging for the reuse of
documents (2014/C 240/01)
http://ec.europa.eu/newsroom/dae/document.cfm?action=display&doc_id=6421
http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html