Open Data – An Introduction

What is Open Data?

“Open data is data that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement to attribute and sharealike.” OpenDefinition.org

The Open Definition gives full details on the requirements for ‘open’ data and content. Key features are:

Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form.

Reuse and Redistribution: the data must be provided under terms that permit reuse and redistribution including the intermixing with other datasets. The data must be machine-readable.

Universal Participation: everyone must be able to use, reuse and redistribute – there should be no discrimination against fields of endeavour or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed.

Read the full Open Definition

Open Data: How We Got Here, and Where We’re Going.
From the LIFT 2012 conference.

What Kinds of Open Data?

Types of Data Geodata Culture Science Financial Statistics Climate Environment Transport

Geodata

The data that is used to make maps — from the location of roads and buildings to topography and boundaries.

Cultural

Data about cultural works and artefacts – for example titles and authors – and generally collected and held by galleries, libraries, archives and museums.

Science

Data that is produced as part of scientific research from astronomy to zoology.

Finance

Data such as government accounts (expenditure and revenue) and information on financial markets (stocks, shares, bonds etc).

Statistics

Data produced by statistical offices such as the census and key socioeconomic indicators.

Weather

The many types of information used to understand and predict the weather and climate.

Environment

Information related to the natural environment such presence and level of pollutants, the quality and rivers and seas.

Transport

Data such as timetables, routes, on-time statistics.

Why Open Data?

Why should data be open? The answer, of course, depends somewhat on the type of data. However, there are common reasons such as:

Transparency. In a well-functioning, democratic society citizens need to know what their government is doing. To do that, they must be able freely to access government data and information and to share that information with other citizens. Transparency isn’t just about access, it is also about sharing and reuse — often, to understand material it needs to be analyzed and visualized and this requires that the material be open so that it can be freely used and reused.

Releasing social and commercial value. In a digital age, data is a key resource for social and commercial activities. Everything from finding your local post office to building a search engine requires access to data, much of which is created or held by government. By opening up data, government can help drive the creation of innovative business and services that deliver social and commercial value.

Participation and engagement – participatory governance or for business and organizations engaging with your users and audience. Much of the time citizens are only able to engage with their own governance sporadically — maybe just at an election every 4 or 5 years. By opening up data, citizens are enabled to be much more directly informed and involved in decision-making. This is more than transparency: it’s about making a full “read/write” society, not just about knowing what is happening in the process of governance but being able to contribute to it.

How to Open Up Data

If you are looking for practical, more detailed, advice on how to open up data, have a look at the Open Data Handbook. The handbook discusses the legal, social and technical aspects of how to open up data. Read more in the Open Data Handbook. Here we provide some short suggestions for initial steps.

3 Key Rules

There are three key rules we recommend following when opening up data:

  • Keep it simple. Start out small, simple and fast. There is no requirement that every dataset must be made open right now. Starting out by opening up just one dataset, or even one part of a large dataset, is fine — of course, the more datasets you can open up the better.

    Remember this is about innovation. Moving as rapidly as possible is good because it means you can build momentum and learn from experience — innovation is as much about failure as success and not every dataset will be useful.

  • Engage early and engage often. Engage with actual and potential users and re-users of the data as early and as often as you can, be they citizens, businesses or developers. This will ensure that the next iteration of your service is as relevant as it can be.

    It is essential to bear in mind that much of the data will not reach ultimate users directly, but rather via ‘info-mediaries’. These are the people who take the data and transform or remix it to be presented. For example, most of us don’t want or need a large database of GPS coordinates, we would much prefer a map. Thus, engage with infomediaries first. They will re-use and repurpose the material.

  • Address common fears and misunderstandings. This is especially important if you are working with or within large institutions such as government. When opening up data you will encounter plenty of questions and fears. It is important to (a) identify the most important ones and (b) address them at as early a stage as possible.

The Four Steps

These are in very approximate order – many of the steps can be done simultaneously.

  1. Choose your dataset(s). Choose the dataset(s) you plan to make open. Keep in mind that you can (and may need to) return to this step if you encounter problems at a later stage.
  2. Apply an open license.
    1. Determine what intellectual property rights exist in the data.
    2. Apply a suitable ‘open’ license that licenses all of these rights and supports the definition of openness discussed in the section above on ‘What Open Data’
    3. NB: if you can’t do this go back to step 1 and try a different dataset.
  3. Make the data available – in bulk and in a useful format. You may also wish to consider alternative ways of making it available such as via an API.
  4. Make it discoverable – post on the web and perhaps organize a central catalog to list your open datasets.

Frequently Asked Questions – FAQs

Commercial Use

A key element of the definition is that commercial use of open data is allowed – there should be no restrictions on commercial, for-profit, use of open data.

In the full Open Definition, this is included as “No Discrimination Against Fields of Endeavor — The license must not restrict anyone from making use of the work in a specific field of endeavor. For example, it may not restrict the work from being used in a business, or from being used for genetic research.”

The major intention of this clause is to prohibit license traps that prevent open material from being used commercially; we want commercial users to join our community, not feel excluded from it.

Attribution, “Integrity” and Share-alike

Whilst the Open Definition permits very few conditions to be placed on how someone can use open data it does allow a few specific exceptions:

  • Attribution: an open data provider is allowed to require attribution (that you credit them in an appropriate way). This can be important in allowing open data providers to receive credit and for downstream users to know where data came from.
  • Integrity: an open data provider may require that a user of the data make it clear if the data has been changed. This can be very relevant for governments who wish to ensure that people do not claim data is official if it has been changed.
  • Share-alike: an open data provider may impose a share-alike requirement that any new datasets created using their data are also shared as open data

Machine-readability and Bulk access

Data can be provided in many ways and this can have significant impact on the ability to easily use it. The Definition thus requires that data be machine-readable and available in “bulk”.

Data is machine-readable if it can be easily processed by a computer. This does not just mean digital, but that it is in a digital structure that is appropriate for the relevant processing. For example, consider a PDF document containing tables of data. These are digital, but computers will struggle to extract the information from the PDF (even though it is very human readable!). The equivalent tables in a format such as a spreadsheet would be machine readable. Read more about [machine-readability in the open data glossary].

Data is available in bulk if you download or access the whole dataset easily. Conversely it is non-bulk if you are you limited to just getting parts of the dataset, for example, are you restricted to a few elements of the data at a time – imagine for example trying to a database of all the towns in the world one element at a time.

APIs versus Bulk Access

Providing data through an API is great – and often more convenient for many uses than bulk access. However, the Open Definition requires bulk access rather than an API, why? The answer is 2-fold:

  • Bulk access allows you to build an API (if you want) but an API does not mean you get all the data (think about e.g. twitter – using their API it would be impossible or very hard (and very inefficient) to get access to the database in bulk). Thus, bulk is the only way to guarantee full access to the data for everyone.
  • Bulk access is significantly cheaper that providing an API. Today you can store gigabytes of data for less than a dollar a month. By contrast running even a basic API can cost much more and running a proper API that supports high demand can be very expensive.

Thus, having an API is not a requirement for data to be open – though it is, of course, great if one is provided.

Moreover, it is perfectly fine for someone to charge for access to open data through an API – as long as they also provide the data for free in bulk. (Strictly speaking, the requirement isn’t that it’s free but that it’s no more than the extra cost of reproduction. For online download, that’s very close to free.) This makes sense: open data must be free but open data services (such as an API) can be charged for and this provides one of the most immediate business opportunities around open data.

Aside: what about real-time data that is changing all the time (think again of Twitter or live traffic information)? The answer here depends somewhat on the situation but for open real-time data one would imagine a combination of bulk and some way to get rapid or regular updates. For example, you might provide a stream of the latest X items that is available all the time and a bulk dump of a complete day’s data every night.

Further Information