If data, contextualised, is information, then how might information professionals put their skills to work in opening up access to data and supporting their patrons use of it? On the one hand, we have questions of access, on the other, sensemaking around the data. Let's look at each of them in turn.
Data accessibility starts with discoverability, and how to actually locate a dataset that may contain the data your require, but it doesn't end there.
Once we've found a dataset, how are we to download the data file? Many publishers release tabular datasets using Excel spreadsheet files which may contain multiple sheets; others release data using simple text based CSV files.
The latter have the advantage of generally being easy to open, although the file encoding used to save the file (particular from Windows office applications) may cause problems with accented characters.
A downside of CSV files is that they do not contain information about what sort of thing is represented by each data column (whether it's a date column, for example, or a numerical column), although there are steps being taking to address this by bundling CSV data files with additional metadata files; (for example, the ONS are exploring the use of the tabular data package).
Many datasets need cleaning before they can be used - so-called "dirty data" may include typographical errors, arbitrary use of abbreviations (Limited, Ltd and Ltd. are not the same thing...) or date formats (10th March, 2014, 3-10-14, 10/03/14 may all refer to the same date but unless you treat them as dates they'll be seen as distinct). £1,249 million may be an amount, but it's not a number we can do sums with (unlike 1249000000, for example, which we can easily work with).
Others need reshaping before we can effectively work with them. A spreadsheet in a wide format containing monthly figures with a separate month for each column may be easier to work with if the data is cast into a long form with one column identifying the month and a second the actual figure.
Another important question to ask when publishing data relates to the granularity of the data sets you share. It's all very well publishing monthly datasets as separate files, but what if someone wants to do a time series analysis over a period of months? Is it up to them to download all the separate data files (or do you only publish "this month's data"?) and aggregate it before downloading it?
One approach is to make available the whole of a dataset available as a single database file (indeed, this is encouraged by pure open data definitions), another to allow users to make their own queries onto your database and then export just the data they want.
On sites like the Electoral Commission's PEF Online (party and election funding) website, an empty search returns all the data that can be downloaded as a CSV file.
And if you really want your data to be used? Firstly, make sure the license conditions are clear (public sector bodies should use OGL, the Open Government License, wherever possible; for others, a good place to start is the list of conformant licenses on the OpenDefinition website). Secondly, try to make your data linkable by including in it identifiers (where appropriate) using standard identification schemes; for example, when talking about companies, try to identify them by company number; if talking about publications, include ISBNs or DOIs.
Making sense of data
Once access to a dataset has been gained, the question naturally arises as to how to work with it.
Accessibility issues around data filetypes, formats, and shapes may make it easier or harder for a user to work with the data in an analysis tool of their choice. Supporting tool choice - and the adoption of tools that facilitate good working proactices and effective workflows - is an important consideration when thinking about how data will be used.
For many people, that tool is likely to be Excel. In the enterprise, data warehouses allow users to interrogate data cubes and look at garish charts to their heart's content. But even so, many users will still download the dataset and open it as a spreadsheet, where they can torture it using formulas and macros of their own devising. (This creates a whole other set of data management and data quality issues, but we won't go there in this post!)
If you require a more powerful environment, statistical analysis packages such as Stata or SPSS are the tool of choice for many. Increasingly popular tools for working with data include the R and python programming languages.
Using programming tools to work with data means that you can create "reproducible reports" that explicitly describe every operation applied to a dataset in the order they are applied, along the results of those operations. Productivity tools such as RStudio and IPython notebooks both integrate with workflows that allow documents to be published in a variety of formats, from PDF documents to HTML pages, the latter including interactive charts generated automatically alongside "flat" chart images intended for print.
It's often said that to become a good writer you should read a lot. When data is reported upon data visualisations and charts are often used to represent the data. But how skilled are we at actually reading charts, let alone creating them?
Most people are introduced to the holy trinity of pie charts, bar charts and line charts in primary school - and then their chart literacy education stops.
One of my favourite data books - the admittedly niche interest, if exquisitely titled, Making Sense of Squiggly Lines - spends almost 150 pages describing how to read, in detail, a very small handful of chart types. Whether or not the subject matter interests you, the attention to detail that comes across when trying to read these charts - and learn something from them - is illuminating.
At the moment, there is a skills gap when it comes reading charts and data visulaistions. We - the library? - need to educate users in how to read data charts produced by others more effectively. All too often audiences may be mislead by shiny data rhetoric and largely meaningless infographics. But education will also help learners better understand and interpret the charts they may generate as part of their own analyses.
In terms of data education more broadly, I favour developing a "conversations with data" skills based approach which helps provide people with the skills they need to manipulate a dataset so that they can actually use the analytical and data visualisation tools that help them better understand, or ideally gain insight from, a dataset.
This education should include power tools developed for working with data, rather than the general tools that are packaged as office suite applications.
At The Open University, we're currently writing a third year equivalent course that seeks to provide a broad, if technical, introduction to data analysis and management, using IPython notebooks and the pandas data wrangling library as both an instructional medium and a hands-on environment.
At the School of Data, we produce short form online hands-on courses and tutorials covering a wide range of data wrangling activities and tasks, as well as delivering face to face training sessions across the world.
So what can libraries do to help? In many organisations, where do you go to ask for help on data matters? If not the library, then why not? I think libraries can support users in two main ways:
- supporting accessibility: identify good quality open data information sources in appropriate areas and link to them. When it comes to an organisation publishing its own data, work with the publishers to make sure their data is accessible.
- supporting sensemaking and education: statisticians and analysts can do the statistics, but there are plenty of skills gaps when it comes to working with data. On the one hand, literacy in reading charts and graphics, even familiar forms such as line charts and bar charts. Secondly, using power tools for working with data, or at least raising awareness about what tools exist and the sorts of thing they can do. This extends not just to the task of data analysis, but also managing the workflows associated with guaranteeing data quality and provenance, as well as reproducibility, reporting and archiving.
Data is often the first step on the road to creating information, and increasingly the data is out there - already collected and packaged - just waiting for its secrets to be unlocked and its stories revealed.
Surely, as information professionals, we should be be able to help?
How else can library and information professionals help open up access to data? Share your thoughts in the comments below