Wednesday, August 6, 2014

The Six Challenges of Census Data

Recently I have been interpreting census data as part of the preparation for a presentation I will be giving later in the year surrounding Library New Grads and the future of the library workforce. As I am sufficiently familiar with TableBuilder, a tool for extracting customised information from the census, it seemed like a straightforward enough task to analyse information on professional level library staff.

How little I knew.

The first warning came in the data breaking Librarians up into five year age categories. In the 2006 census 98 people aged 15-19 had their occupation categorised as 'Librarian'. Should you not be familiar with the library world, this is a professional occupation that almost invariably requires a degree or postgraduate qualification recognised by a professional body, ALIA, much as is the case for accountants and other professionals.

I don't think it's unreasonable to be slightly sceptical of the accuracy of this particular number. In a group of 10068 it's not huge, but it was the first sign that this data might not be quite as straightforward as I initially thought.

Things got more interesting when I discovered that the Teacher Librarians had gone MIA.

For reasons I will explain below, I broadened the scope of my analysis. The problems multiplied roughly in proportion to this. Initially this was very frustrating but it rapidly became fascinating in its own, potentially relevance-challenging way.

The Six Challenges of Census Data (So Far)

1. Free-text Census Questions

Most of the work I have done relates to the question of occupation. The 2006 census paper's question on this looks like so:

Whilst the sheer range of occupations in existence make the necessity of asking the question in this way clear a free text field that ultimately produces categorised numerical data has a significant risk of error even before points 2 and 3 are introduced.

2. User error

What exactly goes into that occupation box when a particular person fills it in? The accuracy is very difficult to control. 'Occupation' is also a term that can be interpreted in a considerably different way to 'Job'. I could argue that I have been a professionally recognised Librarian and in that career path for considerably longer than my title has contained the word 'Librarian'. This might have been stretching the truth and rather optimistic but not all that difficult to justify even if it misses the spirit of the question. (Just in case anyone from the ABS stumbles across this, I didn't. It would have been easy though, and virtually undetectable.)

3. Re-categorisation to a standard classification

Collecting statistics for each variant answer in the 'occupation' box would be messy and produce fairly meaningless data. Consequently a classification system is used - ANZSCO in both 2006 and 2011.

Widely varied responses must be fitted into the classification based on the listed occupation, here take note of the question's first explanatory point 'give full title'. In 2004 an ALA survey found 37 common job titles for library support staff and numerous less common titles, it is more than likely that Australia has a similar range. Many are not obviously library jobs. These could be categorised all over the occupational spectrum.

It's not too much of a stretch to think that job titles in other industries might create inbound interference on top of this.

I strongly suspect that aspects of points 2 and 3 are responsible for the large number of teenage Librarians.

4. Clumped professions or; Night of the Vanishing Teacher Librarians

One night while crunching some numbers I discovered that the Teacher Librarians had gone AWOL.

When using a guide to ANZSCO to work out where Librarians might be in TableBuilder (harder than you might expect) I found a page explaining 'Unit Group 2246 Librarians' where, partway down, it explains 'Teacher-Librarians are included in Minor Group 241 School Teachers' which, on investigation, is only further divided by the category of school. I know there are Teacher Librarians in there. Somewhere.

5. Classification system modifications

By this point in my analysis these issues and several other data issues and points raised in discussion led me to broaden my analysis to include other library occupations. Two were available.

Library Assistants were nice and straightforward.

Library Technicians were another matter entirely. In 2006 there was a category titled 'Library Technicians' but in 2011 the census used a new edition of ANZSCO and the category becomes 'Gallery, Library and Museum Technicians'. Unsurprisingly while a decline of around 3% is observed in the Librarian and Library Assistant categories this one has grown, acquiring two more industries worth of technicians. The trends between 2006 and 2011 behave in a way that is consistent with the other categories however an interesting anomaly from the national figures observed in Library Technician data for South Australia in 2006 has disappeared. I am totally unable to determine if this disappearance is real or a result of the classification changing.

6. Data availability

The last issue is the availability of census data. I am unable to find the information I need in standard ABS releases so must use TableBuilder. At present only 2006 and 2011 data is available. While I believe I see a trend that is consistent across all levels of library staff I cannot see if this is the continuation of a long-term trend, a new development or perhaps even the repetition of a cycle within the workforce.

Another census will be held in 2016, at some point after this I should be able to add another set of figures and start to answer this question. Until then I must, with the aid of prior studies and related papers and reports, make an educated guess.

In Conclusion

Teacher Librarians have gone MIA, Gallery and Museum Technicians invaded in 2011 and the most personally valuable lessons and insights from this exercise might be from the process rather than the outcome.

What appeared like a fairly straightforward interpretation of data from a large and respected source tested not only my ability to retrieve, analyse and interpret data but also my ability to spot the potential for errors and work out which errors were significant, which could be adjusted for and which.

This has reinforced that even the most authoritative, objective and thorough sources of information are fallible. While I believe I will still be able to complete a useful analysis with the scale of data I wished to work with factoring in these issues is going to be a challenge in itself.

P.S. Reviews and crafts will return one day, right now this analysis is taking up the time I might spend on those.


  1. I worked in the data processing centre on the 2006 census - it actually triggered my interest in information management - and one of our jobs was to code professions that the computer didn't know how to. So for something like librarian, the computer would - unfortunately - classify that as specifically as they could to a particular ANZSCO number, probably. But for something like 'manager' or 'project manager', a person gets involved - that was me! (and my team) - so we'd get to look at snippets of the census form that may also help us be more specific about someone's _actual job_ - their daily duties, and a couple of other fields I forget what they were... and then if it still wasn't enough information, we could either code it as 'unclassifiable' , or as a 10000 type level job, or we could - rarely - access someone's entire form to see if we could glean more info . So now that I know more about librarians, I'd probably code them differently - or treat is as generic as 'manager' - but I don't know what the Census Machine's default is unfortunately. }

    1. While on reflection I can see how the complexity would be incredible I hadn't expected it - it was a really interesting journey of discovery! Job market changes make it difficult but it'd be fantastic if the data were treated more consistently between censuses - at least from the point of view of someone trying to interpret a fairly specific slice of data. Although I've slowed down a bit on this project as I no longer have the same deadline I keep on learning more from it :)