Showing posts with label information overload. Show all posts
Showing posts with label information overload. Show all posts

Wednesday, August 6, 2014

The Six Challenges of Census Data

Recently I have been interpreting census data as part of the preparation for a presentation I will be giving later in the year surrounding Library New Grads and the future of the library workforce. As I am sufficiently familiar with TableBuilder, a tool for extracting customised information from the census, it seemed like a straightforward enough task to analyse information on professional level library staff.

How little I knew.

The first warning came in the data breaking Librarians up into five year age categories. In the 2006 census 98 people aged 15-19 had their occupation categorised as 'Librarian'. Should you not be familiar with the library world, this is a professional occupation that almost invariably requires a degree or postgraduate qualification recognised by a professional body, ALIA, much as is the case for accountants and other professionals.

I don't think it's unreasonable to be slightly sceptical of the accuracy of this particular number. In a group of 10068 it's not huge, but it was the first sign that this data might not be quite as straightforward as I initially thought.

Things got more interesting when I discovered that the Teacher Librarians had gone MIA.

For reasons I will explain below, I broadened the scope of my analysis. The problems multiplied roughly in proportion to this. Initially this was very frustrating but it rapidly became fascinating in its own, potentially relevance-challenging way.

The Six Challenges of Census Data (So Far)


1. Free-text Census Questions

Most of the work I have done relates to the question of occupation. The 2006 census paper's question on this looks like so:


Whilst the sheer range of occupations in existence make the necessity of asking the question in this way clear a free text field that ultimately produces categorised numerical data has a significant risk of error even before points 2 and 3 are introduced.


2. User error

What exactly goes into that occupation box when a particular person fills it in? The accuracy is very difficult to control. 'Occupation' is also a term that can be interpreted in a considerably different way to 'Job'. I could argue that I have been a professionally recognised Librarian and in that career path for considerably longer than my title has contained the word 'Librarian'. This might have been stretching the truth and rather optimistic but not all that difficult to justify even if it misses the spirit of the question. (Just in case anyone from the ABS stumbles across this, I didn't. It would have been easy though, and virtually undetectable.)


3. Re-categorisation to a standard classification

Collecting statistics for each variant answer in the 'occupation' box would be messy and produce fairly meaningless data. Consequently a classification system is used - ANZSCO in both 2006 and 2011.

Widely varied responses must be fitted into the classification based on the listed occupation, here take note of the question's first explanatory point 'give full title'. In 2004 an ALA survey found 37 common job titles for library support staff and numerous less common titles, it is more than likely that Australia has a similar range. Many are not obviously library jobs. These could be categorised all over the occupational spectrum.

It's not too much of a stretch to think that job titles in other industries might create inbound interference on top of this.

I strongly suspect that aspects of points 2 and 3 are responsible for the large number of teenage Librarians.


4. Clumped professions or; Night of the Vanishing Teacher Librarians

One night while crunching some numbers I discovered that the Teacher Librarians had gone AWOL.

When using a guide to ANZSCO to work out where Librarians might be in TableBuilder (harder than you might expect) I found a page explaining 'Unit Group 2246 Librarians' where, partway down, it explains 'Teacher-Librarians are included in Minor Group 241 School Teachers' which, on investigation, is only further divided by the category of school. I know there are Teacher Librarians in there. Somewhere.

5. Classification system modifications

By this point in my analysis these issues and several other data issues and points raised in discussion led me to broaden my analysis to include other library occupations. Two were available.

Library Assistants were nice and straightforward.

Library Technicians were another matter entirely. In 2006 there was a category titled 'Library Technicians' but in 2011 the census used a new edition of ANZSCO and the category becomes 'Gallery, Library and Museum Technicians'. Unsurprisingly while a decline of around 3% is observed in the Librarian and Library Assistant categories this one has grown, acquiring two more industries worth of technicians. The trends between 2006 and 2011 behave in a way that is consistent with the other categories however an interesting anomaly from the national figures observed in Library Technician data for South Australia in 2006 has disappeared. I am totally unable to determine if this disappearance is real or a result of the classification changing.


6. Data availability

The last issue is the availability of census data. I am unable to find the information I need in standard ABS releases so must use TableBuilder. At present only 2006 and 2011 data is available. While I believe I see a trend that is consistent across all levels of library staff I cannot see if this is the continuation of a long-term trend, a new development or perhaps even the repetition of a cycle within the workforce.

Another census will be held in 2016, at some point after this I should be able to add another set of figures and start to answer this question. Until then I must, with the aid of prior studies and related papers and reports, make an educated guess.


In Conclusion

Teacher Librarians have gone MIA, Gallery and Museum Technicians invaded in 2011 and the most personally valuable lessons and insights from this exercise might be from the process rather than the outcome.

What appeared like a fairly straightforward interpretation of data from a large and respected source tested not only my ability to retrieve, analyse and interpret data but also my ability to spot the potential for errors and work out which errors were significant, which could be adjusted for and which.

This has reinforced that even the most authoritative, objective and thorough sources of information are fallible. While I believe I will still be able to complete a useful analysis with the scale of data I wished to work with factoring in these issues is going to be a challenge in itself.



P.S. Reviews and crafts will return one day, right now this analysis is taking up the time I might spend on those.

Tuesday, March 26, 2013

Reading too much

I've got some good news and some bad news - I'm back, but my good friend Google Reader is dying. Let me tell you the story of my relationship with Google Reader. It's a bit long, but I promise this is all going somewhere.

Warning: some self evaluation ahead.

I was never really away, but mobile data speed, format, reliability and bandwidth were very restrictive. I managed to stay more or less in touch on Facebook, read important emails and do critical things like pay the rent, but not a whole lot else. Naturally this means that things built up in a big way. It's happened before, I've coped before. It's frustrating but nothing that time and patience won't solve.

But then it got interesting. About halfway through the wait for my internet connection I heard that Google Reader will be retired in June.

Initially I was horrified. I use this thing daily on both my PC and my phone. I follow 152 RSS feeds which range from near dead to high volume updaters. Before I started using Google Reader it was a nightmare to keep track of my blogs - and there were considerably less of them then.

I wasn't completely disconnected from Google Reader while I was waiting for my internet connection but I was only able read a tiny  minority of posts, focusing on webcomics and text-only blogs, otherwise I risked running up a large phone bill. This meant that when I turned on my PC with the new internet connection up and running I saw the dreaded new post count list - 1000+. After that Google Reader stops giving you a precise count. I don't know how many were there exactly but the Android app had been giving me that number for quite some time.

I decided to put off other things and start tackling what I saw as the problem. I started reading the webcomics, they're always my first stop in Google Reader and the site's learnt that - it always lists them first on the home screen no matter what is most recent. After that I started at the newest post and just started working my way down, but I wasn't really getting anywhere. I entirely cleared a few blogs that are only occasionally interesting and then picked the two largest post count offenders. These were Craft Gossip and Laughing Squid.

Craft Gossip is a huge blog that aggregates craft tutorials and news from all over the internet. I blitzed through it pretty quickly, considering there were over 500 unread posts. I starred maybe twenty of the most interesting posts for later reference and then moved on.

Laughing Squid is another aggregator, this time of news, memes and all sorts of tech/geek focused bits and pieces from all over the Internet. It started at just under 600 unread posts. Sorting this was considerably harder. There are a lot of videos, links onward and a much higher proportion are of interest. I watched short videos I was interested in - those up to maybe three minutes (mostly cats) - but more of them, all the way up to two hours, I starred to watch when I have more time. I also starred a number that I intend on sharing with friends later. I can only put so many on Facebook at once without borderlining on spam, after all. Eventually I looked at the clock - and it was pretty horrifying. What's worse, there were still over 300 Laughing Squid posts unread, and the total was still at 1000+.

At this point I began to realise that falling behind and the demise of Google Reader might not be the issue here. I actually wrote some dot points for this article because I couldn't stop thinking about it. This didn't really help the getting to bed problem.

When I woke up in the morning, feeling somewhat average and having trouble waking up, I picked up my phone and started again from there. Just from the top, not from any particular blog. In just under an hour I knocked off all of the posts that had appeared overnight by either starring to read later or discarding. I began to work on getting the numbers down again.

By the time I raced out of the house with only a banana for breakfast and not even time for a glass of water let alone a coffee, there were 858 posts.

Right now, 5:51 p.m., my webcomics have been read and there are 914 unread posts remaining - 297 of which are Laughing Squid and 34 of which are Craft Gossip.

This is becoming a chore. I'm not enjoying it. I'm at risk of falling out of some good patterns I've been able to set for myself since moving in, and there are still hours of reading ahead. I'm not getting to enjoy the books I've borrowed either. It is consuming my life, my energy and more.

Possibly worse, all of those starred articles, recipes, craft tutorials and videos are still waiting. It's not just the ones I've done in the last 24 hours either. I have been building those up for more than a year. There must be thousands. Every now and then I watch a few but the list grows far faster than I deal with it.

There are other readers out there. Google Reader is not unique. The Old Reader appears to be particularly similar - I can even import my subscriptions. I cannot, however, find a way to import my list of starred items. When Google Reader goes, they go.

Preserving them can be done - either Evernote or OneNote would be a good candidate. As a bonus it'd be cloud based, not subject to the blogger's alterations and nicely organised. But like catching up with these posts and watching the starred videos, it would be hours and hours of work and an absolutely vast volume of data to capture it all.

I started thinking about the hours and hours (heck, weeks) catching up, backtracking to see things I've put aside for later and backing up the parts I want to keep might take. The picture below is a reasonable approximation of how all this had me feeling. The article from Naldz Graphics that it's from is getting at the problem here.


This is not freshly out of control, it's just that right now it's ramped up even further that it's really struck me. I'm spending  amounts of time bordering on insanity just trying to keep up with this. It's getting backlogged more and more with time. It seems a bit ironic that someone whose business is information management has such a bad case of out-of-control information overload.

I think I spend more time reading craft tutorials than I spend crafting and I have a vast list of tutorials waiting for me. And even if I tried a new saved recipe every meal it'd take me months to try all the ones I've saved.

So instead of thinking about how daunting this is and wishing that Google Reader was staying around I choose to look at it as an opportunity. I'm clearing this thing up and retaking control.

Make no mistake. I'm not going to stop reading blogs. There's some great professional reading in there, I look forward to a number of webcomics, there are some really inspiring crafters blogging and when this all started out I was really enjoying myself.

But some of these have to go.

Craft Gossip, you've shown me all kinds of amazing things. But I can't keep up with five new craft projects added to my to do list every day even if I don't spend all that time reading about them. Those five are less than ten percent of the posts I trawl from this subscription.

Laughing Squid, the content linked is really wonderful stuff - funny, interesting, though-provoking... but the volume of it is so vast that most of the interesting stuff winds up being saved for a later that never comes. Part of me is sad to loose the fantastic content but I just can't take this.

I'm not just talking about removing things either. As I type this line at around 6:15 those two are gone. But I'm not done yet. There are still 584 unread posts still on the list. Some represent blogs that will stay. Many do not.

I estimate this has already halved the rate of incoming posts. That estimate may be conservative.

So although it'll take a little time to sort the rest out it's probably less time than it'd take to read them. I'm about to get started. The fitness blog that occasionally posts an interesting recipe but mostly posts workouts I'm not interested in? The mommy blogs posted one craft tutorial I liked over a year ago? The free vintage graphics feed that usually gets the 'mark all posts read' treatment? More smaller aggregators that still churn out a dozen or more posts a day? Other blogs that have mostly become something to click past?

It's time to say goodbye.

So losing Google Reader is not a tragedy. And being partially offline for more than two weeks might have been frustrating, but it wasn't blinding me to the world.

All of this has given me an opportunity to start over. To clean out, take control and stop losing hours and hours of every day.

When I'm done I should have a manageable list of the things I most want and need to read about. Taking back control feels pretty good.

Email subscriptions, I've got my eye on you.






P.S. I asked a friend to read this over to make sure it made sense to others too. While he did that, I kept working. I'm down to 402, with more than twenty listings gone. This doesn't clear the lists of saved articles, but I'll get onto that soon, probably just deleting most of them. I'll tame this beast yet.