December 7, 2006

My Job in 10 Years: Collections: Further Thoughts on Abstracting & Indexing Databases

To recap:

My Job in 10 Years:

(PDF version for printing here.)

The last one of these posts from way back in the fall of 2005 sparked a bit of response, in the comments and in an email from Roddy Macleod of EEVL, taking issue with my implication that traditional fee-based bibliographic databases are going the way of the dodo bird. The main bone of contention seems to be based around the value of subject-based indexing and thesauri provided in bibliographic databases versus the lack thereof in that most popular of free web search engines: Google.

Before I go too much further, I think I should clarify exactly what I’m talking about when I say Google / Google Scholar. I think I tend to use the lower case version, google, almost as a one would say kleenex for tissue. I mean specialized and general freely available search engines that can be used for scholarly research. So, Google Classic, Google Scholar and the new kid on the block, Windows Live Academic all the others that exist now and will exist in the next 10 years. Let’s just call it googlesoft.

One of the interesting things about speculating about the future, is that everyone has a different take on what’s in store; of course, this is half the fun if we’re all thinking about the future at the same time, we get to bounce ideas off each other and hopefully change and grow our own conceptions. There are also, I guess, two ways to speculate about the future: first of all, how we think things are going to turn out and second, how we would like things to turn out. Dystopian versus utopian, in a way. I guess many have viewed my speculations as quite dystopian and I can live with that. Certainly, I view our roles as professional librarians as evolving quite a bit over time in particular from being a group who had a definite service to offer that the customers really had no choice but to use (the situation up until the early 1990’s in many ways) to being a group that has to make a case for their usefulness to their potential customers (ie. the net generation, millennials, whatever you want to call them). This group of potential customers have lots of choices about how they are going to do the research they need, both for their courses and, for faculty members and grad students, the work that makes up their thesis and research work. And we must not forget that today’s connected millennials are going to be the new faculty members in that magic 10 years. Certainly the habits and expectations they have now will manifest themselves in their new, adult roles. If anything, they will be intensified. Current faculty members are certainly attached to the system of journals and conferences, to publishing monographs, to the apparatus of scholarly publishing that we know. But, will they retain this attachment and will their new colleagues have that attachment at all? I suspect the answer to this question is that, no, they will not retain the same level of attachment to the journal/conference/monograph culture that we have grown used to. Or at least not in the same way as in the past and particularly not the STEM crowd.

So, what does all this have to do with googlesoft?

It has to do with expectations of simplicity, it has to do with the desire to find rather than search, it has to do with convenience and most of all, it has to do with “good enough.”

Librarians place a high value on subject classifications, controlled vocabularies and all those parenthood issues. And they’re incredibly powerful tools to make our databases easier and more useful. But, when you get right down to it, formal human-generated subject classifications aren’t the only strategies for deciding what a particular document is about. There are informal human-generated classifications that can be useful (ie. folksonomies) as well as automated text mining subject classification methods. That is certainly an active area of research and development and it’s not hard to imagine that a lot of progress is going to be made in the next decade. And certainly, these automated and informal methods are going to be an awful lot cheaper than formal human ones. And that’s going to be important, because it’s also very important that googlesoft remain free to users. And text mining isn’t the only way to decide what a document is about. There is also relevance ranking via links, a popular method that googlesoft et al already use to find the most relevant documents in a search. Will formal, human-created article metadata disappear completely in just 10 years? I doubt it, but we definitely start to see the shift in that time frame. So, the first reason I think that subscription A&I databases are in trouble is because I believe that ultimately a “good enough” system of automated subject classification will be devised that will work in tandem with user-generated tagging and keyword assignment and, where necessary, human-generated formal classification. (Remember, I’m talking A&I services here, not book cataloguing which I don’t think will be affected in the same way.)

The second reason is because I think our users like using the free ones, and that they will continue to like the ones that they’ve grown up and used in elementary school and high school. They’re quick and easy, they mostly return fairly relevant hits for most clearly defined topics. It’s just easier to find “good enough” and that is not necessarily a bad thing. And this is a trend that will only get more pronounced over time as the expectation comes around to quicker, easier, more integrated, more connected, more open. Our patrons will increasingly get addicted to those things long before we see them, it’ll happen in high school. It’ll be a huge challenge for the subscription database vendors to compete with googlesoft in the coolness, openness and ease of use categories. And it will be our job to make sure our students understand how to use these search engines effectively just as it has been our job to make sure they use current products effectively.

Think for a minute. Compare the revenue and market capitalization of Google and Microsoft versus Elsevier and Thomson? (Take a look here.) Who has the resources to radically improve their products, to acquire metadata, to market and promote, to win this particular battle of free vs. fee?

Where are the publishers in all this? They want the best and widest distribution of the metadata for their publications. Whether OA or subscription-based, eyeballs looking at documents, creating impact, that is what is going to drive their business model. That is how they will justify themselves to their funders, be they governments, libraries, authors, whatever. The publishers are probably even now starting to realize that it really doesn’t matter if someone finds your document through INSPEC or Google Scholar, as long as they find it and recognize the value you as a publisher provide. Certainly, there have been studies that show that open access documents have a greater impact than non-OA; it would seem to follow that more widely available and searchable metadata would also have a greater impact for the author and publisher. Subscription A&I databases are potentially in trouble because content publishers will gladly distribute their metadata to anyone and everyone who wants it because it is impact that drives their business model.

Another thing that we must remember – as librarians, our loyalty is absolutely to our patrons, not the A&I or content publishers. Obviously, we want those organizations to do well enough to continue to be able to provide their products to us, but really that is our only interest in their survival as organizations. We value them for what they provide for our users (of course, it’s quite complicated here, as I certainly value and appreciate scholarly societies very differently than commercial publishers). Over the decades, the organizations that have helped us to provide products and services to our patrons have evolved and changed as we and our users continue to evolve and change. If today we spend a fraction on serials binding compared to 10 or 20 years ago, well, we make our decisions based on our needs and the needs of our users not the needs of our vendors. As librarians interested in free and open access to scholarly output, we enthusiastically support the Open Access movement. Good quality free discovery tools are just as much a part of the goal of providing access to that output as good quality free journals. Just as I mostly don’t care that what the business model is of companies that provide OA journals (ie. scholarly society, commercial publisher, dedicated OA publisher, somebody in their basement), I also mostly don’t care what the business model is of companies that provide freely available search engines. Can an A&I company add enough value to the metadata to make it worth paying, no matter what? Sure, look at SciFinder as a perfect example. Subscription A&I databases are potentially in trouble because librarians’ loyalty to them is contingent on the value they add to the information discovery process.

So, to sum up, my real goal is to serve my user community as best I can and if in the longer term I see an opportunity to maximize my expenditures on content or infrastructure by minimizing my expenditures on discovery tools, I will seize it. What’s the time frame for me to make this kind of shift? I think that in the next decade we will certainly start to see expenditures on A&I databases diminish as free alternatives get better and, more importantly, are perceived (by our users and, ultimately, by us too) as equivalent to the more expensive alternatives. The A&I databases that survive this shake-out will be the ones that find ways to very significantly add value to raw metadata.

As usual, I realize prognostication is a risky business at best and I may be proven completely wrong on all of this (maybe even tomorrow!) so all disagreement, debate, comments and feedback is appreciated, as a comment here or email to jdupuis at yorku dot ca.

Next up: Instruction. (Hopefully much sooner than 14 months.)


Roddy said...

I completely agree with you when you say: "It’s just easier to find “good enough” and that is not necessarily a bad thing." 'Good enough' is fine for many searchers, and Google etc is a fantastic service.

However, despite this, I don't think that the A&I services are going to go belly up in the near future.

For example, CSA records are now searchable via Google Scholar. If you subscribe to CSA, you can find results via Scholar, and often click through to the full text. This arrangement makes Scholar work better, and benefits subscribers to CSA.

CSA have also produced a new product called Illustrata -
This makes tables and figures embedded in journal articles searchable an accessible. This is a completely new idea, and worth checking out.

CSA now have a suite of connected/semi-connected products - PapersInvited, CoS, RefWorks, federated search, Ulrichs, etc which can add value to the basic search process.

I've used CSA as an example here, but the other services are developing in various ways as well.

These added value services may not be of much use to those who just want 'good enough'. But for others, they are often worth the expense.

So - Google is great, but A&I can be of service too.


John Dupuis said...


Points well taken. You have a point that there's a lot of innovation in the A&I industry going on. Similarly, I don't expect them all to run up a white flag. On the other hand, it'll be interesting to see if they can innovate faster than Google et al. I certainly paint a kind of "worst case" scenario here for the A&I services; only time will tell what really happens, if it's not as drastic as I imagine it could be or if it's even more so. As I'm writing these things, I usually ping pong in between "I'm being way too pessimistic here" and "Oh boy, things are going to change way more and in much stranger ways than I can imagine," never knowing which extreme I'll eventually settle on.

I see these thought experiments as a way of preparing myself for the change that does come, to be able to avoid pitfalls, dodge fads and seize opportunities. It if get people thinking, myself included, I'm happy.

It's also worth noting that Google Scholar also integrates with RefWorks, bibtex and a whole bunch of other tools too.