Monday, October 1, 2012

From the bX data lab: Topic visualization and usage data

Experimenting and testing is an important part of our bX work. In our bX data lab, we work with data samples to check the quality of our services, improve them and evaluate new ideas. I recently conducted a small experiment relating to topic popularity and relationships between topics and subtopics, an ideal subject for trying out some visualization software, and here are the results.

First I took the 20 most used articles from each of the medical categories in the monthly Hot Articles indexes from March to July 2012, and extracted the articles’ medical subject headings (MeSH). Not all the articles had subject headings and my final sample consisted of about 300 articles. I then checked for the most popular topic which turned out to be “Obesity”. From all articles with this topic, I extracted all main subject headings and used Gource and ScreenR to create this little video that shows a visualization of the most popular aspects of Obesity at present.

Why Gource? I watched Tony Hirst’s visualization of OpenURL referrals, made with Gource, some time ago, and really liked the idea. Gource is an open-source software available from the Google code web page.  It is actually a software version-control visualization tool and was not meant for my purpose, but it is also quite easy to tweak it to do (almost) what I wanted. Gource requires a very simple input file in txt format that can be created from a logfile or from a spread sheet. The format for the input file is:

<unix timestamp>|<User>|<Added or modified (A or M)>|<directory/file name>

I had to repurpose the input parameters since my little project is about popular topic associations. I took as the user a number. For my purpose, it was not really important what number, so I took a bX internal article number and - instead of a directory and file name - a heading and subheading. The latter I had to tweak because Gource tends to show all labels on top of each other if there are no directories to separate them, so I grouped the headings and added directories for each group. My visualization does not show how the topic changes over time; rather, it shows the topic associations that are popular at present. Therefore I re-purposed the timestamp and used it to trigger the directories to appear sequentially, rather than all at the same time, for easier viewing. And that’s about it. The Gource commands are easy enough and can be run from command line on a windows computer. There is a nice help screen available by just running gource –H. For producing the video, I used ScreenR to capture it from the screen.

My little experiment only includes data from a few months and for only one topic. But the experiment does show that analysing topics based on popularity can be a very interesting exercise, if, for example, it relates to usage data for a much longer time period. Key topics of interest and the items that users view most will change over time, and a good visualization tool can show how interest in a topic and in its associations with other topics, change and evolve. Visualization tools can also uncover trends and correlations and reveal 'hot' topics in the past and at present.

A good example for uncovering connections and trends is Google Correlate, which shows correlations between different search keywords and real-life data. Another example is the Google Flu map that estimates where the next flu outbreak will occur – all based on usage data. Unlike the Google data, our bX data is purely focussed on the use of scholarly material and can therefore identify trends in scholarly discourse. Fascinating stuff!

1 comment:

  1. This comment has been removed by a blog administrator.