GK Logo 003 350 x 100

In previous posts (see here for my previous Bibtex posts) I have been talking about a research project that  have in mind. One of the things I have to do is download quite a few bibtex files from various journals. Easy you might think, and you’d be right in thinking that. It should be easy, of course. Go to the journal home page, select the papers you want, hit the button to download those papers as a bibtex file, and you are done.

Except, it is not quite as easy as that. I purposefully won’t mention any journals or publishers but here are some of the issues that I have come across.

  1. Some of the journals/publishers are too restrictive in that you cannot specify exactly what you need. Instead, you have to click on a particular volume/issue and then download that. If you want a particular year (or volume), then you might have to download six, or twelve, separate files.
  2. Some journals/publishers will not allow (at least as far as I can see) you to download bibtex, but only CSV (Comma Separated variable).
  3. As far as I know, bibtex is designed to be a text based system, yet when you download some files you get all sorts of whitespace and non-printable characters in there. At best these do not display correctly. At worst it stops you being able to process the files. For example, I use Jabref and, on occasions, it gives me errors as it cannot process the given text file. This is frustrating to say the least as you have to track down the problem and manually edit the file.

To deal with the first point is easy, just much more time consuming that it could/should be. The second point, is just frustrating. It means that you have to write another parser to convert from CSV (e.g. Excel) to bibtex. Not a major task, but an additional process that should not be necessary. The final point is the most frustrating as, in my mind, this should not be an issue. Perhaps I am wrong (and I am willing to be corrected) but the whole idea of bibtext (and latex actually) is that it is text based and having strange characters in the file goes against the high level aims.

Anyhow, to get around the third point you need to somehow clean the text file. You’d think this might be an easy task but a quick google shows that this is far from the truth. There is not one recognised way of doing it. If you are literate across many platforms/languages you might be able to go via Unix/Linux, perhaps with a little PHP, perhaps using sed, or awk or some other editor. To the casual user, how to resolve this issue is far from obvious. Actually, I might be being a little harsh as what is a corrupt file on one system might be a corrupt file on another system. You only need to consider the way that carriage returns and line feeds are treated differently on Unix and Window to start having some idea of the issues involved.

So, what did I do?

I definitely wanted a Windows solution to save me having to transfer the file to other systems, write PHP programs etc. and I ideally did not want to have to download software, play around with regular expressions etc.

In the end I came up with a solution that is probably not that elegant but it is easy to do, requires no additional software and is relatively quick. Here is what I did.

  1. Open the bib file in Wordpad
  2. Copy the text into Excel
  3. Use the CLEAN() function to clean the text
  4. Copy the clean text
  5. Paste it into an empty bibtex file (e..g if you open Jabref, you can paste into the empty pane)

Actually, when I was playing around with this, I could sometimes get away without going via Excel (i.e. open in Wordpad, then copy and paste into an empty bibtex file). I assume that the act of copying/pasting somehow strips off the strange characters. I would be interested to know why this sometimes happens, but that can wait for another time. At the moment, I am just pleased that I have some (semi-)automatic way of (semi-)cleaning up bibtex files that contain strange characters.

 

3 Responses

  1. Hi Graham

    The best work flow procedure I have found is to use JabRef for managing references off-line and CiteULike for on-line use. To be more specific, when I have to search for new literature I first find it using e.g. Google Schoolar or Web of Science. Next, post each article to my CiteULike account (this can be done with one mouse click) where I give it a specific tag. Finally, I export all my entries to BibTeX (CiteULike have powerful exporting facilities) and open the file in JabRef. Normally using this procedure I do not have to write any BibTeX reference manually.

    Best Lars

  2. … although what I am trying to do is download (for example) a complete issue for a given journal. With Elsevier it is easy. Other publishers are not so “friendly” as they don’t allow, for example, “Select All” (so you have to manually select all the files you need.

    Others, will only allow you to click on a Volume/Issue so you have to do each issue separately.

    But I do like the idea of going via CiteULike as it may resolve some of the other issues I am experiencing.

    Graham