In previous posts (see here for my previous Bibtex posts) I have been talking about a research project that have in mind. One of the things I have to do is download quite a few bibtex files from various journals. Easy you might think, and you’d be right in thinking that. It should be easy, of course. Go to the journal home page, select the papers you want, hit the button to download those papers as a bibtex file, and you are done.
Except, it is not quite as easy as that. I purposefully won’t mention any journals or publishers but here are some of the issues that I have come across.
- Some of the journals/publishers are too restrictive in that you cannot specify exactly what you need. Instead, you have to click on a particular volume/issue and then download that. If you want a particular year (or volume), then you might have to download six, or twelve, separate files.
- Some journals/publishers will not allow (at least as far as I can see) you to download bibtex, but only CSV (Comma Separated variable).
- As far as I know, bibtex is designed to be a text based system, yet when you download some files you get all sorts of whitespace and non-printable characters in there. At best these do not display correctly. At worst it stops you being able to process the files. For example, I use Jabref and, on occasions, it gives me errors as it cannot process the given text file. This is frustrating to say the least as you have to track down the problem and manually edit the file.
To deal with the first point is easy, just much more time consuming that it could/should be. The second point, is just frustrating. It means that you have to write another parser to convert from CSV (e.g. Excel) to bibtex. Not a major task, but an additional process that should not be necessary. The final point is the most frustrating as, in my mind, this should not be an issue. Perhaps I am wrong (and I am willing to be corrected) but the whole idea of bibtext (and latex actually) is that it is text based and having strange characters in the file goes against the high level aims.
Anyhow, to get around the third point you need to somehow clean the text file. You’d think this might be an easy task but a quick google shows that this is far from the truth. There is not one recognised way of doing it. If you are literate across many platforms/languages you might be able to go via Unix/Linux, perhaps with a little PHP, perhaps using sed, or awk or some other editor. To the casual user, how to resolve this issue is far from obvious. Actually, I might be being a little harsh as what is a corrupt file on one system might be a corrupt file on another system. You only need to consider the way that carriage returns and line feeds are treated differently on Unix and Window to start having some idea of the issues involved.
So, what did I do?
I definitely wanted a Windows solution to save me having to transfer the file to other systems, write PHP programs etc. and I ideally did not want to have to download software, play around with regular expressions etc.
In the end I came up with a solution that is probably not that elegant but it is easy to do, requires no additional software and is relatively quick. Here is what I did.
- Open the bib file in Wordpad
- Copy the text into Excel
- Use the CLEAN() function to clean the text
- Copy the clean text
- Paste it into an empty bibtex file (e..g if you open Jabref, you can paste into the empty pane)
Actually, when I was playing around with this, I could sometimes get away without going via Excel (i.e. open in Wordpad, then copy and paste into an empty bibtex file). I assume that the act of copying/pasting somehow strips off the strange characters. I would be interested to know why this sometimes happens, but that can wait for another time. At the moment, I am just pleased that I have some (semi-)automatic way of (semi-)cleaning up bibtex files that contain strange characters.