Bibtax parser: Mashup no more

Johnny Depp - Mashup by ~razorwalk on deviantART (http://razorwalk.deviantart.com/art/Johnny-Depp-Mashup-269028630). Downloaded 12 Apr 2013. labelled as free to reuse
Johnny Depp – Mashup by ~razorwalk on deviantART (http://razorwalk.deviantart.com/art/Johnny-Depp-Mashup-269028630). Downloaded 12 Apr 2013. labelled as free to reuse

In some (many!) posts I have talked about parsing bibtex files. Till now, I have been using PHP, Excel – in fact anything I could find that meant that I did not have to write my own software. I suppose I was producing a form of mashup.

All this seems to work but the problem of mash up is that they are just that and when you return to them, it is quite difficult to follow what you were doing, unless you are super organised which I try to be, but going back to a project seems to take a long time to work out what each part of the mash up is doing, let alone the order in which each element needs to be executed.

Returning to the Bibtex project, I finally decided to take the bull by the horns and write a parser in C++. That is probably a bit of an exagaration as I will only write a parser that can parse Bibtex to the standard I require for the project I need. So I don’t have to worry about getting everything working correctly, just as long as I can extract the relevant parts that I need.

I am part way through the development. For the first time I have really taken to C++ Standard Template Library. Not sure I fully understand it, but I understand enough to have got it working.

I am also trying to write the project as a series of classes, as I have some other ideas, beyond this project where I think I will be able to call on some of these classes. The aim, as with most Computer Scientists is to have a software framework that we are able to reuse for a variety of projects that will lead to (hopefully) that we can publish.

 

Bibtex: Display papers by a given author

For a while I have been blogging on various bibtex related topics. My interest in parsing bibtex sfiles started a long when I was looking for a way to automatically generate my publications web page, rather than having to update the raw HTML every time I published a new paper. There are a few free options around which enable you to do this but all those I looked at had some shortcomings, as far as I was concerned.

In the end, I found a PHP system from Andreas Classen. This system does just about everything I want it to do and, being written in PHP, it is also extendible. The other benefit, although it did not seem like it at the time, was the fact that it forced me to learn PHP. I am really glad I did now as it has proven to be invaluable in many projects that I am working on.

I have already made a number of changes to the system supplied by Andreas, just so that I can do things were not available out of the box. Not that this was a problem. Andreas was not selling a software package and it was good of him to make the software he had written as open source.

The approach I have taken is to split lots of the functions into smaller functions and also put them into a class just to try and make things easier and tidier. The overall aim being that I should be able to string these functions together (in true mash up style) to provide additional functionality.

As an example of this, I have recently added some functionality (albeit, quite specific)  that was not available before now in the original system, and would not have been easy to incorporate into the software as it arrived.

This new functionality enables you to show all the papers that have been published by a single author. That is, I take a bibtex files, generate one record for each author/paper. So if you have a paper that is written by two authors, this will generate two records; one for each author.

The process is as follows:

  1. Read in the bibtex file into an array.
  2. Process each record and extract each author as a separate record (along with some information about the paper). All of this is placed into another array.
  3. Sort this new array by the author family name.
  4. Once you have this array, you can display it as you see fit.

You can see what I do with it here. Note, this is not a page at my web site, but this functionality was developed for the MISTA conference series that I have chaired since 2003.

If you look at this page, you’ll also see I generate a separate menu for each author so that somebody could explore a specific author without having to continually scroll through the page. It also makes it easier to see how many publications a particular person has published.

Whether the system is really needed for the MISTA conference series is open to debate. The conference has published around 500 papers so it is possible to simply scan the papers, but as the conference publishes even more papers it will become increasingly useful, rather than having to look through every paper.

Probably of more use, at least to me, is that the various PHP support functions that I have developed now enables me to develop even more bibtex functionality, which make various other projects I have in mind a realistic possibility.

 

Endnote: Thumbs up to Technical Support

I have recently been posting about bibtex and some of the problems I was having, and some of the solutions I had come up with. As the research project I have in mind develops, I do envisage more problems on the horizon so I thought I might look around for something a little more sustainable. One of the issues I have been having is trying to get various references form various journals. Some of the publishers are very good at enabling you to do this. Others, not so good. Some do not even give you the option of downloading citations. I plan to blog about some of the issues I encountered later, but I need to do more investigation to be sure of my facts.

For the project I am working on, I need to download a set of papers, from various journals. I don’t want to go into too many details here, as that will wait until the research is done and the paper is submitted and (hopefully) accepted.

A lot of my research had been done using Thomson Reuters and their Web of Science product. It was while looking more deeply at their various products that I came across Endnote. I had heard of this product before but I have never used it and I did not know that it was part of the Thomson Reuters family of products.

Endnote actually has two flavours. A web based product that is free to use (at least for me, because I have an institutional subscription to Web of Science) and a desk based version that is available for Windows and Macs.

I started off using the web based version and as all the papers that I am interested in are in ISI ranked journals, I could access everything I needed to from within Web of Science. This meant that I did not have to go to individual journals or publishers, which was a big plus for me as I could do all my paper searching from one place. Moreover, importing to Endnote Web was very easy.

I soon came across a problem though. Endnote Web has a limit on the number of citations that you can store. This limit is 25,000. I guess that this will not be a problem for most people but for me (or, at least, the research project), I would need a few more thousand for the current project.

If you are interested in the various limits and comparisons for the products, they can be seen here. At the time of writing, the web page was saying that the latest desk based version was Endnote X5, when in fact the latest version is X6.

In other pages I have seen, it states that if you buy Endnote X6 you can store 100,000 references in Endnote Web. Actually the limit is 50,000. In any case, 50,000 would be good enough for my project.

The results of all my investigations was that I shelled out the £66 (GBP) for a copy of Windows Endnote X6. As well as being able to store 50,000 references within Endnote Web, there is also a sync facility so that both the desktop and the web versions mirror each other as far as the data is concerned.

But there was a problem with the sync facility. I have just over 35,000 records in the web based version and when I tried to sync, it did not work (I won’t bore you with the details of the error message, I later learned, was just a generic message). I tried various things but eventually raised a support ticket with Thomson Reuters.

They responded, and kept me informed every couple of days that they were still looking at the problem. Then one of their support personnel (I won’t name them for fear of embarrassment) contacted me and arranged a skype call. We actually had the call as they were driving home from work on (their) Friday evening. I was very impressed that somebody would do this. They also promised to continue to look at the problem once they got home.

I think that this is amazing service. The problem has not been resolved yet but I do know that their technical support is taking the problem seriously, which is as much as you can hope for.

I am hoping that the problem will be sorted out soon, and then I can move onto the next stage of the research project. But, at the moment, there is a big thumbs up for Thomson Reuters technical support. Thank you.

Finally, I have given what links I can above, but it is difficult to give links to Endnote Web as to access it I have to login via my institutional subscription. I am not sure if the product is (freely) available to the general public. Perhaps if you use it as an individual you could post a reply letting others know how you access it, and whether it costs anything? Perhaps the only way is to buy Endnote X6 and then you get an Endnote Web account as a matter of course?

Easily converting ris-citations to bibtex and some reflections

I have recently done a series of blogs on parsing Bibtex files. When I post a blog I tweet it with a #bibtex hashtag. Every so often I search for #bibtex in Twitter just to see what others are saying.

A recent search threw up an interesting web page. The blog post is interesting in that it shows another way of parsing citation information and getting it into bibtex (my posts have focussed on getting information out of bibtex). In this case it is the problem of getting RIS (Research Information Systems)  formatted citations into bibtex format.

What is interesting about the post (apart from the fact that it provides a solution to the problem, especially if you face the problem!) is that it talks about disturbing workflow and how you can automate the task. Even if you can find a solution to the problem you face it often means having to go through many different stages, perhaps using different software packages and/or operating systems (in fact, I commented on this in my post on biber and biblatex). Once you have a solution that works when you need to return to it a few days/weeks/months/years later, unless you have documenated it (or have a very good memory), you may need to spend a lot of time working out what you actually did.

As I delve more into bibtex, and how to manipulate (or parse) the data I am gradually forming the opinion that there are standards in place but either they are not strict enough or people do not use them. For me a good example, is the use of names in bibtex (see here and here for a discussion).

The problem (or at least one of the problems) is that bibtex has a structured layout but within each element of that structure you can enter what you like, whether that is names not in the correct format or strange characters that do not display on every device.

Perhaps there is a standards agency that looks after these things, and perhaps I am missing something, but it seems to me that there are some areas for improvement in this important area.

 

Bibtex: How to enter names

My recent bibtex posts have drawn a few comments, which I am very grateful for. I have already described one of these comments in a post I uploaded yesterday, where somebody had suggested that I look at biber and biblatex.

 

Another comment I received was made in the post about Parsing Bibtex Authors. I received the following comment:

 

… snip

Incidentally, why do you have
“T. van Woensel” instead of “van Woensel, Tom” in your bibtex entry?
If you want to the first names abbreviated then use the appropriate bibtex style.

 

I tried the format suggested above in my own bibtex file and I am pleased to say that the bibtex parser supplied by Andreas Classen does parse things as you would expect. That is.

“van Woensel, T.” displays correctly as “van Woensel, T.“, rather than “Woensel, T va” if you type the name in as “T. van Woensel”

 

This is good (and apologies to Andreas if I ever gave the impression that his parser was somehow flawed).

However, this is all well and good but ONLY if everybody types the names in the correct format given the structure of the names. In my experience, this is not the case, and so the parser has to somehow cope with when people do not follow the standard way of doing things.

As an example, I have just looked at a bibtex file that I downloaded from a leading journal’s web site and one of the author’s names is “Joyce van Loon” which would not parse correctly.

 

In case you want to read more, in a previous blog I pointed towards Norman Walsh’s web page as a good explanation of bibtex author formats. A recent forum entry I looked at pointed towards a slightly different Norman Walsh page.

 

Bibtex is not the only solution apparently

Recently I have been blogging quite a lot about bibtex (see here) and how I can parse bibtex files. When I tweeted about my most recent blog post, I received a reply asking whether I had looked at biber and, instead of parsing the bibtex file, whether I would be better off parsing the bcf file that biber produces as, if noting else, it is in XML and that will be a lot easier to parse. And, apparently, the names are already split into given name and family name.

I have had a quick look at biber (and the underlying package biblatex). They certainly look impressive but the learning curve seems to be quite high. From the VERY quick read, you have to install quite a few packages before it will work. This reminds me of the time I first installed WinEdt (a latex system). As an aside, I note that my version of WinEdt is 5.5 and the latest release is version 7. I wonder if it is worth upgrading?

When you install WinEdt, you also have to install the Mitex system. I remember when I did this a few years ago it was painful (I think it is simplified now) and if you did not do things in the correct order then WinEdt failed to work at all. Having said that, WinEdt is a great piece of software and it is my program of choice for writing scientific articles in Latex and, of course, Latex uses Bibtex and it all works.

If I install biblatex/biber will this work with all style files, class files etc. that are supplied by academic publishers or will I forever be trying to work around the system? Indeed, is it possible to use bibtex, and just use the new system when you want?

Maybe it will all work seamlessly (and I know that I should look at the documentation, but there is quite  a lot of it) but I have been caught too many times where you do a quick install and then you are trying to get things to work for hours, if not days. I am not saying that biblatex/biber fall into this category, but I need to be sure that installing this system gives me benefits above and beyond which I already have.

As I have reported on previous posts I now just about have the parsing of names sorted out. However, I do know that I am just about to face problems with strange characters which are usually as a result of mathematical symbols or accent on names. If a new package is able to deal with those, so I don’t have to edit the file manually, or write yet another parser, then it might be worth installing.

But is it worth installing, just to find out? I’m not sure yet. I need to do some more digging around.

 

 

Downloading Bibtex Files

In previous posts (see here for my previous Bibtex posts) I have been talking about a research project that  have in mind. One of the things I have to do is download quite a few bibtex files from various journals. Easy you might think, and you’d be right in thinking that. It should be easy, of course. Go to the journal home page, select the papers you want, hit the button to download those papers as a bibtex file, and you are done.

Except, it is not quite as easy as that. I purposefully won’t mention any journals or publishers but here are some of the issues that I have come across.

  1. Some of the journals/publishers are too restrictive in that you cannot specify exactly what you need. Instead, you have to click on a particular volume/issue and then download that. If you want a particular year (or volume), then you might have to download six, or twelve, separate files.
  2. Some journals/publishers will not allow (at least as far as I can see) you to download bibtex, but only CSV (Comma Separated variable).
  3. As far as I know, bibtex is designed to be a text based system, yet when you download some files you get all sorts of whitespace and non-printable characters in there. At best these do not display correctly. At worst it stops you being able to process the files. For example, I use Jabref and, on occasions, it gives me errors as it cannot process the given text file. This is frustrating to say the least as you have to track down the problem and manually edit the file.

To deal with the first point is easy, just much more time consuming that it could/should be. The second point, is just frustrating. It means that you have to write another parser to convert from CSV (e.g. Excel) to bibtex. Not a major task, but an additional process that should not be necessary. The final point is the most frustrating as, in my mind, this should not be an issue. Perhaps I am wrong (and I am willing to be corrected) but the whole idea of bibtext (and latex actually) is that it is text based and having strange characters in the file goes against the high level aims.

Anyhow, to get around the third point you need to somehow clean the text file. You’d think this might be an easy task but a quick google shows that this is far from the truth. There is not one recognised way of doing it. If you are literate across many platforms/languages you might be able to go via Unix/Linux, perhaps with a little PHP, perhaps using sed, or awk or some other editor. To the casual user, how to resolve this issue is far from obvious. Actually, I might be being a little harsh as what is a corrupt file on one system might be a corrupt file on another system. You only need to consider the way that carriage returns and line feeds are treated differently on Unix and Window to start having some idea of the issues involved.

So, what did I do?

I definitely wanted a Windows solution to save me having to transfer the file to other systems, write PHP programs etc. and I ideally did not want to have to download software, play around with regular expressions etc.

In the end I came up with a solution that is probably not that elegant but it is easy to do, requires no additional software and is relatively quick. Here is what I did.

  1. Open the bib file in Wordpad
  2. Copy the text into Excel
  3. Use the CLEAN() function to clean the text
  4. Copy the clean text
  5. Paste it into an empty bibtex file (e..g if you open Jabref, you can paste into the empty pane)

Actually, when I was playing around with this, I could sometimes get away without going via Excel (i.e. open in Wordpad, then copy and paste into an empty bibtex file). I assume that the act of copying/pasting somehow strips off the strange characters. I would be interested to know why this sometimes happens, but that can wait for another time. At the moment, I am just pleased that I have some (semi-)automatic way of (semi-)cleaning up bibtex files that contain strange characters.

 

Parsing Bibtex Authors: How I Do It

I mentioned (see here for a list of previous bibtex posts) that I was facing a challenge of parsing bibtex authors. One of the problems with the current system that I use is that, although it works pretty well, but it does not handle all cases correctly. To give you an example, if I have:

T. van Woensel (who is actually one of my co-authors)

in my bibtex file, this parses as

Woensel, T. va

whereas, it should parse as

T. van Woensel

That is, the family name is actually “van Woensel”, and not “Woensel”.

In a bibtex file, you would place the name within braces, and the bibtex parsers that are built into (say) latex would treat the text as one element. So, you would type this into your bibtex file.

T. {van Woensel}

But the current parser I have, although not displaying the braces, does not treat the elements inside the braces as one element.

This gave me an opportunity to try out using regular expressions. I must admit that I do not have a lot of experience of using these, although I do recognise that they are very powerful and I susppose I have used them in various guises, and on various operating systems, over the years.

Anyhow, here is the idea.

Step 1

I take each bibtex author string and split it into the various authors using the PHP explode function. That is:

$authors = explode(” and “, str_replace(array(“\n”,”\r”,”\t”),” “, $entry[“author”]));

This function is actually taken from formatAuthors, which is part of the code supplied by Andreas Classen.

This results in the bibtex author string being split into separate authors (as we use ” and ” as a delimiter, which is the standard delimeter in bibtex author names). The various authors are stored in an array called $authors.

 

Step 2

Next, we iterate through each author

foreach($authors as $author) {

// stuff (see below)

}

 

Step 3

This step is an extension to what Andreas does. I now protect certain names. For example, if I want to look for “van xxxx”, I use the following replacement regular expression.

$author = preg_replace(‘/([Vv]an) +/’, ‘$1xxxyyyzzz’, $author);

If the name was “van Woensel”, this will make it “vanxxxyyyzzzWoensel”. I use xxxyyyzzz, as this is unlikely to appear in somebody’s name.

 

The other names I currently protect are “Vanden” and “De”. That is, I would add these two lines:

 

$author = preg_replace(‘/([Vv]anden) +/’, ‘$1xxxyyyzzz’, $author);
$author = preg_replace(‘/([Dd]e) +/’, ‘$1xxxyyyzzz’, $author);

Of course, we could put the various names into an array and iterate through them, which would make it easier to maintain. In my actual implementation, I actually put all this messy regular expression stuff into a function (and use an array to hold the various names), just to keep it all out of the way of the main parsing function.

 

Step 4

I now call the formatAuthors function supplied by Andreas. That is:

$author = formatAuthors($author);

Continuing with the example above, this would now return

vanxxxyyyzzzWoensel, T

 

Step 5

Now, of course, I have to strip out the protection, so I do another regular expression replacement. That is:

$author = preg_replace(‘/([Vv]an)xxxyyyzzz+/’, ‘$1 ‘, $author);

Which just replaces xxxyyyzzz with a space.

Looking back at step 3, I would also have to unprotect “Vanden” and “De”. That is:

$author = preg_replace(‘/([Vv]anden)xxxyyyzzz+/’, ‘$1 ‘, $author);
$author = preg_replace(‘/([Dd]e)xxxyyyzzz+/’, ‘$1 ‘, $author);

 

This achieves what I set out to do. But, actually, the reason I wanted my own parser is so that I could split the authors into the family names and their given names. Now that I explode the authors before formatting them I do have each author in a separate string. Carrying out the above step sorts out the issue (well, at least one of the issues) with family names that comprise more than one word. After the above procedure I have each name in the format

family name, initial(s)

It is an easy take to use string manipulation to split the name into family and given names. This is how I do it.

 

$len = strlen($author);
$commaPos = strpos($author, ‘,’);
$family = substr($author, 0, $commaPos);
$given = substr($author, $commaPos+2, $len – $commaPos +2);

I use $len and $commaPos just to make the code a little easier to follow. Of course, you might just want to incorporate the various string functions within the calls to substr.

 

The result of all the above is that I can now parse author strings from bibtex, resolve certain names that did not work before and also split up family and given names. This is important for the research project I have in mind, more of which later.

 

 

 

Some of the Challenges in Parsing Bibtex Authors

In previous posts (see here) I have been talking about a system that I have been working on that enables the publications area of my web site to be driven by a bibtex file. It all seems to work pretty well and maintaining my web site is now a lot easier than it used to be. Another advantage (and one not to be sniffed at) is the fact that my web site has a uniform presentation. I also hope that once I have the various functions that I plan to implement I will be able to do things like present my papers sorted by journal name (so that you can see how many papers I have published in a given journal) and also publish a list of my co-authors. All of this, and more, should be quite easy once the main foundations are in place.

As I progress with this project, and I try to extend the system, there are various challenges that I keep coming up against. One of these is parsing author names.

The system I am developing is based on the one developed by Andreas Classen. It actually does a fantastic job of parsing author names. You simply pass a string of authors from the bibtex file to a function called formatAuthors, and you are returned a string that presents the authors in a standard way. Indeed, this is the method I use on my current web site (see here, but bear in mind the method may have changed by the time you read this).

I have recently been trying to write my own parser, for a research idea that I have. It is not easy! Just to give you a few examples of the issues that have to be addressed:

  • In bibtex, you have various ways that you can specify names. Norman Walsh does a better job that I could in describing some of the ways that names can be provided to bibtex.
  • The formatAuthors function provided by Andreas, as I said, does an excellent job but it is lacking in some ways.  For example, it does not deal with people who have two family names, such as some people from Belgium and Holland who are often called “van xxxxx” or “de xxxxx”.
  • When I was experimenting recently, I saw an author who was called “Billy James III”. The “III” causes problems, in the same way that the Belgium/Holland names do. There would be similar problems with people who have “jr” at the end of their name.

None of these problems are insurmountable. Indeed, the bibtex style files do a great job of handling all types of names. The challenge for me is to try and develop a parser that is able to deal with anything I throw at it, and which will deal with thousands of names, rather than the ones I can easily check from the bibtex file that comprises just my own publications.

Anything that I do, will be based on the formatAuthors function from Andreas but I think it needs a little tweaking just to try and deal with a few more cases.

 

Update: Displaying bibtex on web site

A while ago (see here for my Bibtex posts) I commented that I was working on a system where I could take a bibtex file and display that on my web site. The result can be seen here. The system works pretty well in that my web site (at least this part of it) is driven from a bibtex file.

I have also implemented a system where you can freely download my papers, but you have to supply your email address. This is for a number of reasons. Firstly it is useful to know how many of my papers are being downloaded. Secondly, it is interesting to know what papers people are interested in. Thirdly, it might be useful to collect email addresses to let people know about conferences, new publications etc.

Since the system went live, 108 of my papers have been downloaded. Some of downloads were done by me, just testing the system, or making sure newly added papers could be downloaded – but around 100 downloads is pretty good.

In my previous post I said that all I needed to do was keep my bibtex file up to date. Actually, I have a few ideas as to what I want to do with the system. I have started to work on some of those ideas, which I’ll explain in a future blog.

One of the reasons that I want to extend the system, and also make it easier to maintain, is for a research project that I have in mind. If I decide to go ahead with that, I’ll need to do a lot more bibtex manipulation that I do at the moment and anything that I can do to make my life that little bit easier will be well worth the implementation effort that is required.

Whatever I do, I still owe a big debt to Andreas who was good enough to provide the code that I initially used, and still draw on very heavily.