Bibtex: Display papers by a given author

For a while I have been blogging on various bibtex related topics. My interest in parsing bibtex sfiles started a long when I was looking for a way to automatically generate my publications web page, rather than having to update the raw HTML every time I published a new paper. There are a few free options around which enable you to do this but all those I looked at had some shortcomings, as far as I was concerned.

In the end, I found a PHP system from Andreas Classen. This system does just about everything I want it to do and, being written in PHP, it is also extendible. The other benefit, although it did not seem like it at the time, was the fact that it forced me to learn PHP. I am really glad I did now as it has proven to be invaluable in many projects that I am working on.

I have already made a number of changes to the system supplied by Andreas, just so that I can do things were not available out of the box. Not that this was a problem. Andreas was not selling a software package and it was good of him to make the software he had written as open source.

The approach I have taken is to split lots of the functions into smaller functions and also put them into a class just to try and make things easier and tidier. The overall aim being that I should be able to string these functions together (in true mash up style) to provide additional functionality.

As an example of this, I have recently added some functionality (albeit, quite specific)  that was not available before now in the original system, and would not have been easy to incorporate into the software as it arrived.

This new functionality enables you to show all the papers that have been published by a single author. That is, I take a bibtex files, generate one record for each author/paper. So if you have a paper that is written by two authors, this will generate two records; one for each author.

The process is as follows:

  1. Read in the bibtex file into an array.
  2. Process each record and extract each author as a separate record (along with some information about the paper). All of this is placed into another array.
  3. Sort this new array by the author family name.
  4. Once you have this array, you can display it as you see fit.

You can see what I do with it here. Note, this is not a page at my web site, but this functionality was developed for the MISTA conference series that I have chaired since 2003.

If you look at this page, you’ll also see I generate a separate menu for each author so that somebody could explore a specific author without having to continually scroll through the page. It also makes it easier to see how many publications a particular person has published.

Whether the system is really needed for the MISTA conference series is open to debate. The conference has published around 500 papers so it is possible to simply scan the papers, but as the conference publishes even more papers it will become increasingly useful, rather than having to look through every paper.

Probably of more use, at least to me, is that the various PHP support functions that I have developed now enables me to develop even more bibtex functionality, which make various other projects I have in mind a realistic possibility.

 

Easily converting ris-citations to bibtex and some reflections

I have recently done a series of blogs on parsing Bibtex files. When I post a blog I tweet it with a #bibtex hashtag. Every so often I search for #bibtex in Twitter just to see what others are saying.

A recent search threw up an interesting web page. The blog post is interesting in that it shows another way of parsing citation information and getting it into bibtex (my posts have focussed on getting information out of bibtex). In this case it is the problem of getting RIS (Research Information Systems)  formatted citations into bibtex format.

What is interesting about the post (apart from the fact that it provides a solution to the problem, especially if you face the problem!) is that it talks about disturbing workflow and how you can automate the task. Even if you can find a solution to the problem you face it often means having to go through many different stages, perhaps using different software packages and/or operating systems (in fact, I commented on this in my post on biber and biblatex). Once you have a solution that works when you need to return to it a few days/weeks/months/years later, unless you have documenated it (or have a very good memory), you may need to spend a lot of time working out what you actually did.

As I delve more into bibtex, and how to manipulate (or parse) the data I am gradually forming the opinion that there are standards in place but either they are not strict enough or people do not use them. For me a good example, is the use of names in bibtex (see here and here for a discussion).

The problem (or at least one of the problems) is that bibtex has a structured layout but within each element of that structure you can enter what you like, whether that is names not in the correct format or strange characters that do not display on every device.

Perhaps there is a standards agency that looks after these things, and perhaps I am missing something, but it seems to me that there are some areas for improvement in this important area.

 

Bibtex: How to enter names

My recent bibtex posts have drawn a few comments, which I am very grateful for. I have already described one of these comments in a post I uploaded yesterday, where somebody had suggested that I look at biber and biblatex.

 

Another comment I received was made in the post about Parsing Bibtex Authors. I received the following comment:

 

… snip

Incidentally, why do you have
“T. van Woensel” instead of “van Woensel, Tom” in your bibtex entry?
If you want to the first names abbreviated then use the appropriate bibtex style.

 

I tried the format suggested above in my own bibtex file and I am pleased to say that the bibtex parser supplied by Andreas Classen does parse things as you would expect. That is.

“van Woensel, T.” displays correctly as “van Woensel, T.“, rather than “Woensel, T va” if you type the name in as “T. van Woensel”

 

This is good (and apologies to Andreas if I ever gave the impression that his parser was somehow flawed).

However, this is all well and good but ONLY if everybody types the names in the correct format given the structure of the names. In my experience, this is not the case, and so the parser has to somehow cope with when people do not follow the standard way of doing things.

As an example, I have just looked at a bibtex file that I downloaded from a leading journal’s web site and one of the author’s names is “Joyce van Loon” which would not parse correctly.

 

In case you want to read more, in a previous blog I pointed towards Norman Walsh’s web page as a good explanation of bibtex author formats. A recent forum entry I looked at pointed towards a slightly different Norman Walsh page.

 

Parsing Bibtex Authors: How I Do It

I mentioned (see here for a list of previous bibtex posts) that I was facing a challenge of parsing bibtex authors. One of the problems with the current system that I use is that, although it works pretty well, but it does not handle all cases correctly. To give you an example, if I have:

T. van Woensel (who is actually one of my co-authors)

in my bibtex file, this parses as

Woensel, T. va

whereas, it should parse as

T. van Woensel

That is, the family name is actually “van Woensel”, and not “Woensel”.

In a bibtex file, you would place the name within braces, and the bibtex parsers that are built into (say) latex would treat the text as one element. So, you would type this into your bibtex file.

T. {van Woensel}

But the current parser I have, although not displaying the braces, does not treat the elements inside the braces as one element.

This gave me an opportunity to try out using regular expressions. I must admit that I do not have a lot of experience of using these, although I do recognise that they are very powerful and I susppose I have used them in various guises, and on various operating systems, over the years.

Anyhow, here is the idea.

Step 1

I take each bibtex author string and split it into the various authors using the PHP explode function. That is:

$authors = explode(” and “, str_replace(array(“\n”,”\r”,”\t”),” “, $entry[“author”]));

This function is actually taken from formatAuthors, which is part of the code supplied by Andreas Classen.

This results in the bibtex author string being split into separate authors (as we use ” and ” as a delimiter, which is the standard delimeter in bibtex author names). The various authors are stored in an array called $authors.

 

Step 2

Next, we iterate through each author

foreach($authors as $author) {

// stuff (see below)

}

 

Step 3

This step is an extension to what Andreas does. I now protect certain names. For example, if I want to look for “van xxxx”, I use the following replacement regular expression.

$author = preg_replace(‘/([Vv]an) +/’, ‘$1xxxyyyzzz’, $author);

If the name was “van Woensel”, this will make it “vanxxxyyyzzzWoensel”. I use xxxyyyzzz, as this is unlikely to appear in somebody’s name.

 

The other names I currently protect are “Vanden” and “De”. That is, I would add these two lines:

 

$author = preg_replace(‘/([Vv]anden) +/’, ‘$1xxxyyyzzz’, $author);
$author = preg_replace(‘/([Dd]e) +/’, ‘$1xxxyyyzzz’, $author);

Of course, we could put the various names into an array and iterate through them, which would make it easier to maintain. In my actual implementation, I actually put all this messy regular expression stuff into a function (and use an array to hold the various names), just to keep it all out of the way of the main parsing function.

 

Step 4

I now call the formatAuthors function supplied by Andreas. That is:

$author = formatAuthors($author);

Continuing with the example above, this would now return

vanxxxyyyzzzWoensel, T

 

Step 5

Now, of course, I have to strip out the protection, so I do another regular expression replacement. That is:

$author = preg_replace(‘/([Vv]an)xxxyyyzzz+/’, ‘$1 ‘, $author);

Which just replaces xxxyyyzzz with a space.

Looking back at step 3, I would also have to unprotect “Vanden” and “De”. That is:

$author = preg_replace(‘/([Vv]anden)xxxyyyzzz+/’, ‘$1 ‘, $author);
$author = preg_replace(‘/([Dd]e)xxxyyyzzz+/’, ‘$1 ‘, $author);

 

This achieves what I set out to do. But, actually, the reason I wanted my own parser is so that I could split the authors into the family names and their given names. Now that I explode the authors before formatting them I do have each author in a separate string. Carrying out the above step sorts out the issue (well, at least one of the issues) with family names that comprise more than one word. After the above procedure I have each name in the format

family name, initial(s)

It is an easy take to use string manipulation to split the name into family and given names. This is how I do it.

 

$len = strlen($author);
$commaPos = strpos($author, ‘,’);
$family = substr($author, 0, $commaPos);
$given = substr($author, $commaPos+2, $len – $commaPos +2);

I use $len and $commaPos just to make the code a little easier to follow. Of course, you might just want to incorporate the various string functions within the calls to substr.

 

The result of all the above is that I can now parse author strings from bibtex, resolve certain names that did not work before and also split up family and given names. This is important for the research project I have in mind, more of which later.

 

 

 

Some of the Challenges in Parsing Bibtex Authors

In previous posts (see here) I have been talking about a system that I have been working on that enables the publications area of my web site to be driven by a bibtex file. It all seems to work pretty well and maintaining my web site is now a lot easier than it used to be. Another advantage (and one not to be sniffed at) is the fact that my web site has a uniform presentation. I also hope that once I have the various functions that I plan to implement I will be able to do things like present my papers sorted by journal name (so that you can see how many papers I have published in a given journal) and also publish a list of my co-authors. All of this, and more, should be quite easy once the main foundations are in place.

As I progress with this project, and I try to extend the system, there are various challenges that I keep coming up against. One of these is parsing author names.

The system I am developing is based on the one developed by Andreas Classen. It actually does a fantastic job of parsing author names. You simply pass a string of authors from the bibtex file to a function called formatAuthors, and you are returned a string that presents the authors in a standard way. Indeed, this is the method I use on my current web site (see here, but bear in mind the method may have changed by the time you read this).

I have recently been trying to write my own parser, for a research idea that I have. It is not easy! Just to give you a few examples of the issues that have to be addressed:

  • In bibtex, you have various ways that you can specify names. Norman Walsh does a better job that I could in describing some of the ways that names can be provided to bibtex.
  • The formatAuthors function provided by Andreas, as I said, does an excellent job but it is lacking in some ways.  For example, it does not deal with people who have two family names, such as some people from Belgium and Holland who are often called “van xxxxx” or “de xxxxx”.
  • When I was experimenting recently, I saw an author who was called “Billy James III”. The “III” causes problems, in the same way that the Belgium/Holland names do. There would be similar problems with people who have “jr” at the end of their name.

None of these problems are insurmountable. Indeed, the bibtex style files do a great job of handling all types of names. The challenge for me is to try and develop a parser that is able to deal with anything I throw at it, and which will deal with thousands of names, rather than the ones I can easily check from the bibtex file that comprises just my own publications.

Anything that I do, will be based on the formatAuthors function from Andreas but I think it needs a little tweaking just to try and deal with a few more cases.