GK Logo 003 350 x 100

I mentioned (see here for a list of previous bibtex posts) that I was facing a challenge of parsing bibtex authors. One of the problems with the current system that I use is that, although it works pretty well, but it does not handle all cases correctly. To give you an example, if I have:

T. van Woensel (who is actually one of my co-authors)

in my bibtex file, this parses as

Woensel, T. va

whereas, it should parse as

T. van Woensel

That is, the family name is actually “van Woensel”, and not “Woensel”.

In a bibtex file, you would place the name within braces, and the bibtex parsers that are built into (say) latex would treat the text as one element. So, you would type this into your bibtex file.

T. {van Woensel}

But the current parser I have, although not displaying the braces, does not treat the elements inside the braces as one element.

This gave me an opportunity to try out using regular expressions. I must admit that I do not have a lot of experience of using these, although I do recognise that they are very powerful and I susppose I have used them in various guises, and on various operating systems, over the years.

Anyhow, here is the idea.

Step 1

I take each bibtex author string and split it into the various authors using the PHP explode function. That is:

$authors = explode(” and “, str_replace(array(“\n”,”\r”,”\t”),” “, $entry[“author”]));

This function is actually taken from formatAuthors, which is part of the code supplied by Andreas Classen.

This results in the bibtex author string being split into separate authors (as we use ” and ” as a delimiter, which is the standard delimeter in bibtex author names). The various authors are stored in an array called $authors.

 

Step 2

Next, we iterate through each author

foreach($authors as $author) {

// stuff (see below)

}

 

Step 3

This step is an extension to what Andreas does. I now protect certain names. For example, if I want to look for “van xxxx”, I use the following replacement regular expression.

$author = preg_replace(‘/([Vv]an) +/’, ‘$1xxxyyyzzz’, $author);

If the name was “van Woensel”, this will make it “vanxxxyyyzzzWoensel”. I use xxxyyyzzz, as this is unlikely to appear in somebody’s name.

 

The other names I currently protect are “Vanden” and “De”. That is, I would add these two lines:

 

$author = preg_replace(‘/([Vv]anden) +/’, ‘$1xxxyyyzzz’, $author);
$author = preg_replace(‘/([Dd]e) +/’, ‘$1xxxyyyzzz’, $author);

Of course, we could put the various names into an array and iterate through them, which would make it easier to maintain. In my actual implementation, I actually put all this messy regular expression stuff into a function (and use an array to hold the various names), just to keep it all out of the way of the main parsing function.

 

Step 4

I now call the formatAuthors function supplied by Andreas. That is:

$author = formatAuthors($author);

Continuing with the example above, this would now return

vanxxxyyyzzzWoensel, T

 

Step 5

Now, of course, I have to strip out the protection, so I do another regular expression replacement. That is:

$author = preg_replace(‘/([Vv]an)xxxyyyzzz+/’, ‘$1 ‘, $author);

Which just replaces xxxyyyzzz with a space.

Looking back at step 3, I would also have to unprotect “Vanden” and “De”. That is:

$author = preg_replace(‘/([Vv]anden)xxxyyyzzz+/’, ‘$1 ‘, $author);
$author = preg_replace(‘/([Dd]e)xxxyyyzzz+/’, ‘$1 ‘, $author);

 

This achieves what I set out to do. But, actually, the reason I wanted my own parser is so that I could split the authors into the family names and their given names. Now that I explode the authors before formatting them I do have each author in a separate string. Carrying out the above step sorts out the issue (well, at least one of the issues) with family names that comprise more than one word. After the above procedure I have each name in the format

family name, initial(s)

It is an easy take to use string manipulation to split the name into family and given names. This is how I do it.

 

$len = strlen($author);
$commaPos = strpos($author, ‘,’);
$family = substr($author, 0, $commaPos);
$given = substr($author, $commaPos+2, $len – $commaPos +2);

I use $len and $commaPos just to make the code a little easier to follow. Of course, you might just want to incorporate the various string functions within the calls to substr.

 

The result of all the above is that I can now parse author strings from bibtex, resolve certain names that did not work before and also split up family and given names. This is important for the research project I have in mind, more of which later.

 

 

 

2 Responses

  1. I set the problem of parsing bibtex author names as an exercise in test-driven development for a software testing course.

    Some of the test cases might be interesting. The test cases don’t cover everything.

    http://user.it.uu.se/~justin/Teaching/Testing/bibtex_lab.pdf

    It is a good exercise in TDD.

    Incidentally, why do you have
    “T. van Woensel” instead of “van Woensel, Tom” in your bibtex entry?
    If you want to the first names abbreviated then use the appropriate bibtex style.

  2. Justin

    Many thanks. I took a look at your site. It looks interesting.

    Good point about the way the name is presented. I could change my personal bibtex file but when I download files from publishers they often present them in the way I have done them as well.

    Perhaps I do need to read up on some of the “standards” though.

    Thanks again.

    Graham