In previous posts (see here) I have been talking about a system that I have been working on that enables the publications area of my web site to be driven by a bibtex file. It all seems to work pretty well and maintaining my web site is now a lot easier than it used to be. Another advantage (and one not to be sniffed at) is the fact that my web site has a uniform presentation. I also hope that once I have the various functions that I plan to implement I will be able to do things like present my papers sorted by journal name (so that you can see how many papers I have published in a given journal) and also publish a list of my co-authors. All of this, and more, should be quite easy once the main foundations are in place.
As I progress with this project, and I try to extend the system, there are various challenges that I keep coming up against. One of these is parsing author names.
The system I am developing is based on the one developed by Andreas Classen. It actually does a fantastic job of parsing author names. You simply pass a string of authors from the bibtex file to a function called formatAuthors, and you are returned a string that presents the authors in a standard way. Indeed, this is the method I use on my current web site (see here, but bear in mind the method may have changed by the time you read this).
I have recently been trying to write my own parser, for a research idea that I have. It is not easy! Just to give you a few examples of the issues that have to be addressed:
- In bibtex, you have various ways that you can specify names. Norman Walsh does a better job that I could in describing some of the ways that names can be provided to bibtex.
- The formatAuthors function provided by Andreas, as I said, does an excellent job but it is lacking in some ways. For example, it does not deal with people who have two family names, such as some people from Belgium and Holland who are often called “van xxxxx” or “de xxxxx”.
- When I was experimenting recently, I saw an author who was called “Billy James III”. The “III” causes problems, in the same way that the Belgium/Holland names do. There would be similar problems with people who have “jr” at the end of their name.
None of these problems are insurmountable. Indeed, the bibtex style files do a great job of handling all types of names. The challenge for me is to try and develop a parser that is able to deal with anything I throw at it, and which will deal with thousands of names, rather than the ones I can easily check from the bibtex file that comprises just my own publications.
Anything that I do, will be based on the formatAuthors function from Andreas but I think it needs a little tweaking just to try and deal with a few more cases.