Dear colleagues,

thanks to all those who came to our compu-salon, episode 2. Here's a brief summary of what we covered, with pointers, etc.

The next salon will be this Friday, Feb 3, at 4:30pm. We'll introduce Python numerical libraries such as numpy, and we'll discuss interfacing Python with C (for performance). I'm not sure yet what example we'll use, perhaps something as simple as plotting the Mandelbrot set.

I've received some good tips regarding the structure and style of these presentations, so episode 3 should be a little easier to follow.

I'll send out a reminder on Friday morning. And remember, Simple is Better than Complex!

Michele

Compu-salon ep. 2 (2012/01/27) in summary

Also at www.vallis.org/salon.

In this second meeting, we talked about:

Installing Python packages

Sorry if this is verbose, but I assure you that you want the detail here. A previously messy situation (many special cases on different platforms, etc.) is now much better. There is now one widely recommended recipe:

Keep one (or more) Python library installations isolated in "virtual environments", which are private to each user and possibly to each of her projects, thus avoiding conflicts between old and newly installed packages, preserving reproducibility of research, etc. Virtual environments are created with

$ virtualenv DIRNAME/ENVNAME

(for instance, DIRNAME could be $HOME/env and ENVNAME could be cs) and activated with

$ source DIRNAME/ENVNAME/bin/activate

which you can alias to something quick inside your .login file. If you use csh/tcsh rather than bash, the above should be source DIRNAME/ENVNAME/bin/activate.csh. (Throughout this tutorial, I will generally use bash-relevant commands, since bash is POSIX compatible and the shipped default on many Unix-like systems, including OS X.)

Unfortunately there's a bootstrap problem if virtualenv itself is not installed already... on OS X, you can do sudo easy_install virtualenv.

To actually install the packages, you'd first activate a virtualenv, and then use pip install XXX. pip, which is distributed with virtualenv, is very smart: it works from sources (the most reliable way), it knows how to download packages from a standard repository (pypi), it knows to install within the virtualenv, will try to install dependencies if needed, it can remove packages, and it provides humane error messages.

For instance, to install Michele's favorite Python shell, as well as the arbitrary-precision math library that we used last week, and for good measure numpy, do:

# ...do after source $HOME/env/cs/bin/activate...
$ pip install ipython
$ pip install mpmath
$ pip install numpy
# for extra goodness, do "easy_install readline" after installing ipython
# (but that may not be required with Homebrew's Python 2.7)

There's one remaining case: Python packages that have non-Python dependencies (e.g., C libraries, command-line utilites...). For example, matplotlib needs pkg-config to install correctly. Instead of doing case-by-case installations, on OS X we're going to use the lightweight Homebrew system. (On Linux, you should get by with apt-get or yum.) Again, there's a bootstrap problem: install Homebrew by doing

$ /usr/bin/ruby -e "$(curl -fsSkL raw.github.com/mxcl/homebrew/go)"

This cool one-liner only works in bash... in csh/tcsh, you can do

$ curl -fsSLO raw.github.com/mxcl/homebrew/go
$ /usr/bin/ruby go

Then you can do

$ brew install pkg-config
$ pip install matplotlib

Ta-da! (As my 2-year-old son loves to say...) There are other distribution systems for OS X such as Fink and Macports, but they have problems (Fink distributes binaries but these are not updated often enough; Macports is heavyweight because it wants to install its own version of everything). However you can certainly use them if you wish. Just Google all your problems.

How LaTeX/BibTeX handle citations

  1. run latex once (say, on manuscript.tex); all requested citations get dumped into manuscript.aux, which contains also the location of the bibliography file (given, say as \bibliography{manuscript})
  2. make sure you have the required BibTeX entries in manuscript.bib
  3. run bibtex (on manuscript), which looks for the citations inside manuscript.bib, and writes the resulting \bibitem{}'s to manuscript.bbl
  4. run latex again, which processes the \bibitem{}'s into the compiled paper, and dumps the citation handles into manuscript.aux
  5. run latex again, which finally gets the correct citation handles into the compiler paper
  6. [phew]

We're going to automate step 2 with a Python script, by getting all BibTeX entries from ads.harvard.edu. We'll assume that the citations keys are already given in the ADS format. For that we need...

Regular expressions!

The manual for the python re module is actually a pretty good introduction to regular expressions. A few of the features we use are

  • . can be anything (* is anything in shells, not in regexps), but it matches \n (a newline) only with the flag re.DOTALL
  • * (0 or more), + (1 or more), ? (0 or one)
  • (...) denote groups collected for use later
  • special characters need to be escaped with \; to avoid escaping the escapes, use raw Python strings such as r"\(.*\)" (this would match anything within parentheses, while normally parentheses denote regexp groups)
  • the Python function re.search(pattern,string,[flags]) returns None for no match, or a match object that can be queried with its method group().

For instance:

>>> import re
>>> mo = re.search(r'\\citation\{(.*)\}',r'\citation{foo,author={me}}')
>>> mo is None  # match successful!
False
>>> mo.group(0) # note the escaped backslash
'\\citation{foo,author={me}}'
>>> mo.group(1)
'foo,author={me}'

We also need...

HTTP requests

When you type a URL into your web browser, you're usually requesting a specific file within a directory hierarchy. But sometimes you're calling a program (known as CGI script, for "common gateway interface"), which runs on your input, generates a page, and returns it to you. Sometimes the program call is obvious in the URL

http://adsabs.harvard.edu/cgi-bin/basic_connect?qsearch=vallisneri+2005

but sometimes it looks like a filename (this is known as a RESTful call)

http://adsabs.harvard.edu/abs/2005PhRvD..72h4027B

These calls can be executed programmatically from Python. After installing the lovely requests package (pip install requests), you can do

>>> import requests
>>> r = requests.get('http://adsabs.harvard.edu/cgi-bin/nph-bib_query?bibcode=2005PhRvD..72h4027B&data_type=BIBTEX&db_key=AST')
>>> r.ok
True
>>> r.content   # contains all the text returned by the query
u'Query Results from the ADS Database [...]'

Cool, now we have all the ingredients to make our ADS-grabbing tool. You could try to roll your own before looking at the solution below; or take my code, improve it, and share it with everybody.

Want homework?

Try extending the ADS tool (code below) to INSPIRE. Add handling for citations given as surname:adskey. For many extra points (not that I'm awarding them), try Google Scholar; use BeautifulSoup or pyquery to parse the resulting HTML.

Python code follows!

I include a license at the top since my colleague Davide Gerosa improved and published the ADS tool. I highly recommend his version.

# copyright 2012, 2017 Michele Vallisneri

# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:

# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.

# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
# OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
# IN THE SOFTWARE.

import sys, os, re
import requests

# this script will query ADS to collect the citations used in a LaTeX file
# into a BibTeX file

# you can add your own command-line option parsing and help line
# using getopt or optparse; we'll assume a command-line single argument FILE
# (given by sys.argv[1]), read from FILE.aux, and write to FILE.bib

auxfile = sys.argv[1] + '.aux'
bibfile = sys.argv[1] + '.bib'

# FIRST, we'll collect all citation keys from FILE.aux;
# citations will look like \citation{2004PhRvD..69j4017P,2004PhRvD..69j4017P}

cites = set()   # start with an empty set (like list, not ordered, no repetitions)

for line in open(auxfile,'r'):              # Python idiom—loop over every line from file
    m = re.search(r'\\citation\{(.*)\}',line)   # match \citation{...}, collect the ...
                                                # note that we escape \, {, and }
    if m:
        cites.update(m.group(1).split(','))     # if there's a match, split string by commas
                                                # add the resulting keys to set

# check: print "Seek:", cites

# SECOND, we'll check what refs we have already in FILE.bib, to avoid
# repetitive querying of ADS; references will look like
# @TYPE{key,
# ...

haves = []

if os.path.isfile(bibfile):                 # the bibfile exists...
    for line in open(bibfile,'r'):
        m = re.search(r'@.*?\{(.*),',line)  # .*\{ means "any # of any char followed by {";
                                            # .*?\{ means "the shortest string matching
                                            #              any # of any char followed by {"
        if m:
            haves.append(m.group(1))        # remember: append item to list
                                            #           extend list with list
                                            #           add item to set
                                            #           update set with list

# check: print "Have:", haves

# THIRD, we'll query ADS for all the keys that are not in haves,
# and write the resulting bibtex entries to FILE.bib

bibtex = open(bibfile,'a')      # open for appending (very C-like)

for c in cites:
    if c not in haves:
        r = requests.post('http://adsabs.harvard.edu/cgi-bin/nph-bib_query',    # CGI script
                          data = {'bibcode': c, 'data_type': 'BIBTEX'} )        # CGI parameters (note pretty indent)

# we could also have done a (more restrictive) GET HTTP request
# r = requests.get('http://adsabs.harvard.edu/cgi-bin/nph-bib_query?bibcode=%s&data_type=BIBTEX' % c)
# where % yields the Python %-formatting for strings

        if r.ok:                                            # found it!
            m = re.search(r'(@.*\})',r.content,re.DOTALL)   # get everything after the first @
                                                            # until the last }
            bibtex.write(m.group(1) + '\n')                 # write to file with extra newline
            # check: print "Found:", c
        else:
            bibtex.write('@MISC{%s,note="{%s not found in ADS!}"}' % (c,c))
                                                            # record not found,
                                                            # we'll write a useful note in bibtex
            # check: print "Not found:", c

bibtex.close()                              # close file and flush write buffer

# FOURTH, we're done!