Dear Salonnières,
thank you so much for attending compu-salon episode 7, CGI-scripting the web. Here's a summary of our discussion.
This week (Mar 16) we'll talk about running parallel computations in Python, using a simple high-level framework. I may also show you a little MPI. Chad is providing a simple (I hope) example in GW data analysis. Since the Tapir seminar is at 3pm, I will let you finish asking your questions, and we will start at 4:15pm.
I'll send out a reminder on Friday morning. Until then!
Michele
Also at www.vallis.org/salon, now in beautiful, prettified html.
How does the web work? In the words of the Python docs,
When a user enters a web site, their browser makes a connection to the site’s web server (this is called the request). The server looks up the file in the file system and sends it back to the user’s browser, which displays it (this is the response). This is roughly how the underlying protocol, HTTP, works.
However, browsers are not limited to serving files:
Dynamic web sites are not based on files in the file system, but rather on programs which are run by the web server when a request comes in, and which generate the content that is returned to the user. They can do all sorts of useful things, like display the postings of a bulletin board, show your email, configure software, or just display the current time. These programs can be written in any programming language the server supports. Since most servers support Python, it is easy to use Python to create dynamic web sites.
These programs run on web servers, and they should not be confused with the programs that are downloaded as part of a webpage, and then run locally in the browser (i.e., on your computer). Such is the case of Java and Flash programs, which run in a sandbox and display output in "applet-like" rectangles, and of JavaScript, which has more organic access to the browser, and can modify on the fly the webpage that contains it. Another recent and scientist-friendly example is Wolfram's Computable Document Format, which can run snippets of Mathematica code in the browser.
In fact, many modern examples of the "web 2.0" (such as Google Maps and Google Mail) use both techniques together, running browser code that handles the interface and requests data from database backends running on the server. Today however we'll stick to running on the server, and we'll discuss the simplest and most compatible way to do so:
A CGI program is run by requesting from the server a file that has a special file ending (usually .cgi
) or that sits in a specially designated directory (usually cgi-bin
). When that happens, instead of sending back the contents of the file, the web server executes it, and sends the output back to the client. Indeed, before executing the program, the server places a number of useful pieces of information in environment variables. The output of the CGI program must begin with a header (such as Content-Type: text/html
, plus a blank line) specifying the format of the data that follows.
Let's see a basic CGI script in Python that prints out the value of all the environment variables:
#!/usr/bin/env python
# this "shebang" sequence tells the browser that this script
# must be run with Python
# enable the reporting of Python errors in the output of the CGI script
import cgitb
cgitb.enable()
# we'll also need the os module, which grants access to the
# environment variable through the Python dictionary os.environ
import os
# print the CGI header (in this case, announcing plain text)
print "Content-Type: text/plain"
print
# iterate over dictionary
for key,value in os.environ.items():
print '{0:20}: {1}'.format(key,value)
...where in the last line we have used Python's new recommended string formatting syntax
If you do run this program as a CGI script, you'll discover that the web server is learning a lot about the browser and operating system of the requester. The CGI script also gains access to information about the web server itself, and its capabilities.
However, a useful scripts needs parameters. How does it get them?
The answer is that CGI scripts are usually called from forms embedded inside HTML pages. A form is an HTML construct that can contain any number of inputs (text fields, checkboxes, menu dropdowns, file uploads, etc.), as well as a submit
button. The CGI script that is called upon submit is specified as the attribute action of the form
itself.
Here's the full HTML5 code for the webpage that we used in our salon, which specifies parameters for the sensitivity of a LISA-like space-based gravitational-wave observatory:
<!DOCTYPE html> <!-- we're writing html5; and this is a comment -->
<html> <!-- main html container -->
<head> <!-- the head contains metadata -->
<title>A simple cgi-bin example</title> <!-- all html _tags_ appear in pairs... -->
<meta charset="utf-8" /> <!-- except for self-closing ones! -->
</head>
<body> <!-- the body contains the main webpage content -->
<!-- here's the form with the CGI script URL -->
<form action="example.cgi" method="get">
<!-- several text inputs follow, with plain text labels -->
angle [deg]: <input type="text" name="alpha" value="60"> <br/>
L [m]: <input type="text" name="L" value="5e9"> <br/>
displ. noise [m/s^2/Hz^1/2]: <input type="text" name="dis" value="3e-15"> <br/>
pos noise [m/Hz^1/2]: <input type="text" name="pos" value="2e-11"> <br/>
<!-- here's the submit button -->
<input type="submit" value="Compute">
</form>
</body>
</html>
The inputs define the name
and default value
of the variables that are passed to the script. This can happen in two ways: with the GET method, the parameters are included as the final part of the URL (such as example.cgi?alpha=60&L=5e9&dis=3e-15&pos=2e-11
), and the server places this string in the environment variable QUERY_STRING
before calling the CGI script; with the POST method, the parameters are inserted in a message body, which the server sends to the CGI script as standard input. It may seem that retrieving the parameters could be a tedious job, but luckily the Python standard library comes to the rescue with the cgi module.
By the way, the official recommendation in the HTML specifications is that GET should be used only when form processing is "idempotent" (e.g., retrieving data, as opposed to making a purchase). But POST may be necessary even in that case if the list of parameters gets too long.
cgi
works in the same way whether the CGI script has been called with GET or POST: it places all the form inputs into a dictionary that is returned by cgi.FieldStorage()
. Thus, if the html form given above is passed as is to a CGI script, the dictionary (say, form
) would contain the keys 'alpha'
, 'L'
, 'dis'
, 'pos'
(as well as 'submit'
), with form['alpha'] = '60'
(the string, not the number), form['L'] = '5e9'
, and so on.
A special case is the form input that lets the user upload a file (<input type="file" name="FIELDNAME">
); in that case form['FIELDNAME'].file
would be True
, form['FIELDNAME'].filename
would be the name of the file on the user's computer, and form['FIELDNAME'].value
would be set to the contents of the file itself.
It's really quite simple... in the Python script below, which we built in real time at the salon, we collect the parameters from the form given above, we generate a plot of the LISA sensitivity, and we include it inline (as SVG) in the webpage returned to the user.
This is the kind of details from which I wanted you to be insulated in this salon. But if things don't work, I want you to be able to go to your system administrator and know what to ask them. So here goes.
Most Linux hosts, such as www.tapir.caltech.edu
, include the web server Apache (httpd
), may already be set up to run CGI scripts in the user directories http://www.tapir.caltech.edu/~USERNAME
. Technically, files ending in .cgi
should be registered as CGI scripts (with the option AddHandler cgi-script .cgi
in httpd.conf
); next, AllowOverride Options
should be specified for home directories, so that CGI scripts can be activated by dropping an .htaccess
file containing Options +ExecCGI
in the directory that contains the scripts. Furthermore, suEXEC
and mod_userdir
should be enabled, so that CGI scripts are run as the user that owns them.
This begs a question about the security of CGI scripts. Now, web servers are regularly patched, and it's unlikely that they may be vulnerable to a buffer overflow attack; the same is true for Python and the cgi
module. So you should be OK if you remember that the script runs as you, and therefore has the same level of access to the system that you have. Thus, you should never run as code (e.g., with eval
) the strings provided in a form, and you should never access files in your system on the basis of filenames provided in the form.
If you must, a few steps:
/private/etc/apache2/httpd.conf
and uncommenting the line AddHandler cgi-script .cgi
. You'll need to be an admin and use sudo
to edit the file.$HOME/Sites
, and can be seen as http://$HOSTNAME/~vallis
) by changing the AllowOverride
line to AllowOverride Options
.sudo apachectl restart
to make the webserver see them..htaccess
file containing the line Options +ExecCGI
inside your $HOME/Sites
.There's one more unfortunate issue: the OS X web server, as distributed, is not configured for suexec. As we say above, this is desirable so that your script can read and write your files without you having to change permissions, etc. It is also a good idea security wise.
To enable suexec, you need to recompile the web server! For once I will give you a canned recipe, adapted from Sam Ruby, which should be good for OS X 10.6–10.8:
$ httpd -v
# modify the following lines depending on what version of Apache you see
$ wget http://archive.apache.org/dist/httpd/httpd-2.2.20.tar.gz
$ tar xf httpd-2.2.20.tar
$ cd httpd-2.2.20
$ ./configure --enable-suexec --with-suexec-docroot=/Library/WebServer/Documents --with-suexec-gidmin=20 --with-suexec-uidmin=501 --with-suexec-logfile=/var/log/apache2/suexec_log --with-suexec-caller=_www --with-suexec-userdir=Sites
$ make
$ sudo cp support/suexec /usr/bin
$ sudo chown root:_www /usr/bin/suexec
$ sudo chmod 4750 /usr/bin/suexec
$ cd modules/generators
$ sudo apxs -i -a -c mod_suexec.c
$ rm -rf httpd-2.2.20
$ sudo apachectl restart
Here's example.cgi
, to be used with the html form given above.
#!/usr/bin/env python
import cgi, sys, tempfile, string, math
# enable "traceback" error reporting
import cgitb; cgitb.enable()
# if we have installed packages in non-standard system locations,
# we must update the PYTHONPATH so that the Python interpreter will find them
sys.path.append('/Users/vallis/env/cs/lib/python2.6/site-packages')
import numpy as N
import matplotlib.pyplot as P
# this time we return html
print "Content-Type: text/html"
print
# in Python, triple quote let us break strings across multiple lines
print """
<!DOCTYPE html>
<html>
<head>
<title>Response</title>
<meta charset="utf-8" />
</head>
<body>
"""
# let cgi parse the inputs
form = cgi.FieldStorage()
# distribute the values to local variables
# (and apply float to convert strings to numbers)
alpha, L, dis, pos = [float(form[key].value) for key in ('alpha','L','dis','pos')]
# a vector of frequencies for the x-axis
f = 10**N.linspace(-5,-1,100)
# compute sensitivity
c = 299792458
resp = N.sqrt(1 + (4 * f**2 * L**2)/(0.41**2 * c**2))
S_dis = 2 * dis / (2*math.pi*f)**2
S_pos = pos
S_h = (math.sqrt(5)/math.sin(alpha * (math.pi / 180)) * resp *
N.sqrt(S_dis**2 + S_pos**2) / L)
# plot as log-log, add labels
P.loglog(f,S_h)
P.xlabel('f [Hz]')
P.ylabel('S_h [Hz^-1/2]')
# set a reasonable figure size (in inches); to get a pixel size,
# multiply by 72 dpi
f = P.gcf()
f.set_size_inches(6,4)
# to avoid saving to a local file that may be overwritten by other callers,
# or that may take space in the server filesystem, we write SVG to a temporary
# file, then include the file (except for the preamble) in the html
with tempfile.TemporaryFile() as f: # will remove the temp file automatically
P.savefig(f,format='svg')
f.seek(0); svg = f.read() # go back to the beginning of the file, read it
pos = string.find(svg,'<svg') # look for the end of the preamble
print svg[pos:] # and print the rest of the file
# complete the html file
print """
</body>
</html>
"""