Salon2 Lecture 2 - Feb 5, 2015

1 Python language review

Here are the topics we touched, with some weblinks to tutorials or discussions:

2 Choosing good third-party packages

Possible considerations:

  • Use effort commensurate to need
  • Start with googling
  • Watch "external" signs of reliability:
    • Decent website
    • Documentation
    • Funding if any
    • Support resources
    • Updates/releases
    • Version number >= 1.0?
    • Reasonable dependencies
  • Download/install, try tutorial
  • What are the alternatives?
  • When everything fails, write your own

3 Numpy

Python lists (and nested lists) are not usually suitable for mathematical work; they do not understand arithmetics natively, they are not efficient (because they require levels of indirection), and they require looping in Python, which is again not efficient.

By contrast, working with mathematical vectors and arrays requires fast operations on large memory blocks that contain homogeneous types (all ints, doubles, etc.). This is also the memory layout that's expected by C/Fortran programs, so interacting with external legacy or performance codes requires this same "buffer" paradigm.

This is what's provided by numpy, which is a building block and a basic infrastructure for most scientific applications in Python.

See Johansson's numpy tutorial; also on scipy, Python's "ecosystem" open-source scientific software, and on matplotlib.

3.1 Game of life with numpy

In [15]:
import numpy as np
import matplotlib.pyplot as pp
%matplotlib inline
In [5]:
import scipy

We'll represent the grid with a numpy array of integers. Let's start by initializing it randomly. numpy.random.rand yields reals in [0,1], so we round them to 0 (dead) or 1 (alive).

In [6]:
nsize = 32
In [7]:
init = np.floor(np.random.rand(nsize,nsize) + 0.5)

Or we could convert a boolean array (which will be True/False with 50% probability) to integers...

In [8]:
(np.random.rand(nsize,nsize) > 0.5) * 1
Out[8]:
array([[0, 0, 1, ..., 0, 0, 0],
       [1, 1, 0, ..., 1, 0, 1],
       [1, 0, 1, ..., 0, 1, 1],
       ..., 
       [1, 1, 0, ..., 1, 1, 1],
       [0, 1, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 1, 1, 0]])

By the way, here's how you can count true elements in a boolean array:

In [9]:
np.sum(np.random.rand(nsize,nsize) > 0.5)
Out[9]:
517

Now we'd like to plot the array. imshow from matplotlib.pyplot seems to work well.

In [11]:
def plotgrid(x):
    # turn off interpolation to get hard cell edges; put (0,0) at bottom left;
    # use a gray colormap that will yield black/white
    pp.imshow(x,interpolation='none',origin='bottom',cmap=pp.cm.gray)
In [12]:
plotgrid(init)

Time to implement the evolution in the Game of Life; how to compute the number of neighbors? We could loop brute force, using modulo arithmetic to implement wrap-around boundary conditions...

In [19]:
def lifestep_brute(x):
    newx = x.copy()
    
    m, n = x.shape
    for i in range(m):
        for j in range(n):
            neighbors = (x[(i+1) % m,(j-1) % n] + x[(i+1) % m,j] + x[(i+1) % m,(j+1) % n] +
                         x[i,        (j-1) % n]                  + x[i,        (j+1) % n] +
                         x[(i-1) % m,(j-1) % n] + x[(i-1) % m,j] + x[(i-1) % m,(j+1) % n])
            
            if x[i,j] == 1 and (neighbors < 2 or neighbors > 3):
                newx[i,j] = 0
            elif x[i,j] == 0 and neighbors == 3:
                newx[i,j] = 1
    
    return newx

This works OK, although it's not especially fast...

In [94]:
# let's make a function to run some steps with a choice of "stepper" 
def runiter(stepper,iters=100,nsize=64):
    x = np.floor(np.random.rand(nsize,nsize) + 0.5)
    for i in range(iters):
        x = stepper(x)
In [23]:
%timeit runiter(lifestep_brute,100)
1 loops, best of 3: 1.02 s per loop

We can plot a few frames using matplotlib.pyplot.subplot, which works just like matlab's.

In [24]:
pp.figure(figsize=(12,12))
x = np.floor(np.random.rand(64,64) + 0.5)

for i in range(16):
    pp.subplot(4,4,i+1)
    plotgrid(x)

    x = lifestep(x)

In numpy, it's always more efficient to work with full arrays; however in this case this means that we need to do something special on the boundaries to get the wraparound effect. That's a lot of special cases to write in our code (the four edges, plus the four corners, all require different indexing).

We could make an extended copy of the array by pasting the last column to the left of the first, the first column to the right of last, etc. Or, we look in the numpy manual and find the function roll that yields a rolled-with-wraparound copy of a matrix. (numpy.lookfor, and in this case numpy.lookfor('shift') is a good way to find these things!)

In [36]:
def lifestep_roll(x):
    newx = x.copy()
    
    neighbors = 0
    for di in [(1,1),(1,0),(1,-1),(0,1),(0,-1),(-1,1),(-1,0),(-1,-1)]:
        neighbors += np.roll(np.roll(x,di[0],axis=0),di[1],axis=1)
    
    # notice the numpy *fancy indexing* to work on selected items of an array
    # based on some conditions
    newx[(x == 1) & (neighbors < 2)] = 0
    newx[(x == 1) & (neighbors > 3)] = 0
    newx[(x == 0) & (neighbors == 3)] = 1    

    return newx

It's 60 times faster!

In [37]:
%timeit runiter(lifestep_roll,100)
100 loops, best of 3: 17.6 ms per loop

...and somebody who's mathematically inclined may realize that the neighbor counting operation is a 2D convolution, which must certainly be implemented in one of these libraries. We do need to make sure to pass the right options to get back a matrix of the right size, and to impose the periodic boundary conditions.

In [41]:
import scipy.signal

kernel = np.array([[1,1,1],[1,0,1],[1,1,1]])

def lifestep_conv(x):
    newx = x.copy()
    
    neighbors = scipy.signal.convolve2d(x,kernel,boundary='wrap',mode='same')
    
    newx[(x == 1) & (neighbors < 2)] = 0
    newx[(x == 1) & (neighbors > 3)] = 0
    newx[(x == 0) & (neighbors == 3)] = 1    

    return newx

For this size of matrix, the timing is actually comparable to the rolling version.

In [43]:
%timeit runiter(lifestep_conv,100)
100 loops, best of 3: 17 ms per loop

Last, somebody who's very mathematically inclined may realize that the convolution can be performed efficiently with FFTs. It takes some trial and error to figure out some necessary adjustments.

In [88]:
import numpy.fft as npf

def fft_convolve2d(x,y):
    """2D convolution using FFT, adapted from Tristan Hearn's example."""
    
    fr  = npf.fft2(x)
    fr2 = npf.fft2(y,s=fr.shape)
    
    convolve = np.real(npf.ifft2(fr * fr2))

    # the result needs to be shifted around
    m, n = y.shape
    convolve = np.roll(convolve,-int((m-1)/2),axis=0)
    convolve = np.roll(convolve,-int((n-1)/2),axis=1)

    return convolve.round()
In [89]:
def lifestep_fftconv(x):
    newx = x.copy()
    
    neighbors = fft_convolve2d(x,kernel)
    
    newx[(x == 1) & (neighbors < 2)] = 0
    newx[(x == 1) & (neighbors > 3)] = 0
    newx[(x == 0) & (neighbors == 3)] = 1    

    return newx

This is slower!

In [93]:
%timeit runiter(lifestep_fftconv,100)
10 loops, best of 3: 41.7 ms per loop

We should really make an animation! matplotlib offers the tools to do so, but the interface is not the best. It requires that we make a figure, then provide a function to modify it for each frame. See for instance this blog post.

In [107]:
frames = []
x = np.floor(np.random.rand(256,256) + 0.5)

# we collect the frames in a list
for i in range(250):
    frames.append(x)
    x = lifestep_conv(x)

This will take a few seconds...

In [108]:
import matplotlib.animation as ma

# default resolution is 72 dpi
figure = pp.figure(figsize=(5,5))
image  = pp.imshow(frames[i],interpolation='none',cmap=pp.cm.gray)

# the animate function needs to change the elements of the figure,
# and return an iterable object containing the parts of the figure that have changed
def animate(i):
    image.set_array(frames[i])
    return [image]

# the API is not the best though...
anim = ma.FuncAnimation(figure,animate,frames=len(frames),
                        interval=20,blit=True)

anim.save('life.mp4',fps=30);

You'll have to open the animation and watch it outside the notebook. The current version of iPython notebook has trouble embedding movies. Can you see some gliders rolling by?

In [111]:
!open life.mp4

3.2 Homework (for the ambitious)

Modify our code to implement Stephan Rafler's SmoothLife floating-point generalization of the Game of Life. See his paper.

4 h5py (for the HDF file format)

We went over this very quickly in class, so I will just paste in the code from last year. If you don't have h5py, try conda or pip.

It's HDF5, a "data model, library, and file format", "portable and extensible", that "supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data". h5py is the Python library to read/write HDF5; it can be used to get efficient numpy views on very large datafiles.

  • http://www.hdfgroup.org
  • http://www.h5py.org

4.1 Playing with LIGO data

LIGO has recently released its flagship "S5" data: https://losc.ligo.org

Following the tracks of the webinar at https://dcc.ligo.org/public/0113/G1400509/004/losc_webinar.pdf. We'll look for "H1" data starting at time 843894784 and ending at 843898880.

In [5]:
!wget https://losc.ligo.org/archive/data/S5/843055104/H-H1_LOSC_4_V1-843894784-4096.hdf5
--2014-09-10 18:26:45--  https://losc.ligo.org/archive/data/S5/843055104/H-H1_LOSC_4_V1-843894784-4096.hdf5
Resolving losc.ligo.org... 131.215.114.49
Connecting to losc.ligo.org|131.215.114.49|:443... connected.
HTTP request sent, awaiting response... 301 MOVED PERMANENTLY
Location: https://losc.ligo.org/archive/data/S5/843055104/H-H1_LOSC_4_V1-843894784-4096.hdf5/ [following]
--2014-09-10 18:26:45--  https://losc.ligo.org/archive/data/S5/843055104/H-H1_LOSC_4_V1-843894784-4096.hdf5/
Connecting to losc.ligo.org|131.215.114.49|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 128652813 (123M) [application/x-hdf]
Saving to: 'H-H1_LOSC_4_V1-843894784-4096.hdf5'

100%[======================================>] 128,652,813 39.7MB/s   in 3.1s   

2014-09-10 18:26:49 (39.7 MB/s) - 'H-H1_LOSC_4_V1-843894784-4096.hdf5' saved [128652813/128652813]


In [7]:
!pip install h5py
Downloading/unpacking h5py
  Downloading h5py-2.3.1-cp27-none-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.whl (4.3MB): 4.3MB downloaded
Installing collected packages: h5py
Successfully installed h5py
Cleaning up...

In [1]:
import h5py
In [2]:
data = h5py.File('H-H1_LOSC_4_V1-843894784-4096.hdf5','r')
In [4]:
data
Out[4]:
<HDF5 file "H-H1_LOSC_4_V1-843894784-4096.hdf5" (mode r)>

HDF5 file objects have a tree-like structure, and in the Python API tree nodes at every level behave as Python dictionaries. Remember the Python dict methods, which return iterators (technically, dictionary views) that you can loop over:

  • keys(), yielding the keys indexing the dictionary,
  • values(), yielding the values in the same order,
  • items(), yielding (key,value) pairs.

If you need a proper list of any of these, you can always feed the iterators to the list() constructor.

In [5]:
list(data.items())
Out[5]:
[('meta', <HDF5 group "/meta" (8 members)>),
 ('quality', <HDF5 group "/quality" (3 members)>),
 ('strain', <HDF5 group "/strain" (1 members)>)]

h5py implements the somewhat peculiar visit method that walks down the HDF tree in full.

In [6]:
def printname(name):
    print("%-35s %s" % (name,data[name]))
In [7]:
data.visit(printname)
meta                                <HDF5 group "/meta" (8 members)>
meta/Description                    <HDF5 dataset "Description": shape (), type "|S33">
meta/DescriptionURL                 <HDF5 dataset "DescriptionURL": shape (), type "|S21">
meta/Detector                       <HDF5 dataset "Detector": shape (), type "|S2">
meta/Duration                       <HDF5 dataset "Duration": shape (), type "<i8">
meta/GPSstart                       <HDF5 dataset "GPSstart": shape (), type "<i8">
meta/Observatory                    <HDF5 dataset "Observatory": shape (), type "|S1">
meta/Type                           <HDF5 dataset "Type": shape (), type "|S16">
meta/UTCstart                       <HDF5 dataset "UTCstart": shape (), type "|S19">
quality                             <HDF5 group "/quality" (3 members)>
quality/detail                      <HDF5 group "/quality/detail" (0 members)>
quality/injections                  <HDF5 group "/quality/injections" (3 members)>
quality/injections/InjDescriptions  <HDF5 dataset "InjDescriptions": shape (6,), type "|S33">
quality/injections/InjShortnames    <HDF5 dataset "InjShortnames": shape (6,), type "|S8">
quality/injections/Injmask          <HDF5 dataset "Injmask": shape (4096,), type "<u4">
quality/simple                      <HDF5 group "/quality/simple" (3 members)>
quality/simple/DQDescriptions       <HDF5 dataset "DQDescriptions": shape (18,), type "|S71">
quality/simple/DQShortnames         <HDF5 dataset "DQShortnames": shape (18,), type "|S15">
quality/simple/DQmask               <HDF5 dataset "DQmask": shape (4096,), type "<u4">
strain                              <HDF5 group "/strain" (1 members)>
strain/Strain                       <HDF5 dataset "Strain": shape (16777216,), type "<f8">

There's useful information in "meta":

In [8]:
data['meta/GPSstart'].value, data['meta/Duration'].value
Out[8]:
(843894784, 4096)

"strain" seems to be the main structure containing data...

In [9]:
strain = data['strain/Strain']

...which looks like a numpy array...

In [10]:
type(strain.value), strain.value
Out[10]:
(numpy.ndarray,
 array([  2.86956749e-17,   3.12350546e-17,   3.37783919e-17, ...,
         -1.12435587e-16,  -1.08725030e-16,  -1.04909780e-16]))

But it has also attributes:

In [12]:
list(strain.attrs.items())
Out[12]:
[('Npoints', 16777216),
 ('Xunits', b'second'),
 ('Xlabel', b'GPS time'),
 ('Xspacing', 0.000244140625),
 ('Xstart', 843894784),
 ('Yunits', b''),
 ('Ylabel', b'Strain')]

We use these to make an array of timestamps.

In [16]:
tstart = strain.attrs['Xstart']
dt     = strain.attrs['Xspacing']
tend   = tstart + strain.attrs['Npoints'] * dt

# according to C convention, "tend" is not actually included in this array
ts = np.arange(tstart,tend,dt)
In [17]:
len(ts), len(strain.value)
Out[17]:
(16777216, 16777216)

Let's plot the very beginning of the data.

In [18]:
pp.plot(ts[:10000],
        strain.value[:10000])
Out[18]:
[<matplotlib.lines.Line2D at 0x12261b7b8>]

There's also "quality", which (we learn from https://losc.ligo.org/archive/dataset/S5) describes the quality of the data with a 1-Hz bitmask array.

In [19]:
quality = data['quality/simple']
In [23]:
list(quality.items())
Out[23]:
[('DQDescriptions', <HDF5 dataset "DQDescriptions": shape (18,), type "|S71">),
 ('DQShortnames', <HDF5 dataset "DQShortnames": shape (18,), type "|S15">),
 ('DQmask', <HDF5 dataset "DQmask": shape (4096,), type "<u4">)]

These are the meanings of the bits:

In [28]:
for bit,(name,desc) in enumerate(zip(quality['DQShortnames'].value,quality['DQDescriptions'].value)):
    print('%2d %-16s %s' % (bit, name.decode(), desc.decode()))
 0 DATA             Science data present
 1 CBCHIGH_CAT1     Category-1 checks passed for CBC high-mass search
 2 CBCHIGH_CAT2     Category-2 and 1 checks passed for CBC high-mass search
 3 CBCHIGH_CAT3     Category-3 and 2 and 1 checks passed for CBC high-mass search
 4 CBCHIGH_CAT4     Category-4,3,2,1 checks passed for CBC high-mass search
 5 CBCLOW_CAT1      Category-1 checks passed for CBC low-mass search
 6 CBCLOW_CAT2      Category-2 and 1 checks passed for CBC low-mass search
 7 CBCLOW_CAT3      Category-3 and 2 and 1 checks passed for CBC low-mass search
 8 CBCLOW_CAT4      Category-4, veto active for CBC low-mass search
 9 BURST_CAT1       Category-1 checks passed for burst search
10 BURST_CAT2       Category-2 and 1 checks passed for burst search
11 BURST_CAT3       Category-3 and 2 and 1 checks passed for burst search
12 BURST_CAT2E      Category-2E and 1 subsecond-event checks passed for burst search
13 BURST_CAT3E      Category-3E, 2E, 1 subsecond-event checks passed for burst search
14 CW_CAT1          Category-1 checks passed for continuous-wave search
15 STOCH_CAT1       Category-1 checks passed for stochastic search
16 STOCH_CAT2_H1L1  Category-2 and 1 checks passed for stochastic search, H1-L1 combination
17 STOCH_CAT2_H2L1  Category-2 and 1 checks passed for stochastic search, H2-L1 combination

A note on the decode, which is needed in Python 3 to avoid displaying strings as b'...'. The issue is that in Python 3 native strings are unicode, but the LOSC files contain simple ASCII strings (where each character is 8 bits), which in Python 3 correspond to bytes object, not quite a string. The bytes method decode(), with the implied argument 'UTF-8', uses a unicode encoding to make a unicode string out of a sequence of bytes. Its inverse is encode().

Back to LIGO. How much data at "CBCHIGH" categories 1 through 4? CBCHIGH denotes the search for high-mass binary inspirals. As for the categories,

  • CAT1 = data that passes basic quality checks
  • CAT2 = data that does not have obvious corruption or noisy features with clear correlations with instrument events and failure modes
  • CAT3 = data that does not have even less obvious noisy features with unclear instrumental associations
  • CAT4 = data that has been cleaned of glitchy features that result in excess triggers

A LIGO search would use CAT2 data; an upper-limit analysis would use CAT3.

In [31]:
dqarray = quality['DQmask'].value

for cbchigh_cat in [1,2,3,4]:
    bitmask = 1 << cbchigh_cat
    print("%s seconds at CAT %d" % (sum(dqarray & bitmask > 0),cbchigh_cat))
4096 seconds at CAT 1
4029 seconds at CAT 2
3391 seconds at CAT 3
3391 seconds at CAT 4

We are going to work with CAT2 data. Let's plot which data is CAT2 in "raster" fashion, reshaping a vector to a 2D matrix. We'll use matplotlib.pyplot.matshow the latter. We want a discrete rather than continuous colormap (i.e., 1 = red, 2 = yellow, 3 = green, 4 = blue). Some quick googling shows that this can be achieved with ListedColormap and BoundaryNorm.

In [32]:
dqplot = np.zeros(dqarray.shape)

for cbchigh_cat in [1,2,3,4]:
    bitmask = 1 << cbchigh_cat
    dqplot[dqarray & bitmask > 0] = cbchigh_cat
In [33]:
import matplotlib.colors as mpc

cmap = mpc.ListedColormap(['red','yellow','green','blue'])
norm = mpc.BoundaryNorm([0,1,2,3],cmap.N)
    
pp.matshow(dqplot.reshape((16,256)),cmap=cmap,norm=norm,aspect=1.5);

We use numpy masked arrays to obtain the segments of CAT2 data. * http://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html

In [35]:
dq2 = (dqarray & (1 << 2)) > 0

masked = np.ma.masked_array(np.zeros(4096),~dq2)
segs = np.ma.flatnotmasked_contiguous(masked)
print(segs)
[slice(0, 251, None), slice(253, 385, None), slice(387, 517, None), slice(519, 645, None), slice(647, 651, None), slice(652, 743, None), slice(745, 868, None), slice(870, 945, None), slice(947, 975, None), slice(977, 986, None), slice(988, 1012, None), slice(1014, 1055, None), slice(1057, 1083, None), slice(1085, 1100, None), slice(1102, 1130, None), slice(1132, 1202, None), slice(1203, 1384, None), slice(1386, 1548, None), slice(1550, 1827, None), slice(1828, 1854, None), slice(1856, 1881, None), slice(1883, 1961, None), slice(1963, 1987, None), slice(1989, 2024, None), slice(2026, 2362, None), slice(2364, 2451, None), slice(2453, 2590, None), slice(2592, 2880, None), slice(2882, 2883, None), slice(2885, 2910, None), slice(2912, 3035, None), slice(3037, 3426, None), slice(3428, 3611, None), slice(3613, 3997, None), slice(3999, 4038, None), slice(4040, 4096, None)]

This will get us the strain data for segment n.

In [36]:
def segdata(n):
    start, end = segs[n].start * 4096, segs[n].stop * 4096
    
    return ts[start:end], strain.value[start:end]

Let's look at the PSD for the first few segment, using psd from matplotlib.mlab.

In [37]:
import matplotlib.mlab as mp
In [38]:
def plotpsd(i):
    segt, segd = segdata(i)
    psd, freqs = mp.psd(segd,NFFT=2048,Fs=1.0/dt)

    pp.loglog(freqs[1:],np.sqrt(psd[1:]))
    pp.xlabel('$f/\mathrm{Hz}$'); pp.ylabel('$S(f)/\mathrm{Hz}^{1/2}$')
    pp.axis([10,1000,1e-24,1e-14]);
    pp.title('t = [%s,%s]' % (segt[0],segt[-1]))
In [40]:
print(len(segs))
36

In [41]:
pp.figure(figsize=(12,16))

for s in range(8):
    pp.subplot(4,2,s+1)
    plotpsd(s)

pp.tight_layout()

To see if something goes "ping", we obtain a spectrogram using matplotlib.mlab.specgram. We examing segments of 4096 points = 1 sec.

We pick a prettier colormap than the default at http://matplotlib.org/examples/color/colormaps_reference.html

In [42]:
def plotspec(i):
    segt, segd = segdata(i)
    pxx, freqs, t = mp.specgram(segd,NFFT=4096,Fs=1.0/dt)

    # normalize each frequency row by its median (effectively to whiten data) 
    pxx = np.array([row / np.median(row) for row in pxx])
    
    pp.pcolormesh(t,freqs,np.log10(pxx),cmap=pp.get_cmap('Oranges'))
    pp.xlabel('$t/\mathrm{s}$'); pp.ylabel('$S(f)/\mathrm{Hz}^{1/2}$')
    pp.axis([0,t[-1],freqs[0],freqs[-1]])
    pp.title('t = [%s,%s]' % (segt[0],segt[-1]))
In [43]:
pp.figure(figsize=(12,16))

for s in range(16,24):
    pp.subplot(4,2,s-16+1)
    plotspec(s)

pp.tight_layout()