Monday, December 17, 2012

lit: Tully-Fisher relation in SDSS bands

Pizagno et al., attention-worthy. After skimming through it, I had an idea of running GALFIT through my sample and getting better values of b/a ratios and PA's. I have a pretty good value of the sky, so why not model it?

Sunday, December 16, 2012

Matplotlib: errorbar throws a ValueError: too many values to unpack

I've been working on the errorbars in Matplotlib, and kept getting this error.
To avoid it, check your array dimensions: errorbar(), unlike scatterplot(), keeps on insisting on (x, ) error array dimensions.
1D arrays (vectors) in Numpy can be of shape (x, 1) or (x, ). That's usually interchangeable, but in cases like that, np.reshape(x, (x.shape[0], )) does the trick.
I think hope there is a deep underlying reason for that.

Matplotlib: zorder kwarg

One can pass zorder kwargs in order to change the depth preference of plots/errorbars/etc:
scatter_kwargs = {"zorder":100}
error_kwargs = {"lw":.5, "zorder":0}

errorbar(X,Y,yerr=ERR,fmt=None, marker=None, mew=0,**error_kwargs )

Matplotlib: no edge for marker

"edgecolor = None" sets the point edge colour to the default value, which is black. "edgecolor = "none"" removes the border altogether. I can't stand stuff like this.

Tuesday, December 4, 2012

The select for my TF sample

SELECT m.califa_id,, h.hubtype FROM mothersample as m, nadine as n, morph as h, flags_new as f where ( > 0.3) and n.califa_id = m.califa_id and h.hubtype = 'S' and n.califa_id = h.califa_id and f.sum = 0 and f.califa_id = n.califa_id
edit: here's the new one, selecting galaxies with available kinematics data from CALIFA: SELECT distinct h.rowid,,,, f.sum,, h.hubtype FROM nadine as n, flags_new as f, morph as h, tf_maps_list as j where and h.hubtype='S' and h.califa_id = n.califa_id and > 0.3 and < 0.9 and f.sum = 0 and f.califa_id = n.califa_id

Pyfits clobber=True keyword

Permits overwriting existing file, if set. e.g.: hdu.writeto(filename+'.fits', clobber=True)

Monday, November 26, 2012

Scipy: installing from source

My laptop Linux is old and rotten, so I compile newer versions of Scipy from source. It's being said that compiling blas and lapack libraries is notoriously difficult, but it was not, using this guide (with minor filename changes).

some Python scripts from Durham

A pile of interesting astrophysics scripts, especially the one creating images from GALFIT output files and the other plotting Petrosian quantities for different n.

Sunday, November 25, 2012

awk: making a LaTeX table from a csv file

I found this while digging through my master's notes, potentialy very useful. awk ' {print $1," & ", $2, " & ", $3, " & ", $4, " & ", $13, " & ", $14} ' galaxies_Cat.txt > table.txt

Converting IDL color table to Python: Matplotlib colour map from rgb array

We have our own colour table, mostly used for kinematics or similar spatial plots. There was some Python code to access it (I think), but it used a look-up table, and didn't look concise enough.
M. from MPIA wrote a short IDL script that basically takes the rgb distribution vectors across the colour table length, interpolates it to 256 bins and creates a callable colour table.
I thought it would be easy to rewrite it. Matplotlib's documentation was quite incomprehensible. It uses a set of tuples to define a colourmap, which is neat and gives you a lot of control, but I had a different input of colour vectors, which was an array of pre-defined rgb values. colors.ListedColormap did the trick, so here is a script to make a custom colour map from a rgb array with matplotlib.

Friday, November 16, 2012

Seeing structure in the data: histogram bin widths

It's easy to lie with histograms. Choosing a small histogram bin width brings out spurious peaks and you can claim to see signal while there are only random sampling fluctuations. Choosing a large histogram bin width lets you smooth over the data distribution, hiding peaks or gaps (see an example using the Cauchy distribution here). Choosing slightly different histogram widths for two distributions you are comparing can lead to wildly different histogram shapes, and thus to a conclusion the two datasets are not similar.
This problem was sitting at the back of my head for quite some time: we are making interesting claims about some property distributions in our sample, but to what extent are our histograms, i.e. data density distribution plots, robust? How can we claim the histograms represent the true structure (trends, peaks, etc) of the data distribution, while the binning is selected more or less arbitrarily? I think the data should determine how it is represented, not our preconceptions of its distribution and variance.

I tried to redo some of our histograms using Knuth's rule and astroML. Knuth's rule is a Bayesian procedure which creates a piecewise-constant model of data density (bluntly, a model of the histogram). Then the algorithm looks for the best balance between the likelihood of this model (which favours the larger number of bins) and the model prior probability, which punishes complexity (thus preferring a lower number of bins).
Conceptually, think of the bias-variance tradeoff: you can fit anything in the world with a polynomial of a sufficiently large degree, but there's big variance in your model, so its predictive power is negligible. Similarly, a flat line gives you zero variance, but is most probably not a correct representation of your data.
The histogram below, left shows our sample's redshift distribution histogram I found in our article draft. The right one was produced using the Knuth's rule. They might look quite similar, but the left one has many spurious peaks, while the larger bumps on the right one show the _real_ structure of our galaxies distribution in redshift space! The peak near 0.025 corresponds to the Coma galaxy cluster, while the peak at roughly 0.013 shows both the local large-scale structure (Hydra, maybe?) _and_ the abundance of smaller galaxies which do make it into our sample at lower redshifts.
Also consider this pair of histograms:
They show the axis ratio (b/a) of our sample of galaxies. The smaller this ratio is, the more inclined we assume the galaxy to be, with some caveats. The left one was produced using matplotlib's default bin number, which is 10, at least in Matlab after which matplotlib is modelled.I think are sqrt(n) or some other estimate.
The right one shows the histogram produced using Knuth's rule. It shows the real data distribution structure: the downward trend starting at the left shows that we have more inclined, disk galaxies in our sample (which is true). The bump at the right, at b/a = 0.7, is the population of rounder elliptical galaxies. The superposition of these two populations is shown much more clearly in the second histogram, and we can make some statistical claims about it, instead of just trying to find a pattern and evaluate it visually. Which is good, because we the humans tend to find patterns everywhere and the noise in astronomical datasets is usually large.

Thursday, November 15, 2012

location of Python module file

Tuesday, November 6, 2012

ds9 to NumPy pixel coordinates

I have to crop several images that have defects in them. I will inspect the images using ds9, and crop them with NumPy. Rather, I will not crop the images themselves, but use only part of them for growth curve analysis.
I'll have to update the galaxy center pixel coordinates, as the WCS information is taken from the original SDSS images. They would not have to change if the cropped part is below and right the original center.
I don't have a paper and a pencil at the moment, so I'm thinking out loud here. NumPy indexing goes like this, y first:
   +------------------> x 

ds9, however, uses the standard mathematical notation: (x, y), where x goes left to right, and y goes up from the lower left corner:
y A
  + -------------------------> x

This array had its (0, 0) and (10, 20) pixels set to 1:
[[ 1.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

And here is the corresponding ds9 view:
tl;dr: if you get a pair of pixel coordinates from ds9 and want to use them in NumPy, you have to swap the coordinate numbers and flip the y-axis, i.e. subtract the ds9 y coordinate from the total image height (image.shape[0]).

Matplotlib: XKCDify, bimodal Gaussians

I tinkered with xkcdify last night, it's a brilliant script. I learned something about bimodal Gaussians too (i.e. functions that are superpositions of two single Gaussians with sufficiently different means and sufficiently small standard deviations). The source code of the plot is here: I modified the XKCDify script itself a little, the rotated labels looked somewhat ugly, when present on several subplots. Might be cool to set the text angle at random.
Someday I'll fork it and add the title and histograms.

Monday, November 5, 2012

Matplotlib: intensity map with hexbin

While looking for unrelated stuff, I found this set of matplotlib examples. This is going to be handy when I'm plotting the output of a spectra fitting tool next week.

Thursday, November 1, 2012

SQLite: one more query

in case my computer freezes again: SELECT r.califa_id, r.el_mag, 0.396*r.el_hlma, g.el_hlma, f.sum, g.gc_sky - r.gc_sky FROM r_tot as r, gc_flags as f, gc as g where 0.396*r.el_hlma > 40 and r.califa_id = f.califa_id and r.califa_id = g.califa_id

Thursday, October 25, 2012

Histogram widths: Bayesian blocks

One of frustrating things is getting the histogram widths right: it has always been an arbitrary procedure, which can be misleading. Here is a astroML implementation of a rigorous procedure to determine the fixed or flexible width histogram bars.
The utility of the Bayesian blocks approach goes beyond simple data representation, however: the bins can be shown to be optimal in a quantitative sense, meaning that the histogram becomes a powerful statistical measure.

Wednesday, October 24, 2012

Photometry: linear regression fitting of sky

Yesterday we agreed that the photometry procedure should have as little arbitrary procedures, constants, etc., as possible. So I'm back to the photometry measurements again...but that's a good thing, as I didn't feel entirely happy about them.
The idea I had this morning on the train was simple -- I don't know what conceptual block prevented me from going this way sooner.
Basically, having a cumulative flux curve (the growth curve) is not generally helpful, as the growth of flux outside is non-linear (depends on geometry as well as on the sky value). However, if I normalise the flux profile to _flux_per_pixel, it should theoretically flatten far away from the galaxy. The slope of a linear regression fit should show the quality of the sky value -- the level of variations. If the sky slope is negative and pretty large, then we probably still are within the galaxy.
If the slope is reasonably small (here go the arbitrary parameters again..), simply taking a mean of all measurements within the ring would give a reasonably good sky value.
The catch is getting the width of the elliptical ring used for fitting right. (I can get its distance by waving my hands, taking the maximum distance from my previous measurements, multiplying it by pi/2 or something. We're testing for it anyway).
However, the width of this ring is a tradeoff between accuracy and precision. Taking a wider ring would surely help to reduce the scatter due to random noise, arbitrary gradients and so. However, the possibility to get a (poorly) masked region or some sky artifact, etc. inside this ring also increases.
I tested it a bit using scipy.linalg routines, so far the slope was below 10^-4 counts.
The growth curve itself is useful as a sky subtraction quality check.

Monday, October 22, 2012

awk one-liner

I've been using this for a long time, as most of data I use still comes in unruly, mis-formatted csv files.
awk 'BEGIN {FS =","}; {print $1, $6, $7, $8, $9}' catalogue.csv > cat.csv

Thursday, October 18, 2012

SQLite: some average values

For copying and pasting:SELECT avg(zpt), avg(ext_coeff), avg(airmass) FROM u_tsfieldParams
select z.califa_id, z.z_mag - m.petroMag_z, z.z_mag, m.petroMag_z from z_test as z, mothersample as m where m.califa_id = z.califa_id

Monday, October 15, 2012

Python: script to zip multiple files by filename, flatten structure

zip: flatten directory structure

zip -j zipfile path/to/file1 /path/to/file2

SQLITE: query for multiple quantities from several tables

Nothing fancy, just putting it out there for reuse.

SELECT g.califa_id, g.r_mag, g.el_hlma, l.hlr, m.isoA_r,, g.flags, r.mean_sky, 
r.gc_sky  FROM gc as g, gc_r as r, lucie as l, mothersample as m where (g.califa_id 
= r.califa_id) and (g.califa_id = m.califa_id) and (g.califa_id = l.califa_id)

and (g.el_hlma > 25) order by g.el_hlma desc

Thursday, October 11, 2012

orientation angles of SDSS images wrt North

I was happy using the position angles of galaxies relative to SDSS image's y axis, but people in the collaboration needed the absolute position angles with respect to the North. I used astroLib's astWCS.getRotationDeg() function, in this script. Didn't test yet, I don't know if that makes sense.

A wrapper for wrapper for kcorrect

A little script I cobbled together in order to feed the data to kcorrect, and save its output (absolute ugriz magnitudes and stellar masses, in this case).

SQLITE: useful query: matching two tables by ra, dec

I wanted to cross-match two tables, one of which had IDs, another -- only ra, dec coordinates. THe second one had extinction for all SDSS bands, and it's a pain to go and do a SDSS query for a list of objects again. So:

SELECT m.CALIFA_ID,  round(m.ra, 5), round(m.dec, 5), round(s.ra, 5), round(s.dec, 5)
 from mothersample as m, sdss_match as s where (round(m.ra, 5) = round(s.ra, 5) 
and round(m.dec, 5) = round(s.dec, 5)) order by CALIFA_ID asc
or rather, the actual useful query:

SELECT m.rowid, m.CALIFA_ID,  round(m.ra, 5), round(m.dec, 5), s.petroMagErr_u, 
 s.petroMagErr_g, s.petroMagErr_r, s.petroMagErr_i, s.petroMagErr_z, s.extinction_u, 
s.extinction_g, s.extinction_r, s.extinction_i, s.extinction_z

from mothersample as m, sdss_match as s where (round(m.ra, 5) = round(s.ra, 5) and 
round(m.dec, 5) = round(s.dec, 5)) order by m.rowid 

NED positional query -- a script to export ra, dec from database

It's a general purpose script I quickly put together to export entries from database to csv. Now it does it the retarded way, as I have to use NED Batch Jobs system.

Wednesday, October 10, 2012

SQLite: field names and leading space

Just spent a few minutes, frustrated by
sqlite3.OperationalError: no such column: el_hlma
. The column name had a leading space before it, that's why the query failed.

Merging two csv files -- replacing some lines with those from the second

I had to redo the photometry for some galaxies, and afterwards I was left with two csv files. Some of the entries in the first one had to be replaced with those from the second. I did it with vim, first, because I like it, but I'm not a typist. I had it easy, because the entries started with galaxy IDs, which in a way correspond to their line numbers.

Tuesday, October 9, 2012

NumPy: around() and a way around it

Sunday, October 7, 2012

SDSS CASjobs -- looking for objects within a given radius of a coordinate pair

There's an inbuilt function for that:

select p.ra,p.dec,p.b,p.l,p.objID,,p.rerun,p.camcol,p.field,p.obj,p.type,p.flags,p.fiberMag_r,p.petroMag_u,p.petroMag_g,p.petroMag_r,p.petroMag_i,p.petroMag_z,
       p.petroRad_r,p.petroR50_r,p.petroR90_r,p.isoA_g,p.isoB_g,p.isoA_r,p.isoB_r,p.isoPhi_r,p.specObjID into mydb.MiceB from PhotoObjAll as p,
dbo.fGetNearbyObjEq(191.5, 30.7, 1.0)
WHERE p.objID = n.objID 

      ( flags & (dbo.fPhotoFlags('NOPETRO') +
       +dbo.fPhotoFlags('TOO_FEW_GOOD_DETECTIONS')) ) = 0

Friday, September 28, 2012

SQLITE -- a useful join

Just putting it out here:
SELECT g.CALIFA_ID, g.el_major_axis*0.396, g.el_mag, g.circ_hlr*0.396, g.circ_mag, g.gc_sky,, FROM gc_new as g, nadine as n where n.CALIFA_ID = g.CALIFA_ID order by g.rowid asc

Monday, September 24, 2012

Deploying my inpainting code on other machines

It's quite a mess, since I have limited permissions on each of them, and their environments vary. Also, no git.
python (developed on 2.6.5)
First, I need pyfits:
mkdir python
cd python

Second, pyfits need setuptools:

I want to install to a custom location (the ./python directory tree), hence I have to create directory trees for both of them:
mkdirhier [ABSOLUTE PATH TO]/python/lib/python2.7/site-packages/
mkdirhier [ABSOLUTE PATH TO]/python/lib64/python2.7/site-packages/

The .bashrc file's PYTHONPATH variable must be adjusted:
vi ~/.bashrc
export PYTHONPATH=[ABSOLUTE PATH TO]/python/lib/python2.7/site-packages:[ABSOLUTE PATH TO]/python/lib64/python2.7/site-packages/

Do not forget to restart bash:
Then, installing setuptools:
sh setuptools-0.6c11-py2.7.egg --prefix='[PATH TO]/python'
python install --prefix='[PATH TO]/python'

The file structure:
../data/SDSS/u/ #griz
../data/filled_u #griz

I need AstLib as well:
tar -xvzj astLib-0.6.1.tar.gz
cd astLib
python install --prefix='[PATH TO]/python/'

...and all the hell breaks loose here, because my home is mounted across the network, hence $PYTHONPATH points to a couple of different Python versions. Separate ~.bashrc.$HOSTNAME files worked:
Then should be edited, or even copied to allow several different processes work in different bands and galaxy sample slices.

Friday, September 7, 2012

making posters with LaTeX, rendering matplotlib images of right size

I'm doing a poster presentation next week in a conference, and I didn't want to use our default poster templates -- they just don't look pretty enough for me. So, I cooked my own, or rather lobotomised the baposter template by Brian Amsberg. It's quite customisable and well-documented. I was a web developer once so I get some masochistic satisfaction out of positioning, aligning and resizing stuff anyway.
Some things I learned underway, from matplotlib cookbook mostly:
For an A0 poster, getting image sizes and font sizes right is important. While compiling my .tex document, I inserted
command next to relative to text scaled images. It made LaTeX print out the image size in points in its hilarious output stream.

fig_width_pt = 628.62  # Get this from LaTeX using \showthe\columnwidth
inches_per_pt = 1.0/72.27               # Convert pt to inches
golden_mean = (sqrt(5)-1.0)/2.0         # Aesthetic ratio
fig_width = fig_width_pt*inches_per_pt  # width in 
fig_height = fig_width*golden_mean+0.3       # height in inches
fig_size = [fig_width,fig_height]    
print fig_size

fig_size was later set as a matplotlib parameter. If you specify .pdf as the filename extension, matplotlib generates a crisp file to be used as an image by LaTeX.

Sunday, September 2, 2012

Getting a single file from git stash

Sometimes I do want to keep some local changes, even though stash is such a good way to bury and forget them.
A solution from stackoverflow:

git checkout stash@{0} -- [filename]

opening images with ds9 from a python script

I did it the old-fashioned way (os.system is going to be deprecated):

  import os

  os.system("~/ds9  -zoom 0.3 -scale mode 99.5 -file "+ 

GalaxyParameters.getSDSSUrl(listFile, dataDir, i) +" -file  "+ 

GalaxyParameters.getMaskUrl(listFile, dataDir, simpleFile, i) +" -match frames")

Monday, August 27, 2012

Decent scientific plots with matplotlib

I think matplotlib is poorly documented and too object-oriented to be immediately usable. Trying to be too many things at once. Besides, the default values and default behaviour is kooky sometimes.
For instance, histogram bar widths: they are different for different datasets, and you cannot compare two distributions if that's the case. You have to resort to hacks like this:

  hist, bins = np.histogram(data, bins = 10)
And I think it's a bloody hack, calling histogram method from some other module in order to be able to call matplotlib's histogram plotting routine. Nevertheless. It's flexible and, since I already use Python to pull my data from databases, I decided to give it a try when I had to prepare some plots for a review poster.
So, what makes an ugly plot look decent? First of all, ticks:

      minor_locator = plt.MultipleLocator(plotTitles.yMinorTicks)
      Ymajor_locator = plt.MultipleLocator(plotTitles.yMajorTicks)  
      major_locator = plt.MultipleLocator(plotTitles.xMajorTicks)      
      Xminor_locator = plt.MultipleLocator(plotTitles.xMinorTicks)   
They have to be set separately for each axis, I pass them as parameters from a wrapper class. Then, hatched bars.
Some parameters, redefining matplotlib's defaults (these plots are for printouts, so font sizes are big):

params = {'backend': 'ps',
          'axes.labelsize': 10,
          'text.fontsize': 10,
          'legend.fontsize': 10,
          'xtick.labelsize': 8,
          'ytick.labelsize': 8,
          'text.usetex': True, 
          'font': 'serif',
          'font.size': 16,
          'ylabel.fontsize': 20}
If you'd like to have different symbols (markers) in a scatterplot, this is useful:

      markers = itertools.cycle('.^*')
      p1 = ax.plot([0],[1], linestyle='none',, markersize=8, color=gd.colour, mec=gd.colour, alpha = 1) 
But the double symbols in the legend, why is that? No sense.
I would not claim the plots shown are publication-quality (just look at slightly differing histogram widths). But they look way better than default plots one would get with matplotlib.

RGB colours in matplotlib plots

LaTeX accepts un-normalised (like 0-65-122) rgb values as colour arguments in, say, .cls files:
\definecolor{AIPblue} {RGB}{0, 65, 122}
matplotlib insists on getting something like that:

#colour settings
ASColour = '#666666'
or normalised (0 - 1) rgb values tuple. It's easy to normalise, but it's also easy to convert rgb values to hex.

Forcing capital letters in Bibtex bibliography lists

Some bibliography styles eat the capital letters in article titles. To avoid this, one should include the letter in question in curly brackets, like {T}hat.

Saturday, August 25, 2012

Generating matplotlib plots via ssh on a remote server

If you're running a plotting script via ssh, you're likely to get an error message of this sort:

_tkinter.TclError: no display name and no $DISPLAY environment variable
One possible solution is to use the Agg backend.

Friday, August 24, 2012

Failed optimisation: numpy.roll(), masked arrays

I've spent yesterday trying to optimise the inpainting code -- it's quite slow, taking up to 10 minutes for 1 galaxy, and with 900 galaxies, I can't really afford that. Or I should get access to more computers, then prepare the working environment, copy files, install Python libraries, screw something up along the way. That's boring and teaches nothing new.
Besides, there's really some room for improvement: those loops and exceptions are ugly and probably avoidable, and looping over array many times could be sped up using some numpy kung-fu.
I wrote a prototype that juggles with the masks, uses some boolean manipulations and numpy.roll(). Near the edges the algorithm reverts to traditional loops over indices.
However, it's still a slow. I ran a profiler: the loops shifting the array take 0.5s each, and so I win nothing in the long run. Array operations are expensive, and when an array is indexed repeatedly within a loop, the code gets slow inevitably. But there's no way I can escape recursion, because I have to pick up a new set of pixels every time.
I still have some ideas, but I want to have some time to analyse the results.

Thursday, August 23, 2012

Morning coffee read -- ps

ps reference. this command was pretty mysterious to me, with several types of option syntax.

Wednesday, August 22, 2012

The new inpainting script

It's ridiculous in places and still a bit slow, but I'm quite happy with the idea underneath it.
It maps the number of all finite (not NaNs, not infs, not having indices beyond the array edges) neighbours of each pixel in the given array. Then it loops through the masked pixels which have the _largest_ number of adjacent neighbours available, picks the closest neighbours in a 5*5 matrix around each pixel, fills the masked pixel with distance-weighted Gaussian average of said neighbours, then updates the neighbour count, loops through the pixels which have the largest number (out of 4) of neighbours again, (...).
I realised I should do inpainting this way while riding my bicycle from cbase. My previous implementation simply looped over indices along one axis, disregarding the number of pixels used for filling the given one. Hence ugly flux shape deformations when a masked region is directly above or below the galaxy in question.
But now it seems to be fine! So finally I get to do some science, instead of coding 24 if's and exceptions in one function.

randomize contents of a list

The new inpainting script left some ringing artefacts, I think they might come from np.where() results ordering (and the resulting imposed order of inpainting). I've tried numpy.random.shuffle(). It's still running, so I have yet to see whether this worked.
Updated -- no, it didn't. The weird lines and gradients are the edge effects of rectangular mask shapes, which I didn't like from the beginning. On the other hand, given that the inpainted values vary by ~4 counts (the contrast is stretched in this image), I'm not going to worry about this.

Monday, August 20, 2012

SciPy: spatial binary tree

I just realised that scipy.spatial.KDTree works in _data_ space, not in _indices_ space, which makes total sense now that I think of it. Should go get me some breakfast, as I'm starting to overcomplicate things.

Numpy: getting indices of n-dimensional array

just found np.ndenumerate. I should absolutely read some numpy documentation, there's lots of good hidden stuff there.

Inpainting, optimisation, nearest neighbours in Numpy

After looking at my ~900 inpainted images and rejecting ~130 of them outright, I'd like to rewrite my inpainting code. It creates weird artifacts or distorted galaxy views under some circumstances, because it iterates _along_ the axes.
My current idea is to do this recursively: first fill (using the same Gaussian kernel) the pixels that have the nearest 3 4 neighbouring pixels (out of 4 possible) available (not masked), then repeat it again, then go to those that have 3 neighbours, rinse and repeat.
A similar, but possibly a better approach would be to include the diagonal neighbours in the calculation as well (those who have a common vertex, but do not lie side by side). Start with those that have 8 neighbours, (...).
Here's some food for thought:,,

Sunday, August 19, 2012

batch command line image resizing on remote server

I had the images I've generated on a remote server and wanted to copy them to my laptop to work on Sunday. However, the zipfile was 1.4 GB large. I used parts of this shell script, luckily, ImageMagick was installed on the server.

Batch conversion of fits files to .png images, dynamic range and histogram equalisation

My sample consists of almost a thousand galaxies. It's that inconvenient sample size where you would like to inspect each result visually, but it already sucks. First of all, I wanted to see if the inpainting code makes sense and leaves no strange artifacts. To avoid opening each .fits file manually with ds9, I scripted some quick loop to save all the fits images as .png files. It uses scipy.imsave():

scipy.misc.imsave('img/output/filled/'+ID+'_imsave.jpg', outputImage)
However, I wanted to stretch the pixel intensity values and increase the dynamic range of the output images. This code was useful. It tries to make all intensities equally common and hence increases the contrast.

Friday, August 17, 2012

matplotlib: missing x, y axis labels

There must be some reason why it was so, but I had

plt.ylabel = plotTitles.ylabel 

instead of


in my pyplot wrapper.

Getting a fits header keyword value from a list of files

I wanted to get the sky value estimates from SDSS fits headers (fpC). There were 939 of them. It was amusing to see how many SDSS images of large galaxies didn't have the sky estimated -- should probably look at why that was the case. I remember writing something similar before, but it was easier to quickly script it.

#a rough excerpt from ellipse_photometry module. 
#loops through filenames, extracts fits header keyword values.

  sky = np.empty((939, 2))
  for i in range(0, 939):    
      fileName = GalaxyParameters.getSDSSUrl(listFile, dataDir, i)
      head =[0].header
      sky[i:, 0] = i+1
      sky[i:, 1] = head['SKY']
      print head['SKY']
    except IOError as e:
      sky[i:, 0] = i+1
      sky[i:, 1] = 0
    except KeyError as err:
      sky[i:, 0] = i+1
      sky[i:, 1] = -1
  np.savetxt('SDSS_skies.txt', sky, fmt = ['%i', '%10.5f'])

Drawing a diagonal (or any arbitrarily inclined and offset) line in Matplotlib

I wanted to overplot a simple 45 deg line on my data. As far as I know, Matplotlib doesn't accept equations, only sets of points, so here's a clumsy solution to it, based on this:

    import pylab as pl
    import numpy as np
    x1,x2,n,m,b = min(data),max(data).,1000,1.,0. 
# 1000 -- a rough guess of the sufficient length of this line.
#you can actually calculate it using the Pythagorean theorem.
#1, here -- the slope, 0 -- the offset.

    x = np.r_[x1:x2:n*1j] #
    pl.plot(x,m*x + b); pl.grid();

Thursday, August 16, 2012

Image intensity scaling

I wanted to make some images, showing the output of the photometric analysis, i.e. the radii where the growth curves flattened, etc. scipy.misc.imsave() bytescales the image, so here's some documentation on image intensity scaling I'm currently digesting: how ds9's 'scale --> 99%' works: it's outlier clipping.

Bulges and ellipticals

I read an interesting article on the train, Galaxy Bulges and Elliptical Galaxies, lecture notes by D. Gadotti. Dimitri had helped us a lot with the MICE decomposition.
He sums up a great deal of recent research regarding the structure of galaxies. A few main points:
  • the components that we call bulges are not necessarily old 'scaled down' ellipticals as we often hear: there are classical bulges, disk-like bulges, 'box-peanuts', which may coexist or overlap.
  • 'box-peanuts' are thought to be the inner parts of bars, not separate components
  • classical bulges are kinematically hot, i.e. supported by their velocity dispersion. disk-like bulges are kinematically cold, i.e. supported by their rotation velocity.
  • there's an interesting section on photometric bulge recognition, and why the arbitrary cut on n=2 is what it is, arbitrary. I tend to believe that any classification has to be data-driven, so I'll keep the alternative suggestions in mind.
  • another interesting part is about the scaling relations, i.e. the fundamental plane and F-J relation. I didn't know that the FP can be derived directly from the virial theorem!
  • F-J and L-size relations seem to be nonlinear, according to the data.
The only thing I'm not sure about is the use of 2D K-S test, an enlightening read otherwise.

Wednesday, August 15, 2012

Git -- non-fast-forward updates were rejected

I got the following error message:

To prevent you from losing history, non-fast-forward updates were rejected
Merge the remote changes before pushing again.  See the 'Note about
fast-forwards' section of 'git push --help' for details.
Stackoverflow helped, as usual. As I am the sole developer of my can of worms, I just added the '-f' flag. It's nice to do the last commit of the photometry code which I've been working on for the last 3 months.

Git -- fixing a detached head

'fixing' as in 're-attaching'. There's nothing intrinsically wrong with having a detached head.
I returned from the state of detached head this way.

'In Git commits form a directed acyclic graph where each commit points to one or more parent commits'.

Geek language, you're a thing of beauty.

Monday, August 13, 2012

Python: convert a numpy array to list

One more little trick I didn't know:

  inputImage = np.zeros(inputImage.shape)
  for i in range(0, inputImage.shape[0]):
    for j in range(0, inputImage.shape[1]):  
      inputIndices[i, j] = (i, j)
  inputIndices =  inputIndices.tolist() 

Python: removing duplicate elements from a list

python has a lovely set type, which makes removing duplicate elements (in my case, tuples) extremely easy, e.g.:

out = [(12, 3), (11, 4), (12, 3), (44, -5)]
out = list(set(out)))

Thursday, August 9, 2012

unix: show sizes of all files and directories, without subdirectories

simple but handy:
du -ah --max-depth=1

numpy array slices: inclusivity vs.exclusivity

I was banging my head at the desk for the last fifteen minutes: moving averaging code didn't work where it was supposed to. Look at this snippet:
fluxData[distance-5:distance, 3]
The idea was to average an array over a running 5 element width annulus, counting back from the distance'th element.
I normally know this, but Numpy's array slicing does not include the last element, i.e. distance'th row was excluded from the count.
For instance, fluxData[distance-4:distance+1, 3] looks stupid, but somehow worked.

Inpainting again

I'm running the inpainting script once more: it took 75 iterations to fix one file. That is, all night. I don't know why I haven't done this before, as this unfinished business was frustrating me throughout the holidays.

Saturday, May 26, 2012

SDSS axis ratio measures

I copy it verbatim from here, to have a reference ready.

The model fits yield an estimate of the axis ratio and position angle of each object, but it is useful to have model-independent measures of ellipticity. In the data released here, frames provides two further measures of ellipticity, one based on second moments, the other based on the ellipticity of a particular isophote. The model fits do correctly account for the effect of the seeing, while the methods presented here do not.

The first method measures flux-weighted second moments, defined as:

  • Mxx = x^2/r^2
  • Myy = y^2/r^2
  • Mxy = x^y/r^2

In the case that the object's isophotes are self-similar ellipses, one can show:

  • Q = Mxx - Myy = [(a-b)/(a+b)]cos2φ
  • U = Mxy = [(a-b)/(a+b)]sin2φ

where a and b are the semi-major and semi-minor axes, and φ is the position angle. Q and U are Q and U in PhotoObj and are referred to as "Stokes parameters." They can be used to reconstruct the axis ratio and position angle, measured relative to row and column of the CCDs. This is equivalent to the normal definition of position angle (East of North), for the scans on the Equator. The performance of the Stokes parameters are not ideal at low S/N. For future data releases, frames will also output variants of the adaptive shape measures used in the weak lensing analysis of Fischer et al. (2000), which are closer to optimal measures of shape for small objects.

Isophotal Quantities

A second measure of ellipticity is given by measuring the ellipticity of the 25 magnitudes per square arcsecond isophote (in all bands). In detail, frames measures the radius of a particular isophote as a function of angle and Fourier expands this function. It then extracts from the coefficients the centroid (isoRowC,isoColC), major and minor axis (isoA,isoB), position angle (isoPhi), and average radius of the isophote in question (Profile). Placeholders exist in the database for the errors on each of these quantities, but they are not currently calculated. It also reports the derivative of each of these quantities with respect to isophote level, necessary to recompute these quantities if the photometric calibration changes.

Many more about those second order moments, and why they are called 'Stokes' parameters' can be found in references herein.

Thursday, May 24, 2012

Including source code in LaTeX document

It's possible to just paste it, by using \begin{verbatim}/\end{verbatim} commands. However, updating the source from the script directly is way neater . I use the listings package to do this.

Counting occurences of a particular word in a file

i had to calculate how many objects NED database found near determined positions had no definite redshifts. I parsed the NED output files using by Min-Su Shin.

NED writes out None for quantities it does not find, so I used grep -o None ned_output.txt | wc -w

Wednesday, May 23, 2012

NED batch job form

It's amazing what contrived workarounds are constructed in public astronomical databases in order to avoid letting people see and harness the SQL underneath. Here's a setup file for NED batch jobs tool I made to work finally:

OUTPUT_SORTED_BY Distance_to_search_center
177.34806988d; -1.08377011d; 0.05

REDSHIFT Unconstrained

 Galaxies X GClusters X Supernovae _ QSO X AbsLineSys _ GravLens _
 Radio _ Infrared _ EmissnLine X UVExcess X Xray X GammaRay _
 GPairs _ GTriples _ GGroups _

 Galaxies _ GClusters _ Supernovae _ QSO _ AbsLineSys _ GravLens _
 Radio _ Infrared _ EmissnLine _ UVExcess _ Xray _ GammaRay _
 GPairs _ GTriples _ GGroups _


HyperLeda sql field names

Somehow I keep losing this.

Left outer joins in SDSS

While I was working on our survey sample characterisation paper, I had to redo the sample selection again. There's an interesting effect in diameter limiter surveys: they should theoretically include more highly-inclined disk galaxies near the lower diameter limit, as their total brightness is projected onto a smaller area.

I was selecting two samples from the SDSS DR7: one which was only redshift-selected, and another which had no cuts in redshift, but was diameter-limited. I wanted to keep the redshift information for the diameter-limited sample (wherein not all galaxies have spectroscopic redshifts), if it was available for a particular galaxy. Hence, outer joins.

Tuesday, April 24, 2012

numpy: combining 'less' and 'more' operators in where expression, bitwise AND

This caused a major problem last night, so I'm putting it here.

Efficiency 101: reading large FITS table with Python

Last week I hastily wrote some code to search the NSAtlas. It was almost ok for a single galaxy: the array is huge, and iterating through it 1000 took 1 minute for each loop, so I would have to wait all day for the full query to complete.

I've changed the 'if' loop to numpy.where statements and used the collections module: here. It is almost usable, though it takes 15 minutes to search for some variable for 1000 galaxies, and the machine slows down to an almost complete halt.

There is a significant overhead associated with opening and closing files, especially as the one in question was quite huge (0.5 GB) in this case. Not calling the function from within a loop, but passing an array of search arguments and then looping through it within the function reduced the running time to 16 seconds. A 60-fold improvement in two minutes.

I'd also like to mention a fine module called collections, which helped me find an overlap between two lists, like that:

      a_multiset = collections.Counter(list(ras[0]))
      b_multiset = collections.Counter(list(decs[0]))
      overlap = list((a_multiset & b_multiset).elements())

Wednesday, April 18, 2012

NSAtlas, surface photometry

I wrote some [terribly inefficient, as of now] code for searching NASA Sloan Atlas, given a ra and dec of a galaxy. Just like I expected, my photometry results (SDSS r band) fall somewhere in between the original SDSS photometry and Sloan Atlas (they re-did the photometry, using more sophisticated sky-fitting techniques). And I am still unsure if mine makes sense. Just a good point illustrating the lesson from the previous post: due to the sheer number of pixels in the outskirts of the galaxy, the fine systematic differences on where to stop integrating give rise to significant variations in the final brightness.

Monday, April 16, 2012

Surface photometry

I used to think that galaxy photometry is difficult only in that sense that it necessarily includes errors. As galaxy edges are diffuse, it's impossible to decide where the limit is, so any estimate is subjective and uncertain. I used to think that those are only second-order errors, important in statistical sense (every approach brings its own systematics), but not for individual galaxies. Last week I was struck by a simple plot I made -- a graph showing cumulative flux, flux per pixel and flux at a ring of given distance vs. distance from the galaxy center. The cumulative flux curve does not flatten or grow linear as one would be inclined to think -- as the number of pixels at each distance increases quadratically, their contributed flux increases as well. Mean flux per pixel, however, does flatten (and gives us a rough estimate of the sky value). It might just be a pretty good measure, but I'm worried about gradient in SDSS sky.

Friday, April 13, 2012

SDSS database utilities

I've digested some available scripts for SDSS into my code, namely, Min-Su Shin's code that fetches SDSS ugriz magnitudes (actually, not necessarily, any query can do) from SDSS database, and Tamas Budavari's The first one was disemboweled to be a callable function instead of standalone command-line tool. It's in github.

Thursday, April 12, 2012

Latex degree symbol

Kelvins are official and basic, but I needed to make my students be aware of the conversion between Kelvin and Celsius scales. I'll probably remember that, but will keep it here for digital posterity: $^{\circ}$

Wednesday, April 4, 2012

Creating a discrete Gaussian kernel with Python

Discrete Gaussian kernels are often used for convolution in signal processing, or, in my case, weighting. I used some hardcoded values before, but here's a recipe for making it on-the-fly.

def gauss_kern(size, sizey=None):
    """ Returns a normalized 2D gauss kernel array for convolutions """
    size = int(size)
    if not sizey:
        sizey = size
        sizey = int(sizey)
    x, y = mgrid[-size:size+1, -sizey:sizey+1]
    g = exp(-(x**2/float(size)+y**2/float(sizey)))
    return g / g.sum()

Note that 'size' here refers to (shape[0]/2 - 1), i.e. the resulting kernel side is size*2 + 1 long. I need to use a kernel with a central value of 1, so I adjusted the normalisation accordingly:

def gauss_kern(size, sizey=None):
    """ Returns a normalized 2D gauss kernel array for convolutions """
    size = int(size)
    if not sizey:
        sizey = size
        sizey = int(sizey)
    x, y = scipy.mgrid[-size:size+1, -sizey:sizey+1]
    g = scipy.exp(-(x**2/float(size)+y**2/float(sizey)))
    return g / g.max()

Tuesday, April 3, 2012

A quick way to check if a list of files is complete

That's a pretty dumb way to do it, but I'm tired. It involves touch, however, it doesn't check if new files are created.

# -*- coding: utf-8 -*-
import os
import csv
import string

def touch(fname, times = None):
    with file(fname, 'a'):
        os.utime(fname, times)

for i in range(0, 939):
  with open('../data/maskFilenames.csv', 'rb') as f:
    mycsv = csv.reader(f)
    mycsv = list(mycsv)
    fname = string.strip(mycsv[i][1])
    print fname

Check if files in a list exist with Python

I needed to test whether a large list of galaxies have corresponding mask files in a certain directory. The tentative mask filenames were created by a code that cross-correlated internal survey galaxy IDs with their respective NED names.

  wrongFilenames = []
  for i in range(0, noOfGalaxies):
    maskFile = GalaxyParameters.getMaskUrl(listFile, dataDir, simpleFile, i)
    print i, 'maskfile: ', maskFile  
      f = open(maskFile)
      print "maskfile exists:", maskFile
    except IOError as e:

Creating masks with GIMP

Last time I had to make some really fine masking, and I resorted to GIMP. At least under Linux it can read and save fits files, albeit it treats them as _images_, not _data_, which I consider fundamentally wrong in many cases. So, you open the fits file in GIMP, draw something white or black, using any tool and save it. However, there are many pixels that are not 0 or 1, meaning that the image cannot be used as mask. Here's a script to normalise it back. It is cautious, meaning all pixels that are not equal to 0 will be treated as ones (and your mask may get bigger than you were expecting, at least theoretically).

masked = np.where(mask != 0) mask[masked] = mask[masked]/mask[masked]

Monday, April 2, 2012

WCS to pixel values and vice versa in Python

It's often necessary to convert pixel coordinates within a binary image to physical(ra, dec) sky coordinates. For instance, one might need to find brightness at the central pixel of the galaxy, calculate flux, any photometry-related task would do, actually. I used astLib.astWCS for this purpose, though I think pywcs might work just as well.
from astLib import astWCS


center = WCS.wcs2pix(ra, dec)
print center
print 'brightness at the galaxy center', img[center[1], center[0]] 
#thanks, numpy -- python indexes arrays ass-backward (or, rather, column-backward) for some reason. One have to be careful not to swap them.

Getting tsField files from SDSS

I'm always looking for this or that little piece of code I really wrote sometime ago but can't remember where I put it. Then I end up rewriting it because it is faster than grepping entire work directory and sorting through the offal. I wrote about batch downloading of SDSS images, now it seems I need a specific pipeline parameter from all 900 tsField files. Here's a oneliner (from a multi-line blob of code I'm now keeping at GitHub).

After downloading (or just reading the parameter from header and moving happily along, I'm currently figuring that out)-- no downloading. I'm reading the value from header via http, as pyfits are capable of that.

Friday, March 23, 2012

M. Kramer on pulsars

Michael Kramer (MPIfR/JBCA) gave a captivating and eloquent talk this morning on using pulsars as tests for GR and as constraints on alternative theories of gravity. It's possible to use pulsar data to measure post-Keplerian parameters of the only known double pulsar system, and GR passes all the tests with flying colours. Measurements of other binary pulsar systems (PS + a white dwarf, for example) very probably put an end to MOND and other TeVeS gravity theories.

Pulsar timing is precise to 10^-18 s, more precise than atomic clocks and it's apparently possible to notice irregularities in the terrestrial atomic timescale. Perihelion precession of pulsar binary systems is of order of degrees/year, meaning that Newtonian dynamics are completely inadequate here (one would be hopelessly lost soon, if she tried to navigate her spaceship in the system using Newton's dynamics. The stars would be out of place).

This image, showing the change in revolution time in binary pulsar system, came up. It's the strongest proof of gravitational wave existence so far, showing how the system radiates its energy away as ripples in the spacetime. The uncertainties are too small to picture.

Interesting results coming up, something to keep an eye on.

Thursday, March 22, 2012

SDSS images batch downloading, getting key values from .fits headers

I'm sure I'm probably a 1000th person writing scripts to download a bunch of SDSS images (here and here are the two examples I know of). It takes a list of (awk-generated) SDSS parameters, uses wget to download them, reads some value from the .fits file header and writes it out. A colleague asked me to get a list of such values, so here it goes.
I'm using a couple of string manipulation functions from sdsspy, I felt it's really pointless to rewrite those.

import os
import pyfits
import csv
import numpy
import string
import gzip

#awk < sdss_database_dump.csv 'BEGIN { FS="," }; { print $1,",",$2,",",$3,",",$8,",",$9,",",$10,",",$11}' > list.txt

dataFile = 'list.txt'

#the next three functions are from sdsspy library (

def run2string(runs):
    rs = run2string(runs)
    Return the string version of the run.  756->'000756'
    Range checking is applied.
    return tostring(runs,0,999999)

def field2string(fields):
    fs = field2string(field)
    Return the string version of the field.  25->'0025'
    Range checking is applied.
    return tostring(fields,0,9999)

def tostring(val, nmin=None, nmax=None):
    if not numpy.isscalar(val):
        return [tostring(v,nmin,nmax) for v in val]
    if isinstance(val, (str,unicode)):
 nlen = len(str(nmax))
 vstr = str(val).zfill(nlen)
 return vstr
    if nmax is not None:
        if val > nmax:
            raise ValueError("Number ranges higher than max value of %s\n" % nmax)
    if nmax is not None:
        nlen = len(str(nmax))
        vstr = str(val).zfill(nlen)
        vstr = str(val)
    return vstr        

csvReader = csv.reader(open(dataFile, "rU"), delimiter=',')
f = csv.writer(open('angles.txt', 'w'), delimiter=',')
for row in csvReader:
      print '********************************', row[0]      
      ID = string.strip(row[0])
      ra = string.strip(row[1])
      dec = string.strip(row[2])
      run = string.strip(row[3])
      rerun = string.strip(row[4])
      camcol = string.strip(row[5])
      field = string.strip(row[6])
      runstr = run2string(run)
      field_str = field2string(field)
      print 'wget'+run+'/'+rerun+'/corr/'+camcol+'/fpC-'+runstr+'-r'+camcol+'-'+field_str+'.fit.gz'
      print 'uncompressing..'
      gz ='fpC-'+runstr+'-r'+camcol+'-'+field_str+'.fit.gz')
      imgFile =, mode='readonly')
      print 'getting header info...'
      head = imgFile[0].header
      angle = head['SPA']
      info = (int(ID), angle)  

Tuesday, March 20, 2012

the end of MICE

I'm about to finish running GALFIT on the HST F606 MICE image (preview here).
I've used GALFIT before, while writing my MSc thesis, but oh, those were faint, redshifted and unremarkable SDSS galaxies. That was simple -- I used 5 components once, because a galaxy had a LINER AGN.
The MICE are monsters. They've apparently already passed through each other once, and have highly disturbed tidal features, clumps of active star formation, dust lanes, you name it. I've never used GALFIT to model spiral arms, too, so that was an interesting experience.
Here are they:

I started simple, first masking out each galaxy and fitting the other in turn, then fitting both simultaneously (that's a better way to do it anyway). After running about 150 models I decided to mask out the dark dust lanes in galaxy A -- I couldn't fit them using truncation functions, as GALFIT crashes (on my 64bit SUSE, at least) while attempting that and I don't think the science goal required this. I have some reservations about the decomposition, as I think the image has some flat-fielding issues, besides, the galaxies take up a large part of the overall image, so I have doubts about the determined sky value as well.
In the end I had 91 free parameter, and I think more might have made sense, but that was past the point of diminishing returns. I stopped at Chi^2 = 1.877, and guess the main reason why the model would fit the image so poorly were the bright and irregular cores of the galaxies -- or possible higher order flat-field curvature. The other objects were masked out.
Here is the model -- at ds9's 99.5 scale and heat colourmap (this is handy: ds9 -scale mode 99.5 -cmap Heat -medatacube ~/workspace/MICE/galfit/subcomps.fits). The long, bright tails are not well visible with these settings, but they weren't the goal.
I am not really happy with the residual picture, as it tells me there are some big, obscured components lurking underneath. But that'll have to do for the moment, as I have other assignments. I've been doing this for other people in the collaboration, so I don't think I can publish the fit logs here, unfortunately.

Friday, March 9, 2012

Healing holes in arrays in Python: interpolation vs. inpainting

I had solved that interpolation task I was working on previously, by the way. I post my approach here, in case someone finds it useful.

If the bad pixel regions are big, standard methods of interpolation are not applicable. Interpolation works under assumption that the regions are contiguous, whereas this is not necessarily the case. Besides, none of interpolation implementations in python seem to work with masked arrays, and I've tried a metric ton of them. miceuz, who was sitting on this task half night as well, asked about it on stackoverflow, and someone mentioned the right keyword: inpainting.

That's still inventing the data, but the option is having unquantifiable errors in photometry if we leave out the bad pixels or don't mask the bad areas at all.

OpenCV is a huge piece of software, and using it to fill those holes is tantamount to using a hammer to crack a nut. However, I reused this tiny piece of code, adding a [hardcoded ,)] Gaussian weights kernel as I needed to use inverse distance weighting for interpolation. Here's the code:

import numpy as np
import pyfits
from scipy import ndimage
import inpaint

maskedImg =, mask = cutoutMask)
NANMask =  maskedImg.filled(np.NaN)

badArrays, num_badArrays = sp.ndimage.label(cutoutMask)
print num_badArrays
data_slices = sp.ndimage.find_objects(badArrays)

filled = inpaint.replace_nans(NANMask, 5, 0.5, 2, 'iwd')
hdu = pyfits.PrimaryHDU(filled)

inpaint module boils down to this -- I used the code linked above, just lobotomised the Cython:

# -*- coding: utf-8 -*-

"""A module for various utilities and helper functions"""

import numpy as np
#cimport numpy as np
#cimport cython

DTYPEf = np.float64
#ctypedef np.float64_t DTYPEf_t

DTYPEi = np.int32
#ctypedef np.int32_t DTYPEi_t

#@cython.boundscheck(False) # turn of bounds-checking for entire function
#@cython.wraparound(False) # turn of bounds-checking for entire function
def replace_nans(array, max_iter, tol,kernel_size=1,method='localmean'):
    """Replace NaN elements in an array using an iterative image inpainting algorithm.
The algorithm is the following:
1) For each element in the input array, replace it by a weighted average
of the neighbouring elements which are not NaN themselves. The weights depends
of the method type. If ``method=localmean`` weight are equal to 1/( (2*kernel_size+1)**2 -1 )
2) Several iterations are needed if there are adjacent NaN elements.
If this is the case, information is "spread" from the edges of the missing
regions iteratively, until the variation is below a certain threshold.
array : 2d np.ndarray
an array containing NaN elements that have to be replaced
max_iter : int
the number of iterations
kernel_size : int
the size of the kernel, default is 1
method : str
the method used to replace invalid values. Valid options are
`localmean`, 'idw'.
filled : 2d np.ndarray
a copy of the input array, where NaN elements have been replaced.
#    cdef int i, j, I, J, it, n, k, l
#    cdef int n_invalids
    filled = np.empty( [array.shape[0], array.shape[1]], dtype=DTYPEf)
    kernel = np.empty( (2*kernel_size+1, 2*kernel_size+1), dtype=DTYPEf )
#    cdef np.ndarray[np.int_t, ndim=1] inans
#    cdef np.ndarray[np.int_t, ndim=1] jnans
    # indices where array is NaN
    inans, jnans = np.nonzero( np.isnan(array) )
    # number of NaN elements
    n_nans = len(inans)
    # arrays which contain replaced values to check for convergence
    replaced_new = np.zeros( n_nans, dtype=DTYPEf)
    replaced_old = np.zeros( n_nans, dtype=DTYPEf)
    # depending on kernel type, fill kernel array
    if method == 'localmean':
        print 'kernel_size', kernel_size       
        for i in range(2*kernel_size+1):
            for j in range(2*kernel_size+1):
                kernel[i,j] = 1
        print kernel, 'kernel'

    elif method == 'idw':
        kernel = np.array([[0, 0.5, 0.5, 0.5,0],
                  [0, 0.5, 0.5 ,0.5 ,0]])
        print kernel, 'kernel'      
        raise ValueError( 'method not valid. Should be one of `localmean`.')
    # fill new array with input elements
    for i in range(array.shape[0]):
        for j in range(array.shape[1]):
            filled[i,j] = array[i,j]

    # make several passes
    # until we reach convergence
    for it in range(max_iter):
        print 'iteration', it
        # for each NaN element
        for k in range(n_nans):
            i = inans[k]
            j = jnans[k]
            # initialize to zero
            filled[i,j] = 0.0
            n = 0
            # loop over the kernel
            for I in range(2*kernel_size+1):
                for J in range(2*kernel_size+1):
                    # if we are not out of the boundaries
                    if i+I-kernel_size < array.shape[0] and i+I-kernel_size >= 0:
                        if j+J-kernel_size < array.shape[1] and j+J-kernel_size >= 0:
                            # if the neighbour element is not NaN itself.
                            if filled[i+I-kernel_size, j+J-kernel_size] == filled[i+I-kernel_size, j+J-kernel_size] :
                                # do not sum itself
                                if I-kernel_size != 0 and J-kernel_size != 0:
                                    # convolve kernel with original array
                                    filled[i,j] = filled[i,j] + filled[i+I-kernel_size, j+J-kernel_size]*kernel[I, J]
                                    n = n + 1*kernel[I,J]
                                    print n

            # divide value by effective number of added elements
            if n != 0:
                filled[i,j] = filled[i,j] / n
                replaced_new[k] = filled[i,j]
                filled[i,j] = np.nan
        # check if mean square difference between values of replaced
        #elements is below a certain tolerance
        print 'tolerance', np.mean( (replaced_new-replaced_old)**2 )
        if np.mean( (replaced_new-replaced_old)**2 ) < tol:
            for l in range(n_nans):
                replaced_old[l] = replaced_new[l]
    return filled

def sincinterp(image, x,  y, kernel_size=3 ):
    """Re-sample an image at intermediate positions between pixels.
This function uses a cardinal interpolation formula which limits
the loss of information in the resampling process. It uses a limited
number of neighbouring pixels.
The new image :math:`im^+` at fractional locations :math:`x` and :math:`y` is computed as:
.. math::
im^+(x,y) = \sum_{i=-\mathtt{kernel\_size}}^{i=\mathtt{kernel\_size}} \sum_{j=-\mathtt{kernel\_size}}^{j=\mathtt{kernel\_size}} \mathtt{image}(i,j) sin[\pi(i-\mathtt{x})] sin[\pi(j-\mathtt{y})] / \pi(i-\mathtt{x}) / \pi(j-\mathtt{y})
image : np.ndarray, dtype np.int32
the image array.
x : two dimensions np.ndarray of floats
an array containing fractional pixel row
positions at which to interpolate the image
y : two dimensions np.ndarray of floats
an array containing fractional pixel column
positions at which to interpolate the image
kernel_size : int
interpolation is performed over a ``(2*kernel_size+1)*(2*kernel_size+1)``
submatrix in the neighbourhood of each interpolation point.
im : np.ndarray, dtype np.float64
the interpolated value of ``image`` at the points specified
by ``x`` and ``y``
    # indices
#    cdef int i, j, I, J
    # the output array
    r = np.zeros( [x.shape[0], x.shape[1]], dtype=DTYPEf)
    # fast pi
    pi = 3.1419
    # for each point of the output array
    for I in range(x.shape[0]):
        for J in range(x.shape[1]):
            #loop over all neighbouring grid points
            for i in range( int(x[I,J])-kernel_size, int(x[I,J])+kernel_size+1 ):
                for j in range( int(y[I,J])-kernel_size, int(y[I,J])+kernel_size+1 ):
                    # check that we are in the boundaries
                    if i >= 0 and i <= image.shape[0] and j >= 0 and j <= image.shape[1]:
                        if (i-x[I,J]) == 0.0 and (j-y[I,J]) == 0.0:
                            r[I,J] = r[I,J] + image[i,j]
                        elif (i-x[I,J]) == 0.0:
                            r[I,J] = r[I,J] + image[i,j] * np.sin( pi*(j-y[I,J]) )/( pi*(j-y[I,J]) )
                        elif (j-y[I,J]) == 0.0:
                            r[I,J] = r[I,J] + image[i,j] * np.sin( pi*(i-x[I,J]) )/( pi*(i-x[I,J]) )
                            r[I,J] = r[I,J] + image[i,j] * np.sin( pi*(i-x[I,J]) )*np.sin( pi*(j-y[I,J]) )/( pi*pi*(i-x[I,J])*(j-y[I,J]))
    return r
#cdef extern from "math.h":
 #   double sin(double)
The thing that's bugging me is that this algorithm doesn't traverse around the edges of the hole, rather, it indexes the bad pixels incrementally. However, it does several iterations of smoothing, so maybe it's ok. At least, there are no visible spooky artifacts.

Thursday, March 8, 2012

Mice and masks

I'm currently running GALFIT on HST f606W Mice galaxies image, trying to find point sources within the nucleii, decompose them into structural elements, find bulge shapes, etc. GALFIT is only as smart as its input, so I'm still preparing.

I masked other sources, spurious objects, trails using SeXtractor (it's a pain to compile it from source). However, it's way better to fit only a single galaxy at a time, hence I had to create mask images for them separately.

Doing it the sensible way failed, as the C script would segfault or produce a blank file -- it's probably has to do with ds9 output, but I didn't want to dive into it. Instead, I wrote a simple Python script to mask out rectangular regions around each galaxy.

More elaborate shapes could be made by reusing this script, for instance. When exporting region vertices with ds9, take care to use image coordinates instead of physical, otherwise your mask will look wonky. Here are the Mice galaxies in all their false-colour glory, the image I use for science and one of the final masks.

And here goes the essence of that Python script, it's trivial:

maskA = np.zeros((image.shape[0], image.shape[1]))

y = np.zeros(4)
x = np.zeros(4)

for i in range(0, 4):
  y[i], x[i] = poly[i]

for i in range(image.shape[0]):
  for j in range(image.shape[1]):
    if (min(y) < i and i < max(y)) and (min(x) < j and j < max(x)):
      maskA[i,j] = 1

poly is a list of (x, y) tuples. Now that I think about it, there's an option in GALFIT to fit only a section of an image -- but masking out seems to be more appropriate, as the galaxies are really close together.

Friday, February 24, 2012

Reionisation, healing masked arrays

This morning Benedetta Ciarli (MPA) gave an interesting talk about cosmic reionisation, modelling they're doing and observations with LOFAR, where they might glimpse the 21 cm line forest.
Two take-home messages were:
  • Helium fraction must be included in reionisation simulations
  • _If_ (apparently, it's a big if) LOFAR finds a high-redshift radio-loud quasar, they might see the absorbtion features along the line of sight, and obtain better constraints on reionisation parameters.

Later on I was solving a problem that preoccupies me these weeks ahead of the thesis comittee meeting: automated photometry of CALIFA (actually, any SDSS) galaxies, using growth curve analysis for flux extraction. The field stars and other irrelevant objects were already masked by N., but the holes left in the image arrays have to be interpolated, and that's not a trivial task, apparently. Python's interp2d fails on arrays this big (standard SDSS frames, I won't do manual cutouts), besides, I'm not sure it is suitable for this goal, as well as all similar python/numpy routines.

My current approach is finding contiguous bad (=masked) regions in the image array, forming smaller sub-arrays and filling the masked regions within them with the mean of a circular aperture around. That's obviously a naive approach to take -- once I get it running, I'll be able to do more realistic interpolation, e.g. using the galaxy's Sersic index.

It was an interesting detour through numpy array masking, motion detection, image processing and acquainting myself with the slice object. I kind of enjoy diving into those gory details instead of using some nice high level routine -- I'm reinventing the wheel, obviously, but that's more scientific than using black box methods.

Finding contiguous regions within an array, apparently, is not that trivial as I first expected, but the solution I used was really helpful -- stackoverflow rocks as usual.