Monday, November 30, 2009

R Packages on Riemann

I asked Michael Jennings to install R on riemann so I can play with this SVM-light thing for the target selection. He installed R but asked me to install any add-on packages into my home directory. Here is how you do this:
  1. Create a directory where you want the R add-on packages to be stored (i.e. /home/jessica/R/library)
  2. Tell R to look for libraries in this location by adding this to your .path file: export R_LIBS=/home/jessica/R/library
  3. Source your .bashrc file to update path changes
  4. Open R
  5. Type the command: install.packages("e1071", lib = '/home/jessica/R/library', repos = "") where "e1071" is the name of the package you want to install, lib is the directory you created to store the packages, and repos is the location to download the packages from.
  6. Test to make sure it was successful by closing R, re-opening and importing package you just installed: library("e1071")
And now you have successfully installed a R package on Riemann. Thanks to Michael for working so quickly. Now, to try this SVM thing on the quasar targeting...

Here's the code (also in like:svmcode.r)

qso = readFITS("smallqso.fits")
star = readFITS("smallstar.fits")
x2 <- as.matrix(star$col[[1]],ncol=5)
x1<- as.matrix(qso$col[[1]],ncol=5)
x <- rbind(x1,x2)
y1= seq(1,smallsize)*0+1
y2= seq(1,smallsize)*0

fy = factor(y)
m = svm(x,fy)
save(list = ls(), file = "./svmcode.Rdata")

Where smallqso.fits and smallstar.fits are subsets of the qso/star templates used in the likelihood calculation. In this run I am using 100,001 objects for these small files, but that run is taking 2+ days so far on my laptop, so I think I'm going to try to do it with less points for this run on riemann (30,000). To load this session image, type: load("./svmcode.Rdata")

To run this file as a script type the following (for some reason the qsub way of running a script isn't working):
R CMD BATCH ./svmcode.r svmcode.out 2> svmcode.err &

Thursday, November 26, 2009

Happy Thanksgiving

Happy Thanksgiving Everyone!

Wednesday, November 25, 2009

Likely Testing

I'm trying to track down why the likelihood method is missing quasars that other targeting methods are finding.

The plot of human-confirmed QSOs that were in the target list for the likelihood method. The magenta are targeted by likelihood, the cyan are missed by likelihood, the white are targeted only by likelihood (click on image to enlarge):

Color-Color (u-g vs g-r)
of likelihood selected objects (magenta),
likelihood missed objects (cyan),
likelihood only objects (white)

The reason this plot looks different than yesterdays is because I was accidentally plotting both QSOs and stars on the plot yesterday. The above human-confirmed QSOs.

Below is a log-log plot of the likelihood selection variables L_everything vs L_qso for the likelihood-selected QSOs (magenta), the likelihood-missed QSOs (cyan), the likelihood-selected stars (red), and the likelihood-missed stars (green) you can see the value of L_ratio we cut on is the slope of the dividing line between these populations (click on image to enlarge):

L_everything vs L_qso

I've looked at the individual fluxes of the likelihood selected/missed QSOs to see if there an obvious place in flux-color space where we can get more of the missing objects. They are all intermixed.

David Schlegel suggested increasing the 'added errors' in the likelihood calculation to see if this improves things. I am re-running the likelihoods on these ~1400 objects (above) with larger errors. The hope is that because we are have such a sparse sample of the QSOs, that by increasing the errors we'll allow for objects that are "farther away" in flux-space to be selected. The errors are currently on the order of ~1%. I am re-running with 10%, however David suggests that I do it with 5%, so I'll run with that too.

Tuesday, November 24, 2009

Commissioning Quasars

I am trying to answer these questions about the likelihood targets in the commissioning data. Today I took the collated targets for the QSO commissioning region from the wiki (here) and used spherematch to align it with the truth table that Nic Ross has provided me (based on visually inspected spectra). For truth table see the Nov 20th email from Ross with subject line BOSS Quasars "Truth Table." Both are also on riemann in the directory:
Target bitmasks from the wiki:
# restrictive qso selection
# permissive qso selection
# known qso between [2.2,9.99]
# known qso outside of miz range. never target
# Neural Net that match to sweeps/pass cuts
maskbits BOSS_TARGET1 14 QSO_NN
# UKIDSS stars that match sweeps/pass flag cuts
# kde targets from the stripe82 coadd
# likelihood method
These are bitmasks, so in IDL to select items that are only selected by the likelihood method we would do the following:
LIKEONLY = where(truthmatch.BOSS_TARGET1 EQ ishft(long(1),17))

To select items which are selected by the likelihood (and others):

LIKE = where((truthmatch.BOSS_TARGET1 AND ishft(long(1),17)) GT 0)

To select items which are not selected by the likelihood (selected by others):

NOTLIKE = where((truthmatch.BOSS_TARGET1 AND ishft(long(1),17)) EQ 0)

To select objects with high redshift:

HIGHZ = where(truthmatch.Z_PERSON GT 3)

Below is a color-color plot of the above selections on top of the templates used for the likelihood selection. Red is everything. Green is QSOs. Magenta are objects that the likelihood method selected (LIKE). Cyan are objects that only the likelihood method selected (LIKEONLY). White are objects that have a high redshift (HIGHZ):

Color-Color (u-g vs g-r)
of likelihood selected objects

Below are objects that are QSOs but were missed by the likelihood (NOTLIKE). Need to figure out why.

Color-Color (u-g vs g-r)
of not-likelihood QSO objects

Monday, November 23, 2009

Revisiting Likelihood

Nick, David and Adam want me to check that the single epoch likelihood computation is correct. I also need to address some long-standing questions about the likelihood method including:
1. Why are we getting z > 3 when we aren't targeting these objects?
2. What is the proper normalization of the likelihood objects?
And several other questions posted a while back and never answered.

The likelihood code is at the following location on the repository:

I've also downloaded it here on riemann:

The varcat files are here (on riemann):

The star/qso template files are here on the repository:

I've also downloaded it here on riemann:

Myers sent me a fits file that has the single epoch targets with the likelihoods he computed. He wants me to check that they look reasonable. The file is in an email sent Nov 18/19 from Adam Myers with the subject line: Likelihood Method.

Below is an email I sent to Myers today regarding this check:

From: Jessica Ann Kirkpatrick
Date: Mon, Nov 23, 2009 at 4:15 PM
Subject: Re: Likelihood method
To: Nic Ross
Cc: Joseph Hennawi , Adam Myers, David Schlegel

when I compute the likelihoods I get the same values as you do for
like_ratio, like_everything, like_QSO_Z (well, within expected
rounding errors).

One thing that I had a question about is how like_rank is numbered. I
was assuming that like_rank = 0 would be the target with the largest
like_ratio, but it seems it is the reverse (like_rank = 0 is the
lowest like_ratio object).

So we are sorting target selection by largest ranking to smallest?

I've attached a color-color (ug, gr) plot of the 10,000 highest ranked
likelihood objects (cyan) on top of "everything" objects (red) and
qsos (green). You can see that these targets fall in the quasar zone.

Color-Color (ug, gr) plot of the 10,000 highest ranked
likelihood objects (cyan) on top of "everything" objects (red) and
qsos (green). You can see that these targets fall in the quasar zone.

Useful Mac Tips of the Day (via Demitri):
If you want to ssh into your Mac, go to
  • System Preferences -> Sharing -> Remote Login
  • Turn Remote Login on. Then find out your IP address by typing the following into a web browser:
You can then ssh/scp into your Mac in the normal way using jessica@myipaddress!

Also... Alexia will be happy to know that I finally implemented different terminal colors for different computers that I am logged into (so much less confusing):

Top terminal window (green) is riemann,
lower terminal window (black) is my Macbook.

And... Demitri showed me how to turn any mp3 into a ring tone using Garage Band. What a great day for cool tricks! Now my alarm clock will be the beginning of James Brown's Sex Machine: "Get-up, get-on-up. Get-up, get-on-up."

Friday, November 20, 2009

Mock Reconstruction

I ran a full reconstruction on the mock data over the past couple days. There are some problems with this run (which I discovered after I started it running, but decided to just let it finish). The first is that I have not fixed the distribution of the randoms in the 2D correlation function to go as cosine for the declinations (they are flat). The second is that I discovered the mock catalog I was using was not dark matter halos but actually LRGs and to the clustering properties are nonsensical. I have since gotten a dark matter halo catalog from Martin White.

Here are the 2D and 3D correlation function for the various redshift bins:

2D correlation functions for redshift bins

3D correlation functions for redshift bins

As you can see the 3D correlation function are no longer fluctuating around zero as they were on the Sloan data.

The reconstruction is a bit puzzling. Below is a histogram of the comoving distances (from the center of the sphere) of the spectroscopic (yellow), photometric (green) data sets as well as the reconstructed phi (redshift distribution (pink)). You will noticed that while the histograms don't match, the bumps in the photometric/spectroscopic histograms seem to match the bumps in phi. I was thinking that because these photometric/spectroscopic histograms include both the geometry and the "selection function" so to speak (I didn't actually apply a selection function, so it is just the natural clustering of the objects). Whereas phi (I think) is only supposed to be the selection function, not the geometry. I was wondering if perhaps if I subtract the geometry from the p/s histograms (which is proportional to r^2) then we would basically be left with these same wiggles? I've asked Alexia about this and I am waiting for a response.

Thursday, November 19, 2009

Catalog Woes

It turns out my mock catalog is probably not what I want to be using. I talked to Alexia and Martin White, and I am using a LRG catalog, but I want to be using a halo catalog. This probably means that my clustering properties are nonsensical (because I am generating galaxies based on galaxies, not on halos). Martin said he is going to give me a better catalog to use.

In the mean time I am running on the catalog I have. The reconstruction should work as long as both the photometric and spectroscopic data sets are generated in the same way... even if they don't accurately represent any real distribution on the sky.

The reconstruction is currently running, and I should have it done later today.

I also checked my 3D slice against Alexia's to make sure they match, and they do:

My and Alexia's 3D Correlation functions on
a redshift slice of the mock data

Wednesday, November 18, 2009

Running Scripts (Update)

I want to utilize the fact that I can run the correlation function on multiple nodes of riemann. Eric helped me learn to use qsub to run a script on multiple nodes. I've got it working as follows:

In the file on riemann 091118jess.script there is the following:

#PBS -l nodes=1
#PBS -N jessjob1
#PBS -j oe
cd /home/jessica/repository/ccpzcalib/Jessica/pythonJess

where runcode.script contains the following:

I make both runcode.script and 091118jess.script executables, and then type in the terminal:
qsub 091118jess.script

I then exit from riemann/insure (didn't close it using x button, not sure if this makes a difference).

When I log back in the code is still running. (check using qstat)

Now I need to integrate this into the python code so that qsubs are generated for each correlation function for each redshift slice. This should make things go a lot faster.

The email from Eric:
From: Eric Huff
Date: Wed, Nov 18, 2009 at 3:06 PM
Subject: qsub
To: Jessica Kirkpatrick

Here's a script example:

#PBS -l nodes=1
#PBS -l walltime=24:00:00
#PBS -m bea
ulimit -s unlimited
export PHOTO_DATA=/clusterfs/riemann/raid006/bosswork/groups/boss/photo/data
export PHOTO_REDUX=/clusterfs/riemann/raid006/bosswork/groups/boss/photo/redux
cd /home/emhuff/Coadds/Code/Coadd_Data_Root/Scripts
echo `hostname`
/home/emhuff/Coadds/Code/Coadd_Data_Root/Scripts/r.out 380720 6 0 00

And here's some online documentation for qsub:

I'm running the full reconstruction on the mock data, I might have a fully working code tomorrow. Woo hoo!

Degrees vs. Radians

Turns out the "List length exceeded MaxList" error was due to the randoms not falling in the same region as the data. I had forgotten to convert my data from radians to degrees in Python (but it was being converted in the C program) and so the C-converted randoms (in x, y, z) did not match the Python-converted data (in x, y, z).

I've fixed this, and get the following 3D correlation functions for my redshift slices of the Mock data (and they look pretty darn good)!

My 3D Correlation function on Mock Data

This makes me want to make the bold statement that all the problems I was having with my 3D correlation functions on the Sloan data was a problem with the data, (not my code)... but I need to check this again because the 2D correlation functions looked ok on the Sloan data. This could be an issue of the photo-z's just being completely wrong, (David keeps saying he doesn't trust the CAS) so the 2D correlation function (which only uses ra and dec) is fine, but once we start using the photo-z's things get messed up.

Another thought I had was that I need to fix this declination random cosine population issue in the 2D correlation function, currently the declinations are being populated using a flat distribution.

Oh today is a good day indeed!

Énergie Noire

I think from now on I am going to refer to dark energy by it's French name. Énergie Noire just sounds so much more exotic.

There is a BigBOSS Meeting for the rest of the week at LBNL therefore I don't know how much I'll get done research-wise, but there is also a looming target selection deadline, so I need to do some stuff there too.

Tuesday, November 17, 2009

Exceeded Maxlist

I am getting a strange error when I run the 3D Correlation function. The error is from Martin White's meshing algorithm (nearmesh):

"List length exceeded MaxList=234700"

It think it happens when a point is near any mesh within the grid. But I am not sure exactly, I didn't write this section of the code, and I've never gotten this error before.

I wrote Alexia about it, and I am hoping she'll have some experience with this problem.

Below is the email I send her today:

From: Jessica Ann Kirkpatrick
Date: Tue, Nov 17, 2009 at 11:24 PM
Subject: halo files and numbers
To: Alexia Schulz

the file I am using is from here:

It was under a section that used to be there called LRG but now seems to be gone. See this blog entry.

The number of halos in the file is 102496.

When I run halolist with mmin = 5e10 I get 3811293 galaxies (for photometric data set)
When I run halolist with mmin = 5e11 I get 378932 galaxies (for spectroscopic data set)

output when I run halolist is below:

In [324]: #make galaxies
In [325]: mmin=5.e10 #for gal
In [326]: h=gals.halolist(mockLRGfile,mmin)
1.30144190788 seconds wall time to readin
1.3 seconds processor time to readin
33.0253970623 seconds wall time
30.84 seconds processor time

In [327]: g=gals.galist(h,nbins,outfile=photofile)
halo number is 102496
number of halos with 1 or more gals 102496
176.569406033 seconds wall time
168.58 seconds processor time

In [328]: mmin=5.e11 #for spec
In [329]: h=gals.halolist(mockLRGfile,mmin)
10.7454509735 seconds wall time to readin
1.66 seconds processor time to readin
37.9757530689 seconds wall time
27.46 seconds processor time

In [330]: g=gals.galist(h,nbins,outfile=specfile)
halo number is 102496
number of halos with 1 or more gals 101718
27.7532730103 seconds wall time
26.72 seconds processor time

I'm getting the error:
List length exceeded MaxList=234700
on one of my runs, has this ever happened to you?

It seems to be coming from deep in the calculation in the function meshnear.

I am getting this error when I try to run on redshift slices which are further away (and thus have higher statistics).

Tomorrow the BigBOSS meeting has a break from 12-1. Can I call you sometime in that window?


I thought perhaps a problem is that the randoms are not overlaping with the data (and so the nearmesh would mess up because the randoms are not in the same region as the data).

And this seems to be part of the problem:

Need to check these to see where the problem is. I think there might be a degrees vs radians issue. I'll look at this tomorrow.

Monday, November 16, 2009

Input files

I've decided to use put files for the arguments to my correlation functions (for purposes of documentation and also to make debugging easier). I used to do this with DRIFT data analysis, and liked it a lot.

The main problem is that when I debug using Xcode, you type in the arguments by hand. This is annoying because the masks change everytime I run a different slice. This isn't an issue when I am not debugging because I simply have python feed the arguments into the C program (that does the correlation).

Essentially I have changed the correlation function code so that it's only input parameter is a file name which has all the arguments in it. I think this should also help with replicating old runs as well as general organization.

Friday, November 6, 2009

Mock Slice

The next step in trying to figure out what is wrong with my 3D correlation functions is to try to actually do a full reconstruction on the mock data. I'm starting with taking a redshift slice of the mock data and calculating a 3D correlation function on the slice to see if I get a logical result. This appears to be working:

Matching 3D correlation functions

To be more explicit about what I am doing to create the above. I am first taking the mock data box and selecting a "redshift slice" of the data:
For my correlation function:

Redshift slice (green) of all data points in
comoving distance range of 350-450 Mpc/h

What this looks like in Cartesian coordinates:

It is hard to see in 2D projections,
but this is essentially a thick spherical shell with inner radius
350 Mpc/h and outer radius 450 Mph/h

For Alexia's correlation function I am taking a slice in the z dimension:

Redshift slice (green) for Alexia's data

I chose the slice such that both mine and Alexia's data subsets had approximately the same number of objects (~100,000).

The distributions of the randoms continue to match the data:
For my correlation function:

Number of randoms (blue) and data points (red)
for different dimensions (ra, dec, redshift, x, y, z)

This gives me the confidence to try to do the full reconstruction on this mock data set, as that would simply be just repeating the above on several redshift bins.

I seem to be running Alexia's and my correlation functions side by side a lot to compare the results, I have made the following scripts:

svn +x *.script
./091105jess.script 091105jess.out 2> 091105jess.err &
./091105alexia.script 091105alexia.out 2> 091105alexia.err &

Cool link of the day:
Now my parents can understand xkcd:

Update (from Kevin):
As far as is concerned, is useful,
but is funny,
and exists.

Thursday, November 5, 2009

Data Fixed?

I wanted to see if the changes I made to the declination distribution of the randoms fixed the problem with the Sloan data 3D correlation function. It appears it has not.

The correlation function is going negative ~5 Mpc/h:

As per Nic Ross's advice I am printing out the individual elements of the correlation function:

bin dd dr rd rr
0 317762.000000 216879.000004 217092.400004 226377.000004
1 1438272.000000 1190313.799866 1189213.999867 1244237.199854
2 2708480.000000 2477950.600010 2481026.000013 2551443.400078
3 3778250.000000 3627401.601080 3635341.801088 3668580.801119
4 4493904.000000 4437785.801835 4441650.801839 4424500.401823
5 4827204.000000 4858547.602227 4850951.402220 4790191.802163
6 4832844.000000 4926666.002291 4915620.202280 4845375.602215
7 4632790.000000 4742744.802119 4739360.602116 4702071.602081
8 4356652.000000 4461201.001857 4459481.601855 4483517.001878
9 4157056.000000 4233592.001645 4233477.601645 4276671.401685
10 4014424.000000 4070859.001493 4070540.001493 4105703.401526
11 3903134.000000 3944458.201376 3945670.001377 3970457.201400
12 3811114.000000 3847011.401285 3849257.601287 3867510.001304
13 3745984.000000 3769143.601212 3771384.201215 3788595.401231
14 3684692.000000 3707139.601155 3706282.401154 3723841.201170
15 3632222.000000 3651903.001103 3652026.201103 3667982.201118
16 3582504.000000 3600661.601056 3600676.801056 3617006.401071
17 3541194.000000 3554850.401013 3556219.601014 3567875.401025
18 3505996.000000 3511868.800973 3511927.400973 3523757.600984
19 3464186.000000 3468078.800932 3468489.800932 3482728.600946
20 3418956.000000 3425039.800892 3424701.000892 3442556.200908
21 3376304.000000 3386133.600856 3386409.600856 3406330.600875
22 3348124.000000 3351585.600824 3351850.400824 3371267.600842
23 3312086.000000 3319740.000794 3317782.400792 3335477.000809
24 3278338.000000 3285228.400762 3283852.800761 3299746.600775
25 3237588.000000 3252288.800731 3251628.800731 3266714.800745
26 3207848.000000 3220719.400702 3217525.800699 3234663.000715
27 3184346.000000 3189389.600673 3187092.600670 3199858.800682
28 3154432.000000 3160154.200645 3157859.200643 3168571.000653
29 3127406.000000 3131088.200618 3127228.000615 3137800.800625
30 3093844.000000 3098019.600587 3096805.400586 3105624.200595
31 3064538.000000 3064802.600557 3064300.800556 3072282.600563
32 3036138.000000 3035210.800529 3032522.400526 3040139.800534
33 2999470.000000 3000089.200496 3000837.800497 3008239.000504
34 2962448.000000 2968977.200467 2970271.600468 2979345.000477
35 2927012.000000 2934725.000435 2935492.800436 2946856.400447
36 2901310.000000 2903968.800407 2902961.000406 2913088.800415
37 2863636.000000 2869294.400374 2869514.600375 2881526.600386
38 2821798.000000 2833074.200341 2834320.800342 2847371.800354
39 2796586.000000 2803686.200313 2803161.600313 2813829.200323

Because the dr an rd are being subtracted from rr and dd in the calculation of the correlation function, it is possible for the correlation function to go negative if the objects are more correlated with the random data than with themselves. This seems very strange to me, but it is possible that because we don't trust the photo-z's from the data set I am using that maybe this is causing the problem?

I think the next step is to run the reconstruction purely on the mock data to see if that works, and this should isolate if this is a bad data issue or a bad code issue.

Wednesday, November 4, 2009

Populating the Sphere

I am trying to populate the randoms in the same way that the data is populated over the sphere. This means that in right ascension I populate the randoms with a uniform distribution over the range of the data values. However, for declination I populate using a cosine function (because we want to density per volume element to be constant). I am doing the following:
oo = 1;
thisdec = drand48()*mask; // Choose a declination inside the mask
theta = drand48(); //Choose a random value for sin(thisdec)
/*If theta is "below the sin(thisdec) curve",
keep thisdec, otherwise repeat above process. */
if(theta < sin(thisdec){
Rb[nn].pos[i] = thisdec;
oo = 0;


This uses a rejection sampling method to populate the declinations. It seem to work, here are plots of the random distributions compared with the data, you can see that the distributions match:

Matching distributions of data (red) and randoms (blue)

Just to make sure that the data still falls in the same regions as the randoms here are more plots:

The data (red) falls in the same physical region as the randoms (blue)

Now my correlation function matches Alexia's again (as opposed to my October 30th post):

Matching 3D correlation functions!

Tuesday, November 3, 2009

Stripe 82 Histograms

Here are histograms of the random numbers versus data for Stripe 82. As you can see it seems like the randoms are matching the data pretty well:


Spherical Coordinates

Therefore I don't think this is what is causing the problem in my calculation on the data. I coded up the correlation function to print out the components as per Nic's suggestion (see yesterday's post). Hopefully this will help me isolate the problem.

Monday, November 2, 2009

Populating Randoms

I talked to Nic Ross and Mark Davis about this issue of populating the random declinations in the correlation functions. They both said that this is a well known issue and that (obviously) I want to populate the random declinations with a cosine function. Nic doesn't think this should matter much with my Stripe 82 3D correlation functions because the dec range is small and therefore is ~flat. Still, I am going to see if this matters.

Nic also recommended that I print out the individual elements of the correlation function to see if there is one in particular that isn't behaving properly. By "elements" mean that the 3D correlation function is calculated in the following way:

xi = (dd-2*dr+rr)/(rr);

where dd is the data correlated with itself, dr is the data correlated with the randoms and rr is the randoms correlated with themselves, so I am doing that now too.