Showing posts with label testing. Show all posts
Showing posts with label testing. Show all posts

Monday, June 6, 2011

Testing QSO-Galaxy Cross CF

I spent the last couple days making sure that I trusted the correlation function code I had developed for this project. I compared it to the results from a (slow) sphere-match code on a small data set.

The following are the testing codes (also here ../logs/cross_corr_check.idl). In IDL using spherematch:

; Read in data/random files
readcol,'cfht_data_tiny.dat',rad1,decd1, format = ('D,D')
readcol, 'qso_data_tiny.dat',rad2,decd2,cd, format = ('D,D,D') ;cd is comoving distance of qsos
readcol,'cfht_randoms_tiny.dat',rar1,decr1, format = ('D,D')

; Find number of objects in files
nr1 = n_elements(rar1)
nd1 = n_elements(rad1)
nd2 = n_elements(rad2)

; Correlate out to a large theta to get all pairs
thetamax = 50.0

print,'Starting cross-correlation:'
print,'Estimating DD...'
spherematch,rad1,decd1,rad2,decd2,thetamax,ind1a,ind1b,dist_dd,maxmatch=0

; Convert from angular separation to comoving distance separation
this_dd = 2*sin(dist_dd*!pi/360)*cd[ind1b]

;Bins go from 0.1 to 10 with 15 bins.
corrmin = 0.1D
corrmax = 10.0D
nbins = 15.0D

; Find bins lower, upper and centers
bins_lower = (corrmax-corrmin)/(nbins)*findgen(nbins)+corrmin
bins_upper = (corrmax-corrmin)/(nbins)*(findgen(nbins)+1)+corrmin
rmean = fltarr(nbins)
for i = 0,(nbins-1) do rmean[i] = (bins_lower[i]+bins_upper[i])/2.

; Bin the DD separation distances
dd = fltarr(nbins)
for i = 0,(nbins-1) do dd[i] = n_elements(where(this_dd gt bins_lower[i] AND this_dd le bins_upper[i]))

print,'Estimating DR...'
spherematch,rar1,decr1,rad2,decd2,thetamax,ind1,ind2,dist_dr1,maxmatch=0

this_dr = 2*sin(dist_dr1*!pi/360)*cd[ind2]
dr = fltarr(nbins)
for i = 0,(nbins-1) do dr[i] = n_elements(where(this_dr ge bins_lower[i] AND this_dr le bins_upper[i]))

corr1 = 1L*dd/dr*1L*(nd2*nr1)/(1L*nd1*nd2)-1L

for i = 0,(nbins-1) do print, rmean[i], corr1[i]

Separation omega
0.430000 -0.115686
1.09000 -0.104478
1.75000 -0.120804
2.41000 -0.0914845
3.07000 -0.0393971
3.73000 -0.0268416
4.39000 0.0134841
5.05000 0.0596094
5.71000 0.0227162
6.37000 0.102554
7.03000 0.0929233
7.69000 0.0900670
8.35000 0.0591398
9.01000 0.0284724
9.67000 0.0598689

(Note that these "tiny" files only have 6 qsos and 9000 galaxies, so the correlation function values are very noisy, this was just to test and I used small files to)

By comparison I also have the python/C code which runs much faster (../Jessica/qsobias/Correlate/runCorrelation.py):

import numpy as N

from pylab import *

from correlationFunctions import *


#------------------------------------------------------------------------

# Create file names (tiny catalogs)

#------------------------------------------------------------------------

workingDir = 'tinyrun'

makeworkingdir(workingDir)

galaxyDataFile, qsoDataFile, randomDataFile, corr2dCodefile, argumentFile, runConstantsFile = makeFileNamesTiny(workingDir)


oversample = 5. # Amount that randoms should be oversampled

corrBins = 25.0 # Number of correlation bins (+1)

mincorr = 0.1 # (Mpc/h comoving distance separation) Must be great than zero if log-binning

maxcorr = 10.0 # (Mphc/h comoving distance separation)

convo = 180./pi # conversion from degrees to radians

tlogbin = 1 # = 0 for uniform spacing, = 1 for log spacing in theta


#------------------------------------------------------------------------

# Write run constants to a file

#------------------------------------------------------------------------


writeRunConstantsToFile(runConstantsFile, galaxyDataFile, qsoDataFile, \

randomDataFile, corr2dCodefile, argumentFile, oversample, corrBins, \

mincorr, maxcorr, tlogbin)


#------------------------------------------------------------------------

# Compute the Angular Correlation Function

#------------------------------------------------------------------------

runcrossCorrelation(workingDir, argumentFile, corr2dCodefile, galaxyDataFile,\

qsoDataFile, randomDataFile, mincorr, maxcorr, corrBins, tlogbin)


# separation (Mpc/h) crossw (Mpc/h)

0.4300000000 -0.1156862745

1.0900000000 -0.1044776119

1.7500000000 -0.1208039566

2.4100000000 -0.0914845135

3.0700000000 -0.0393970538

3.7300000000 -0.0268417043

4.3900000000 0.0134841235

5.0500000000 0.0596093513

5.7100000000 0.0227161938

6.3700000000 0.1025539385

7.0300000000 0.0929232804

7.6900000000 0.0900670231

8.3500000000 0.0591397849

9.0100000000 0.0284723490

9.6700000000 0.0598689436


As you can see the correlation functions match!

Wednesday, March 2, 2011

Mask Help From Martin White


Hi Jessica,

I don't remember exactly what I did to make the mask you're using, but those plots look more or less right, I'm not sure what you were keying on? In addition to dropping regions below 75% completeness I also threw away regions whose area was too small to reliably estimate a completeness -- that leads to small "gaps" in the mask. And the mask was made on the plates observed up until June (I think) last year and only included the higher z (CMASS) sample. Which part of the zoom in plot is worrying you? I assume that's the one you're worried about?

Of course, those files are very out of date now compared to the current situation and should probably be redone -- quite a lot has changed since then.

Martin.


~~~~~~~~
Martin,
The actual concern stems from the fact that the ra/dec distribution of the randoms I generate using this mask, do not match (within expected noise) to the distribution of the data. This caused me to wonder if perhaps I am applying the mask incorrectly. This is my first time working with mangle.

Here are plots of the masked data and masked randoms along with the distribution histograms. Note that both of these data sets have the same number of objects: 139874. In these plots the masked BOSS data is white and the masked randoms are green.

You'll notice that there are ra/dec regions where there are over densities of the randoms or the data much more than what you would expect due to Poisson noise. I've shown this by binning both roughly (50 bins) and finely (1000 bins). In the fine binning you can see what the noise levels are.








The way I am generating these randoms is as follows (code is at bottom of email):
1) Read in the BOSS Mask
2) Generate a set of random ra and dec values
3) Check if random ra/dec is in the mask
4) Cut set of randoms to only be items in side the mask
5) Get the polygon weight for each random in the mask
6) Generate a random "test weight" a value between 0 and 1 for each random.
7) If the "test weight" is less than or equal to the polygon weight for that random, keep the random.
8) Otherwise throw out the random. This assures that the completeness of the randoms matches that of the BOSS data.
9) Repeat above process until you get the desired number of randoms

I've talked to both David and Shirley about this and they seem to think this is the correct way to generate the randoms. So I am confused as to why the distributions don't match. I would think the clustering of the BOSS data would be on a smaller order than the bin sizes of the attached histograms.

Any insight you have on this problem would be wonderful.

Also, it would be useful to have a mask for the current BOSS data set, as you said the one I have is quite out of date. Is this mask continuously generated? Where could I download it? Or do you need to do it manually?

Thank you again for any ideas you have.
Jessica


Here is the code to generate the randoms:
; Read in BOSS Mask

bossmask = "/clusterfs/riemann/raid001/jessica/boss/bossX.002.ply"
read_mangle_polygons, bossmask, polygons, id

;Ignore mangle polygons where the pixel weight is zero
bosspolygons = polygons[where(polygons.weight GT 0)]

;Generate a set of random ra/dec:
randomdec = 90.0-acos(RANDOMU(S, 10*randomsize)*2.0-1.0)*180./!
p
randomra = RANDOMU(S, 10*randomsize)*360.

;Select objects in mask
results = is_in_window(ra=randomra,dec=randomdec,bosspolygons,in_polygon=polynum)
inBossMask = where(results EQ 1)
polys = polynum[inBossMask]

;Get weights of the randoms
randomweights = bosspolygons[polys].weight

;Cut the Ra/Dec such that it is in the BOSS Mask
rsra = randomra[inBossMask]
rsdec = randomdec[inBossMask]

; Keep or throw out a random based on the weight of the pixel it is generated in:
; Assign a random test weight, only keep randoms whose test weight
; is less than or equal to the weight of that pixel (randomweights).
; So if the weight of a pixel is 0.8 then 80% of randoms generated in that pixel will be kept.
weighttest = RANDOMU(S, n_elements(randomweights))
isobserved = where(weighttest LE randomweights)

;Randoms inside mask, weighted properly
randomra = rsra[isobserved]
randomdec = rsdec[isobserved]

thisra = randomra
thisdec = randomdec

(repeat until get the desired number of randoms)

----------

Jessica,

Ok, I'm not sure that I understand that and I don't know much (i.e. anything) about the IDL versions of Mangle routines. We can try a few spot tests quickly though. First, a more up-to-date mask. I don't have the latest mask (Will has taken over generating those) but I have one that's fairly close. I put it on Riemann in the BOSS subdirectory of my home directory:

/home/mwhite/BOSS/bossX.004.
ply

For that file I just generated a bunch of random points and asked what the polygon ID and weights were. See if your code agrees with this:

# RA DEC PolyID Weight
120.22085116 44.18548740 2395 0.8810
115.07126735 41.20294362 2276 0.0000
111.08122943 41.50788940 2458 0.9365
122.77802060 40.69059229 2520 0.9615
120.27949275 40.36484626 2501 0.8905
110.35797556 40.43283558 2453 0.0000
125.63829796 40.93875235 2542 0.9833
117.66991611 40.36353309 2324 0.7857
112.57187913 41.52219305 2452 0.9158
123.30608484 40.38555016 2522 0.9130

If you agree then I suspect everything is fine. If not, we can figure out what the differences are and try to fix them.

By the way ... there is a command ("ransack" I think) for generating random points which is part of the Mangle software package. And Will is supplying random catalogs now as part of his catalog efforts. They are available off the LSS pages on the BOSS Wiki.

Martin

-----

Martin,
Thank you for helping me debug this.

I get the same numbers as you below for polyID and weight. I guess I'll try using Will's random catalogs and see if that helps things.

I appreciate your time.
Jessica

----

Here is the code I used for the test:

testra = [120.22085116, 115.07126735, 111.08122943, 122.77802060, 120.27949275, 110.35797556, 125.63829796, 117.66991611, 112.57187913, 123.30608484]
testdec = [44.18548740, 41.20294362, 41.50788940, 40.69059229, 40.36484626, 40.43283558, 40.93875235, 40.36353309, 41.52219305, 40.38555016]
testpolyID = [2395, 2276, 2458, 2520, 2501, 2453, 2542, 2324, 2452, 2522]
testweight = [0.8810, 0.0000, 0.9365, 0.9615, 0.8905, 0.0000, 0.9833, 0.7857, 0.9158, 0.9130]

bossmask = "/clusterfs/riemann/raid001/jessica/boss/bossX.004.ply"
read_mangle_polygons, bossmask, polygons, id
results = is_in_window(ra=testra,dec=testdec,polygons,in_polygon=polynum)
weight = polygons[polynum].weight
polyID = id[polynum]

Friday, June 25, 2010

Coding Wisdom from David Hogg

David Hogg came Scicoder to talk to us about Test-Driven Coding. He is pretty awesome.

Testing Your Code
1 bug / 1000 lines of code.

Ways to deal with this:
1) Test Driving Programming
2) Open-Source
3) Do Science with Code

Extreme Coding
1) Pair coding (coding together)
2) Stand up meetings (don't talk -- do)
3) Test Driving Programming (see below)
4) Minimal Implementation (re-factor frequently) only code things that you are actually going to use.

Test-Driven Programming

1) First write a test function to perform all possible tests you can imagine for your function.
2) Then run write the function and see how many of the tests it passes...
3) Keep modifying your function until it passes all the tests
4) Can't use code, check into repository unless it passes all tests

Functional Testing
1) Generate fake data + garbage
2) Run code
3) Did you get back what you put in?
4) Put assert statements in your code, such that if these assertions fail your code fails and spits out an error.

Have to be careful not to generate fake data that covered the entire spectrum of properties of your real data --- thus you could not be testing all cases or making all possible assertions.


But at the end of the day... what we really care about it this:
How do I know that my results are correct? ← (Probably not answerable)
or in real life...
Why do I think that my students results are correct?
- You do science with real data.
- You get results that you think you can publish and people will believe you (seriously this is what he said).

Blanton et. al. Luminosity Function Paper was originally a test by Mike Blanton to see if the SDSS data software pipeline was working and turned into one of the highest refereed papers out of the Sloan Survey. Thus testing really does matter.

On a side note, a link to my blog made it into Hogg's blog, and the thus circle is now complete.