Below are the resulting 3D correlation function from the run on the big box (described in previous blog post)
3D Correlation Functions on large mock data box
The first thing I notice is that they don't match as well as the functions matched on the smaller box. Following up on my concern from yesterday about the distribution of the random declinations:
The red line is the mock data and the blue line is the random data.
As you can see I am not properly simulating the declination randoms. Adam and I actually got in an argument/discussion about this on Tuesday. He was concerned that randomly populating in a ra, dec, redshift mask and then converting to x, y, z, was not the same as randomly populating in x, y, z but constraining the points to be within the mask. The above plot seems to point to this being true. My question is if this is also true with the Sloan data, or is the above an artifact of the fact that this mock data is actually in x, y, z coordinates, and I am applying a conversion/cut to try to simulate what is happening with the data. But with the Sloan data the true coordinate system is ra/dec/redshift, and so I would expect the point to be evenly distributed in that coordinate system, and slightly warped in x, y, z... but maybe this example shows I am wrong about this. Next step is to make these plots for Sloan data.
I am performing several test to figure out what is wrong with my 3D correlation functions. The first is to run on a bigger mock data set.
I had a discussion with Nic Ross today about the proper way to deal with populating randoms. I was starting to fear that I was doing something wrong in that I force the randoms to have the same distribution in the "redshift" dimension. I was told to do this by David and Nikhil, but wondered if this could be the cause of the problem. Nic assured me that this was the right thing to do, and suggested I read the following papers about correlation functions:
Nic looked at my 3D correlation functions and agrees that there is something fishy about them.
In terms of the bigger run: I am doing the same comparison of Alexia's and my correlation 3D correlation functions that I did yesterday, but on a much bigger box/sphere. My data is contained in a sphere with radius 471.0 Mpc/h (half of the box). Alexia's data is the entire box.
Some plots of the data masks:
I am worried that some of the geometry effects will come into play when I compare the correlation functions because
When you look at a histogram of the distribution of the declinations, it is not flat (you would expect this due to there being more objects in the center of the sphere when dec is ~pi/2 (versus the top or bottom or top of the sphere). However, when I populate the randoms I think I populate them uniformly... need to check this.
This takes a while to run because there are 300k objects in each box. Results to follow...
I spent the rest of the day yesterday and all day today trying to reproduce the plot I made on my October 12th post of Alexia's and my matching 3D correlation functions:
My working 3D correlation function!
I figured this would be a good place to start, and then expand from here on the full-reconstruction of the mock-data. However, when I re-ran the code that I *thought* produced this plot, Alexia's correlation function looked very different. I then spent a day and a half trying to reproduce this plot.
Lessons learned by this experience:
I should store the code I use to make the plots for a given day's blog in the same place I store the images for that day's blog in some sort of log file.
Alexia's code (at least the version I have), assumes you are populating the data in a 1x1x1 box. If you put the data into a box less than 1x1x1, you need to do something to the randoms, because right now they populate a 1x1x1 box.
If you updated the repository more often you wouldn't get this problem because you would have older versions of your code saved on there. So you need to update the repository everyday as a back-up for the log file mentioned above.
And now I have reproduced the plot from above again (and hopefully learned a lesson about note-keeping):
The reason it looks slightly different from above is because I am randomly seeding the random number generator.
I had been planning to buy a new computer for a while now, but I decided yesterday's depression would be as good of a reason as any to buy one today. So... I am a proud new owner of a Macbook Pro. For this of you who aren't Mac people, let me just say that Macs are amazing. I've been a big fan for years now. Today I became an even bigger fan when I discovered how easy it is to transfer your old computer's data to your new computer on a Mac. It is really as simple as buying a firewire cable and hooking them up to each other. Everything gets transferred! From your settings, to your bookmarks to your applications. It is amazing. I only had to re-install one thing. And I was transferring from a computer with Leopard to a computer with Snow-Leopard (different versions of the operating system)! This would have taken a week out of my life if I had been doing it on a windows machine.
I talked to Alexia today and she agrees that my 3D correlation functions look funky. It is possible that the reason the reconstruction isn't working is because there is still a problem with these functions. The plan is to go back to comparing the two functions with each other and doing a reconstruction on a mock data box (instead of the Sloan data) so that I can continue to compare to Alexia's working function. David thinks it will be necessary to compare my answer to a simulation anyway, so that this exercise isn't a waste of time.
I am so frustrated! I don't know what I am doing wrong. It doesn't match at all. I am having one of those moments where I feel like nothing I do ever works and I am a complete failure as a grad student. I mean, I'm in my 5th year of my PhD and don't have a single paper published, nor a working project. Everything that I am doing that does work is someone else's code (i.e. Alexia or David) and therefore has nothing to do with my talent or skills. Rage Rage Rage!
Here is a histogram of the "redshifts" (converted to units of comoving line-of-sight distance (Gpc/h) away from the observer) of the photometric data set (based on the Sloan photo-z's):
This is what the reconstruction should look like. However, when I do the reconstruction I get the following:
(The green is the reconstruction)
This actually paints a better picture than I actually have because
1) The normalization doesn't work and so I am tuning the normalization to match the answer (can't do this when I don't know the answer).
2) This is with really course binning. If I use finer binning (which is what we would ideally want to do), I get worse results:
What is up with the reconstruction going to zero (~1, 1.6, 1.9 Gpc/h)? I am so confused about that. Something that Adam and I discussed last night was that we would hope that this method would work at least as well as simply taking a histogram of the redshifts of the spectroscopic data set. In this case the spectroscopic data is actually from the same data as the photometric set, so the redshift distributions of the two sets are almost identical:
The fact that the reconstruction is significantly worse than this is really disheartening. I give up for today. I'm going to work on likelihood stuff in rebellion! (and post a sad facebook status message so people feel sorry for me and make me feel better)
Here are the angular cross correlation between the photometric data set and the binned spectroscopic data sets (12 -- one for each bin):
I also calculate the 3D auto-correlation function of spectroscopic data with itself (12 -- one for each bin):
The 3D correlation functions don't look very good to me. I would expect them to have a similar shape to the 2D correlation function, but I actually had to plot these on a normal plot (not log-log) because they were going negative. Alexia -- what do these look like for you on the mock data? I guess I should calculate them on the mock data myself.
In science it seems that we are supposed to somehow organically learn how to use linux computers. I don't know how people do this. I feel like I know only 15 commands because I use them everyday. Everything else I need to do on linux I either have to ask others to help me, or I look it up when I need it and then promptly forget. All books on the subject seem to give me more information than I need. Is there a 'Practical Linux for Dummies' book?
However, one of the MANY benefits of dating a computer scientist (for those who haven't tried it, I highly recommend it) is that Adam knows how to do almost anything I would want to do on a computer. I've lost track of how many times he has helped me.
Latest example of this is running a script remotely the need to stay logged into the computer. I know this is a basic task, but every time I've tried to get someone to show me how to do it, I've never gotten a clear-cut answer.
So this is how I did it (I am writing it here so that I when I promptly forget how to do this, it will be easy to remind myself):
1) At the top of your python code put the following line (after the import statements): if __name__ == '__main__': For more info.
2) Indent the rest of your code (as python requires you to do for all functions)
3) Make sure your code doesn't plotting in it.
4) Save your code as a python script (i.e. code2run.py)
The segmentation fault was due to a stupid hard-coded directory I had in my code. I was reading out the random data points to make sure they were in the same region as the data, but I forgot that I had put in the exact directory. When I uploaded it to riemann and tried to run it, the code was trying to write files to a directory that didn't exist. Silly silly me.
Princeton (and Alexia) makes everything better! It seems I have finally got a working 3D correlation function. I don't know why it took me so long to get this thing to work, it seems like it should be simple enough to do. Anyway, here is a summary what I've done...
The inputs to the function are a set of mock "data" point in Cartesian coordinates. For Sloan data, these will be converted from ra/dec (spherical coordinates) to Cartesian in python. There are also mask inputs, both in spherical and Cartesian coordinates.
For the mock data I simply applied a mask cut on ra/dec/redshift and then converted those points back to x/y/z:
The mask in ra/dec/redshift is a contiguous box.
Converted to x/y/z
The mask has the minimum and maximum values that the data can fall in for each coordinate. For example in Cartesian coordinates the mask contains:
In : minX Out: 173.568011004
In : maxX Out: 449.984440618
In : minY Out: -289.251614958
In : maxY Out: 289.251614958
In : minZ Out: -190.178217783
In : maxZ Out: 190.178217783
The data is then scaled down to a 1x1x1 box (this is what the Alexia/Martin correlation function calculation code is expecting). This is done by doing the following to each dimension of each data point (i):
where dataX[i] is the x value of the ith data point, minX is the minimum value that x can be (from the mask), padding is how much padding you want around the edge of your data (this prevents power from being "wrapped" around as the correlation calculation uses periodic boundary conditions), rmax is the maximum distance you are calculating the correlation function out to, and maxBoxside is the length of the longest side of the databox [i.e. max(maxX - minX, maxY - minY, maxZ - minZ)].
Because the data falls within a contiguous ra/dec/redshift, I populate the randoms in the ra/dec/redshift mask and then convert them x/y/z using the same conversion as I do on the data. I then apply the same scaling (as described in previous paragraph) to the x/y/z randoms. The result is data which falls on top of randoms and is contained in a padded 1x1x1 box:
As you can see all the data falls between 0 and 1 and the data falls on top of the randoms.
It is hard to see from the above plots but I would also like the redshift distribution of the randoms to follow that of the data. This is done by binning the data into redshift bins (20 in the example I am plotting here) and then for every data point in a particular bin, I generate 10 random points in the same bin.
Histogram of number of point in each redshift bin. I multiplied the data by 10 so that the scale is the same as the randoms.
Once I have both the data and the randoms in a padded 1x1x1 box (making sure the randoms follow the same redshift distribution as the data), then the 3D correlation function can be calculated. This calculation is done in Cartesian coordinates, but this shouldn't matter because we are just looking at distances of points from each other and so as long as each side of the box is in the same units we are good.
Here is a comparison of my 3D correlation function with Alexia's working 3D function. The reason they don't fall exactly on top of each other is because Alexia's is calculated on different data points (in the same mock catalog) due to her's requiring a Cartesian mask for the data:
My working 3D correlation function!
Now I get to run it on the Sloan data and see if the reconstruction still fails or if that fixes my problem. If it works, I am done with my PhD thesis (well not really, but it would be huge progress). Let's keep our fingers crossed!
Oh it has been a wonderful day indeed. Alexia makes everything work. I love the IAS, I am so much more productive here (I should just move here).
I've implemented the third bullet point from my posting from earlier today. The distribution of the redshifts of my random data point are now the same as the redshift distributions of the data points:
I multiplied the number of data points by 10 so that they match in scale to the randoms (there are 10x as many randoms as data). This doesn't seem to improve the matching of the correlation functions that much however:
Very productive day, and a great way to end the week. Happy weekend everyone!
I ran into David Weinberg at tea at the IAS today. We started talking about the Newman Project. He suggested that after we get the method working on Stripe 82 using the LRGs as our "spectroscopic sample" and the rest of the galaxies as our "photometric sample" we could compare the reconstructed distribution with spectroscopic "follow-up" surveys such as COSMOS, Vimas?, and DEEP.
Then it would be interesting to break down the main sample (i.e. the rest of the galaxies) by colors and do redshift distribution reconstruction on each type of galaxy. We could also break these into photometric redshift slices and then look at the reconstruction in each slice and break those down by colors, to see if perhaps there is a different spread in the colors.
I thought these were really good ideas and ones that I haven't though of before. Need to talk to Schlegel/Nikhil and see what they think. Oh exciting!
After banging my head against the wall for an hour (and talking to Alexia) I figured out the problem with my mask/conversion from yesterday. I was using a definition I found here to do the conversion from ra, dec, redshift to x, y, z for the Sloan data. However on the mock data, I was using a simple coordinate change from spherical to Cartesian as found in most physics text books like here. This was because I don't actually need to get redshifts (we want ra, dec and comoving distance) and therefore the conversion is less complicated. I had changed this function in my python code, but not in the correlation function code (which is in C) and so one set of code was doing one conversion and the other was doing something different.
I also had an issue that I was putting the data into a box of dimensions 1x1x1 (because this is what the correlation function was expecting) but had forgotten to do the same thing to the randoms. This is probably what was causing the correlation function to be way off.
I have corrected both these problem, and now my random data falls in the same region as my masked data. Woo hoo!
The correlation function matches (sort of), there seems to be an issue with the boxsides matching up, I need to figure this out:
If I divide my correlation function binning by the size of my box then they match up pretty well. I think this comes down to the fact that Alexia's code thinks the boxside is 1Mpc, and my code thinks the boxside is ~1000Mpc, I need to fix this in the code itself, but for now this makes me happy that they seem to match up well:
Remaining Problems/To dos:
You'll notice that the random data has negative values. This is probably due to the randoms covering a slightly larger area than the actual data, and thus when I try to fit it into a 1X1X1 box (which sides are determined by the actual data) it spills out slightly and thus goes slightly negative. This will cause problems when calculating the correlation function because the calculation expects the data to be in a 1X1X1 box which goes from 0 to 1. I need to fix this in both the randoms and the data, perhaps by shifting them both slightly?
The x-axis of the correlation functions aren't matching, this is probably due to different versions of the code thinking the size of the box is different. This needs to be corrected.
I am currently populating the random redshifts uniformly within the redshift range. However, because the density of objects evolves with redshift, I should actually be populating the random data with the same redshift function as the data. I had this implemented in an older version of the code, but it was never tested, so now I need to add that back into the code here.
I am back to working on the Newman project. I decided to follow my idea from Bad Blogger posting and take a chunk of mock data and apply a mask in ra, dec, and comoving distance and then convert that data into x, y, z and feed it into my code (which is what I will be doing with the Sloan data).
Here is my masked data:
ra and dec mask ra and comoving distance mask
These masks are funny shapes in x, y, and z (as to be expected): Because the data is contiguous in ra, dec, comoving coordinates, I make the randoms in these coordinates and when I plot the data and the randoms you can see they fall int he same regions:
However when I translate these randoms into Cartesian coordinates -- using the same algorithm I used to create the data -- I get the following problem:
This is very confusing to me. I must be doing something wrong in the conversion, but I've checked this several times, so I don't know why it would be different now. AAAAAAAHHHHH.
Today there was a meeting at LBNL to introduce people to the BOSS Data pipeline. Some useful links (appologies, those not part of the SDSS collaboration wont be able to view these without membership): Photometry information: http://www.sdss.org/dr7/algorithms/photometry.html Photometry will be essentially the same as SDSS-II
I had a talk with Eric Huff today about the Newman Project. He is interested in possibly using a similar technique on background galaxies, and foreground reconstructed mass distribution (from weak lensing) and then using these two sets as the "photometric" and "spectroscopic" data sets, as a possible way to perhaps dig out errors in your redshift distribution of the background galaxies. It is an interesting idea, however because the two sets do not overlap in redshift space I am not sure if we can do this.
He also was asking about some way to use the photo-z information of the objects in my photometric data in this method, to somehow improve on the photo-zs instead of disregarding that information entirely. I was thinking that perhaps we could use the photoz information in the binning somehow. Or take the photo-z distribution as a starting place for reconstruction. This might be something to think about if the reconstruction continues to not work on it's own.
I got the idea for this blog from David Hogg (through Alexia Schulz pointing his blog out to me). The idea is to be accountable to do research everyday, and to briefly write up what I have done here.
1) I must post regularly. 2) I must write only about research, no personal stuff, no administrative work, no excuses. 3) I must actually tell people about this blog, so that I am accountable to someone.