These are actually a version of the notes I made up for a group seminar at ICTS. Just posting it here for anyone interested, or looking for more detailed documentation of the astroML correlation functions.

What is BAO and BAO Peak?

Say there is a big box far away from you. You know the actual dimensions of the box. You can then calculate the distance to the box by knowing its apparent angular size.

A similar trick can be applied in cosmology to measure distances to galaxies, but there is one slight caveat - spacetime is curved. This curvature acts like a lens, distorting the images produced. What that means is - if we notice that the image appears small, it could be far away, or it could be that the lens distorted the image to make it look small when it actually is nearby.

We then need a distance ruler which is not affected by such effects, which means we need an object of known size for a particular redshift $z$, or a population of objects whose redshift variation of size we know. Baryon Acoustic Oscillations (BAO) accomplish both these things for us. One could calculate, given a certain redshift, the correlation function between galaxies at that redshift, and we would expect there to be a peak at some length scales. Similarly, we can calculate the correlation function in the redshift space, and expect to find a peak.

One can see a BAO peak around 100 Mpc

The above figure shows the 2-point correlation function in the redshift space (redshift converted to comoving distance).

But why do we expect to see a BAO peak at all? Imagine an overdense region in the early universe. This region will pull matter towards it due to its gravity, but there will also be an outward pressure due to radiation-matter interaction. This competition creates oscillations like the sound waves produced in air, and these oscillations propagate outwards.

But after a point, electrons and protons combine to form the Hydrogen atom, and the photons are ‘decoupled’ from the matter. This causes the photons to freestream, and eventually reach us after millions and bliions of years as the cosmic microwave background (CMB). This means that a fair portion of matter is left at a fixed radius. This excess is what we detect as the BAO peak.

The Landy-Szalay Estimator for the two-point angular correlation function

tl;dr - Stephen Landy and Alexander Szalay in their 1993 paper said that the best estimator for 2pacf is

$w(\theta) = \frac{DD-2DR+RR}{RR}$.

The 2pacf $w(\theta)$ is the relative probability of finding a pair of galaxies separated by a certain angular distance with respect to that for an uniform distribution. The estimators for for $w(\theta)$ are generally built out of $DD$, $DR$ and $RR$, all suitably normalized with regard to the geometry from which data is taken.

There are lot of calculations mentioned in the paper, like calculating variance and bias of the estimator and comparing it to other estimators. But the takeaway formula from the paper to implement code is the following :-

$1 + w(\theta) = d - 2x + 2$

where $d = \frac{2DD}{G_p(\theta)n(n-1)} $ and $x = \frac{DR}{G_p(\theta)nn_r}$

Correlation Functions in `astroML`

Here, I describe various functions in astroML.correlation. astroML.correlation (source code here contains correlation functions written with astronomy/cosmology applications in mind. The package astroML contains a lot of tools that can be used for machine learning applications in astronomy.

`two_point(data, bins, method='standard', data_R=None, random_state=None)`

This is the primary function of the module in a sense. All the other functions involve calls of this function, so understanding this is the key.

Lets describe each argument that the function takes :-

data - The dataset for which correlation function $w(s)$ has to be found. data can be a one-dimensional or two-dimensional array. If data is a one-dimensional array, it is converted into a two-dimensional array by adding a redundant axis using numpy.newaxis. Also, n_samples, n_features = data.shape
bins - The explicit values of the bins over which $w(s)$ has to be calculated. bins has to be a one-dimensional array.
method - Optional argument. Can take value standard which will use estimator $\frac{DD}{RR} - 1$, or landy-szalay which will use the Landy-Szalay estimator $\frac{DD-2DR+RR}{RR}$. We shall use method=landy-szalay for all our purposes.
data_R - Optional argument. The random dataset to be used in the estimator. The data_R should have same n_features as data. If not given, the program chooses a version of data shuffled along all but one axis as data_R.
random_state - Optional argument. The seed of the random number generator to be used throughout the program.

The program returns array corr containing the estimate of the correlation function.

The following is the alogrithmic flow of the program :-

Check for consistency conditions of data, method, bins, and data_R.
Define factor = len(data_R) * 1. / len(data). This will be useful to normalize the estimator later.
Create the KDTree data structure (present in scikit-learn) to represent data and data_R. This data structure allows for fast computation of correlation functions.
KDTree data structure has an attribute called two_point_correlation(data_1,bins) which provides the correlation between the KDTree data and the dataset data_1 for the distances in bins. This means if we use the KDTree data structure corresponding to data and put data_1 = data, we will get counts_DD (not equal to $DD$; explained below) to be used in the estimator. Similarly, if we use data_1 = data_R, we get counts_DR, and so on.
An important point to note is that two_point_correlation(data_1,bins) returns the total number of galaxy pairs lying within a sphere of radius equal to the distance given in bins. This is not what we need - we would like to have the number of galaxy pairs lying within a spherical shell of some thickness around the distance value.

To get $DD$ we just take np.diff(counts_DD) and normalize it by the total number of pairs. With N_D = len(data) and N_R = len(data_R), we assign

$DD$ = np.diff(counts_DD)

$RR$ = np.diff(counts_RR)

$DR$ = np.diff(counts_DR)

As $RR$ appears in the denominator, we find all places where $RR$ is zero and set it to 1 to avoid problems. Later we will set all such locations in the final correlation function array to numpy.nan.

The correlation function is now calculated by the formula (Landy-Szalay Estimator)

corr = (factor ** 2 * DD - 2 * factor * DR + RR) / RR

`bootstrap_two_point(data, bins, method='standard',Nbootstraps=10, return_bootstraps=False, random_state=None)`

Bootstraping is a process used to identify errors in the estimation of a particular quantity. It is pretty straightforward - we use one realization of the quantity to be measured as the baseline, and then take the standard deviation (with delta degrees of freedom = 1) over multiple realizations (or ‘bootstraps’) of the quantity as the errorbars on the baseline.

The two extra arguments in this function as compared to two_point are both optional arguments.

Nbootstraps gives the the number of realizations over which the error will be calculated. Nbootstraps $\ge$ 2.
return_bootstraps takes value True or False and decides whether the function should return all the bootstraps.

The function returns corr, corr_err and all the bootstraps (given condition).

`two_point_angular(ra, dec, bins, method='standard', random_state=None)`

This is an implementation of the angular two-point correlation function in astroML, which merely converts angular coordinates into 3D (x,y,z) coordinates, and passes this as data into two_point. The algorithmic flow of this function is as follows :-

As opposed to data in two_point, two_point_angular takes one-dimensional arrays of Right Ascension ra and Declination dec as input. ra and dec is converted into euclidean (x,y,z) using the ra_dec_to_xyz(ra,dec) function. The output of this function is used as data
Notice that there is no argument here which corresponds to data_R in two_point. That is because the program generates uniform distributions of RA ra_R in range [min(ra), max(ra)] Dec dec_R in th e range [(min(dec), max(dec)]. The size of ra_R and dec_R is twice that of the ra (or dec). data_R is constructed using ra_dec_to_xyz(ra_R,dec_R).
The angular bins entered are also transformed into euclidean distances. The function then returns the follwing

return two_point(data, bins_transform, method=method, data_R=data_R, random_state=rng)

`bootstrap_two_point_angular(ra, dec, bins, method='standard',Nbootstraps=10,random_state=None)`

This is defined in a similar fashion as the other bootstrap procedure, with the sole exception that, rather than considering a single realization of the estimator as the baseline, we take the baseline to be the mean of all realizations.

My Comments

The factor in two_point is a bit fishy. It could be that this is causing an apparent problem in the normalization.
I am not very comfortable with the ddof = 1 used in the numpy.std, but at first glance, that seems to be a property of the distribution/error

Of Bumps and Correlations . . .

A Primer on BAO and Correlation Functions in astroML