These are actually a version of the notes I made up for a group seminar at ICTS. Just posting it here for anyone interested, or looking for more detailed documentation of the astroML correlation functions.
What is BAO and BAO Peak?
Say there is a big box far away from you. You know the actual dimensions of the box. You can then calculate the distance to the box by knowing its apparent angular size.
A similar trick can be applied in cosmology to measure distances to galaxies, but there is one slight caveat - spacetime is curved. This curvature acts like a lens, distorting the images produced. What that means is - if we notice that the image appears small, it could be far away, or it could be that the lens distorted the image to make it look small when it actually is nearby.
We then need a distance ruler which is not affected by such effects, which means we need an object of known size for a particular redshift $z$, or a population of objects whose redshift variation of size we know. Baryon Acoustic Oscillations (BAO) accomplish both these things for us. One could calculate, given a certain redshift, the correlation function between galaxies at that redshift, and we would expect there to be a peak at some length scales. Similarly, we can calculate the correlation function in the redshift space, and expect to find a peak.

The above figure shows the 2-point correlation function in the redshift space (redshift converted to comoving distance).
But why do we expect to see a BAO peak at all? Imagine an overdense region in the early universe. This region will pull matter towards it due to its gravity, but there will also be an outward pressure due to radiation-matter interaction. This competition creates oscillations like the sound waves produced in air, and these oscillations propagate outwards.
But after a point, electrons and protons combine to form the Hydrogen atom, and the photons are ‘decoupled’ from the matter. This causes the photons to freestream, and eventually reach us after millions and bliions of years as the cosmic microwave background (CMB). This means that a fair portion of matter is left at a fixed radius. This excess is what we detect as the BAO peak.
The Landy-Szalay Estimator for the two-point angular correlation function
tl;dr - Stephen Landy and Alexander Szalay in their 1993 paper said that the best estimator for 2pacf is
$w(\theta) = \frac{DD-2DR+RR}{RR}$.
The 2pacf $w(\theta)$ is the relative probability of finding a pair of galaxies separated by a certain angular distance with respect to that for an uniform distribution. The estimators for for $w(\theta)$ are generally built out of $DD$, $DR$ and $RR$, all suitably normalized with regard to the geometry from which data is taken.
There are lot of calculations mentioned in the paper, like calculating variance and bias of the estimator and comparing it to other estimators. But the takeaway formula from the paper to implement code is the following :-
$1 + w(\theta) = d - 2x + 2$
where $d = \frac{2DD}{G_p(\theta)n(n-1)} $ and $x = \frac{DR}{G_p(\theta)nn_r}$
Correlation Functions in astroML
Here, I describe various functions in astroML.correlation. astroML.correlation (source code here contains correlation functions written with astronomy/cosmology applications in mind. The package astroML contains a lot of tools that can be used for machine learning applications in astronomy.
two_point(data, bins, method='standard', data_R=None, random_state=None)
This is the primary function of the module in a sense. All the other functions involve calls of this function, so understanding this is the key.
Lets describe each argument that the function takes :-
- 
    data- The dataset for which correlation function $w(s)$ has to be found.datacan be a one-dimensional or two-dimensional array. Ifdatais a one-dimensional array, it is converted into a two-dimensional array by adding a redundant axis usingnumpy.newaxis. Also,n_samples, n_features = data.shape
- 
    bins- The explicit values of the bins over which $w(s)$ has to be calculated.binshas to be a one-dimensional array.
- 
    method- Optional argument. Can take valuestandardwhich will use estimator $\frac{DD}{RR} - 1$, orlandy-szalaywhich will use the Landy-Szalay estimator $\frac{DD-2DR+RR}{RR}$. We shall usemethod=landy-szalayfor all our purposes.
- 
    data_R- Optional argument. The random dataset to be used in the estimator. Thedata_Rshould have samen_featuresasdata. If not given, the program chooses a version ofdatashuffled along all but one axis asdata_R.
- 
    random_state- Optional argument. The seed of the random number generator to be used throughout the program.
The program returns array corr containing the estimate of the correlation function.
The following is the alogrithmic flow of the program :-
- 
    Check for consistency conditions of data,method,bins, anddata_R.
- 
    Define factor = len(data_R) * 1. / len(data). This will be useful to normalize the estimator later.
- 
    Create the KDTreedata structure (present inscikit-learn) to representdataanddata_R. This data structure allows for fast computation of correlation functions.
- 
    KDTreedata structure has an attribute calledtwo_point_correlation(data_1,bins)which provides the correlation between theKDTreedata and the datasetdata_1for the distances inbins. This means if we use theKDTreedata structure corresponding todataand putdata_1 = data, we will getcounts_DD(not equal to $DD$; explained below) to be used in the estimator. Similarly, if we usedata_1 = data_R, we getcounts_DR, and so on.
- 
    An important point to note is that two_point_correlation(data_1,bins)returns the total number of galaxy pairs lying within a sphere of radius equal to the distance given inbins. This is not what we need - we would like to have the number of galaxy pairs lying within a spherical shell of some thickness around the distance value.
To get $DD$ we just take np.diff(counts_DD) and normalize it by the total number of pairs. With N_D = len(data) and N_R = len(data_R), we assign
$DD$ = np.diff(counts_DD)
$RR$ = np.diff(counts_RR)
$DR$ = np.diff(counts_DR)
As $RR$ appears in the denominator, we find all places where $RR$ is zero and set it to 1 to avoid problems. Later we will set all such locations in the final correlation function array to numpy.nan.
- The correlation function is now calculated by the formula (Landy-Szalay Estimator)
corr = (factor ** 2 * DD - 2 * factor * DR + RR) / RR
bootstrap_two_point(data, bins, method='standard',Nbootstraps=10, return_bootstraps=False, random_state=None)
Bootstraping is a process used to identify errors in the estimation of a particular quantity. It is pretty straightforward - we use one realization of the quantity to be measured as the baseline, and then take the standard deviation (with delta degrees of freedom = 1) over multiple realizations (or ‘bootstraps’) of the quantity as the errorbars on the baseline.
The two extra arguments in this function as compared to two_point are both optional arguments.
- 
    Nbootstrapsgives the the number of realizations over which the error will be calculated.Nbootstraps$\ge$ 2.
- 
    return_bootstrapstakes valueTrueorFalseand decides whether the function should return all the bootstraps.
The function returns corr, corr_err and all the bootstraps (given condition).
two_point_angular(ra, dec, bins, method='standard', random_state=None)
This is an implementation of the angular two-point correlation function in astroML, which merely converts angular coordinates into 3D (x,y,z) coordinates, and passes this as data into two_point. The algorithmic flow of this function is as follows :-
- 
    As opposed to dataintwo_point,two_point_angulartakes one-dimensional arrays of Right Ascensionraand Declinationdecas input.raanddecis converted into euclidean(x,y,z)using thera_dec_to_xyz(ra,dec)function. The output of this function is used asdata
- 
    Notice that there is no argument here which corresponds to data_Rintwo_point. That is because the program generates uniform distributions of RAra_Rin range[min(ra), max(ra)]Decdec_Rin th e range[(min(dec), max(dec)]. The size ofra_Randdec_Ris twice that of thera(ordec).data_Ris constructed usingra_dec_to_xyz(ra_R,dec_R).
- 
    The angular binsentered are also transformed into euclidean distances. The function then returns the follwing
return two_point(data, bins_transform, method=method, data_R=data_R, random_state=rng)
bootstrap_two_point_angular(ra, dec, bins, method='standard',Nbootstraps=10,random_state=None)
This is defined in a similar fashion as the other bootstrap procedure, with the sole exception that, rather than considering a single realization of the estimator as the baseline, we take the baseline to be the mean of all realizations.
My Comments
- 
    The factorintwo_pointis a bit fishy. It could be that this is causing an apparent problem in the normalization.
- 
    I am not very comfortable with the ddof = 1used in the numpy.std, but at first glance, that seems to be a property of the distribution/error