Application

Proposal title

Astronomy Using Unevenly Sampled Data

Organization

OpenAstronomy(StingraySoftware)

Summary

Analyzing unevenly sampled data cannot be done with methods of the Fourier domain. We use the Lomb Scargle method to account for the unevenness of the data. I will be implementing the Lomb Scargle cross spectrum and power spectrum from scratch. Then I will be creating wrapper classes and helper functions that will provide a sensible API to the user. The Lomb Scargle cross spectrum implementation can be optimized by using JAX. Reason for choosing to do the power spectrum from scratch is that it allows us to get a raw PSD to which we can apply stingray’s methods of normalisation and astropy’s implementation does not have that option, and their normalisations are different from the ones used in stingray, and we can use the \(\mathcal{O} (N\log{}N)\) algorithm proposed by Press and Rybicki. There would also be a need of helper functions to create objects of these classes automatically from various input types similar to the existing PSDs. Extra parameter of maximum frequency is to be given as it cannot be inferred from the data as equal to 1/time between two data points. Further the frequency resolution cannot be inferred due to the same reason. The cross spectrum class must be compatible with statistic functions already existing in stingray such as the time lag, phase lag and coherence (which exist only for cross spectra) (if this turns out to not be possible then create new similar statistic functions).

Contributor Information

Blog : https://pupperemeritus.github.io/gsoc_blog/
Name : Sri Guru Datta Pisupati
Email : pupperemeritus9@protonmail.com
Time-zone : +05:30 (IST)
Slack : Sri Guru Datta Pisupati
Github : pupperemeritus
PR Links :

Background

I am a second year undergraduate student of Chaitanya Bharathi Institute of Technology studying Artificial Intelligence and Data Science. I have a lot of hobbies out of which some of them include astrophotography, Guitar(Rock and Pop Trinity Grade 8) and chess. Furthermore, I originally wanted to study a course in Computational Astronomy/Astrophysics as it is the confluence of all my interests but could not do it due to lack of availability of such a course in my country. I am passionate about using, propagating and contributing to open source software. I also like to do astrophotography in my free time, though I cannot really produce great images as I am limited by a Bortle 9 sky and lack of a tracker.

Relevant Experience

Created a discord bot that uses AI to determine the mood of chat messages
Participated in my college’s Hacktober fest and came 5th in my panel.

Interest in Open Astronomy

I’ve always been interested in astronomy. I also delved into astrophotography recently. Furthermore, I originally wanted to study astrophysics/astronomy along with computer science as my major and minor but the institutes in my country neither offer such flexibility nor the courses in one institute. Not only that, but I have done a few courses on Coursera on astronomy such as Data Driven Astronomy and Analyzing The Universe. Furthermore, I am passionate about using and propagating open source software, and I feel this is a multi-faceted way of giving back to the community that I like so much. Likewise, I feel my skill set is best suited for OpenAstronomy compared to any other project as I have a decent amount of domain as well as programming and software design knowledge. I am always looking to learn more. I view research work and work that aids in other’s research as a noble pursuit. Furthermore, I hope OpenAstronomy will not only be my journey into the world of FOSS but also to research, and it will help me gain experience for higher studies.

Detailed Description

The project involves time series analysis of unevenly sampled data. The goal is to create the classes that generates Lomb-Scargle Power spectrum and Crossspectrum of the data respectively and match these implementations with the structure of existing functionalities of Powerspectrum and Crosspectrum classes as well as the normalisations performed by them.

Time-lag, phase lag and coherence of light curves are also to be found out in the case of cross spectra.

The implementation of cross spectrum is to be done from scratch in Python as it only exists in MATLAB and FORTRAN courtesy of Dr. Jeffrey D. Scargle (Studies in astronomical time series analysis. III-Fourier transforms, autocorrelation functions, and cross-correlation functions of unevenly spaced data).

The implementation of power spectrum may not be required since a compatible implementation exists in scipy.signal.LombScargle. We can refer to the original paper by Jeffrey D. Scargle Studies in astronomical time series analysis. II-Statistical aspects of Spectral Analysis of Unevenly Spaced Data. But it is useful to implement it from scratch as we can use the optimizations introduced by William H. Press and George B. Rybicki as shown below.

A deep dive into Lomb Scargle Method and various optimizations

Referring Review paper by Jacob T. VanderPlas

The Lomb Scargle method uses the mean of the sampling intervals to account for the uneven sampling intervals. Lomb Scargle method is equivalently interpreted as a Fourier method as well as a least squares method. The generalized periodogram form ensures time shift invariance.

The least squares approach works by minimizing the chi^2 test statistic of a function \(y(t;f) = A_f \sin(2\pi f(t-\phi_f))\) where amplitude \(A_f\) and phase \(\phi_f\) vary with frequency.

\[\chi^2 = \sum_n (y_n - y(t_n;f))^2\]

The periodogram using this approach is given by \(P(f) = \frac{\chi^2-\chi^2(f)}{2}\) where \(\chi^2(f)\) is the \(\chi^2\) test statistic for the given frequency and \(\chi^2\) is the non varying reference model.

The periodogram can also be derived by extending the classical Schuster periodogram by generalizing it to unevenly sampled data which comes from the fourier domain of methods.

Given by \(P(f) = \frac{1}{2} [{ \frac{ (\sum_n{g_n\cos(2\pi f[t_n-\tau])})^2} {\sum_n\cos^2(2\pi f[t_n-\tau])} + \frac{(\sum_ng_n\sin(2\pi f[t_n-\tau]))^2} {\sum_n\sin^2(2\pi f[t_n-\tau]) }}]\) where \(\ \tau =\frac{1}{4\pi f} \tan^{-1} {\frac{\sum_n\sin(4\pi f_n)}{\sum_n{\cos(4\pi f_n)}}}\)

We can modify the least squares approach to account for Gaussian errors. Where \(\chi^2\) test statistic is modified as follows

\[\chi^2(f) = \sum_n \frac{(y_n - y_{model}(t_n;f))^2}{\sigma_n^2}\]

Another important modification made is to add an offset term to the sinusoidal model at each frequency in the least squares approach.

\[y_{model}(t;f) = y_0(f) + A_f \sin(2\pi f(t-\tau))\]

The \(y_{model}\) can be further generalized by adding K - 1 terms to it.

\[y_{model}(t;f) = A_f^0 + \sum{K}_{k=1} A_f^{k} \sin(2\pi kf(t-\phi_f^k))\]

These implementations are \(\mathcal{O}(N^2)\)

The above can be further optimized by using the optimizations introduced by William H. Press and George B. Rybicki which brings down the overall time complexity of the algorithm to \(\mathcal{O}(N\log{}N)\)

The optimizations basically include the following:

Simplifications of the following trigonometric sums if we define
\[S_y = \sum_j y_j \sin(\omega t_j)\] \[C_y = \sum_j y_j \cos(\omega t_j)\] \[S_2 = \sum_j \sin(2 \omega t_j)\] \[C_2 = \sum_j \cos(2 \omega t_j)\]
Then we can replace
\[\sum_j y_j \cos(\omega (t_j - \tau)) = C_y \cos(\omega \tau) + S_y \sin(\omega \tau)\] \[\sum_j y_j \sin(\omega (t_j - \tau)) = S_y \sin(\omega \tau) - C_y \sin(\omega \tau)\] \[\sum_j \cos^2(\omega (t_j - \tau)) = 0.5 (N + C_2 \cos(2 \omega \tau) + S_2 \sin(2 \omega \tau))\] \[\sum_j \sin^2(\omega (t_j - \tau)) = 0.5 (N - C_2 \cos(2 \omega \tau) - S_2 \sin(2 \omega \tau))\]
Where \(\omega\ is\ 2 \pi f\) and N is the number of samples.
If the \(t_j\)’s were evenly spaced then the above calculations could be performed by FFTs
Therefore, we extirpolate the values at evenly spaced \(t_j\)’s by using extirpolation weight which is the same as interpolation weight

\(g(t) = \sum_n w_n(t) g(\hat{t}_n)\) where \(w_n(t)\) are interpolation weights and \(\hat{t}_n\) is the extirpolated evenly spaced points
Our sum of interest

\(\sum_n y_n g(t_n) \approx \sum_j y_j[\sum_k(w_k(t_j) g(\hat{t}_k))] = \sum_k\sum_j h_j w_k(t_j) \equiv \sum_k \hat{h}_kg(\hat{t}_k)\) where \(\hat{h}_k = \sum_j h_j w_k(t_j)\)

The following are the practical considerations when implementing the method

Frequency limits and grid spacing must be chosen carefully. The low frequency limit is often chosen to be 0 for the sake of convenience but can be set to \(\frac{1}{T}\) where T is the total span of the time series. The high frequency limit can be calculated by finding the true Nyquist frequency or pseudo Nyquist limit based on careful scrutiny of the window function.
The interaction of the signal, convolution caused by survey window and noise in the data, the largest peak in the periodogram may correspond to an alias of the true frequency
The peak may correspond to a higher harmonic of the fundamental frequency which itself is subject to the above aliasing
The cross spectrum is computed using the Lomb Scargle Fourier Transform.
\[P_{XY} = FT_x(\omega)FT_y^{*}(\omega)\] \[FT_x(\omega) = F_0 \sum A X_n \cos(\omega t_n') + i B X_n \sin(\omega t_n')\]
Which can be simplified by taking \(X_i = 1\)
\[FT_x(\omega) = F_0 \sum A \cos(\omega t_n') + i B \sin(\omega t_n')\]
We can compute this quite quickly using FFTs

My approach

Proposed Structure

Filename	Description
utils.py	helper functions
LombScargle.py	Implementation of the power spectrum and cross spectrum classes

Additions to utils.py

Helper functions like ACF, CCF that work with unevenly sampled data.

Lomb Scargle Cross Spectrum class

Wraps the Lomb Scargle cross spectrum in our own implementation. Handles input of two light curves, eventlists, timearray and light curve iterable. Needs to perform data validity checks and normalisation(leahy/absolute rms/fractional rms).

Class parameters

Parameter	Description	Default
data1	light curve or eventlist	None
data2	light curve or eventlist	None
power_type	real or complex power	all
norm	Normalisation methods(frac RMS, abs RMS, leahy)	Fractional RMS
dt	time resolution only needed for eventlist objects	Automatically selected sensible value
gti	time intervals which have good data (if not none overrides common gtis from both light curves)	Common gti from both light curves
skip_checks	flag to skip checks	False
maxfreq	maximum frequency to check for	Automatically chosen sensible value
df	frequency resolution	Automatically chosen sensible value

Class attributes

Attribute	Description
freq	array of mid bin frequencies that the lsft samples
power	array of cross spectra (complex numbers)
power error	uncertainties in power
m	the number of averages cross spectra amplitudes in each bin
n	the number of data points/time bins in one segment of the light curves
k	the rebinning scheme if object has been rebinned otherwise is set to 1
nphots1	the total number of photons in light curve 1
nphots2	the total number of photons in light curve 2

NOTE : This is a rough outline and the attributes and parameters may change depending on further research at the time of implementation

Lomb Scargle Power Spectrum

Using our own implementation of Press and Rybicki’s fast Lomb Scargle power spectrum. Handles input of one light curve. Helper functions that allow creation of PSD from timearray, eventlists and light curve iterable must be created. And must perform validity checks and normalisation(leahy/absolute rms/fractional rms).

Class parameters

Parameter	Description	Default
data	light curve or eventlist	None
power_type	real or all powers	all
norm	Normalisation methods(frac RMS, abs RMS, leahy)	Fractional RMS
dt	time resolution only needed for eventlist objects	Automatically chosen sensible value
gti	time intervals which have good data (if not None overrides existing gti in light curve)	None
skip_checks	flag to skip checks	False
df	frequency resolution	Automatically chosen sensible value
maxfreq	maximum frequency to check for	Automatically chosen sensible value

Class attributes

Attribute	Description
freq	array of mid bin frequencies that the lsft samples
power	array of cross spectra (complex numbers)
power error	uncertainties in power
m	the number of averages cross spectra amplitudes in each bin
n	the number of data points/time bins in one segment of the light curves
k	the rebinning scheme if object has been rebinned otherwise is set to 1
nphots	the total number of photons in light curve

NOTE : This is a rough outline and the attributes and parameters may change depending on further research at the time of implementation

Deliverables

Implementation of Lomb Scargle Cross Spectrum along with helper functions, statistical functions, user facing wrapper class, tests and documentation
Implementation of Lomb Scargle Power Spectrum along with helper functions, statistical functions, user facing wrapper class, tests and documentation

Description and Timeline

Period	Description
May 4th - 28th	Getting to know mentors, reading documentation, setup development environment, contributing fixes to smaller issues to get familiar with the codebase and getting up to speed with working on the project.
May 29th - June 4th	Implementing the Lomb Scargle Crossspectrum
June 5th - June 19th	Implementing the Lomb Scargle Crossspectrum class
June 19th - July 2nd	Implementing Lomb Scargle Power Spectrum class
July 3rd - July 16th	Implementing the helper functions
July 14th	Midterm evaluation deadline
July 31st - August 13th	JAX optimization, tests and documentation : tests for the wrapper classes as well as the core implementation , documentation for the API and how to use the class
August 13th - August 21st	Buffer period and improving code quality
August 21st - August 28th	Final evaluation week
August 28th - September 4th	Mentor final evaluation week
September 5th - September 24th	Buffer period, comprehensive tests and documentation

GSoC

I have not participated previously in GSoC. I am not applying to any other projects as this project has piqued my interest the most and aligns the most with my knowledge and skill set.

Schedule availability

I don’t really have any plans for holidays during the time of the program. There are barely any holidays between my 4th and 5th semesters which would be around 10-15 days in late August early September. I would be willing to work during these days as well if need be. I would have slightly reduced output from 13th July to Mid/Late August due to semester end exams. My window of maximum productivity would be from May 29th ~ June 25th. I would not like to put behind any work for the buffer period as it clashes with my semester end exams. I would most likely be to work at full productivity from roughly 11th August onward assuming exam schedule would be similar to current semester. Likewise, I would try and cover the work of later weeks early if any unforeseen circumstances arise.