1. Introduction
Ubiquitous galaxy mergers are expected to force their resident supermassive black holes to merge (Berczik et al. Reference Berczik, Merritt, Spurzem and Bischof2006; Pearson et al. Reference Pearson2019). During such merger and the preceding inspiral phases, the black hole pairs are expected to emit gravitational waves (GWs) in the nanohertz frequency range (Burke-Spolaor et al. Reference Burke-Spolaor2019; Susobhanan et al. Reference Susobhanan, Gopakumar, Hobbs and Taylor2020). Pulsar Timing Arrays (PTAs: Hobbs & Dai Reference Hobbs and Dai2017) aim to detect such GWs by accurately timing the arrival of pulses from an ensemble of millisecond pulsars (MSPs) as these are very precise celestial clocks (Hobbs et al. Reference Hobbs2020). The most promising PTA sources include isolated supermassive black hole binaries (SMBHBs) emitting continuous GWs and an astrophysical stochastic GW background formed from an ensemble of many unresolved SMBHBs (Burke-Spolaor et al. Reference Burke-Spolaor2019). The rapidly maturing PTA efforts are soon expected to open an additional window to the GW astronomy landscape inaugurated by the LIGO-Virgo collaboration (Abbott et al. Reference Abbott2019).
At present, there exist three advanced PTA experiments, namely the Parkes Pulsar Timing Array (PPTA: Hobbs Reference Hobbs2013; Kerr et al. Reference Kerr2020), the European Pulsar Timing Array (EPTA: Kramer & Champion Reference Kramer and Champion2013; Desvignes et al. Reference Desvignes2016), and the North American Nanohertz Observatory for Gravitational Waves (NANOGrav: McLaughlin Reference McLaughlin2013; Alam et al. 2020a, 2020b). Additionally, PTA efforts are gaining momentum in India, China, and South Africa (Joshi et al. Reference Joshi2018; Lee Reference Lee2016; Bailes et al. Reference Bailes2018), and these collaborations are referred to as the emerging PTAs. The International Pulsar Timing Array (IPTA) consortium combines data and resources from various PTA efforts to enable faster detection of nanohertz GWs (Hobbs et al. Reference Hobbs2010; Perera et al. Reference Perera2019).
The Indian Pulsar Timing Array (InPTA) experiment, operational since 2015 (Joshi et al. Reference Joshi2018), aims to use the unique strengths of the Giant Metrewave Radio Telescope (GMRT: Swarup et al. Reference Swarup, Ananthakrishnan, Kapahi, Rao, Subrahmanya and Kulkarni1991)—especially after its recent upgrade (uGMRT: Gupta et al. Reference Gupta2017)—along with the Ooty Radio Telescope (ORT: Swarup et al. Reference Swarup1971; Naidu et al. Reference Naidu, Joshi, Manoharan and Krishnakumar2015) to complement the other PTA experiments. The uGMRT, with its ability to observe below 1 GHz, is an ideal instrument to characterise interstellar medium effects such as dispersion measure (DM) variations of PTA pulsars, which is necessary to achieve the nanosecond timing precision required for the first detection of nanohertz GWs (Joshi et al. Reference Joshi2018).
The first step in using uGMRT and ORT data for InPTA science goals is to reduce it to an archive format (Hotan, van Straten, & Manchester Reference Hotan, van Straten and Manchester2004)—a pulsar data format widely used among other PTAs. Then, these data can be further processed using well-known software to derive various astrophysically relevant quantities including the pulse time of arrival (TOA) and the DM (van Straten, Demorest, & OsŁowski Reference van Straten, Demorest and Osłowski2012). This calls for homogeneity in data reduction practices to avoid non-uniformity in the data products used for PTA analysis, which can introduce systematic errors. In this paper, we describe a uGMRT pulsar data analysis pipeline named ‘Pipeline for the Indian Pulsar Timing Array’ (pinta Footnote a), developed for the InPTA experiment to address these concerns as well as to improve the efficiency, reliability, and user friendliness of the data reduction process and to ensure faster turnaround time from observations to PTA analysis. We have developed pinta with the intention to commission it as a standard pipeline at the GMRT observatory to be used by the wider pulsar community. This can help avoid the transfer of large data files by enabling data reduction at the observatory itself.
For the pipeline to be useful to a wider community, we also discuss how to interpret the uGMRT observation frequency settings. We also present the results of our astronomical experiments carried out to validate the definition of the observing frequency in the engineering specifications of the uGMRT backend hardware and software. Using the same experiment, we also ascertained the instrumental delays between various back-end modes used at uGMRT measured through engineering tests. These delays form a crucial piece of information, not only for combining data from multiple bands in the InPTA analysis but also for other simultaneous multi-frequency observations which use different back-end modes of uGMRT.
The outline of this paper is as follows. A detailed description of the uGMRT raw data as well as the workflow and usage of pinta is provided in Section 2. Details of the uGMRT observation frequency settings and the astronomical experiments which were used to validate these settings are presented in Section 3. The performance and RFI mitigation characteristics of pinta are reported in Section 4. The ability of pinta to generate data products from which high-precision TOAs can be derived is demonstrated in Section 5 using J1909 $-$ 3744 as an example. A summary of the pinta pipeline discussed in this paper is given in Section 6, and our future plans for the development of InPTA-relevant codes including pinta are summarised in Section 7.
2. Description of the pipeline
pinta accepts uGMRT raw pulsar timing data as input, performs RFI mitigation and folding, and provides the partially folded pulse profile in the Timer archive format (van Straten & Bailes Reference van Straten and Bailes2011) as its output. In what follows, we give a detailed description of the uGMRT raw data and the workflow of the pinta pipeline.
The thirty GMRT antennas are divided in groups to form multiple subarrays, and each subarray is phased to form voltage beams for two polarisations, and the gains of the two polarisations are equalised during phasing. These voltage beams are then digitised and Fourier transformed (no polyphase filter is employed) to form power spectra across a certain number of frequency channels (Reddy et al. Reference Reddy2017). For the phased array (PA) mode that we use in our InPTA timing observations, the spectral powers from the two polarisations are added to form the total intensity I without applying any calibration, and is integrated maintaining the required spectral and time resolution for the observation specified in terms of the number of channels $N_{\text{chan}}$ and the sampling time $T_{\text{smpl}}$ . Note that the two polarisation voltages can also be combined to compute the Stokes parameters (I, Q, U, V: Hamaker, Bregman, & Sault Reference Hamaker, Bregman and Sault1996). While the recording of the full Stokes data is possible at uGMRT, the implementation of its reduction in the pipeline described here is currently being developed and tested. In addition, a real-time coherent dedispersion observing mode is employed to process the voltages to form and record the coherently dedispersed phased array (CDPA) raw data stream (De & Gupta Reference De and Gupta2016). Lastly, an incoherent array (IA) data stream can be formed by incoherently adding the spectral powers from different antennas.
The PA and the CDPA total intensity modes are used for InPTA observations discussed in this paper. The CDPA mode is primarily used at the lower frequency bands where the effect of interstellar dispersion is prominent. The raw data stream from either of these modes, namely a data cube of spectral intensities at $N_{\text{chan}}$ frequency channels for each time sample, are stored as 16-bit integers in a binary raw data file, and the timestamp (in Indian Standard Time) at the start of the observation is saved as a separate ASCII file. An example timestamp file is shown below.
#Start time and date
IST Time: 19:59:57.633098240
Date: 25:08:2018
#Start ACQ SEQ NO = 17
pinta converts the timestamp given in the timestamp file to MJD using astropy (Price-Whelan et al. Reference Price-Whelan2018). Note that the raw data files do not store any metadata required for downstream processing, and it must be provided to the pipeline through a separate file.
Reduction of PTA data involves processing a large number of such high-volume data sets (obtained from different MSPs at different epochs in separate bands) through complex processing steps.Footnote b In order to ensure that processing can be efficient for such batch processing jobs and to avoid premature run-time failures, a set of checks are done on all the relevant files and folders, and the processing is initiated only if all the checks pass.Footnote c If one of the checks fail, an informative error message is shown to enable easier troubleshooting.
The data processing workflow of pinta is illustrated in Figure 1. pinta uses two separate packages for Radio Frequency Interference (RFI) mitigation, namely gptool Footnote d (A. Chowdhury & Y. Gupta, in preparation) and RFIClean Footnote e (Maan, van Leeuwen, & Vohl Reference Maan, van Leeuwen and Vohl2020). Brief descriptions of these packages are given below.
2.1. Details of gptool
gptool is both an RFI mitigation and a data reduction tool for the beamformer data from GMRT. It mitigates both narrow-band spectral line RFI and broadband bursty time-domain RFI. For narrow-band RFI, it offers a choice of two options for flagging RFI-affected frequency channels: (a) it derives a median band shape and flags channels for which the median absolute deviation (MAD) exceeds a defined threshold or (b) it checks for a drop in mean-to-RMS ratio for each channel below a specified threshold to identify channels corrupted by RFI. Our pinta pipeline employs both of these methods available in gptool. For identifying broadband bursty RFI, gptool once again offers two options for removal of outlier time samples, based on different ways of estimating central tendency and variability in the histogram of the frequency-collapsed time series. In the first method, a standard median and MAD-based scheme is employed to identify RFI-contaminated time samples. However, when strong RFI is present for a significant duration of the observation time block, the histogram may deviate from unimodality, affecting the robustness of median and MAD estimates. In such cases, the major mode and the full width at half maximum around the major mode provide robust estimates of the central tendency and variability of the underlying distribution, and a novel scheme for broadband RFI mitigation has been implemented in gptool based on these statistics. This novel scheme has been found to give superior results, and hence is used in our pipeline. For further handling of the channels and time samples that are flagged as RFI by gptool, it offers two options to the user: either to replace the existing values by zero or to replace the existing values by a local median. In our pipeline, we use the replace by the local median option as it is known to give better results. Both the RFI mitigated and unmitigated data can then be dedispersed and folded to the ephemeris of the observed pulsar. When gptool is run in the interactive mode, the time-series, folded profile, and the band-shape are displayed as the tool processes the raw data. pinta uses the non-interactive mode of gptool, where the RFI mitigated data, in the same format as the raw input data, is written to an output file along with estimated statistics in auxiliary files without performing dedispersion or folding. gptool provides an option for the removal of a baseline computed by dedispersing the data to zero DM, useful for broadband RFI mitigation, and an option for flattening the variations of the band shape across the observing bandwidth by renormalising the output of each frequency channel to the same mean value. The parameters for RFI removal and the selected modes are specified with a configuration file, named gptool.in. gptool has also been extensively used for RFI mitigation in the uGMRT for many other pulsar projects since the beginning of the wide-band observations with the uGMRT (Pleunis et al. Reference Pleunis2020).
2.2. Details of RFIClean
RFIClean excises periodic RFI in the Fourier domain and then mitigates narrow-band spectral line RFI and broadband bursty time-domain RFI using robust statistics. The periodic RFI could severely limit the efficacy of conventional RFI mitigation techniques. There are many terrestrial sources of periodic interference, the most infamous being the household 50/60 Hz power lines. RFIClean identifies and mitigates periodic interference in the time series of individual frequency channels using Fourier domain analysis. After the excision of periodic interference, RFIClean uses the more conventional threshold-based techniques to identify the time samples as well as frequency channels, respectively, contaminated by broadband bursts and narrow-band RFI. The identified time samples and frequency channels are replaced by mean values, computed robustly in the local regions around the affected samples. RFIClean has been extensively and successfully tested against any artefacts which might get incorporated in the data during the periodic RFI excision and might be relevant to the PTA analysis. The details of these tests can be found in Maan et al. (Reference Maan, van Leeuwen and Vohl2020). Before inclusion in pinta, RFIClean was also independently tested as a stand-alone programme using InPTA data and was found to significantly enhance the quality of the reduced data and the timing analysis. For some pulsars with their spin frequency or any of its harmonics unfavourably close to 50 Hz, detection of the pulsar signal at several epochs was possible only after RFIClean’s mitigation of the periodic and other RFI. RFIClean has also been used in several other completed and ongoing projects (e.g., Maan et al. Reference Maan, Joshi, Surnis, Bagchi and Manoharan2019; Oostrum et al. Reference Oostrum, van Leeuwen, Maan, Coenen and Ishwara-Chandra2020), including in timing experiments and searches for fast radio bursts (Sosa Fiscella et al. 2020; Pastor-Marazuela et al. Reference Pastor-Marazuela2020).
We note here an important difference between gptool and RFIClean: gptool performs band shape normalisation on the raw data while RFIClean retains the original band shape. Thus, noticeable difference in shape of the band-averaged profiles can occur between the two branches of the pipeline, especially in wide-band observations of pulsars exhibiting significant profile evolution with frequency and interstellar scintillation. Therefore, we advocate the use of separate templates for generating TOAs from profiles obtained through gptool and RFIClean, especially for high precision pulsar timing applications such as PTAs. In addition, the use of frequency-dependent two-dimensional templates may also help mitigate this issue (Pennucci Reference Pennucci2019).
gptool accepts uGMRT raw data as input and writes the output in the same format. The conversion to the filterbank format is carried out by a version of the filterbank command provided by the sigproc package (Lorimer Reference Lorimer2011), customised for uGMRT and distributed along with pinta. On the other hand, RFIClean accepts input either in uGMRT raw data format or in the sigproc-filterbank format and outputs a sigproc-filterbank file.
It may be illuminating to compare and contrast the RFI mitigation methods available in pinta with that available in the CoastGuard data analysis packageFootnote f (Lazarus et al. Reference Lazarus, Karuppusamy, Graikou, Caballero, Champion, Lee, Verbiest and Kramer2016) developed for the PSRIX backend of the Effelsberg 100-m Radio Telescope. CoastGuard provides four algorithms to find and mask or replace channels, sub-integrations, and phase bins in the folded profile contaminated with RFI. The major difference between the RFI mitigation algorithms available in pinta and CoastGuard is that the former act on raw data, whereas the latter acts on folded profile archives. The mitigation of periodic RFI such as the RFI generated by power distribution lines implemented in RFIClean is not possible in the folded profiles. In addition, the time domain bursty RFI removed by gptool and RFIClean typically occur at GMRT at timescales much shorter than our sub-integration interval of 10 s. These are our main reasons for opting for RFI removal in the raw data rather than folded profiles in our analysis.
While both the RFI mitigation packages have been well tested, the possibility of discovering new artefacts in the future cannot be ruled out. Hence, to avoid the need of reanalysing all the data in such an unlikely future situation, we have designed pinta such that it allows the user to process the data in two separate branches, one for each RFI mitigation package, and produces two separate outputs. Availability of data reduced by two independent parts of the pipeline facilitates detailed comparisons and the choice of the optimal RFI mitigation method. The RFI-mitigated filterbank files are folded using dspsr (van Straten & Bailes Reference van Straten and Bailes2011) and saved in the Timer format, significantly reducing the data volume. Finally, a period and DM search is performed on the resulting profile archive using the pdmp command provided by psrchive, producing a summary document in the postscript format. This file is used as a visual check to ensure that the pulsar has been detected and that the analysis has finished successfully.
2.3. Usage
The pinta pipeline can be invoked from the command line with the following syntax.
$ pinta [–help] [–test] [–no-gptool]
[–no-rficlean] [–nodel] [–retain-aux]
[–log-to-file] [–gptdir <…>]
[–pardir <…>] [–rficconf <…>]
<input_dir> <working_dir>
pinta requires specifying two mandatory parameters and a few other optional parameters as inputs as listed below.
-
1. Input directory (input_dir)—The directory where the raw data files and the corresponding timestamp files are stored.
-
2. Working directory (working_dir)—The output files, as well as all the intermediate products, will be written to this directory. This directory must contain a file named pipeline.in as specified in Section 2.5, and the user must have ‘read’ and ‘write’ permissions for this directory. The working directory can be the same as the input directory.
-
3. gptool configuration directory (gpt_dir)—This directory should contain the configuration files required to run gptool, named gptool.in.xxx where ‘xxx’ represents the local oscillator frequency of the uGMRT band.
-
4. Pulsar ephemeris directory (par_dir)—This directory should contain the pulsar ephemeris (.par) files in the tempo2 format, required for folding the data. Each ephemeris file should be named JNAME.par where ‘JNAME’ is the name of the pulsar in the J2000 epoch.
-
5. RFIClean configuration file (rficconf)—This file contains the settings and flags required to run RFIClean for pinta.
In addition, we shall refer to the directory from which pinta is invoked and the directory where the pinta script is stored as the current directory (current_dir) and script directory (script_dir), respectively.
Note that both working_dir and the current_dir require write access. The input_dir and working_dir are mandatory positional arguments to be passed to pinta, while gpt_dir, par_dir, and rficconf are by default read from a configuration file, detailed in the next subsection. gpt_dir, par_dir, and rficconf can be explicitly specified in the command line through the --gptdir, --pardir, and --rficconf options, respectively. The various options and command line arguments are summarised in Table 1.
2.4. The configuration file
The pinta configuration file stores the default settings required to run the pipeline, such as the gpt_dir, par_dir, and rficconf in YAML format.Footnote g This file should be named pinta.yaml and stored in the script_dir.
A sample configuration file is shown below.
pinta:
pardir: /path/to/pulsar/ephemeris/dir/
gptdir: /path/to/gptool/config/dir/
rficconf: /path/to/rfiClean/config/file/
2.5. The pipeline.in file
Since the raw input data files do not contain any metadata required for downstream processing, such as the number of channels and the bandwidth, it must be provided separately. pinta accepts this information through a space-separated ASCII file named pipeline.in stored in the working_dir. Each row in pipeline.in corresponds to one raw data file and the various columns are described in Table 2. Rows starting with ‘#’ are treated as comments and ignored. pinta processes rows in the pipeline.in files serially until all rows are processed successfully or a validation criterion is not met.
An example pipeline.in file is shown in Figure 2.
2.6. Storage requirements
The uGMRT raw data file generated by an hour-long observation is typically of the order of a hundred Gigabytes. A uGMRT raw data file contains, for each time sample, $N_{\text{pol}}$ polarisation intensities/correlations in $N_{\text{chan}}$ frequency channels represented as 16-bit integers. In general, the file size of the raw data file for an observation duration $T_{\text{obs}}$ and sampling time $T_{\text{smpl}}$ is given by
The intermediate products generated by the pipeline, namely, .gpt and .fil files, will have roughly the same size as the input file along with a small header which stores observation metadata. The output archive files are typically smaller, of the order of hundreds of Megabytes in size, since we fold the raw data over longer sub-integrations. The size of the output archive, excluding the header, is approximately given by
where $T_{\text{subint}}$ is the duration of a sub-integration and $N_{\text{bin}}$ is the number of phase bins in the profile. In our analysis, we typically use $T_{\text{subint}}=10$ s. In general, the maximum amount of disk space required by pinta is less than four times the total size of the raw data files, while preserving all intermediate files (i.e., using the --nodel option). If the --nodel option is not used, the maximum amount of disk space required is approximately the size of the largest raw data file.
3. Interpretation of observatory frequency settings
The GMRT Wide-band Back-end (GWB; Reddy et al. Reference Reddy2017) provides three different observation modes, namely IA, PA, or CDPA, as described in Section 2. The settings used during a pulsar observation depend on the band of observation and the mode of the observatory back-end. These settings are required for data reduction using pinta and are communicated to the pipeline through a pipeline.in file as mentioned in Section 2.5. As the frequency labelling of the pulsar data cube varies with the back-end mode used, these need to be determined and encoded in pinta in a manner which simplifies the specification of observation settings for the user.
The times of arrivals (TOAs) of a pulsar pulse recorded simultaneously in two bands A and B, using backend modes P and Q, respectively, are related by
where $t_{AP}$ and $t_{BQ}$ are the TOAs, $\Delta_{PQ}$ is the relative instrumental offset between modes P and Q, $\mathcal{D}$ is the dispersion measure constant, DM is the dispersion measure of the pulsar at the epoch of observation, and $F_{1A}$ and $F_{1B}$ are the frequency labels of the channels to which the signals in bands A and B are dedispersed. Both the offsets $\Delta_{PQ}$ and the frequency labels $F_{1X}$ (where X represents the band of observation) are crucial for performing precision pulsar timing using uGMRT. These are defined as part of the engineering specifications of the GWB hardware and software (Reddy et al. Reference Reddy2017; De & Gupta Reference De and Gupta2016). Engineering tests with standard inputs to the hardware were carried out to verify these definitions and revealed that there is no offset between time series in IA and PA mode, whereas a 1 buffer (256 Mbytes) offset exists between IA/PA and CDPA modes. This offset is 0.67108864 s for 200 and 400 MHz bandwidths and 1.34217728 s for 100 MHz bandwidth, and this was verified up to 5 ns precision in engineering tests. Likewise, the frequency definitions were worked out from engineering considerations and tested in an engineering sense with fixed frequency tones. While the precision of astronomical tests is not likely to be high due to system noise and coarser sampling, nevertheless such tests with wide-band radio emission are also needed to gain confidence, particularly for coherently dedispersed data. In this section, we describe the astronomical tests carried out to validate the frequency labelling $F_{1X}$ to be encoded in pinta and to determine the offsets $\Delta_{PQ}$ .
3.1. Calibration experiment
The required frequency labelling and the instrumental offsets were validated using observations of the Crab pulsar (PSR J0534+2200) and PSR J0332+5434. The former is a bright pulsar with 33.7 ms period and a relatively high DM (56.7 $\mathrm{pc\,cm}^{-3}$ : Lyne et al. Reference Lyne, Jordan, Graham-Smith, Espinoza, Stappers and Weltevrede2014). The DM of the Crab pulsar varies from epoch to epoch, and this pulsar exhibits sporadic intense pulses, called giant pulses (GPs; Lundgren et al. Reference Lundgren, Cordes, Ulmer, Matz, Lomatch, Foster and Hankins1995; Hankins et al. Reference Hankins, Kern, Weatherall and Eilek2003), typically once every four minutes at uGMRT frequencies at uGMRT sensitivity. The GPs provide a time marker, which is a strong function of frequency due to interstellar dispersion. Moreover, the arrival times of this marker across different frequencies vary with epoch due to DM variations. Thus, GPs provide a sensitive probe to validate the assumed frequency labels for the spectral data. PSR J0332+5434, with a flux density of $\sim$ 1 500 mJy at 408 MHz, is the brightest pulsar in the northern hemisphere at metre-centimetre wavelengths with a period of 714 ms and a DM of 26.76 $\mathrm{pc\,cm}^{-3}$ (Lorimer et al. Reference Lorimer, Yates, Lyne and Gould1995; Hassall et al. Reference Hassall2012). Bright single pulses with pulse-to-pulse intensity variations interspersed with pulse nulls are seen in this pulsar (see Figure 3a).
The GWB can simultaneously be used in its different modes of operation in different bands using any combination of the four beams provided (Gupta et al. Reference Gupta2017; Reddy et al. Reference Reddy2017). This capability was exploited to record data on GPs from the Crab pulsar and single pulses from PSR J0332+5434 in IA, PA, and CDPA modes of GWB using different frequency bands available with the uGMRT. For the Crab pulsar, first the GPs were identified in IA, PA, and CDPA mode data at both Band 3 and Band 5. We investigated the cross-correlation in the recorded time series around the identified GPs from different modes and frequency bands to determine the lag in the arrival times of the GPs. This lag, recorded for example with PA in Band 5 and CDPA in Band 3, depends on the DM of the pulsar (specified up to a precision of 0.001 $\mathrm{pc\,cm}^{-3}$ ) and the frequency labeling used for the two bands, as given by Equation (3). As the DM time series of this pulsar is known to the required precision from independent measurements (Lyne, Pritchard, & Graham Smith Reference Lyne, Pritchard and Graham Smith1993; Lyne et al. Reference Lyne, Jordan, Graham-Smith, Espinoza, Stappers and Weltevrede2014) made public by the Jodrell Bank Observatory,Footnote h the expected lag in the arrival times of identified GPs was calculated from the DM nearest to the epoch of observations. Hence, any difference between the expected and measured lags is due to either (a) incorrect frequency labelling or (b) relative time offset between the two modes. As the DM of this pulsar varies over a timescale of one month, two observations separated by one month will yield different delays due to frequency labelling, whereas the relative instrumental delay is expected to be constant. Thus, both the frequency labelling and relative offsets can be simultaneously determined by two such observations. We check these results for consistency using similar analysis with PSR J0332+5434.
3.2. Calibration observations and results
Calibration observations were carried out on 2019 December 16 (MJD 58832), 2020 January 24 (MJD 58871), and 2020 May 22 (MJD 58991). The estimated lags for one combination of modes on 2020 January 24 are shown in Figure 3a and b. The relative offsets and frequency labelling were then determined by matching the measured and expected lags, given by Equation (3), and the estimated relative offsets for different modes are tabulated in Table 3. While the uncertainty on measurements of these relative pipeline delays ranges from 10 to 80 $\upmu$ s due to coarser sampling and system noise, these measurements are consistent with the engineering measurements. The relative pipeline delays measured as a result of tests conducted in the first two epochs were corrected in the software by the GMRT engineering team in 2020 April. This was verified in the tests conducted on 2020 May 22, as can be seen from Table 3.
The frequency labelling $F_{1X}$ for the different modes are expressed in terms of the value of the highest frequency channel in the following expressions:
For IA and PA,
and for CDPA,
Here, $F_{\text{LO}}$ refers to the Local Oscillator (LO) frequency (MHz) used for the observations, $\Delta F$ is the acquisition bandwidth (typically 100 or 200 MHz), and $N_{\text{chan}}$ denotes the number of channels or sub-bands across the band. The expression is different for each side band denoted by USB or LSB. When $F_{\text{LO}}$ is chosen at the lowest edge of the band being used, this is called upper side band (USB) where frequencies are ordered from lowest to highest frequency. The reverse order of frequencies are used in lower side band (LSB) with $F_{\text{LO}}$ chosen at the highest edge of the band. Equations (4a)–(4b) are in agreement with what is expected from the implementation of the IA, PA, and CDPA pipelines in GWB (Reddy et al. Reference Reddy2017; De & Gupta Reference De and Gupta2016).
These equations are implemented in pinta to make it simpler for the user to use our data reduction pipeline. The user specifies the LO frequency, the side band, the acquisition bandwidth, and the number of sub-bands/channels in the pipeline.in file using the same values as specified for the backend observation setup. The relative offsets determined in these experiments are not coded in pinta, but are included as jumps while performing any timing analysis of the uGMRT data.
4. Performance
To validate the pipeline and investigate its performance, we performed a series of tests using a variety of uGMRT data sets with varying data volume and observation frequencies.
To gauge the computational performance of pinta, we sliced the raw data files from ten different observations (the details of these data sets are given in Table 4) into file sizes of 1 GiB,Footnote i 2 GiB, 4 GiB, 8 GiB, 16 GiB, and 32 GiB, processed each slice separately in pinta, and in each case recorded the execution time of each component of pinta as well as the total execution time. The result of this exercise is shown in Figure 4 where the ratio of the execution time to the observation duration (observe-to-reduce time ratio) is plotted against the observation duration. Each point in Figure 4 represents the median of ten test cases, and the error bar represents the corresponding median absolute deviation. This plot shows the observe-to-reduce ratio to be approximately between 1.5 and 3 and that it is not strongly dependent on the data volume. This behaviour is desirable and the observe-to-reduce ratio can indeed be improved to be better than real-time by optimising and parallelising the pipeline, which we plan to do in the future. Such improvements can in principle allow pinta to be deployed as a real-time observatory pipeline for pulsar data reduction. We also note that the observe-to-reduce ratio while using only one of the two branches is close to or better than real-time.
To ensure the reliability of the pipeline, these tests were repeated by multiple users on the same data sets mentioned above using different command line options, and the results were compared with each other as well as with results obtained by running the various data reduction codes used in pinta directly to ensure that the results are reproducible.
4.1. RFI mitigation
RFI mitigation is one of the most important processing steps in the pinta pipeline. In order to illustrate the RFI mitigation in the pipeline, we present here a study on ten different data sets (see Table 4), each having varying levels of RFI. Data segments were selected from the uGMRT observation bands 3, 4, and 5, MJD 58260-58389 with a total length for the segments 11 544 s. The data quality of each segment prior to and after the pinta RFI mitigation was studied. The rfifind command of PRESTO (Ransom Reference Ransom2011) was used to report the percentage of good intervals in the data. The percentage of good intervals that is gained after the RFI mitigation is shown (in red) in Figure 5. This study provides a feel for the typical RFI mitigation available in the pipeline, and we see from Figure 5 that the degree of improvement after applying RFI mitigation varies greatly from data set to data set, which is expected since the RFI environment itself is highly variable. Data set 3 is of specific interest as the percentage of good intervals more than doubles after applying RFI mitigation, and the pulsar was detected in this data set only after applying RFI mitigation.
To further illustrate the efficacy of the RFI mitigation available in pinta, we show in Figure 6 pulse profiles generated using gptool, RFIClean and without performing any RFI mitigation for two observations. The profiles without any RFI mitigation are produced by running pinta with --no-gptool --no-rficlean options. The signal to noise ratios (SNRs) quoted in Figure 6 are computed using the pdmp Footnote j command of PSRCHIVE. In light of the caveat regarding band shape normalisation discussed in Section 2, we have chosen two observations without significant interstellar scintillation in order to show a fair comparison between gptool and RFIClean.
Figure 6 shows the gain in profile SNR for both data sets while using RFI mitigation. Nevertheless, it should be noted that the SNRs for J2124 $-$ 3358 reported by pdmp may be inaccurate due to its large duty cycle. This does not affect our comparison between the RFI mitigated and non-RFI mitigated data sets as it is clear from the bottom panel of Figure 6b that the RFI mitigated profiles agree with each other better than with the non-RFI mitigated profile, indicating a reduction in the noise level.
5. Timing of PSR J1909–3744
In this section, we demonstrate the capability of pinta to generate profiles from which high-precision TOAs can be derived. We use PSR J1909 $-$ 3744 as an example for this purpose.
The data presented in this section were obtained as part of the InPTA campaign from 2020 April to 2020 October with a cadence of $\sim$ 15 d. The observations were carried out by splitting the 30 uGMRT antennas into two phased subarrays, where the innermost 8 antennas were used in Band 3 (300–500 MHz) and 16 of the outer antennas were used in Band 5 (1 260–1 460MHz). The pulsar was observed simultaneously in both bands in each epoch, with 200 MHz bandwidth and 1 024 frequency channels in each band. The Band 3 data were coherently dedispersed to the known DM of the pulsar and were recorded at 20.48 $\upmu$ s sampling time, whereas Band 5 data were obtained using the PA mode with a sampling time of 40.96 $\upmu$ s. The data were processed using pinta, and the TOAs were extracted from the resulting Timer archives using PSRCHIVE after time and frequency collapsing the folded profiles. The resulting TOAs were fit using TEMPO2 (Hobbs, Edwards, & Manchester Reference Hobbs, Edwards and Manchester2006) using the pulsar ephemeris available in the NANOGrav 12.5 yr data set (Alam et al. 2020a), as our data span is too short to provide a reliable timing solution. Post-fit residuals after fitting for pulsar rotational parameters (F0, F1), and DM are plotted in Figure 7. We do not use any time offsets between the two bands as such offsets were corrected in GWB software since 2020 April based on results mentioned in Section 3.1. The corresponding pre-fit and post-fit parameters, along with the RMS timing residual values, are listed in Table 5. A more thorough timing solution of this data using frequency- resolved TOAs, DM corrections, and rigorous noise analysis will be published elsewhere.
From Table 5, we note that the uGMRT observations processed using pinta are able to produce an RMS post-fit timing residuals of 1.46 $\upmu$ s. This demonstrates that the data products produced using pinta can indeed be used for high-precision timing applications such as PTAs. We expect to further reduce the RMS timing residuals after applying DM corrections, which are discussed elsewhere (Krishnakumar et al. Reference Krishnakumar2021).
6. Summary and discussion
We have developed a pipeline to reduce uGMRT pulsar timing raw data for the InPTA experiment, named pinta, which reduces the raw data input to RFI-mitigated folded profile archives. Since the uGMRT raw data input does not contain any metadata such as the observation settings, they are provided to the pipeline via an ASCII input file named pipeline.in, whose contents are summarised in Table 2. pinta performs RFI mitigation using two different packages, namely gptool and RFIClean, running them in two different branches which produce two different output archives. pinta provides various command line options to control how these two branches are run, and these are summarised in Table 1.
It is crucial to use the correct interpretation of the observatory frequency settings while performing the data reduction. We performed validation and calibration experiments using GPs from the Crab pulsar and single pulses from the bright pulsar J0332+5434 to ensure that our interpretation of the observation frequency for IA, PA, and CDPA pipelines of uGMRT matches what is given in Equations (4a) and (4b). This experiment also allowed us to measure the instrumental delays between IA, PA, and CDPA pipelines of uGMRT, which are consistent with the instrumental delays expected from engineering considerations.
To characterise the computational performance of pinta, we conducted a number of tests using different data sets. These tests showed that the net observe-to-reduce time ratio of pinta is approximately 2, while the observe-to-time ratio of individual branches is less than 1.5. These results lead us to strive to achieve real-time observe-to-time ratio by employing parallelisation techniques to the pipeline. We also conducted tests to investigate the RFI mitigation efficacy of pinta on the same data sets, the results of which are shown in Figure 5. We observe that the RFI mitigation gains seen in different data sets, having different RFI characteristics, vary significantly as expected, with some data sets yielding up to $\sim 10\%$ gain after RFI mitigation. We also demonstrate improvements in the significance of pulse profiles by using the different RFI mitigation paths in pinta, which further advocates their importance in the pipeline. These results substantiate the addition of RFI mitigation tools in pinta. To demonstrate the ability of pinta to generate data products from which high-precision TOAs can be derived, we showed the timing of uGMRT observations of PSR J1909 $-$ 3744, and we are able to produce timing residuals with RMS of the order of 1 $\upmu$ s.
7. Future scope
Our plans for the future development of pinta include the improvement of its computational efficiency to achieve better than real-time performance. This may be achieved by (a) running the two branches of the pipeline parallelly instead of serially, (b) modifying the filterbank program to use GPUs, and (c) utilising the GPU processing option in dspsr.
Similar pipelines for reducing the data obtained using the legacy GMRT and the ORT are also under development, ensuring a high level of compatibility with pinta. In addition, we plan on developing ‘InPTA Data Management System’, a database for tracking metadata associated with the observations and data analysis of the InPTA experiment, which will be tightly integrated with pinta as well as the legacy GMRT and ORT pipelines.
Acknowledgements
We are grateful to the anonymous referee for a detailed perusal and constructive feedback on the manuscript. We thank the staff of the GMRT who made our observations possible. GMRT is run by the National Centre for Radio Astrophysics of the Tata Institute of Fundamental Research. BCJ, YG, and AB acknowledge the support of the Department of Atomic Energy, Government of India, under project # 12-R&D-TFR-5.02-0700. AS, AG, and LD acknowledge the support of the Department of Atomic Energy, Government of India, under project # 12-R&D-TFR-5.02-0200. MPS acknowledges funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 694745). AC acknowledges the funding received from Department of Science and Technology, Government of India, WOS-A scheme, file no. SR/WOS-A/PM-26/2018.