.. _partitioning-guide: Partitioning Guide ================== `Measurement Set v4.0 `_ specifies a series of datasets with ``time``, ``baseline_id`` and ``frequency`` coordinates where ``time`` and ``frequency`` have associated ``integration_time`` and ``channel_width`` attributes. In the best case, this represents monotonic, equidistant values along ``time`` and ``frequency`` and the standard quadratic relation between antennas in the case of ``baseline_id``. Observational data recorded directly off an interferometer and stored for archival purposes will commonly follow a ``(time, baseline_id, frequency)`` ordering. The usefulness of this representation and ordering is that it is simple and easy for software to reason about. This is desirable as it simplifies our software. The challenge in converting from MSv2 to MSv4 is formulating a partitioning strategy to handle irregularity in an MSv2 dataset. Measurement Set v2.0 irregularity --------------------------------- By contrast the `Measurement Set v2.0 `_ is a tabular format that does not enforce any notion of regularity (although much software assumes it). The ``TIME`` and ``INTERVAL`` columns in the MAIN MSv2 table describe the midpoint in time at which a sample was measured and the amount of time (integration time) taken to measure the sample, while the ``ANTENNA1`` and ``ANTENNA2`` columns define the baseline along which the sample was measured. ``TIME``, ``ANTENNA1`` and ``ANTENNA2`` are *keys* in the tabular MAIN table and there is no requirement that the measurements they index are ordered, or even form a regular ``(time, baseline_id)`` grid. Additionally, the ``DATA_DESC_ID`` column establishes a relation to the ``SPECTRAL_WINDOW::CHAN_FREQ`` and ``SPECTRAL_WINDOW::CHAN_WIDTH`` columns representing the frequency centroid and bandwidth of the sample, respectively. The challenge that MSv2 poses to radio astronomy software in the worst case is that it can represent overlapped or disjoint measurements in time and frequency for one or more baselines. However, most observational data is well-behaved: Measurements are commonly ordered by ``TIME, ANTENNA1, ANTENNA2`` and ``CHAN_FREQ`` commonly increases monotically with equidistant values (i.e. ``CHAN_WIDTH`` values are uniform) but this cannot always be assumed. Any regularity in an MSv2 MS is achieved through convention rather than enforcement. Choosing a partitioning strategy -------------------------------- By default, MSv2 measurements are partitioned by ``DATA_DESC_ID``, ``OBSERVATION_ID``, ``PROCESSOR_ID`` and the ``STATE::OBS_MODE`` (via ``STATE_ID``) columns. .. autodata:: xarray_ms.backend.msv2.structure.DEFAULT_PARTITION_COLUMNS For example, it follows from the previous section that, in order to achieve regularity in frequency, *partition* MSv2 measurements by the ``DATA_DESC_ID`` column. Partitioning always uses these columns, but additional columns can be selected if finer grained partitioning is required: .. autodata:: xarray_ms.backend.msv2.structure.VALID_PARTITION_COLUMNS Note that ``OBS_MODE`` and ``SUB_SCAN_NUMBER`` are columns in the ``STATE`` subtable, while ``SOURCE_ID`` is a column of the ``FIELD`` subtable. Partitioning on these columns is achieved by joining on the ``STATE_ID`` and ``FIELD_ID`` columns, respectively. Within these partitions, measurements are sorted by ``TIME``, ``ANTENNA1`` and ``ANTENNA2`` to form a grid. .. _time-partitioning: Partitioning in time ++++++++++++++++++++ Compared to frequency, achieving regularity in time requires more thought as it depends on identifying partitions of MSv2 where data: 1. contains monotically increasing ``TIME`` (after ordering). 2. is dumped with a uniform ``INTERVAL``. 3. ideally contains no gaps: i.e. ``(TIME - INTERVAL)[1:] == (TIME + INTERVAL)[:-1]``. For example, ``OBS_MODE`` specifying ``STATE::OBS_MODE`` via ``STATE_ID`` is a good default partitioner, as it represents a shift in the interferometer's mode of operation: It identifies when the interferometer is e.g. slewing/observing a calibrator/observing a target. Other valid partitioning columns are: - ``FIELD_ID``: Observing a field for a period of time. - ``SOURCE_ID``: Observing a source within a field for a period of time. - ``SCAN_NUMBER``: A coarse, logical number (i.e. scan) associated with the data. - ``SUB_SCAN_NUMBER``: A finer, logical number (i.e. scan) associated with the data. This specifies ``STATE::SUB_SCAN_NUMBER`` (via ``STATE_ID``). - ``STATE_ID``: The state of an interferometer. as these columns frequently identify measurement groupings where the interferometer is consistently dumping. .. code-block:: python import xarray_ms import xarray # Also partition by SCAN_NUMBER and FIELD_ID dt = xarray.open_datatree(ms, partition_schema=["SCAN_NUMBER", "FIELD_ID"]) .. _missing-baselines: Missing Baselines ----------------- Baselines can be missing for distinct ``TIME`` values. This can occur when Measurement Sets are passed through the CASA ``split`` task with ``keepflags=False`` set, for instance. Having all baselines present can be useful for simplifying calibration algorithms and cases where auto-correlations are requested, but none are present in the data. ``xarray-ms`` will impute these missing data points with default values (``nan`` in the case of data, ``1`` in the case of flags). Irregular Grid Warnings ----------------------- Given the specified partitioning schema, ``xarray-ms`` will partition the MSv2 by the supplied columns and attempt to establish a regular ``(time, baseline_id, frequency)`` grid. If this is not possible, three classes of warning can be issued, related to each of the three dimensions. :class:`~xarray_ms.errors.IrregularTimeGridWarning` +++++++++++++++++++++++++++++++++++++++++++++++++++ This warning is raised when it is impossible to identify a unique ``INTERVAL`` value for a partition. This is required to assign a single ``integration_time`` attribute to the ``time`` coordinate. The above check is relaxed slightly by excluding the last time in the partition (to handle averaged data) and by allowing a degree of jitter in the ``INTERVAL`` column. Generally, this happens if the requested partitioning schema does not satisfy the criteria described in :ref:`time-partitioning`. The solution is to experiment with other partitioning columns. Should the user wish to continue with this case, ``xarray-ms`` sets ``integration_time=nan`` and adds ``(time, baseline_id)``-shaped, ``TIME`` and ``INTEGRATION_TIME`` columns. Downstream applications should account for this. :class:`~xarray_ms.errors.IrregularChannelGridWarning` ++++++++++++++++++++++++++++++++++++++++++++++++++++++ This warning is raised when it is impossible to identify a unique ``CHAN_WIDTH`` value for the partition. This is required to assign a single ``channel_width`` attribute to the ``frequency`` coordinate. Should the user wish to continue with this case ``xarray-ms`` sets ``channel_width=nan`` and adds ``(frequency,)``-shaped ``CHANNEL_WIDTH`` columns. Downstream application should account for this. :class:`~xarray_ms.errors.IrregularBaselineGridWarning` +++++++++++++++++++++++++++++++++++++++++++++++++++++++ This warning is raised when baselines were missing for a particular timestep. This is a relatively benign warning as ``xarray-ms`` will impute missing values (See :ref:`missing-baselines`).