As surveys grow, the challenge is how to explore and interpret the increasing quantity of data. For this, removing the observational biases and reducing the dimensionality of the data are fundamental. A promising avenue to do this is a self-supervised deep learning algorithm called contrastive learning.
Contrastive learning is especially effective with noisy or biased data and produces a representation space that organizes the data by their similarity. To achieve this, contrastive learning optimizes the similarity between different views of the same object, while minimizing that of objects that are different. This thesis includes a description of the fundamental components and different implementations of contrastive learning. We review how this method has been
applied in the field of astrophysics and discuss the potential of its use with astronomical data.
Integral field spectroscopic (IFS) galaxy surveys are a good example of how datasets have grown in complexity and in volume. To minimize the dependence of this type of data on its inherent observational effects, we use contrastive learning. Specifically, we use physically resolved stellar properties of galaxies; their V -band reconstructed image, luminosity weighted age and metallicity, and kinematics. To generate the different views required for the contrastive model, we apply transformations that mimic the diversity of the observing conditions to obtain representations of the data that are invariant to them. This allows us to analyze the distribution of common physical structures in the representation space and thus trace different stages of galaxy evolution and formation paths. When only the information relative to the internal structures is considered, galaxies cluster into two well-known groups: rotating main-sequence discs and massive slow rotators. If the information of the integrated physical properties is preserved, a third group of quenched and rotation-dominated galaxies naturally emerges.
In order to find complex relations between the spatially resolved structures of galaxies and their accretion histories, we combine IFS high-dimensionality data, deep learning and numerical simulations to infer the evolutionary paths of galaxies. For this, we generate 10,000 simulated galaxies from the TNG50 hydro-cosmological simulation to compare to the 10,000 galaxies observed in MaNGA (Mapping Nearby Galaxies at APO), thus generating a mock MaNGA sample. We then analyze the implications of the procedure to emulate observations, and evaluate how the simulated galaxies reproduce the properties of those observed. The forward-modelled sample in general recovers the trends in age, chemical composition and kinematics of the observed sample. However, some discrepancies found require further investigation and may lead to a better physical understanding.
This thesis discusses the potential of self-supervised learning tools to map complex data of galaxies to a representation space that organizes them by similarity. This allows us to perform an unsupervised clustering of their observable parameters that minimizes observational biases, recovering physically motivated classes of galaxies in a fully data-driven way. Furthermore, this space can be used as a common ground for comparison with simulations. For this purpose, we generate a forward-modelled galaxy dataset that emulates IFS observations and provide a preliminary analysis of the empirical and predicted datasets. Therefore, this Thesis represents the first steps towards proving the physical processes modelled by hydro-cosmological simulations to unprecedented scales through the use of a self-supervised framework and uneven IFS data.