CAMPA - Conditional Autoencoder for Multiplexed Pixel Analysis

CAMPA is a framework for quantitative analysis of subcellular multi-channel imaging data. It consists of a workflow that generates consistent subcellular landmarks (CSLs) using conditional Variational Autoencoders (cVAE). The output of the CAMPA workflow is an anndata object that contains interpretable per-cell features summarising the molecular composition and spatial arrangement of CSLs inside each cell.

Campa title figure

Manuscript

Please see our preprint “Learning consistent subcellular landmarks to quantify changes in multiplexed protein maps” (Spitzer, Berry et al. (2022)) to learn more.

Contributing

We are happy about any contributions! Before you start, check out our contributing guide.

Installation

CAMPA is developed for and tested on Python 3.9.

Installation with pip

pip install campa

Overview

CAMPA is a framework for analysis of subcellular multi-channel imaging data. It consists of a workflow that generates consistent subcellular landmarks (CSLs) and several plotting functions. The output of the CAMPA workflow is a anndata.AnnData object that contains per-cell features about the molecular composition and spatial arrangement of CSLs inside each cell.

CAMPA both contains a high-level API for easy access to the workflow, a low-level class-based access to all functions for more detailed control, and a CLI for all steps except the manual cluster annotation step. Depending on your data size, you might want to run long-running steps of the CAMPA workflow on a HPC cluster.

CAMPA config

To find data and experiment folders, and dataset specific parameters, CAMPA uses a config, campa.constants.campa_config. Config values can directly be set in memory. For more information on this see the Setup and download data tutorial.

In order to provide a persistent config, CAMPA attempts to read the config values from a campa.ini config file from the following locations:

  • the current directory

  • $HOME/.config/campa

and will use the first file that it finds.

There is an example config file in campa/campa.ini.example. Create campa.ini with system-specific paths to experiment and data folders in by running:

campa setup

This creates a default config in $HOME/.config/campa/campa.ini. Note that for following the tutorials it is not necessary complete this step and write a config file to disk, as all tutorials set the necessary config values at the start.

Data format

The pixel-level multi-channel data is represented as a campa.data.MPPData (Multiplexed Pixel Profile Data) object.

It is represented on disk by a folder containing npy and csv files. Let \(n_o\) the number of objects (cells) in the data, \(n_p\) the number of pixels, and \(m\) the number of measurements per pixel (e.g. number of protein channels), the data is saved in:

  • mpp.npy: \(n_p \times m\), per-pixel values

  • x.npy, y.npy: \(n_p \times 1\), spatial coordinates of each pixel

  • obj_ids.npy: \(n_p \times 1\), object id (cell) that each pixel belongs to

  • channels.csv: \(m\), protein channel names

  • metadata.csv: \(n_o\), metadata information (perturbation, cell cycle, etc) for each object (cell)

During processing of the data, additional numpy files might be created. For example, after training a cVAE with \(c\) conditions and clustering its latent space of dimension \(l\) into \(k\) CSLs, the results directory will also contain:

  • conditions.npy: \(n_p \times c\), conditions used to train the cVAE

  • latent.npy: \(n_p \times l\), latent space of the cVAE used for clustering CSLs

  • clustering.npy: \(n_p \times k\), clustering of latent.npy

  • cluster_annotation.csv: \(k\), mapping of clusters to cluster names

For more information on the data representation, see also the Data representation tutorial. More data formats, including reading data from (multiple) image files will be supported in the future.

Data config

To tell CAMPA how to load specific data and set up the conditions contained in specific datasets, a data_config is used. All available data configs are in campa.constants.CAMPAConfig.data_configs. E.g., for the example data used in the tutorials, the data_config is identified by ExampleData and points to the example data config file.

The data config file is specific per dataset but has to contain the following variables:

  • DATA_DIR: path to the folder containing the data, can use campa.constants.CAMPAConfig.BASE_DATA_DIR to be device-agnostic.

  • DATASET_DIR: path to the folder where training/testing datasets derived from this data should be stored;

  • OBJ_ID: name of column in metadata.csv that contains a unique object identifier.

  • CHANNELS_METADATA: name of CSV file containing channels metadata (relative to DATA_DIR). Is expected to contain channel names in column “name”.

  • CONDITIONS: dictionary of conditions to be used for cVAE models. Keys are column names in metadata.csv, and values are all possible values for this condition. This will be used to convert conditions to one-hot encoded vector.

For using a different dataset, simply create a new constants file with dataset specific settings, and add a new entry to campa.constants.campa_config using campa.constants.CAMPAConfig.add_data_config().

Workflow

CAMPA contains a high-level API that can be easily used to create datasets, train models, and extract features. Settings for the different stages of the workflow are communicated via parameter files. These are python files usually containing a dictionary of settings that are used by the individual steps. You can find a complete set of example parameter files here.

The workflow consists of the following steps:

  • Setup up the config and download data by following along with the Setup and download data tutorial.

  • Create a subsampled pixel-level dataset for neural network training. This is done either by using the API function campa.data.create_dataset() or by using the CLI:

    campa create_dataset ...
    

    For more information, see the Dataset for training models tutorial.

  • Train a conditional variational autoencoder to generate a condition-independent latent representation. This is done either by using the API function campa.tl.run_experiments() or by using the CLI:

    campa train ...
    

    For more information, see the Train and evaluate models tutorial.

  • Cluster cVAE latent representation into CSLs. This is done in three steps:

    • First, the data is subsampled and clustered, because we would like the clustering to be interactive and feasible to compute on a laptop. If you have more time or access to GPUs, you could also consider to skip the subsampling step and cluster all data directly. Use the API function campa.tl.create_cluster_data() or the CLI:

      campa cluster <EXPERIMENT> create ...
      

      Optionally, after this step a manual re-clustering or annotation of clusters can be done. See the Cluster data into CSLs tutorial for more details

    • To project the clustering to the entire dataset, the model needs to be used to predict the latent representation on all data. It is recommended to run this step in a script, as this might take a while for large datasets. Use the API function campa.tl.prepare_full_dataset() or the CLI:

      campa cluster <EXPERIMENT> prepare-full ...
      
    • Finally, the clustering can be projected to the entire dataset. Use the API function campa.tl.project_cluster_data() or the CLI:

      campa cluster <EXPERIMENT> project ...
      

    For more information, see the Cluster data into CSLs tutorial.

  • Extract features from CSLs to quantitatively compare molecular intensity differences and spatial re-localisation of proteins in different conditions. Use the API function campa.tl.extract_features() or the CLI:

    campa extract_features ...
    

    For more information, see the Extract features from CSLs tutorial.

API

High-level functions for processing data, training and evaluating models and plotting results.

Data

data.create_dataset(params)

Create a NNDataset.

data.load_example_data([data_dir])

Download example data to data_dir.

data.load_example_experiment([experiment_dir])

Download example experiment to experiment_dir.

Tools

tl.run_experiments(exps[, mode])

Execute experiments.

tl.create_cluster_data(experiment_dir[, ...])

Create (subsampled) data for clustering.

tl.prepare_full_dataset(experiment_dir[, ...])

Prepare all data for clustering by predicting cluster-rep.

tl.project_cluster_data(experiment_dir, ...)

Project existing clustering to new data.

tl.load_full_data_dict(exp[, keys, ...])

Load mpp_datas used in experiment in a dict.

tl.get_clustered_cells(mpp_datas, cl, ...[, ...])

Get num_objs example cells for each mpp_data.

tl.project_cluster_data(experiment_dir, ...)

Project existing clustering to new data.

tl.add_clustering_to_adata(data_dir, ...[, ...])

Add clustering to adata.

tl.extract_features(params)

Extract features from clustered dataset using FeaturesExtractor.

tl.thresholded_count(df[, threshold])

Count largest CSL objects per cell.

tl.thresholded_median(df[, threshold])

Calculate median area of large CSL objects per cell.

Plotting

pl.plot_mean_intensity(adata[, groupby, ...])

Show per cluster intensity of each channel.

pl.get_intensity_change(adata, groupby[, ...])

Get data for plotting intensity comparison with plot_intensity_change().

pl.plot_intensity_change(adata, ...[, ...])

Plot mean intensity differences between perturbations or clusters.

pl.plot_mean_size(adata[, groupby_row, ...])

Plot mean cluster sizes per cell, grouped by different columns in obs.

pl.plot_size_change(adata[, groupby_row, ...])

Plot mean intensity differences between perturbations and clusters.

pl.plot_object_stats(adata, group_key[, ...])

Barplot of object statistics.

pl.plot_co_occurrence(adata, cluster1, ...)

Plot co-occurrence for one cluster-cluster pairs.

pl.plot_co_occurrence_grid(adata, condition)

Plot co-occurrence for all cluster-cluster pairs in a grid.

pl.annotate_img(img[, annotation, from_col, ...])

Annotate cluster image.

Other

constants.campa_config

Configuration for CAMPA.

utils.load_config(config_file)

Load configuration file and return configuration object.

utils.merged_config(config1, config2)

Update config1 with config2.

utils.init_logging([level])

Set up logging for CAMPA.

Classes

Classes representing data, experiment, and evaluation objects.

constants.CAMPAConfig()

Configuration for CAMPA.

data.MPPData(metadata, channels, data, ...)

Pixel-level data representation.

data.NNDataset(dataset_name[, data_config])

Dataset for training and evaluation of neural networks.

tl.Experiment(config)

Experiment stored on disk with neural network.

tl.Estimator(exp)

Neural network estimator.

tl.Predictor(exp[, batch_size])

Predict results from trained model.

tl.ModelComparator(exps[, save_dir])

Compare experiments.

tl.Cluster(config[, cluster_mpp, save_config])

Cluster data.

tl.FeatureExtractor(exp, data_dir, cluster_name)

Extract features from clustering.

tl.BaseAEModel(**kwargs)

Base class for AE and VAE models.

tl.VAEModel(**kwargs)

VAE with simple Gaussian prior (trainable with KL loss).

tl.ModelEnum(value)

Neural network models.

tl.LossEnum(value)

Loss functions.

CLI

Command line interface for CAMPA.

Use this to execute long-running steps of the CAMPA workflow (e.g. clustering large data or calculating co-occurrence features) in a script on e.g. a HPC system.

See the Workflow documentation for a short introduction to the CLI commands and the Tutorials for the CLI commands needed to reproduce the example workflow.

Basic usage

CLI for CAMPA - conditional autoencoder for multiplexed image analysis.

usage: campa <command> [<args>]

Available subcommands are:
setup               Create configuration file ``campa.ini``
create_dataset      Create NNDataset using parameter file
train               Train and evaluate models defined by experiment config
cluster             Cluster data and project clustering
extract_features    Extract features from clustered dataset
Positional Arguments
command

Subcommand to run.

campa setup

Create configuration file campa.ini.

usage: campa setup [-h] [--quiet]
Named Arguments
--quiet

create default configuration file without asking for user input.

Default: False

campa create_dataset

Create NNDataset using parameter file

usage: campa create_dataset [-h] params
Positional Arguments
params

path to data_params.py

campa train

Train and evaluate models defined by experiment config

usage: campa train [-h] (--config CONFIG | --experiment-dir EXPERIMENT_DIR)
                   [--exp-name [EXP_NAME ...]]
                   {all,train,evaluate,trainval,compare}
Positional Arguments
mode

Possible choices: all, train, evaluate, trainval, compare

Default: “all”

Named Arguments
--config

path to experiment_config.py

--experiment-dir

experiment_dir containing experiment folders. Relative to EXPERIMENT_DIR

--exp-name

Select names of experiments to run. If not specified, all available experiments are run

campa cluster

Cluster data and project clustering.

usage: campa cluster [-h] experiment-dir {create,prepare-full,project} ...
Positional Arguments
experiment-dir

relative to EXPERIMENT_PATH

Sub-commands
create

Create (subsampled) data for clustering. Uses all data used to train exp

campa cluster create [-h] [--subsample] [--frac FRAC] [--save-dir SAVE_DIR]
                     [--cluster]
Named Arguments
--subsample

Subsample the data

Default: False

--frac

Fraction of pixels to use for clustering

Default: 0.005

--save-dir

directory to save subsampled cluster data, relative to experiment-dir. Default is aggregated/sub-FRAC

--cluster

use cluster parameters in Experiment config to cluster the subsetted data.

Default: False

prepare-full

Prepare full data for clustering. Predicts cluster-rep.

campa cluster prepare-full [-h] [--save-dir SAVE_DIR]
                           [--data-dirs DATA_DIRS [DATA_DIRS ...]]
Named Arguments
--save-dir

directory to save prepared full data to, relative to experiment-dir.

Default: “aggregated/full_data”

--data-dirs

Data directories to prepare. Defaults to experiment data directories

project

Project existing clustering.

campa cluster project [-h] [--save-dir SAVE_DIR] [--data-dir DATA_DIR]
                      [--cluster-name CLUSTER_NAME]
                      cluster-data-dir
Positional Arguments
cluster-data-dir

directory in which clustering is stored relative to experiment-dir. Usually in aggregated/sub-FRAC

Named Arguments
--save-dir

directory in which data to be projected is stored, relative to experiment-dir.

Default: “aggregated/full_data”

--data-dir

data to project. If not specified, project all data_dirs in save_dir

--cluster-name

name of clustering to project

Default: “clustering”

campa extract_features

Extract features from clustered dataset.

usage: campa extract_features [-h] params
Positional Arguments
params

path to feature_params.py

Tutorials

This section contains several detailed tutorials showcasing different aspects of CAMPA. See the Workflow documentation for a short introduction to the CAMPA workflow of creating and analysing CSLs.

These tutorials will be using an example 4i dataset to generate and analyse CSLs for unperturbed and Meayamycin-perturbed cells.

Setup and download data

This tutorials shows how to set up CAMPA and download an example dataset. To follow along with this and the following tutorials, please execute the following steps first:

  • install CAMPA (pip install campa)

  • download the tutorials to a new folder, referred to as CAMPA_DIR in the following

  • navigate to CAMPA_DIR in the terminal and start this notebook with jupyter notebook setup.py

Note that the following notebooks assume that you will run them from the same folder that you run this notebook in (CAMPA_DIR). If this is not the case, adjust CAMPA_DIR at the top of each notebook to point to the folder that you run this notebook in.

[1]:
from pathlib import Path

# set CAMPA_DIR to the current working directory
CAMPA_DIR = Path.cwd()
print(CAMPA_DIR)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test
Download parameter files

Before configuring CAMPA, we need to ensure that all parameter files for configuring the running the different CAMPA steps are present in the params subfolder. Note that in general, these files don’t need to be in a folder named params, but the following tutorials will follow this convention. Let us download the necessary parameter files from the github repository.

[2]:
import glob

import requests

# ensure params folder exists
(CAMPA_DIR / "params").mkdir(parents=True, exist_ok=True)

# download parameter files from git
for param_file in [
    "ExampleData_constants",
    "example_data_params",
    "example_experiment_params",
    "example_feature_params",
]:
    r = requests.get(f"https://raw.github.com/theislab/campa/main/notebooks/params/{param_file}.py")
    with open(CAMPA_DIR / "params" / f"{param_file}.py", "w") as f:
        f.write(r.text)

print(f'Files in {CAMPA_DIR / "params"}: {glob.glob(str(CAMPA_DIR / "params" / "*"))}')
Files in /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params: ['/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/example_experiment_params.py', '/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/example_data_params.py', '/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/ExampleData_constants.py', '/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/example_feature_params.py']
Set up CAMPA config

CAMPA has one main config file: campa.ini. The overview describes how you can create this config file from the command line, but here we will see how we can create a config from within the campa module using the config file representation campa.constants.campa_config.

[3]:
from campa.constants import campa_config

print(campa_config)
2022-11-25 09:57:06.641175: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 09:57:24.354282: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-25 09:57:27.507035: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: EXPERIMENT_DIR is not initialised. Please create a config with "campa setup" or set campa_config.EXPERIMENT_DIR manually.
WARNING: BASE_DATA_DIR is not initialised. Please create a config with "campa setup" or set campa_config.BASE_DATA_DIR manually.
CAMPAConfig (fname: None)
EXPERIMENT_DIR: None
BASE_DATA_DIR: None
CO_OCC_CHUNK_SIZE: None

If you have not yet set up a config, this should look pretty empty. The lines WARNING: EXPERIMENT_DIR is not initialised and WARNING: BASE_DATA_DIR is not initialised are expected in this case and alert us that we need to set EXPERIMENT_DIR and BASE_DATA_DIR to that CAMPA knows where experiments and data is stored.

Let us set the EXPERIMENT_DIR and the BASE_DATA_DIR, and add the ExampleData data config. Here, we set the data and experiments paths relative to CAMPA_DIR defined above.

[4]:
# point to example data folder in which we will download the example data
campa_config.BASE_DATA_DIR = CAMPA_DIR / "example_data"
# experiments will be stored in example_experiments
campa_config.EXPERIMENT_DIR = CAMPA_DIR / "example_experiments"
# add ExampleData data_config (pointing to ExampleData_constants file that we just downloaded)
campa_config.add_data_config("ExampleData", CAMPA_DIR / "params/ExampleData_constants.py")
# set CO_OCC_CHUNK_SIZE (a parameter making co-occurrence calculation more memory efficient)
campa_config.CO_OCC_CHUNK_SIZE = 1e7

print(campa_config)
CAMPAConfig (fname: None)
EXPERIMENT_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments
BASE_DATA_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_data
CO_OCC_CHUNK_SIZE: 10000000.0
data_config/exampledata: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/ExampleData_constants.py

We can now save the config to quickly load it later on. Here, we store the config in the params directory in the current folder.

[5]:
# save config
campa_config.write(CAMPA_DIR / "params" / "campa.ini")
Reading config from /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/campa.ini

By default, campa looks for config files in the current directory and $HOME/.config/campa, but loading a config from any other file is also easy:

[6]:
# read config from non-standard location by setting campa_config.config_fname
campa_config.config_fname = CAMPA_DIR / "params" / "campa.ini"
print(campa_config)
Reading config from /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/campa.ini
CAMPAConfig (fname: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/campa.ini)
EXPERIMENT_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments
BASE_DATA_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_data
CO_OCC_CHUNK_SIZE: 10000000.0
data_config/exampledata: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/ExampleData_constants.py

Download example dataset

To follow along with the workflow tutorials, you need to download the example dataset.

Here, we store the example data in the BASE_DATA_DIR just configured in the config.

[7]:
from campa.data import load_example_data

example_data_path = load_example_data(Path(campa_config.BASE_DATA_DIR).parent)
print("Example data downloaded to: ", example_data_path)
Path or dataset does not yet exist. Attempting to download...
{'x-amz-id-2': 'HSPvG563oJllNzdrsV13AQCjHZ7P9FyV0mTfxhkmBn5sm1orzTIridTerZSrwwqhhJja8adJlLA=', 'x-amz-request-id': 'D1AWZ3CZHG6SQ9A8', 'Date': 'Fri, 25 Nov 2022 09:07:47 GMT', 'x-amz-replication-status': 'COMPLETED', 'Last-Modified': 'Fri, 28 Oct 2022 11:44:27 GMT', 'ETag': '"6300ee9228b5e78480a3a5a540e85730"', 'x-amz-tagging-count': '1', 'x-amz-server-side-encryption': 'AES256', 'Content-Disposition': 'attachment; filename="example_data.zip"', 'x-amz-version-id': 'WbEd4ye51WteRY2_BZaTchKIFVKkAxuw', 'Accept-Ranges': 'bytes', 'Content-Type': 'application/zip', 'Server': 'AmazonS3', 'Content-Length': '126837954'}
attachment; filename="example_data.zip"
Guessed filename: example_data.zip
Downloading... 126837954
123866it [00:04, 28644.04it/s]
Example data downloaded to:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_data

The example data is now stored in your campa_config.BASE_DATA_DIR folder.

The data is represented as an MPPData object. For more information on this class and the data representation on disk see the Data representation tutorial.

Data representation

MPPData is the main data representation for pixel-level data used by CAMPA. See also the overview for more information about the on-disk file format.

When possible, MPPData lazily loads the source numpy files (using numpy.memmap). If lazy loading fails, the file will be loaded in memory.

[1]:
from pathlib import Path

from IPython.display import display
import numpy as np
import matplotlib.pyplot as plt

from campa.data import MPPData, load_example_data
from campa.utils import init_logging
from campa.constants import campa_config

# init logging with level INFO=20, WARNING=30
init_logging(level=30)
# ensure that example data is downloaded
load_example_data()
# read correct campa_config -- created with setup.ipynb
CAMPA_DIR = Path.cwd()
campa_config.config_fname = CAMPA_DIR / "params/campa.ini"
print(campa_config)
2022-11-25 10:19:56.178552: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 10:19:57.965331: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-25 10:19:58.277749: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Reading config from /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/campa.ini
CAMPAConfig (fname: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/campa.ini)
EXPERIMENT_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments
BASE_DATA_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_data
CO_OCC_CHUNK_SIZE: 10000000.0
data_config/exampledata: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/ExampleData_constants.py

The MPPData class

We can construct the MPPData instance from numpy and csv files (as described in overview) using MPPData.from_data_dir.

Note that the data_dir provided to the function should be relative to base_dir, which is either provided directly in the function or set to data_config.DATA_DIR otherwise.

[2]:
mpp_data = MPPData.from_data_dir("184A1_unperturbed/I09", data_config="ExampleData")
print(mpp_data)
MPPData for ExampleData (247872 mpps with shape (1, 1, 35) from 20 objects). Data keys: ['y', 'obj_ids', 'x', 'mpp', 'labels'].

All data is accessible through MPPData.data, and there are several convenience attributes for faster access. Attributes x and y are spatial coordinates of each data point (pixel), obj_ids map pixels to objects (cells), and mpp contains the multiplexed pixel profiles, a #pixels x #channels vector.

[3]:
print(mpp_data.data("x"))
print(mpp_data.x)
print(mpp_data.y)
print(mpp_data.mpp.shape)
[1448 1452 1453 ... 1768 1769 1770]
[1448 1452 1453 ... 1768 1769 1770]
[ 106  106  106 ... 1832 1832 1832]
(247872, 1, 1, 35)

Note that the shape of the mpp_data is #pixels x 1 x 1 x #channels. This is because we can add a square neighbourbood around each pixel to the mpp_data.

The centre mpp can be accessed using MPPData.center_mpp:

[4]:
print(mpp_data.center_mpp)
[[142 108 166 ... 110 121 144]
 [147 113 136 ... 109 127 146]
 [157 121 118 ... 105 126 145]
 ...
 [180 111 139 ... 109 123 255]
 [157 105 156 ... 115 128 249]
 [137 102 178 ... 113 140 234]]

Metadata is stored in MPPData.metadata and channel information in MPPData.channels

[5]:
display(mpp_data.metadata)
display(mpp_data.channels)
mapobject_id plate_name well_name well_pos_y well_pos_x tpoint zplane label is_border mapobject_id_cell ... perturbation secondary_only siRNA perturbation_duration LocalDensity_Nuclei_800 TR_factor TR_norm TR TR_factor_DMSO-unperturbed TR_norm_DMSO-unperturbed
0 324138 plate01 I09 0 2 0 0 3 0 324079 ... normal False NaN normal 0.000116 0.974852 409.119591 419.673544 0.974852 409.119591
1 291041 plate01 I09 0 3 0 0 12 0 290975 ... normal False NaN normal 0.000190 0.974852 432.410658 443.565445 0.974852 432.410658
2 383793 plate01 I09 0 4 0 0 19 0 383757 ... normal False NaN normal 0.000118 0.974852 366.781820 376.243596 0.974852 366.781820
3 383341 plate01 I09 1 3 0 0 49 0 383287 ... normal False NaN normal 0.000149 0.974852 378.254582 388.012319 0.974852 378.254582
4 345908 plate01 I09 1 4 0 0 15 0 345871 ... normal False NaN normal 0.000118 0.974852 308.057232 316.004105 0.974852 308.057232
5 205776 plate01 I09 2 1 0 0 8 0 205742 ... normal False NaN normal 0.000128 0.974852 348.677269 357.672008 0.974852 348.677269
6 205790 plate01 I09 2 1 0 0 22 0 205756 ... normal False NaN normal 0.000149 0.974852 417.591761 428.364268 0.974852 417.591761
7 256546 plate01 I09 2 3 0 0 8 0 256502 ... normal False NaN normal 0.000138 0.974852 236.845799 242.955649 0.974852 236.845799
8 231620 plate01 I09 2 4 0 0 31 0 231520 ... normal False NaN normal 0.000123 0.974852 117.952101 120.994881 0.974852 117.952101
9 359378 plate01 I09 3 1 0 0 19 0 359333 ... normal False NaN normal 0.000145 0.974852 358.652196 367.904255 0.974852 358.652196
10 359391 plate01 I09 3 1 0 0 32 0 359346 ... normal False NaN normal 0.000126 0.974852 224.221344 230.005525 0.974852 224.221344
11 359393 plate01 I09 3 1 0 0 34 0 359348 ... normal False NaN normal 0.000122 0.974852 500.750849 513.668590 0.974852 500.750849
12 259784 plate01 I09 3 3 0 0 29 0 259737 ... normal False NaN normal 0.000135 0.974852 339.395416 348.150713 0.974852 339.395416
13 331833 plate01 I09 4 0 0 0 2 0 331804 ... normal False NaN normal 0.000132 0.974852 375.379685 385.063259 0.974852 375.379685
14 366486 plate01 I09 4 2 0 0 33 0 366427 ... normal False NaN normal 0.000173 0.974852 339.261575 348.013419 0.974852 339.261575
15 366493 plate01 I09 4 2 0 0 40 0 366434 ... normal False NaN normal 0.000140 0.974852 322.527668 330.847832 0.974852 322.527668
16 200792 plate01 I09 4 3 0 0 50 0 200738 ... normal False NaN normal 0.000155 0.974852 270.571286 277.551145 0.974852 270.571286
17 248082 plate01 I09 5 4 0 0 17 0 248037 ... normal False NaN normal 0.000156 0.974852 244.189292 250.488581 0.974852 244.189292
18 248102 plate01 I09 5 4 0 0 37 0 248057 ... normal False NaN normal 0.000147 0.974852 502.765703 515.735421 0.974852 502.765703
19 248104 plate01 I09 5 4 0 0 39 0 248059 ... normal False NaN normal 0.000086 0.974852 461.598482 473.506220 0.974852 461.598482

20 rows × 44 columns

name
channel_id
0 00_EU
1 01_CDK9_pT186
2 01_PABPC1
3 02_CDK7
4 03_CDK9
5 03_RPS6
6 05_GTF2B
7 05_Sm
8 07_POLR2A
9 07_SETD1A
10 08_H3K4me3
11 09_CCNT1
12 09_SRRM2
13 10_H3K27ac
14 10_POL2RA_pS2
15 11_KPNA2_MAX
16 11_PML
17 12_RB1_pS807_S811
18 12_YAP1
19 13_PABPN1
20 13_POL2RA_pS5
21 14_PCNA
22 15_SON
23 15_U2SNRNPB
24 16_H3
25 17_HDAC3
26 17_SRSF2
27 18_NONO
28 19_KPNA1_MAX
29 20_ALYREF
30 20_SP100
31 21_COIL
32 21_NCL
33 00_DAPI
34 07_H2B
Export and saving
Save mpp data

The MPPData object can be stored on dist using MPPData.write

Export as csv

Using MPPData.extract_csv, we can extract MPPData into a pandas dataframe.

[6]:
mpp_data.extract_csv(data="mpp", obs=["well_name"])
[6]:
name 00_EU 01_CDK9_pT186 01_PABPC1 02_CDK7 03_CDK9 03_RPS6 05_GTF2B 05_Sm 07_POLR2A 07_SETD1A ... 20_ALYREF 20_SP100 21_COIL 21_NCL 00_DAPI 07_H2B x y well_name mapobject_id
0 142 108 166 174 106 121 115 113 114 111 ... 134 104 110 110 121 144 1448 106 I09 324138
1 147 113 136 153 115 125 109 125 110 109 ... 127 110 111 109 127 146 1452 106 I09 324138
2 157 121 118 139 116 114 120 123 116 106 ... 128 106 104 105 126 145 1453 106 I09 324138
3 161 113 128 124 116 119 115 118 113 109 ... 128 111 111 108 132 138 1454 106 I09 324138
4 158 108 127 114 114 133 113 122 116 112 ... 137 108 106 117 131 161 1465 106 I09 324138
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
247867 209 112 157 121 106 147 152 108 118 114 ... 184 106 111 115 125 255 1766 1832 I09 248104
247868 161 113 158 126 111 137 122 132 116 115 ... 190 102 113 114 120 239 1767 1832 I09 248104
247869 180 111 139 128 103 135 109 123 135 108 ... 173 105 108 109 123 255 1768 1832 I09 248104
247870 157 105 156 121 104 156 117 125 135 113 ... 167 109 117 115 128 249 1769 1832 I09 248104
247871 137 102 178 125 107 152 120 118 124 116 ... 162 106 114 113 140 234 1770 1832 I09 248104

247872 rows × 39 columns

Convert to adata

We can convert the MPPData to an AnnData object, and use scanpy functions for further analysis.

Let us create an adata object for one cell. For that, we first need to subset mpp_data to one cell, and then use MPPData.get_adata to create the AnnData object.

[7]:
# subset to one cell
mpp_data_sub = mpp_data.subset(obj_ids=mpp_data.unique_obj_ids[:1], copy=True)

# get adata object from information in MPPData
adata = mpp_data_sub.get_adata()
print(adata)
AnnData object with n_obs × n_vars = 12494 × 35
    obs: 'mapobject_id', 'plate_name', 'well_name', 'well_pos_y', 'well_pos_x', 'tpoint', 'zplane', 'label', 'is_border', 'mapobject_id_cell', 'plate_name_cell', 'well_name_cell', 'well_pos_y_cell', 'well_pos_x_cell', 'tpoint_cell', 'zplane_cell', 'label_cell', 'is_border_cell', 'is_mitotic', 'is_mitotic_labels', 'is_polynuclei_HeLa', 'is_polynuclei_HeLa_labels', 'is_polynuclei_184A1', 'is_polynuclei_184A1_labels', 'is_SBF2_Sphase_labels', 'is_SBF2_Sphase', 'Heatmap-48', 'cell_cycle', 'description', 'dimensions', 'id', 'cell_type', 'EU', 'duration', 'perturbation', 'secondary_only', 'siRNA', 'perturbation_duration', 'LocalDensity_Nuclei_800', 'TR_factor', 'TR_norm', 'TR', 'TR_factor_DMSO-unperturbed', 'TR_norm_DMSO-unperturbed'
    var: 'name'
    obsm: 'spatial'

The resulting adata object contains pixels as obs and channels as var. It also contains all mpp_data.metadata columns.

We can now use all scanpy functions on this object, e.g. spatial plotting:

[8]:
import scanpy as sc

sc.pl.embedding(adata, basis="spatial", color=["00_DAPI"])
_images/notebooks_mpp_data_15_0.png
Visualisation

We can visualise several channels of the data of either part of the data or of individual objects (e.g. cells). First, let us plot all objects contained in MPPData with their spatial coordinates x and y using MPPData.get_img.

If you could like to get an image with specific channels, you can specify the channel_ids. The MPPData.get_channel_ids helps to map channel names to ids.

Note that as the mpp values might be larger than 255 for outliers, we use np.clip to bring all values in the range from 0-255 for plotting.

[9]:
channel_ids = mpp_data.get_channel_ids(["09_SRRM2", "11_PML", "21_NCL"])
print(channel_ids)

img, (offset_x, offset_y) = mpp_data.get_img(
    channel_ids=channel_ids,
)
_ = plt.imshow(np.clip(img, 0, 255))
[12, 16, 32]
_images/notebooks_mpp_data_17_1.png

This function returns the reconstructed image and offset information that needs to be subtracted from x and y before being able to use them to index the image.

Note that the offset information is only returned when the img_size argument to MPPData.get_img is not defined.

[10]:
print(offset_x, offset_y)

print("Location of first data point in img:", mpp_data.x[0] - offset_x, mpp_data.y[0] - offset_y)
print("Value of img at that location:", img[mpp_data.y[0] - offset_y, mpp_data.x[0] - offset_x])
print("Value of fist data point in mpp for channel_ids:", mpp_data.center_mpp[0, channel_ids])
182 106
Location of first data point in img: 1266 0
Value of img at that location: [104 107 110]
Value of fist data point in mpp for channel_ids: [104 107 110]

Now let’s plot a single object (cell). For that, MPPData.get_object_img is used.

[11]:
img = mpp_data.get_object_img(mpp_data.unique_obj_ids[0], channel_ids=channel_ids, img_size=255)
_ = plt.imshow(np.clip(img, 0, 255))
_images/notebooks_mpp_data_21_0.png

We can also plot all channels of this cell:

[12]:
img = mpp_data.get_object_img(mpp_data.unique_obj_ids[0], img_size=255)
print(img.shape)
num_cols = 5
num_channels = len(mpp_data.channels)
num_rows = num_channels // num_cols

fig, axes = plt.subplots(ncols=num_cols, nrows=num_rows, figsize=(int(3 * num_cols), int(3 * num_rows)))
i = 0
for row in range(num_rows):
    for col in range(num_cols):
        if i >= num_channels:
            break
        axes[row][col].imshow(img[:, :, i], interpolation="none")
        axes[row][col].set_title(mpp_data.channels.iloc[i]["name"])
        axes[row][col].axis("off")
        i += 1
plt.tight_layout()
plt.show()
(255, 255, 35)
_images/notebooks_mpp_data_23_1.png

One can also get a list of images for all objects in MPPData using MPPData.get_object_imgs. Let’s get images for all available cells and plot first six:

[13]:
img = mpp_data.get_object_imgs(channel_ids=channel_ids, img_size=255)
num_rows, num_cols = 2, 3
fig, axes = plt.subplots(num_rows, num_cols, figsize=(int(3 * num_cols), (int(3 * num_rows))))
i = 0
for row in range(num_rows):
    for col in range(num_cols):
        axes[row][col].imshow(np.clip(img[i], 0, 255), interpolation="none")
        axes[row][col].axis("off")
        i += 1
plt.tight_layout()
plt.show()
plt.close()
_images/notebooks_mpp_data_25_0.png
Filter and subset data

There are several functions that allow the modification of mpp_data objects

By channels

We can subset mpp_data selected channels. Note that this operation is performed in-place and will update fields mpp_data.mpp and mpp_data.channels.

[14]:
# subset channels
channels = [
    "01_CDK9_pT186",
    "01_PABPC1",
    "02_CDK7",
    "03_CDK9",
    "03_RPS6",
    "00_DAPI",
    "07_H2B",
]

mpp_data.subset_channels(channels)

print(mpp_data.mpp.shape)
display(mpp_data.channels)
(247872, 1, 1, 7)
name
channel_id
0 01_CDK9_pT186
1 01_PABPC1
2 02_CDK7
3 03_CDK9
4 03_RPS6
5 00_DAPI
6 07_H2B
By objects (cells)

To filter mpp_data based on specific objects (cells), we use the MPPData.subset function. Several filters for sub-setting can be defined:

  • subset to random fraction / number of objects

  • filtering by object ids

  • filtering by values in MPPData.metadata

  • filtering by condition values

Subset to random fraction / number of objects

Using attributes frac and num, we can subset the MPPData to a random selection of objects.

[15]:
# Subset fraction of objects:
mpp_data_sub = mpp_data.subset(frac=0.05, copy=True)
print(f"Subset fraction: {len(mpp_data_sub.unique_obj_ids)/len(mpp_data.unique_obj_ids)}")

# Subset specific number of objects:
mpp_data_sub = mpp_data.subset(num=10, copy=True)
print(f"Subset fraction: {len(mpp_data_sub.unique_obj_ids)/len(mpp_data.unique_obj_ids)}")
Subset fraction: 0.05
Subset fraction: 0.5
filtering by object ids

Use obj_ids to select specific objects.

[16]:
# subset to one cell
mpp_data_sub = mpp_data.subset(obj_ids=mpp_data.unique_obj_ids[:1], copy=True)
print(mpp_data.unique_obj_ids)
print(mpp_data_sub.unique_obj_ids)
[200792 205776 205790 231620 248082 248102 248104 256546 259784 291041
 324138 331833 345908 359378 359391 359393 366486 366493 383341 383793]
[200792]
filtering by values in MPPData.metadata

MPPData.metadata contains object-level metadata like e.g. cell cycle state or perturbation status. We can use this information to subset the MPPData.

For that, we need to specify the name of the column in the metadata table and a set of allowed entries for that column. A special entry NO_NAN is a special token selecting all values except NaN values.

We can also provide several columns together with their entries, separated by a comma, and the resulting MPPData object will be computed as a combination of all provided conditions.

[17]:
# filter by NO_NAN cellcycle stage
# perform operation inplace
print(
    "Cell cycle entries before subsetting by NO_NAN values:",
    (mpp_data.metadata.cell_cycle).unique(),
)
mpp_data.subset(cell_cycle="NO_NAN")
print(
    "Cell cycle entries after subsetting to NO_NAN values:",
    np.unique(mpp_data.metadata.cell_cycle),
)
Cell cycle entries before subsetting by NO_NAN values: [nan 'G1' 'G2' 'S']
Cell cycle entries after subsetting to NO_NAN values: ['G1' 'G2' 'S']
By pixels

We can also subset the MPPData by selecting random pixels. For this, the function MPPData.subsample can be used. This is an important step when selecting training data for the cVAE, as we would like to choose samples from all cells, but not necessarily the whole cell (to keep the size of the training data manageable).

[18]:
mpp_data_sub = mpp_data.subsample(frac_per_obj=0.05)

print(f"Subset fraction: {mpp_data_sub.mpp.shape[0]/mpp_data.mpp.shape[0]}")
Subset fraction: 0.04996453416108649
Normalisation

To prepare the data, pixel values can be normalised. For that, pixel values can be e.g. shifted and re-scaled so that they end up ranging between 0 and 1. For this, we can use MPPData.normalise. This can do the following:

  • background normalisation: shift all values by a fixed background value / one background value per channel.

  • percentile normalisation

Note that the argument rescale_values can be modified as a side-effect by this function. If an empty list is passed as rescale_values, it will be filled with the percentiles calculated on this data. If rescale_values contains previously calculated percentiles, they will be used to re-scale the data, instead of calculating percentiles.

To demonstrate this, let us split the MPPData in two, and first calculate the percentile on the first half.

[19]:
mpp_data1 = mpp_data.subset(obj_ids=mpp_data.unique_obj_ids[:6], copy=True)
mpp_data2 = mpp_data.subset(obj_ids=mpp_data.unique_obj_ids[6:], copy=True)

rescale_values = []
mpp_data1.normalise(percentile=95, rescale_values=rescale_values)
print("Rescale values for mpp_data1:", rescale_values)
Rescale values for mpp_data1: [144.0, 166.0, 319.0, 181.0, 151.0, 167.0, 361.0]

Now, let us use the rescale_values to scale mpp_data2:

[20]:
mpp_data2.normalise(percentile=95, rescale_values=rescale_values)

print("Median in mpp_data1:", np.median(mpp_data1.center_mpp, axis=0))
print("Median in mpp_data2:", np.median(mpp_data2.center_mpp, axis=0))
Median in mpp_data1: [0.8125     0.76506025 0.63009405 0.7016575  0.8476821  0.83233535
 0.72853184]
Median in mpp_data2: [0.8055556  0.7409639  0.6238245  0.70718235 0.8344371  0.83832335
 0.8504155 ]
Prepare for VAE training

In order to prepare data for VAE training, MPPData contains several functions for - adding neighbourhood information - addition conditions - splitting data in train/val/test

See also the Dataset for training models tutorial.

Neighbourhood information

For the CAMPA workflow, it is advantageous to add a small local neighbourhood to each pixel. This results in a more stable latent space. To add neighbourhood information, use the MPPData.add_neighborhood function. Note that this only works before we do any random subsampling of pixels.

It is also possible to do subsampling and add neighbourhood information at the same time using the add_neighborhood argument of the MPPData.subsample function. This is more time-efficient, as the neighbours only need to be computed for every subsampled pixel.

[21]:
mpp_data_neighb = mpp_data.add_neighborhood(size=3, copy=True)

print(f"MPP dataset before adding the neighbourhood, has neighbour data: {mpp_data.has_neighbor_data}")
print(f"Representation of one MPP before adding the neighbourhood (first channel):\n {mpp_data.mpp[0,:, :, 0]}")

print(f"MPP dataset after adding the neighbourhood, has neighbor data: {mpp_data_neighb.has_neighbor_data}")
print(f"Representation of one mpp after adding the neighbourhood (first channel):\n {mpp_data_neighb.mpp[0, :, :, 0]}")
MPP dataset before adding the neighbourhood, has neighbour data: False
Representation of one MPP before adding the neighbourhood (first channel):
 [[117]]
MPP dataset after adding the neighbourhood, has neighbor data: True
Representation of one mpp after adding the neighbourhood (first channel):
 [[117 117 117]
 [117 117 117]
 [111 116 123]]
[22]:
# subsample and add neighbourhood

mpp_data_sub = mpp_data.subsample(frac_per_obj=0.05, add_neighborhood=True, neighborhood_size=3)

print(f"Subset fraction: {mpp_data_sub.mpp.shape[0]/mpp_data.mpp.shape[0]}")

print(f"MPP representation before adding the neighbourhood (first channel)\n: {mpp_data.mpp[0,:, :, 0]}")
print(f"MPP representation after adding the neighbourhood (first channel)\n: {mpp_data_sub.mpp[0, :, :, 0]}")
Subset fraction: 0.04996453416108649
MPP representation before adding the neighbourhood (first channel)
: [[117]]
MPP representation after adding the neighbourhood (first channel)
: [[114 126 109]
 [121 124 122]
 [125 127 119]]

Note that since a random subsampling was performed, the first value in the mpp_data does not necessarily correspond to the first element in mpp_data_sub.

Add conditions to MPPData

The CAMPA framework allows you to learn a condition-independent latent representation of input pixel profiles. For this, we are training a conditional VAE. Here, we are going to prepare the MPPData for this by adding a special condition vector, MPPData.conditions to the object. This could represent e.g. perturbation or cell cycle.

MPPData.add_conditions prepares and adds this condition vector, utilising cell-level information from the MPPData.metadata table. Before categorical conditions can be added, they need to be described on data_config.CONDITIONS, to enable a mapping from categorical values to numbers.

cond_desc describes the conditions that should be added. It is a list of condition descriptions. Each condition is calculated separately and concatenated to form the resulting MPPData.conditions vector. Condition descriptions have the following format: "{condition}(_{postprocess})".

Condition values are obtained as follows:

  1. look up condition in MPPData.metadata. If condition is described in data_config.CONDITIONS, map it to numerical values. Note that if there is an entry UNKNOWN in data_config.CONDITIONS, all unmapped values will be mapped to this class. If condition is not described in data_config.CONDITIONS, values are assumed to be continuous and stored as they are in the condition vector.

  2. post-process conditions. postprocess can be one of the following values:

    • lowhigh_bin_2: Only for continuous values. Removes middle values. Bin all values in 4 quantiles, encodes values in the lowest quantile as one class and values in the high quantile as the second class (one-hot encoded), and set all values in between set to NaN.

    • bin_3: Only for continuous values. Bin values in .33 and .66 quantiles and one-hot encode each value.

    • zscore: Only for continuous values. Normalise values by mean and std.

    • one_hot: Only for categorical values. One-hot encode values.

For categorical descriptions, it is possible to pass a list of condition descriptions. This will return a unique one-hot encoded vector combining multiple conditions.

We can also provide a cond_params dict to the function, to enable the use of precomputed quantiles or mean/std values. If no values are provided in cond_params, the computed quantiles or mean/std values are added to cond_params, allowing us to use these values to process other MPPData objects.

In the following, let us look at some examples how to add conditions.

Continuous conditions

Let us use the transcription rate condition TR for this example. Remember, that every condition we are adding needs to be present in the MPPData.metadata table.

For this dataset, we computed the transcription rate as mean EU molecular intensity per cell over a period of 30 minutes. This gives an estimate of the amount of new RNA transcripts.

Let us again split the MPPData and first add conditions on this first half

[23]:
mpp_data1 = mpp_data.subset(obj_ids=mpp_data.unique_obj_ids[:6], copy=True)
mpp_data2 = mpp_data.subset(obj_ids=mpp_data.unique_obj_ids[6:], copy=True)

cond_params = {}
mpp_data1.add_conditions(["TR_norm_lowhigh_bin_2"], cond_params=cond_params)

print("cond_params for mpp_data1:", cond_params)
cond_params for mpp_data1: {'TR_norm_lowhigh_bin_2_quantile': [339.39541649462217, 417.5917605897343]}

This adds a #pixels x 2 conditions vector to mpp_data1. The two conditions are low quantile TR cells, and high quantile TR cells. All cells that have a TR between the low and high quantiles have NaN values.

As conditions are the same for all pixels in the same cell, we can get an overview over the assigned conditions looking at one value per object id:

[24]:
for obj_id, i in zip(*np.unique(mpp_data1.obj_ids, return_index=True)):
    # get TR for comparison
    TR = float(mpp_data1.metadata[mpp_data1.metadata["mapobject_id"] == obj_id]["TR"])
    # print condition and TR for this cell
    print("obj_id:", obj_id, "condition:", mpp_data1.conditions[i], "TR:", TR)
obj_id: 205776 condition: [nan nan] TR: 357.67200760901375
obj_id: 205790 condition: [1. 0.] TR: 428.3642683585313
obj_id: 248082 condition: [1. 0.] TR: 250.4885806253832
obj_id: 248102 condition: [0. 1.] TR: 515.7354206855616
obj_id: 259784 condition: [1. 0.] TR: 348.1507131885012
obj_id: 291041 condition: [0. 1.] TR: 443.565445026178

We can now use cond_params to use the same quantiles when adding conditions to mpp_data2.

[25]:
mpp_data2.add_conditions(["TR_norm_lowhigh_bin_2"], cond_params=cond_params)

# show resulting conditions
for obj_id, i in zip(*np.unique(mpp_data2.obj_ids, return_index=True)):
    # get TR for comparison
    TR = float(mpp_data2.metadata[mpp_data2.metadata["mapobject_id"] == obj_id]["TR"])
    # print condition and TR for this cell
    print("obj_id:", obj_id, "condition:", mpp_data2.conditions[i], "TR:", TR)
obj_id: 345908 condition: [1. 0.] TR: 316.0041054648299
obj_id: 359378 condition: [nan nan] TR: 367.90425531914894
obj_id: 359393 condition: [0. 1.] TR: 513.66859045505
obj_id: 366493 condition: [1. 0.] TR: 330.84783201527136
obj_id: 383341 condition: [nan nan] TR: 388.0123185637892
obj_id: 383793 condition: [nan nan] TR: 376.2435963681996
Categorical conditions

Categorical conditions are e.g. perturbation or cell cycle. They need to be defined in data_config.CONDITIONS to enable mapping from string values to classes.

Lets check which conditions are defined for this example: Remember that you can add conditions by editing the data_config file.

[26]:
print("data_config:", mpp_data.data_config_name)
print("data_config fname:", campa_config.data_configs[mpp_data.data_config_name.lower()])
print(mpp_data.data_config.CONDITIONS)
data_config: ExampleData
data_config fname: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/ExampleData_constants.py
{'perturbation_duration': ['AZD4573-120', 'AZD4573-30', 'CX5461-120', 'DMSO-120', 'DMSO-720', 'Meayamycin-720', 'TSA-30', 'Triptolide-120', 'Wnt-C59-720', 'normal'], 'cell_cycle': ['M', 'G1', 'S', 'G2'], 'well_name': ['H06', 'H07', 'I03', 'I08', 'I09', 'I10', 'I11', 'I12', 'I13', 'I14', 'I16', 'I17', 'I18', 'I20', 'J06', 'J07', 'J08', 'J09', 'J10', 'J12', 'J13', 'J14', 'J15', 'J16', 'J18', 'J20', 'J21', 'J22'], 'siRNA': ['scrambled', 'SBF2']}

Lets use the cell_cycle condition first. Using just cell_cycle as condition will add values between 0-3 (1-3 for this example, as there are no M-stage cells in this data).

[27]:
mpp_data.add_conditions(["cell_cycle"])

print(mpp_data.conditions.shape)
print(np.unique(mpp_data.conditions))
(153669, 1)
[1 2 3]

Using cell_cycle_one_hot will one-hot encode the cell-cycle stage:

[28]:
mpp_data.add_conditions(["cell_cycle_one_hot"])

print(mpp_data.conditions.shape)
print(mpp_data.conditions)
(153669, 4)
[[0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 ...
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]]
Combine multiple categorical conditions

We can also combine multiple categorical conditions into one one-hot encoded vector. Here, let us combine cell-cycle and perturbation-duration into one conditions. There are 4 cell-cycle values and 10 perturbation values. This means that there are \(4*10=40\) unique cell-cycle-perturbation combinations, which will result in a #pixels x 40 conditions vector.

[29]:
mpp_data.add_conditions([["cell_cycle_one_hot", "perturbation_duration_one_hot"]])
print(mpp_data.conditions.shape)
print(mpp_data.conditions)
(153669, 40)
[[0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 1. 0. 0.]
 ...
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 1. 0.]]
Combine multiple conditions (concatenation)

Add more condition descriptions to the cond_desc list to combine multiple conditions. Here, we combine the TR value with the one-hot encoded cell-cycle.

The first value is the TR, and the following 4 encode the cell cycle stage.

[30]:
mpp_data.add_conditions(["TR", "cell_cycle_one_hot"])
print(mpp_data.conditions.shape)
print(mpp_data.conditions)
(153669, 5)
[[443.56544503   0.           1.           0.           0.        ]
 [443.56544503   0.           1.           0.           0.        ]
 [443.56544503   0.           1.           0.           0.        ]
 ...
 [515.73542069   0.           0.           1.           0.        ]
 [515.73542069   0.           0.           1.           0.        ]
 [515.73542069   0.           0.           1.           0.        ]]
Train-val-test split

We can use MPPData.train_val_test_split to split the data into train/val/test sets.

[31]:
train, val, test = mpp_data.train_val_test_split(train_frac=0.8, val_frac=0.1)

print("Train", train)
print("Val", val)
print("Test", test)
Train MPPData for ExampleData (102400 mpps with shape (1, 1, 7) from 9 objects). Data keys: ['y', 'obj_ids', 'x', 'mpp', 'labels', 'conditions'].
Val MPPData for ExampleData (13668 mpps with shape (1, 1, 7) from 1 objects). Data keys: ['y', 'obj_ids', 'x', 'mpp', 'labels', 'conditions'].
Test MPPData for ExampleData (37601 mpps with shape (1, 1, 7) from 2 objects). Data keys: ['y', 'obj_ids', 'x', 'mpp', 'labels', 'conditions'].

Dataset for training models

A NNDataset is a collection of MPPDatas (one for train/val/test). It is created using a parameter file defining its properties.

This notebooks shows:

  • how to create a dataset from parameters

  • how to iterate over an existing dataset

[1]:
from pathlib import Path
import os

from campa.data import NNDataset, create_dataset, load_example_data
from campa.utils import load_config, init_logging
from campa.constants import campa_config

# init logging with level INFO=20, WARNING=30
init_logging(level=30)
# ensure that example data is downloaded
load_example_data()
# read correct campa_config -- created with setup.ipynb
CAMPA_DIR = Path.cwd()
campa_config.config_fname = CAMPA_DIR / "params/campa.ini"
print(campa_config)
2022-11-25 10:35:43.172656: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 10:35:44.806362: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-25 10:35:45.313484: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Reading config from /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/campa.ini
CAMPAConfig (fname: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/campa.ini)
EXPERIMENT_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments
BASE_DATA_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_data
CO_OCC_CHUNK_SIZE: 10000000.0
data_config/exampledata: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/ExampleData_constants.py

Create NNDataset

NNDataset is created with a config file that specifies from which folders to take the data and how to preprocess the data.

Here, we will use example_data_params.py to create an example dataset and save it to DATASET_DIR/184A1_test_dataset. The following code assumes that you have a copy of this file in a folder called params in the current directory (where this notebook is stored). See create_dataset for a documentation of the data parameters.

Alternatively to this code, the NNDataset can also easily be created with the CLI (make sure that you run this command from the folder where your campa.ini file is stored, so that CAMPA knows the correct data paths):

cd CAMPA_DIR/params
campa create_dataset example_data_params.py

First, let us load the data parameters. This dataset uses data from two unperturbed and two Meayamycin perturbed wells. It contains condition labels for the perturbation (one-hot-encoded) and for the cell cycle (one-hot-encoded). All cells are used, and 10% of all pixels. It is corrected for background fluorescence signal, and percentile-normalised. It uses a neighbourhood of \(3x3\) as inputs to the cVAE.

It is going to be saved in the DATASET_DIR defined by the ExampleData data config.

[2]:
# read config from params folder in current directory.
config = load_config("params/example_data_params.py")
print(config.data_params)
{'dataset_name': '184A1_test_dataset', 'data_config': 'ExampleData', 'data_dirs': ['184A1_unperturbed/I09', '184A1_unperturbed/I11', '184A1_meayamycin/I12', '184A1_meayamycin/I20'], 'channels': ['01_CDK9_pT186', '01_PABPC1', '02_CDK7', '03_CDK9', '03_RPS6', '05_GTF2B', '05_Sm', '07_POLR2A', '07_SETD1A', '08_H3K4me3', '09_CCNT1', '09_SRRM2', '10_H3K27ac', '10_POL2RA_pS2', '11_KPNA2_MAX', '11_PML', '12_RB1_pS807_S811', '12_YAP1', '13_PABPN1', '13_POL2RA_pS5', '14_PCNA', '15_SON', '15_U2SNRNPB', '16_H3', '17_HDAC3', '17_SRSF2', '18_NONO', '19_KPNA1_MAX', '20_ALYREF', '20_SP100', '21_COIL', '21_NCL', '00_DAPI', '07_H2B'], 'condition': ['perturbation_duration_one_hot', 'cell_cycle_one_hot'], 'condition_kwargs': {'cond_params': {}}, 'split_kwargs': {'train_frac': 0.7, 'val_frac': 0.2}, 'test_img_size': 225, 'subset': True, 'subset_kwargs': {'frac': None, 'nona_condition': True, 'cell_cycle': 'NO_NAN'}, 'subsample': True, 'subsample_kwargs': {'frac': 0.1, 'frac_per_obj': None, 'num': None, 'num_per_obj': None}, 'neighborhood': True, 'neighborhood_size': 3, 'normalise': True, 'normalise_kwargs': {'background_value': 'mean_background', 'percentile': 98.0, 'rescale_values': []}, 'seed': 42}

Now we can create the dataset.

[3]:
create_dataset(config.data_params)
Use NNDataset

After creating the dataset, we can inspect it using NNDataset.

[4]:
dataset_name = "184A1_test_dataset"
ds = NNDataset(dataset_name, data_config="ExampleData")
print(ds)
NNDataset for ExampleData (shape (3, 3, 34)). train: 50310, val: 12340, test: 4541

The dataset was saved in the DATASET_DIR defined by the data_config:

[5]:
nn_dataset_fpath = os.path.join(ds.data_config.DATASET_DIR, dataset_name)
print(nn_dataset_fpath)
print(os.listdir(os.path.join(ds.data_config.DATASET_DIR, dataset_name)))
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_data/datasets/184A1_test_dataset
['val', 'val_imgs', 'test_imgs', 'train', 'params.json', 'test']

NNDataset consists of of a train/val/test split in NNDataset.data which is represented as MPPData objects.

[6]:
print(ds.data.keys())
print(ds.data["train"])
dict_keys(['train', 'val', 'test'])
MPPData for ExampleData (50310 mpps with shape (3, 3, 34) from 35 objects). Data keys: ['x', 'obj_ids', 'y', 'mpp', 'labels', 'conditions'].

In addition, NNDataset also has val/test images in NNDataset.imgs, represented as MPPData. These objects contain all pixels from val and test cells such that we can plot images from the test and val split.

[7]:
import matplotlib.pyplot as plt

print(ds.imgs.keys())
plt.imshow(ds.imgs["test"].get_object_img(ds.imgs["test"].obj_ids[0], channel_ids=0, img_size=255)[:, :, 0])
dict_keys(['val', 'test'])
[7]:
<matplotlib.image.AxesImage at 0x7f4f3a5f2d00>
_images/notebooks_nn_dataset_13_2.png

NNDataset also contains attributes x and y, which are numpy arrays for the VAE input and output. x is either the multiplexed pixel profiles (number of pixels x neighbours x neighbours x channels) or the multiplexed pixel profiles and the condition labels.

[8]:
# get x without condition
x = ds.x("val", is_conditional=False)
print(x.shape)

# get x with condition
x, c = ds.x("train", is_conditional=True)
print(x.shape, c.shape)
(12340, 3, 3, 34)
(50310, 3, 3, 34) (50310, 14)

For neural network training, we use the NNDataset.get_tf_dataset method which returns a tensorflow dataset.

[9]:
# dataset can return a tf dataset for using with e.g. keras
tf_ds = ds.get_tf_dataset(split="train", is_conditional=True)
print(tf_ds)

for x, y in tf_ds.take(1):
    # x here contains x, condition
    print(len(x), x[0].shape, x[1].shape)
    print(y.shape)
<FlatMapDataset element_spec=((TensorSpec(shape=(3, 3, 34), dtype=tf.float32, name=None), TensorSpec(shape=(14,), dtype=tf.float32, name=None)), TensorSpec(shape=(34,), dtype=tf.float32, name=None))>
2 (3, 3, 34) (14,)
(34,)
2022-11-25 10:36:08.102849: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 10:36:08.717114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30975 MB memory:  -> device: 0, name: Tesla V100S-PCIE-32GB, pci bus id: 0000:37:00.0, compute capability: 7.0

For more information on the training and evaluation process, see the tutorial on training cVAE models

Train and evaluate models

This notebooks shows how to train, predict and cluster models. Alternatively to executing each step here, the CLI can be used for training models:

cd CAMPA_DIR/params
campa train all --config example_experiment_params.py

For evaluation or comparison only, use

cd CAMPA_DIR/params
campa train compare --experiment-dir test

Before running this tutorial, make sure you create the NNDataset with the create NNDataset tutorial. The models trained here will be saved in EXPERIMENT_DIR/test, with the EXPERIMENT_DIR being the custom experiment path set up in campa_config.

[1]:
from pathlib import Path
import os

from campa.tl import (
    Cluster,
    Estimator,
    Predictor,
    Experiment,
    ModelComparator,
    run_experiments,
)
from campa.data import MPPData
from campa.utils import init_logging
from campa.constants import campa_config

# init logging with level INFO=20, WARNING=30
init_logging(level=30)
# read correct campa_config -- created with setup.ipynb
CAMPA_DIR = Path.cwd()
campa_config.config_fname = CAMPA_DIR / "params/campa.ini"
print(campa_config)
2022-11-25 11:07:57.922828: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 11:07:59.660494: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-25 11:07:59.960552: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Reading config from /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/campa.ini
CAMPAConfig (fname: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/campa.ini)
EXPERIMENT_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments
BASE_DATA_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_data
CO_OCC_CHUNK_SIZE: 10000000.0
data_config/exampledata: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/ExampleData_constants.py

Experiment class handles config files

For training and evaluating models, an experiment_params.py file is used. This file contains several model/experiment parameters for easy training of several models at the same time. The parameter dictionaries contain several sections:

  • experiment (where to save experiment)

  • data (which dataset to use for training)

  • model (model class definition)

  • training (training hyper-parameters)

  • evaluation (evaluation on val/test split)

  • cluster (clustering on val/test split)

For more information on the structure of the experiment parameter dictionary, see the documentation of Experiment. The Experiment class is initialised from a parameter dictionary for one specific experiment and is passed to specific classes for training (Estimator), evaluation (Predictor), and clustering (Cluster).

Here, we are going to be using an example experiment config that creates three models: - condVAE: a cVAE model trained on the example dataset created in the NNDataset tutorial, using perturbation (unperturbed or Meayamycin) and cell cycle as conditions - VAE: a VAE model trained on the example dataset created in the NNDataset tutorial, - MPPleiden: a non-trainable model that is used to create a direct pixel clustering, to compare with the cVAE latent space clustering.

First, we create the Experiments from the example config file:

[2]:
# get Experiments from config
exps = Experiment.get_experiments_from_config("params/example_experiment_params.py")

Experiments are saved in EXPERIMENT_DIR/<experiment_dir>/<experiment_name>, where EXPERIMENT_DIR is the directory set up in campa_config, and <experiment_dir> and <experiment_name> are defined in the experiment config. The experiment config can be accessed with Experiment.config and is stored as config.json in the experiment folder.

[3]:
exp = exps[0]
print("Experiment name:", exp.name)
print("Experiment is stored in:", exp.full_path)
print("Experiment config:", exp.config)
Experiment name: VAE
Experiment is stored in: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test/VAE
Experiment config: {'experiment': {'dir': 'test', 'name': 'VAE', 'save_config': True}, 'data': {'data_config': 'ExampleData', 'dataset_name': '184A1_test_dataset', 'output_channels': None}, 'model': {'model_cls': <ModelEnum.VAEModel: 'VAEModel'>, 'model_kwargs': {'num_neighbors': 3, 'num_channels': 34, 'num_output_channels': 34, 'latent_dim': 16, 'encoder_conv_layers': [32], 'encoder_conv_kernel_size': [1], 'encoder_fc_layers': [32, 16], 'decoder_fc_layers': []}, 'init_with_weights': False}, 'training': {'learning_rate': 0.001, 'epochs': 10, 'batch_size': 128, 'loss': {'decoder': <LossEnum.SIGMA_MSE: 'sigma_vae_mse'>, 'latent': <LossEnum.KL: 'kl_divergence'>}, 'loss_weights': {'decoder': 1}, 'loss_warmup_to_epoch': {}, 'metrics': {'decoder': <LossEnum.MSE_metric: 'mean_squared_error_metric'>, 'latent': <LossEnum.KL: 'kl_divergence'>}, 'save_model_weights': True, 'save_history': True, 'overwrite_history': True}, 'evaluation': {'split': 'val', 'predict_reps': ['latent', 'decoder'], 'img_ids': 2, 'predict_imgs': True, 'predict_cluster_imgs': True}, 'cluster': {'predict_cluster_imgs': True, 'cluster_name': 'clustering', 'cluster_rep': 'latent', 'cluster_method': 'leiden', 'leiden_resolution': 0.2, 'subsample': None, 'subsample_kwargs': {}, 'som_kwargs': {}, 'umap': True}}
Running experiments with the high-level api

The high-level api contains a run_experiments() function that wraps training, evaluation, clustering and comparison of models in one call.

[4]:
run_experiments(exps, mode="trainval")
Running experiment for ['VAE', 'CondVAE_pert-CC', 'MPPleiden'] with mode trainval
Training model for VAE
2022-11-25 11:08:48.091386: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 11:08:55.819611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30975 MB memory:  -> device: 0, name: Tesla V100S-PCIE-32GB, pci bus id: 0000:37:00.0, compute capability: 7.0
Epoch 1/10
2022-11-25 11:09:04.047735: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8202
394/394 [==============================] - 14s 22ms/step - loss: -54.6142 - decoder_loss: -60.4548 - latent_loss: 5.8410 - decoder_mean_squared_error: 0.1186 - latent_kl_loss: 5.8438 - val_loss: -1461.2408 - val_decoder_loss: -1469.6049 - val_latent_loss: 8.3641 - val_decoder_mean_squared_error: 0.0308 - val_latent_kl_loss: 8.3667
Epoch 2/10
394/394 [==============================] - 8s 17ms/step - loss: -1727.2487 - decoder_loss: -1735.4847 - latent_loss: 8.2352 - decoder_mean_squared_error: 0.0265 - latent_kl_loss: 8.2305 - val_loss: -1775.4342 - val_decoder_loss: -1784.1416 - val_latent_loss: 8.7079 - val_decoder_mean_squared_error: 0.0268 - val_latent_kl_loss: 8.7105
Epoch 3/10
394/394 [==============================] - 8s 17ms/step - loss: -1899.6907 - decoder_loss: -1907.9985 - latent_loss: 8.3079 - decoder_mean_squared_error: 0.0245 - latent_kl_loss: 8.3114 - val_loss: -1812.8528 - val_decoder_loss: -1821.5256 - val_latent_loss: 8.6733 - val_decoder_mean_squared_error: 0.0263 - val_latent_kl_loss: 8.6758
Epoch 4/10
394/394 [==============================] - 8s 17ms/step - loss: -1983.5101 - decoder_loss: -1991.8229 - latent_loss: 8.3129 - decoder_mean_squared_error: 0.0235 - latent_kl_loss: 8.3082 - val_loss: -1779.9838 - val_decoder_loss: -1788.4822 - val_latent_loss: 8.4980 - val_decoder_mean_squared_error: 0.0267 - val_latent_kl_loss: 8.5003
Epoch 5/10
394/394 [==============================] - 8s 17ms/step - loss: -2030.0353 - decoder_loss: -2038.2104 - latent_loss: 8.1763 - decoder_mean_squared_error: 0.0231 - latent_kl_loss: 8.1751 - val_loss: -1871.1638 - val_decoder_loss: -1879.5934 - val_latent_loss: 8.4300 - val_decoder_mean_squared_error: 0.0257 - val_latent_kl_loss: 8.4323
Epoch 6/10
394/394 [==============================] - 8s 17ms/step - loss: -2055.1208 - decoder_loss: -2063.1494 - latent_loss: 8.0277 - decoder_mean_squared_error: 0.0228 - latent_kl_loss: 8.0278 - val_loss: -1832.2227 - val_decoder_loss: -1840.4135 - val_latent_loss: 8.1906 - val_decoder_mean_squared_error: 0.0262 - val_latent_kl_loss: 8.1927
Epoch 7/10
394/394 [==============================] - 8s 17ms/step - loss: -2072.0718 - decoder_loss: -2079.9519 - latent_loss: 7.8805 - decoder_mean_squared_error: 0.0226 - latent_kl_loss: 7.8791 - val_loss: -1896.3907 - val_decoder_loss: -1904.4248 - val_latent_loss: 8.0340 - val_decoder_mean_squared_error: 0.0254 - val_latent_kl_loss: 8.0361
Epoch 8/10
394/394 [==============================] - 8s 17ms/step - loss: -2082.7478 - decoder_loss: -2090.5005 - latent_loss: 7.7527 - decoder_mean_squared_error: 0.0225 - latent_kl_loss: 7.7545 - val_loss: -1867.6969 - val_decoder_loss: -1875.5791 - val_latent_loss: 7.8826 - val_decoder_mean_squared_error: 0.0258 - val_latent_kl_loss: 7.8845
Epoch 9/10
394/394 [==============================] - 8s 17ms/step - loss: -2091.8447 - decoder_loss: -2099.4902 - latent_loss: 7.6441 - decoder_mean_squared_error: 0.0224 - latent_kl_loss: 7.6453 - val_loss: -1876.2723 - val_decoder_loss: -1884.0425 - val_latent_loss: 7.7702 - val_decoder_mean_squared_error: 0.0257 - val_latent_kl_loss: 7.7720
Epoch 10/10
394/394 [==============================] - 8s 17ms/step - loss: -2095.2839 - decoder_loss: -2102.7478 - latent_loss: 7.4648 - decoder_mean_squared_error: 0.0224 - latent_kl_loss: 7.4653 - val_loss: -1898.8521 - val_decoder_loss: -1906.3774 - val_latent_loss: 7.5262 - val_decoder_mean_squared_error: 0.0254 - val_latent_kl_loss: 7.5278
Evaluating model for VAE
97/97 [==============================] - 0s 1ms/step
97/97 [==============================] - 0s 1ms/step
319/319 [==============================] - 0s 1ms/step
319/319 [==============================] - 0s 1ms/step
Clustering results for VAE
Cannot read with memmap:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test/VAE/results_epoch010/val/clustering.npy
/home/icb/hannah.spitzer/miniconda3/envs/campa/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
WARNING:MPPData:Saving partial keys of mpp data without a base_data_dir to enable correct loading
WARNING:MPPData:Saving partial keys of mpp data without a base_data_dir to enable correct loading
Training model for CondVAE_pert-CC
Epoch 1/10
394/394 [==============================] - 11s 21ms/step - loss: -96.7190 - decoder_loss: -108.3478 - latent_loss: 11.6285 - decoder_mean_squared_error: 0.0962 - latent_kl_loss: 11.6341 - val_loss: -1121.3718 - val_decoder_loss: -1135.7074 - val_latent_loss: 14.3359 - val_decoder_mean_squared_error: 0.0355 - val_latent_kl_loss: 14.3381
Epoch 2/10
394/394 [==============================] - 9s 20ms/step - loss: -1598.8888 - decoder_loss: -1611.3558 - latent_loss: 12.4673 - decoder_mean_squared_error: 0.0281 - latent_kl_loss: 12.4833 - val_loss: -1597.4363 - val_decoder_loss: -1609.7793 - val_latent_loss: 12.3429 - val_decoder_mean_squared_error: 0.0285 - val_latent_kl_loss: 12.3442
Epoch 3/10
394/394 [==============================] - 9s 20ms/step - loss: -1869.5409 - decoder_loss: -1880.0975 - latent_loss: 10.5565 - decoder_mean_squared_error: 0.0248 - latent_kl_loss: 10.5569 - val_loss: -1745.0593 - val_decoder_loss: -1755.3724 - val_latent_loss: 10.3129 - val_decoder_mean_squared_error: 0.0267 - val_latent_kl_loss: 10.3140
Epoch 4/10
394/394 [==============================] - 9s 20ms/step - loss: -2038.8110 - decoder_loss: -2047.9432 - latent_loss: 9.1308 - decoder_mean_squared_error: 0.0230 - latent_kl_loss: 9.1282 - val_loss: -1810.5372 - val_decoder_loss: -1819.9731 - val_latent_loss: 9.4353 - val_decoder_mean_squared_error: 0.0260 - val_latent_kl_loss: 9.4362
Epoch 5/10
394/394 [==============================] - 9s 20ms/step - loss: -2131.1963 - decoder_loss: -2139.6418 - latent_loss: 8.4446 - decoder_mean_squared_error: 0.0220 - latent_kl_loss: 8.4430 - val_loss: -1829.8097 - val_decoder_loss: -1838.4362 - val_latent_loss: 8.6265 - val_decoder_mean_squared_error: 0.0259 - val_latent_kl_loss: 8.6272
Epoch 6/10
394/394 [==============================] - 9s 20ms/step - loss: -2179.8435 - decoder_loss: -2187.7412 - latent_loss: 7.8981 - decoder_mean_squared_error: 0.0215 - latent_kl_loss: 7.8999 - val_loss: -1886.1505 - val_decoder_loss: -1894.4392 - val_latent_loss: 8.2890 - val_decoder_mean_squared_error: 0.0253 - val_latent_kl_loss: 8.2896
Epoch 7/10
394/394 [==============================] - 9s 20ms/step - loss: -2213.2449 - decoder_loss: -2220.8142 - latent_loss: 7.5693 - decoder_mean_squared_error: 0.0212 - latent_kl_loss: 7.5680 - val_loss: -1909.7399 - val_decoder_loss: -1917.5172 - val_latent_loss: 7.7776 - val_decoder_mean_squared_error: 0.0251 - val_latent_kl_loss: 7.7781
Epoch 8/10
394/394 [==============================] - 9s 20ms/step - loss: -2238.7202 - decoder_loss: -2246.0688 - latent_loss: 7.3495 - decoder_mean_squared_error: 0.0210 - latent_kl_loss: 7.3519 - val_loss: -1905.4167 - val_decoder_loss: -1913.1265 - val_latent_loss: 7.7099 - val_decoder_mean_squared_error: 0.0251 - val_latent_kl_loss: 7.7103
Epoch 9/10
394/394 [==============================] - 9s 20ms/step - loss: -2288.4106 - decoder_loss: -2295.5720 - latent_loss: 7.1623 - decoder_mean_squared_error: 0.0205 - latent_kl_loss: 7.1604 - val_loss: -2012.9708 - val_decoder_loss: -2020.5276 - val_latent_loss: 7.5560 - val_decoder_mean_squared_error: 0.0239 - val_latent_kl_loss: 7.5564
Epoch 10/10
394/394 [==============================] - 9s 20ms/step - loss: -2373.5266 - decoder_loss: -2380.6680 - latent_loss: 7.1418 - decoder_mean_squared_error: 0.0197 - latent_kl_loss: 7.1406 - val_loss: -2084.1760 - val_decoder_loss: -2091.7673 - val_latent_loss: 7.5919 - val_decoder_mean_squared_error: 0.0232 - val_latent_kl_loss: 7.5923
Evaluating model for CondVAE_pert-CC
97/97 [==============================] - 0s 1ms/step
97/97 [==============================] - 0s 1ms/step
319/319 [==============================] - 0s 1ms/step
319/319 [==============================] - 0s 1ms/step
Clustering results for CondVAE_pert-CC
WARNING:MPPData:Saving partial keys of mpp data without a base_data_dir to enable correct loading
WARNING:MPPData:Saving partial keys of mpp data without a base_data_dir to enable correct loading
Clustering results for MPPleiden
WARNING:MPPData:Saving partial keys of mpp data without a base_data_dir to enable correct loading
WARNING:MPPData:Saving partial keys of mpp data without a base_data_dir to enable correct loading

This should have created the three trained experiments in EXPERIMENT_DIR/test (test is the <experiment_dir> defined in the config):

[5]:
os.listdir(os.path.join(campa_config.EXPERIMENT_DIR, "test"))
[5]:
['CondVAE_pert-CC', 'VAE', 'MPPleiden']
Running experiments with Estimator and Predictor

Now, we will be using the cVAE experiment to show how to use the Estimator and Predictor classes. Note that if you ran the command above, this model is already trained, and the below commands will re-train it.

Neural network training with Estimator

The Estimator class handles model setup, training, and prediction. It is instantiated from an Experiment.

[6]:
exp = exps[1]
print("Experiment name:", exp.name)
print("Experiment is stored in:", exp.full_path)
Experiment name: CondVAE_pert-CC
Experiment is stored in: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test/CondVAE_pert-CC
[7]:
est = Estimator(exp)

The Estimator.train_model() function is used to train the experiment.

[8]:
est.train_model()
Epoch 1/10
394/394 [==============================] - 10s 21ms/step - loss: -2423.9329 - decoder_loss: -2431.0073 - latent_loss: 7.0749 - decoder_mean_squared_error: 0.0192 - latent_kl_loss: 7.0753 - val_loss: -2109.7830 - val_decoder_loss: -2117.1072 - val_latent_loss: 7.3246 - val_decoder_mean_squared_error: 0.0231 - val_latent_kl_loss: 7.3250
Epoch 2/10
394/394 [==============================] - 9s 20ms/step - loss: -2481.6553 - decoder_loss: -2488.6077 - latent_loss: 6.9536 - decoder_mean_squared_error: 0.0187 - latent_kl_loss: 6.9541 - val_loss: -2157.7593 - val_decoder_loss: -2165.0781 - val_latent_loss: 7.3189 - val_decoder_mean_squared_error: 0.0226 - val_latent_kl_loss: 7.3193
Epoch 3/10
394/394 [==============================] - 9s 21ms/step - loss: -2531.1716 - decoder_loss: -2538.0715 - latent_loss: 6.9019 - decoder_mean_squared_error: 0.0183 - latent_kl_loss: 6.9030 - val_loss: -2232.9197 - val_decoder_loss: -2240.2471 - val_latent_loss: 7.3282 - val_decoder_mean_squared_error: 0.0217 - val_latent_kl_loss: 7.3285
Epoch 4/10
394/394 [==============================] - 9s 20ms/step - loss: -2570.6665 - decoder_loss: -2577.5518 - latent_loss: 6.8846 - decoder_mean_squared_error: 0.0179 - latent_kl_loss: 6.8844 - val_loss: -2267.1956 - val_decoder_loss: -2274.5400 - val_latent_loss: 7.3443 - val_decoder_mean_squared_error: 0.0213 - val_latent_kl_loss: 7.3445
Epoch 5/10
394/394 [==============================] - 9s 20ms/step - loss: -2596.6277 - decoder_loss: -2603.4556 - latent_loss: 6.8281 - decoder_mean_squared_error: 0.0177 - latent_kl_loss: 6.8282 - val_loss: -2311.4065 - val_decoder_loss: -2318.5420 - val_latent_loss: 7.1355 - val_decoder_mean_squared_error: 0.0209 - val_latent_kl_loss: 7.1357
Epoch 6/10
394/394 [==============================] - 9s 20ms/step - loss: -2613.1116 - decoder_loss: -2619.9062 - latent_loss: 6.7942 - decoder_mean_squared_error: 0.0176 - latent_kl_loss: 6.7924 - val_loss: -2265.0884 - val_decoder_loss: -2272.2742 - val_latent_loss: 7.1861 - val_decoder_mean_squared_error: 0.0213 - val_latent_kl_loss: 7.1862
Epoch 7/10
394/394 [==============================] - 9s 20ms/step - loss: -2622.1897 - decoder_loss: -2628.9617 - latent_loss: 6.7705 - decoder_mean_squared_error: 0.0175 - latent_kl_loss: 6.7707 - val_loss: -2274.3088 - val_decoder_loss: -2281.4370 - val_latent_loss: 7.1281 - val_decoder_mean_squared_error: 0.0212 - val_latent_kl_loss: 7.1281
Epoch 8/10
394/394 [==============================] - 9s 20ms/step - loss: -2625.8840 - decoder_loss: -2632.5740 - latent_loss: 6.6894 - decoder_mean_squared_error: 0.0175 - latent_kl_loss: 6.6884 - val_loss: -2317.0896 - val_decoder_loss: -2324.1270 - val_latent_loss: 7.0367 - val_decoder_mean_squared_error: 0.0208 - val_latent_kl_loss: 7.0367
Epoch 9/10
394/394 [==============================] - 9s 20ms/step - loss: -2630.3965 - decoder_loss: -2637.0344 - latent_loss: 6.6381 - decoder_mean_squared_error: 0.0175 - latent_kl_loss: 6.6368 - val_loss: -2297.4724 - val_decoder_loss: -2304.4858 - val_latent_loss: 7.0141 - val_decoder_mean_squared_error: 0.0210 - val_latent_kl_loss: 7.0141
Epoch 10/10
394/394 [==============================] - 9s 20ms/step - loss: -2634.0779 - decoder_loss: -2640.6655 - latent_loss: 6.5884 - decoder_mean_squared_error: 0.0174 - latent_kl_loss: 6.5918 - val_loss: -2286.9790 - val_decoder_loss: -2294.0789 - val_latent_loss: 7.1011 - val_decoder_mean_squared_error: 0.0210 - val_latent_kl_loss: 7.1011
[8]:
loss decoder_loss latent_loss decoder_mean_squared_error latent_kl_loss val_loss val_decoder_loss val_latent_loss val_decoder_mean_squared_error val_latent_kl_loss
epoch
0 -2423.932861 -2431.007324 7.074893 0.019210 7.075301 -2109.782959 -2117.107178 7.324591 0.023058 7.324974
1 -2481.655273 -2488.607666 6.953554 0.018704 6.954084 -2157.759277 -2165.078125 7.318946 0.022575 7.319310
2 -2531.171631 -2538.071533 6.901866 0.018289 6.902996 -2232.919678 -2240.247070 7.328206 0.021735 7.328503
3 -2570.666504 -2577.551758 6.884612 0.017950 6.884366 -2267.195557 -2274.540039 7.344334 0.021303 7.344531
4 -2596.627686 -2603.455566 6.828085 0.017743 6.828167 -2311.406494 -2318.541992 7.135535 0.020870 7.135660
5 -2613.111572 -2619.906250 6.794180 0.017608 6.792426 -2265.088379 -2272.274170 7.186080 0.021294 7.186206
6 -2622.189697 -2628.961670 6.770476 0.017539 6.770661 -2274.308838 -2281.437012 7.128055 0.021195 7.128136
7 -2625.884033 -2632.573975 6.689381 0.017503 6.688420 -2317.089600 -2324.126953 7.036691 0.020774 7.036706
8 -2630.396484 -2637.034424 6.638138 0.017471 6.636763 -2297.472412 -2304.485840 7.014103 0.020958 7.014116
9 -2634.077881 -2640.665527 6.588428 0.017442 6.591764 -2286.979004 -2294.078857 7.101054 0.021028 7.101066

This saves the weights of the best model in the experiment directory.

[9]:
print(os.listdir(exp.full_path))
['checkpoint', 'results_epoch010', 'weights_epoch010.index', 'config.json', 'weights_epoch010.data-00000-of-00001', 'history.csv']
Predict val split and images with Predictor

The Predictor class can evaluate and predict new data from trained models. It is instantiated with an Experiment.

[10]:
pred = Predictor(exp)
pred.evaluate_model()
97/97 [==============================] - 0s 1ms/step
97/97 [==============================] - 0s 2ms/step
319/319 [==============================] - 0s 1ms/step
319/319 [==============================] - 0s 1ms/step

This function evaluates the model on the val and val_imgs split, and stores the results in a results folder inside the experiment directory.

[11]:
results_folder = os.path.join(pred.exp.full_path, f"results_epoch{pred.est.epoch:03d}")
print("Results folder", results_folder)
print(os.listdir(results_folder))
Results folder /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test/CondVAE_pert-CC/results_epoch010
['val', 'val_imgs']

The results are stored as `MPPData <../classes/campa.data.MPPData.rst>`__ objects. Note that it is important to define data_config when initialising the `MPPData <../classes/campa.data.MPPData.rst>`__, to ensure that the data is found.

[12]:
print(MPPData.from_data_dir(os.path.join(results_folder, "val"), data_config="ExampleData"))
MPPData for ExampleData (12340 mpps with shape (3, 3, 34) from 8 objects). Data keys: ['obj_ids', 'x', 'y', 'mpp', 'conditions', 'labels', 'latent'].
Cluster resulting latent space with Cluster

To get a quick overview of the generated latent space and the clustering of the latent space, we can use the Cluster class to cluster the evaluation split of the data. To generate the final clustering utilising the entire dataset, have a look at the clustering tutorial.

[13]:
cl = Cluster.from_exp_split(exps[1])
Cannot read with memmap:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test/CondVAE_pert-CC/results_epoch010/val/clustering.npy

Cluster the val split of the dataset. You can change the resolution of the clustering by setting the config["leiden_resolution"] parameter.

This will create npy files containing the clusters of every data point in the validation split in the experiment results folder

[14]:
print(cl.config["leiden_resolution"])
cl.create_clustering()
0.2
WARNING:MPPData:Saving partial keys of mpp data without a base_data_dir to enable correct loading

We can again load the evaluated val split using MPPData, this time including the clustering.

[15]:
mpp_data = MPPData.from_data_dir(os.path.join(results_folder, "val"), data_config="ExampleData", keys=["clustering"])
print("clustering:", mpp_data.data("clustering"))
Cannot read with memmap:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test/CondVAE_pert-CC/results_epoch010/val/clustering.npy
clustering: ['0' '0' '2' ... '0' '1' '2']

Using this clustering, we can predict the clusters of the val_imgs split:

[16]:
# predict cluster images
_ = cl.predict_cluster_imgs(exps[1])
WARNING:MPPData:Saving partial keys of mpp data without a base_data_dir to enable correct loading
Plot results using ModelComparator

The ModelComparator class is a convenience class to allow quick comparison between different models. Below, we will compare the condVAE, VAE and MPPleiden experiments that we just trained.

Note that due to the stochastic nature of neural network training (e.g. due to different random initialisations), your outputs could look a bit different to the outputs in the documentation. You should be able to see the same trends though.

[17]:
# get saved experiments from dir
exps = Experiment.get_experiments_from_dir("test")
comp = ModelComparator(exps)
Cannot read with memmap:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test/CondVAE_pert-CC/results_epoch010/val/clustering.npy
Cannot read with memmap:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test/CondVAE_pert-CC/results_epoch010/val_imgs/clustering.npy
Cannot read with memmap:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test/VAE/results_epoch010/val/clustering.npy
Cannot read with memmap:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test/VAE/results_epoch010/val_imgs/clustering.npy
Cannot read with memmap:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test/MPPleiden/results_epoch000/val/clustering.npy
Cannot read with memmap:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test/MPPleiden/results_epoch000/val_imgs/clustering.npy

The loss and MSE summary plots show how the VAE and cVAE model perform with respect to reconstructing the input. Both models seem to be trained well, as indicated by the asymptotic loss curves

[18]:
comp.plot_history(values=["val_loss", "val_decoder_loss"])
comp.plot_final_score(score="val_decoder_loss", fallback_score="val_loss", save_prefix="decoder_loss_")
comp.plot_per_channel_mse()
_images/notebooks_train_34_0.png
_images/notebooks_train_34_1.png
_images/notebooks_train_34_2.png

The two example images show that both models can predict the original inputs well

[19]:
comp.plot_predicted_images(
    img_ids=[
        0,
        1,
    ],
    img_size=225,
)
_images/notebooks_train_36_0.png
_images/notebooks_train_36_1.png

The example leiden clustering (here with resolution 0.2 as set in the experiment_params.py) shows some differences between the models. Even on these two example cells, the MPPleiden experiment (direct pixel clustering) seems to be less consistent across the cells. The condVAE clustering has distinct clusters for the periphery of some of the detected clusters. This is due to the training with a small local neighbourhood and the very limited data size in this toy example (only 10 cells from each perturbation). To remove this effect and train without a local neighbourhood, set num_neighbors in the model definition in the experiment_params and in the dataset definition in the data_params to 1.

[20]:
comp.plot_cluster_images(
    img_ids=[
        0,
        1,
    ],
    img_size=225,
    img_channel="00_DAPI",
)
_images/notebooks_train_38_0.png

The pixel-level UMAP representations of the learned latent representations and the original molecular pixel profiles show that the condVAE_pert-CC model integrated the two perturbation best. In the two other UMAPs, several clusters are entirely only in one perturbation.

[21]:
comp.plot_umap(
    channels=["15_SON", "18_NONO", "11_PML", "21_NCL", "16_H3", "21_COIL", "02_CDK7", "01_PABPC1", "00_DAPI"]
)
/home/icb/hannah.spitzer/miniconda3/envs/campa/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/home/icb/hannah.spitzer/miniconda3/envs/campa/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/home/icb/hannah.spitzer/miniconda3/envs/campa/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/home/icb/hannah.spitzer/miniconda3/envs/campa/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/home/icb/hannah.spitzer/miniconda3/envs/campa/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/home/icb/hannah.spitzer/miniconda3/envs/campa/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/home/icb/hannah.spitzer/miniconda3/envs/campa/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/home/icb/hannah.spitzer/miniconda3/envs/campa/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/home/icb/hannah.spitzer/miniconda3/envs/campa/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/notebooks_train_40_1.png
_images/notebooks_train_40_2.png
_images/notebooks_train_40_3.png

Cluster data into CSLs

Using the trained condVAE_pert-CC model, this tutorial looks at how to cluster the learned latent representation into CSLs. In order to ensure that the clusters and their annotations are reproducible, we will use a pretrained model in this example. You can also use the model that you trained in the Train and evaluate models tutorial, but due to the randomness of training neural networks, the latent space might be different, resulting in different clusters.

In this tutorial we will:

  • create and cluster a smaller subset of the large dataset

  • interactively cluster the data

  • query the Human Protein Atlas to get candidate annotations

  • annotate the data and plot the results on example cells

  • prepare the entire dataset for clustering

  • project the clustering to the entire dataset

[1]:
from pathlib import Path
import os

from IPython.display import display
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import matplotlib.pyplot as plt

from campa.pl import annotate_img
from campa.tl import (
    Cluster,
    Experiment,
    create_cluster_data,
    get_clustered_cells,
    load_full_data_dict,
    prepare_full_dataset,
    project_cluster_data,
    add_clustering_to_adata,
)
from campa.data import MPPData, load_example_data, load_example_experiment
from campa.utils import init_logging
from campa.constants import campa_config

# init logging with level INFO=20, WARNING=30
init_logging(level=30)
# ensure that example data is downloaded
load_example_data()
# read correct campa_config -- created with setup.ipynb
CAMPA_DIR = Path.cwd()
campa_config.config_fname = CAMPA_DIR / "params/campa.ini"
print(campa_config)
2022-11-25 11:51:22.297858: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 11:51:22.485116: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-25 11:51:22.529442: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Reading config from /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/campa.ini
CAMPAConfig (fname: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/campa.ini)
EXPERIMENT_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments
BASE_DATA_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_data
CO_OCC_CHUNK_SIZE: 10000000.0
data_config/exampledata: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/ExampleData_constants.py

First, we need to download the pretrained model

[2]:
example_experiment_folder = load_example_experiment(campa_config.EXPERIMENT_DIR)
print("Example experiment downloaded to:", example_experiment_folder)
Example experiment downloaded to: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test_pre_trained
Prepare a subset of the data

First, the the data is subsampled, because we would like the clustering to be interactive and feasible to compute on a laptop. If you have more time or access to GPUs, you could also consider to skip the subsampling step and cluster all data directly.

Sub-setting and clustering the data can be done with the high-level API, using create_cluster_data. Alternatively, the CLI can be used:

cd CAMPA_DIR/params
campa cluster test/CondVAE_pert-CC create --subsample --frac 0.1 --save-dir aggregated/sub-0.1 --cluster

If you would like to also directly cluster the data, use cluster=True in the call to create_cluster_data. Here, we leave the clustering for the interactive clustering and annotation step below.

[3]:
create_cluster_data("test_pre_trained/CondVAE_pert-CC", subsample=True, frac=0.1, save_dir="aggregated/sub-0.1")
WARNING:Cluster:Could not load MPPData from test_pre_trained/CondVAE_pert-CC/aggregated/sub-0.1
WARNING:Cluster:Could not load MPPData from test_pre_trained/CondVAE_pert-CC/aggregated/sub-0.1
2022-11-25 11:51:35.375731: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 11:51:35.990505: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30975 MB memory:  -> device: 0, name: Tesla V100S-PCIE-32GB, pci bus id: 0000:37:00.0, compute capability: 7.0
2022-11-25 11:51:37.894959: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8202
525/525 [==============================] - 2s 1ms/step
WARNING:MPPData:Saving partial keys of mpp data without a base_data_dir to enable correct loading
/home/icb/hannah.spitzer/miniconda3/envs/campa/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
WARNING:MPPData:Saving partial keys of mpp data without a base_data_dir to enable correct loading

This has created a subset of 10% of all pixels in aggregated/sub-0.1. This again is readable as MPPData. Note that in order to correctly load the MPPData here, we have to define the data_config, and set the base_dir to the EXPERIMENT_DIR, as per default, MPPData looks for data relative to DATA_DIR defined in data_config.

[4]:
cluster_data_dir = "test_pre_trained/CondVAE_pert-CC/aggregated/sub-0.1"
print(os.listdir(os.path.join(campa_config.EXPERIMENT_DIR, cluster_data_dir)))
print(MPPData.from_data_dir(cluster_data_dir, data_config="ExampleData", base_dir=campa_config.EXPERIMENT_DIR))
['channels.csv', 'labels.npy', 'latent.npy', 'y.npy', 'conditions.npy', 'umap.npy', 'mpp.npy', 'obj_ids.npy', 'cluster_params.json', 'metadata.csv', 'x.npy']
MPPData for ExampleData (67084 mpps with shape (3, 3, 34) from 46 objects). Data keys: ['obj_ids', 'x', 'y', 'mpp', 'labels', 'latent', 'conditions'].
Prepare full dataset for projecting cluster to it

To project the clustering to the entire dataset, the model needs to be used to predict the latent representation on all data. It is recommended to run this step in a script, as this might take a while for large datasets:

cd CAMPA_DIR/params
campa cluster test/CondVAE_pert-CC prepare-full --save-dir aggregated/full_data

This script uses the prepare_full_dataset function of the high-level API.

[5]:
prepare_full_dataset("test_pre_trained/CondVAE_pert-CC", save_dir="aggregated/full_data")
iterating over data dirs ['184A1_unperturbed/I09', '184A1_unperturbed/I11', '184A1_meayamycin/I12', '184A1_meayamycin/I20']
1201/1201 [==============================] - 1s 1ms/step
1373/1373 [==============================] - 2s 1ms/step
1412/1412 [==============================] - 2s 1ms/step
1257/1257 [==============================] - 1s 1ms/step

This saves the predicted data to aggregated/full_data. As usual, this can be loaded using MPPData.

[6]:
full_data_dir = "test_pre_trained/CondVAE_pert-CC/aggregated/full_data"
print(os.listdir(os.path.join(campa_config.EXPERIMENT_DIR, full_data_dir)))
# load MPPData from one data_dir
print(
    MPPData.from_data_dir(
        os.path.join(full_data_dir, "184A1_unperturbed/I09"),
        data_config="ExampleData",
        base_dir=campa_config.EXPERIMENT_DIR,
    )
)
['184A1_unperturbed', '184A1_meayamycin']
MPPData for ExampleData (153669 mpps with shape (1, 1, 35) from 12 objects). Data keys: ['obj_ids', 'x', 'y', 'mpp', 'labels', 'latent'].
Interactive clustering and annotation

Now we can cluster our data. This is done by getting an anndata object from the Cluster.cluster_mpp object using MPPData.get_adata and clustering it using scanpy.

Note that for reproducing the clustering we will use the downloaded subsampled and projected data in aggregated/sub-pre for the clustering.

[7]:
# load cl
cluster_data_dir = "test_pre_trained/CondVAE_pert-CC/aggregated/sub-pre"
cl = Cluster.from_cluster_data_dir(cluster_data_dir)
# get adata object
adata = cl.cluster_mpp.get_adata(X="mpp", obsm={"X_latent": "latent", "X_umap": "umap"})

adata contains pixels as obs and protein channels as var. The cVAE latent space is stored in obsm['X_latent'].

[8]:
print("obs:", adata.obs.index)
print("var:", adata.var.index)
print("X_latent shape:", adata.obsm["X_latent"].shape)
obs: Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
       ...
       '67074', '67075', '67076', '67077', '67078', '67079', '67080', '67081',
       '67082', '67083'],
      dtype='object', length=67084)
var: Index(['01_CDK9_pT186', '01_PABPC1', '02_CDK7', '03_CDK9', '03_RPS6',
       '05_GTF2B', '05_Sm', '07_POLR2A', '07_SETD1A', '08_H3K4me3', '09_CCNT1',
       '09_SRRM2', '10_H3K27ac', '10_POL2RA_pS2', '11_KPNA2_MAX', '11_PML',
       '12_RB1_pS807_S811', '12_YAP1', '13_PABPN1', '13_POL2RA_pS5', '14_PCNA',
       '15_SON', '15_U2SNRNPB', '16_H3', '17_HDAC3', '17_SRSF2', '18_NONO',
       '19_KPNA1_MAX', '20_ALYREF', '20_SP100', '21_COIL', '21_NCL', '00_DAPI',
       '07_H2B'],
      dtype='object', name='name')
X_latent shape: (67084, 16)

Cluster the latent space. Here, different clusterings with different resolutions could be created and compared.

[9]:
# cluster the latent space
sc.pp.neighbors(adata, use_rep="X_latent")
sc.tl.leiden(adata, resolution=0.2, key_added="clustering_res0.2", random_state=0)
[10]:
# write clustering to disk
np.save(
    os.path.join(campa_config.EXPERIMENT_DIR, cluster_data_dir, "clustering_res0.2"), adata.obs["clustering_res0.2"]
)
Explore clustered data

The aim of this step is to annotate the clustering, to aid interpretability of the results. Hereby, CSLs corresponding to the same biological structure may be merged to the same annotated CSL. This annotation is done in an iterative fashion and considers the following factors:

  • Presence of canonical organelle markers among top enriched channels in each CSL in unperturbed (control) cells. If no canonical markers are present, consider CSL as ‘background’ (i.e. nucleoplasm / cytoplasm).

  • Spatial distribution of CSLs compared to spatial distribution of canonical markers of organelles in unperturbed (control) cells

  • Human Protein Atlas subcellular localisation (The Human Protein Atlas, Thul et al. 2017) of top enriched channels in each CSL, weighted by z-scored channel intensity.

Below, we show how to visualise and analyse the clusters for each of these aspects.

Load the Cluster object, and export the Cluster.cluster_mpp with MPPData.get_adata and add the just created clustering to the exported adata object with add_clustering_to_adata.

[11]:
# load cl
cluster_data_dir = "test_pre_trained/CondVAE_pert-CC/aggregated/sub-pre"
cl = Cluster.from_cluster_data_dir(cluster_data_dir)
# get adata object
adata = cl.cluster_mpp.get_adata(X="mpp", obsm={"X_latent": "latent", "X_umap": "umap"})
# add clustering and colormap (from cluster_name_annotation.csv) to adata
cl.set_cluster_name("clustering_res0.2")
add_clustering_to_adata(
    os.path.join(campa_config.EXPERIMENT_DIR, cluster_data_dir), "clustering_res0.2", adata, cl.cluster_annotation
)
Cannot read with memmap:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test_pre_trained/CondVAE_pert-CC/aggregated/sub-pre/clustering_res0.2.npy

The following is a UMAP of the latent space coordinates of each pixel. It is useful to check that the clustering algorithm is doing something sensible, and also that the conditions used in the autoencoder don’t end up in different regions of the latent space UMAP (which would indicate that the conditional autoencoder was not able to generate a condition-independent representation). Channel intensities can also be visualised here, which is useful if you have some markers for known structures (e.g. NCL, H3, SON)

[12]:
plt.rcParams["figure.figsize"] = [6, 4]
sc.pl.umap(
    adata,
    color=["perturbation_duration", "cell_cycle", "clustering_res0.2", "15_SON", "21_NCL", "16_H3"],
    vmax="p99",
    ncols=3,
)
/home/icb/hannah.spitzer/miniconda3/envs/campa/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/home/icb/hannah.spitzer/miniconda3/envs/campa/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/home/icb/hannah.spitzer/miniconda3/envs/campa/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/notebooks_cluster_22_1.png

To begin to understand the identity of each of the clusters, it can be helpful to visualise how the intensity of each of the channels distributes across the clusters:

[13]:
cluster_name = "clustering_res0.2"
pixel_values_annotated = pd.concat(
    [
        pd.DataFrame(adata.X, columns=adata.var_names).reset_index(drop=True),
        adata.obs[[cluster_name]].reset_index(drop=True),
    ],
    axis=1,
)
_ = sns.clustermap(
    pixel_values_annotated.groupby(cluster_name).aggregate("mean"),
    z_score=1,
    cmap="vlag",
    figsize=[14, 9],
    vmin=-3,
    vmax=3,
    method="ward",
)
_images/notebooks_cluster_24_0.png

For example, the plot above reveals that PML-body markers PML and SP100 exclusively localise to cluster 5, which is therefore very likely to correspond to PML bodies.

To identify clusters, it can also be very helpful to plot the clusters directly out over some example cells. Using load_full_data_dict and get_clustered_cells, we can cluster some random cells from each well of the experiment.

[14]:
# NOTE: this may take a couple of minutes
# load data
exp = Experiment.from_dir("test_pre_trained/CondVAE_pert-CC")
mpp_datas = load_full_data_dict(exp)
# project clustering to some example cells
example_cells = {}
example_cells.update(get_clustered_cells(mpp_datas, cl, "clustering_res0.2", num_objs=3))
184A1_unperturbed/I09
184A1_unperturbed/I11
184A1_meayamycin/I12
184A1_meayamycin/I20
[15]:
fig, axes = plt.subplots(4, 3, figsize=(25, 30))
for j, data_dir in enumerate(
    ["184A1_unperturbed/I09", "184A1_unperturbed/I11", "184A1_meayamycin/I12", "184A1_meayamycin/I20"]
):
    for i in range(3):
        axes[j, i].imshow(example_cells["clustering_res0.2_colored"][data_dir][i])

for ax in axes.flat:
    ax.axis("off")
_images/notebooks_cluster_28_0.png
Get subcellular localisation from HPA

We can also query the Human Protein Atlas to get subcellular locations of the most enriched channels for each protein. For this, we will use Cluster.get_hpa_localisation. This function returns a dictionary of results for each cluster in our data. Here, we set max_num_channels to 3 to get a location information from the top 3 most enriched channels per cluster (weighted by z-scored channel intensity).

[16]:
cluster_name = "clustering_res0.2"
results = cl.get_hpa_localisation(
    cluster_name=cluster_name, thresh=1, max_num_channels=3, limit_to_groups={"perturbation_duration": "normal"}
)

This function returns a dictionary for each cluster containing the genes that were used to query HPA for subcellular location and the query results. The results dictionary is empty if no enriched channels were found for this cluster or if HPA returned no results.

[17]:
# look at results for cluster 1
display(results["1"]["hpa_data"])
print(results["1"]["subcellular_locations"])
Ensembl Gene Gene synonym Reliability (IF) Subcellular main location Subcellular additional location gene_weights
H2B ENSG00000184678 H2BC21 [H2B, H2B.1, H2B/q, H2BE, H2BFQ, HIST2H2BE] Approved [Nucleoplasm] [Cytosol] 2.114632
KPNA2 ENSG00000182481 KPNA2 [IPOA1, QIP2, RCH1, SRP1alpha] Enhanced [Nucleoplasm] [Cytosol] 1.662806
Nucleoplasm    3.777438
dtype: float64
[18]:
# plot HPA annotation of clusters
fig, axes = plt.subplots(1, len(results), squeeze=False, sharey=True, figsize=(10, 4))
for ax, (idx, res) in zip(axes.flat, results.items()):
    if res["subcellular_locations"] is not None:
        res["subcellular_locations"].plot(kind="bar", ax=ax)
    ax.set_title(idx)
plt.suptitle(f"Subcellular locations by HPA for clustering {cluster_name}")
[18]:
Text(0.5, 0.98, 'Subcellular locations by HPA for clustering clustering_res0.2')
_images/notebooks_cluster_33_1.png
Annotate clustering

Now, we are ready to annotate the clustering with names of the known structures. We will use the HPA subcellular locations, the example images showing the spatial distribution of clusters in cells, and the channel intensities in each cluster plotted above to make a final assignment of clusters to names. Note that this annotation might vary slightly when rerunning this experiment and clustering, due to the inherent randomness of neural network training. As misannotations can introduce a bias in subsequent analysis of the CSLs, we recommend that users only annotate those CSLs where they are sure that this CSL corresponds to the annotated structure.

In this example, we map multiple different clusters to “Nucleoplasm” because we are not especially interested in these different Nucleoplasm ‘flavours’ identified, which tend to occur heterogeneously between cells. This merging process is entirely optional and we recommend that users keep these separate unless they are sure they want to merge. Another possibility is ‘Nucleoplasm 1’, ‘Nucleoplasm 2’ etc.

For the annotation, we will create a dictionary mapping leiden cluster names to annotated names, and use Cluster.add_cluster_annotation to add the annotation to the Cluster object. Annotations and colour maps for the annotations are stored in a csv file in the cluster_data_dir.

[19]:
annotation = {
    # No enriched markers -> nucleoplasm
    "0": "Nucleoplasm",
    # Nucleoplasm (consistent with HPA label)
    "1": "Nucleoplasm",
    # Nucleolus, due to appearance of NCL marker in this cluster.
    # HPA assigns either Nucleoplasm, Nucleoli or Nucleoli rim label
    "2": "Nucleolus",
    # No enriched markers -> nucleoplasm
    "3": "Nucleoplasm",
    # HPA label: Nuclear speckles
    "4": "Nuclear speckles",
    # HPA label: nuclear bodies (= PML bodies).
    "5": "PML bodies",
    # HPA label: cytosol
    # This is a PABPC-1 enriched cluster only occurring on the Meayamycin perturbation (see UMAP)
    # This occurs here, because we did not use enough training data to train a completely perturbation-independent embedding of the data.
    # Let us assign this small cluster of Nucleoplasm for now.
    # Note that in the manuscript, we don't have perturbation-specific clusters in the cVAE output.
    "6": "Nucleoplasm",
}

cl.set_cluster_name("clustering_res0.2")
cl.add_cluster_annotation(annotation, "annotation")

The resulting cluster annotation data frame is stored in cluster_data_dir/clustering_res0.2_annotation.csv

[20]:
# check out the resulting cluster annotation data frame - this is stored in cluster_data_dir/clustering_res0.2_annotation.csv
display(cl.cluster_annotation)
cluster_dir = os.path.join(campa_config.EXPERIMENT_DIR, cl.config["cluster_data_dir"])
print(cluster_dir)
print(os.listdir(cluster_dir))
clustering_res0.2 clustering_res0.2_colors annotation annotation_colors
index
0 0 #1f77b4 Nucleoplasm #f7b6d2
1 1 #ffbb78 Nucleoplasm #f7b6d2
2 2 #d62728 Nucleolus #d62728
3 3 #8c564b Nucleoplasm #f7b6d2
4 4 #f7b6d2 Nuclear speckles #1f77b4
5 5 #bcbd22 PML bodies #9edae5
6 6 #9edae5 Nucleoplasm #f7b6d2
7 #ffffff NaN #ffffff
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test_pre_trained/CondVAE_pert-CC/aggregated/sub-pre
['channels.csv', 'labels.npy', 'latent.npy', 'y.npy', 'conditions.npy', 'umap.npy', 'mpp.npy', 'pynndescent_index.pickle', 'obj_ids.npy', 'cluster_params.json', 'clustering_res0.2_annotation.csv', 'clustering_res0.2.npy', 'metadata.csv', 'x.npy']
[21]:
# check out cell colored by new annotation
fig, axes = plt.subplots(4, 3, figsize=(25, 30))
for j, data_dir in enumerate(
    ["184A1_unperturbed/I09", "184A1_unperturbed/I11", "184A1_meayamycin/I12", "184A1_meayamycin/I20"]
):
    for i in range(3):
        axes[j, i].imshow(
            annotate_img(
                example_cells["clustering_res0.2"][data_dir][i],
                annotation=cl.cluster_annotation,
                from_col=cl.config["cluster_name"],
                to_col="annotation",
                color=True,
            )
        )

for ax in axes.flat:
    ax.axis("off")
_images/notebooks_cluster_38_0.png
[22]:
# legend
plt.rcParams["figure.figsize"] = [3, 5]
df = cl.cluster_annotation.groupby("annotation")["annotation_colors"].first()
plt.barh(y=range(len(df)), width=1, color=df)
_ = plt.yticks(range(len(df)), df.index, rotation=0)
_ = plt.xticks([])
_images/notebooks_cluster_39_0.png
Predict model on full data

After generating a clustering on a subset of the data, we can now project it to the entire dataset. For this, the high-level API function project_cluster_data can be used. Alternatively, the CLI can be used:

cd CAMPA_DIR/params
campa cluster test/CondVAE_pert-CC project aggregated/sub-pre --save-dir aggregated/full_data --cluster-name clustering_res0.2
[23]:
project_cluster_data(
    "test_pre_trained/CondVAE_pert-CC",
    cluster_data_dir="aggregated/sub-pre",
    cluster_name="clustering_res0.2",
    save_dir="aggregated/full_data",
)
Cannot read with memmap:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test_pre_trained/CondVAE_pert-CC/aggregated/sub-pre/clustering_res0.2.npy
WARNING:MPPData:Saving partial keys of mpp data without a base_data_dir to enable correct loading
WARNING:MPPData:Saving partial keys of mpp data without a base_data_dir to enable correct loading
WARNING:MPPData:Saving partial keys of mpp data without a base_data_dir to enable correct loading
WARNING:MPPData:Saving partial keys of mpp data without a base_data_dir to enable correct loading

This creates npy files containing the clustering for the full data:

[24]:
# check out the resulting clustering using one data_dir
full_data_dir = "test_pre_trained/CondVAE_pert-CC/aggregated/full_data"
mpp_data = MPPData.from_data_dir(
    os.path.join(campa_config.EXPERIMENT_DIR, full_data_dir, "184A1_unperturbed/I09"),
    data_config="ExampleData",
    keys=["clustering_res0.2"],
)
print("clustering:", mpp_data.data("clustering_res0.2"))
Cannot read with memmap:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test_pre_trained/CondVAE_pert-CC/aggregated/full_data/184A1_unperturbed/I09/clustering_res0.2.npy
clustering: ['3' '3' '3' ... '3' '3' '3']

We are now ready to extract molecular, morphological, and size features from the CSLs. This is described in the Extract features from CSLs tutorial.

Extract features from CSLs

After creating and annotating CSLs, features can be extracted from each cell to quantitatively compare molecular intensity differences and spatial re-localisation of proteins in different conditions. CAMPA can extract the following features:

  • Intensity: per-cluster mean and size features. Needs to be calculated first to set up the adata.

  • Co-occurrence: spatial co-occurrence between pairs of clusters at different distances.

  • Object stats: number and area of connected components per cluster

The features are saved as an AnnData object and can be used to compare molecular abundance within CSLs and spatial co-occurrence of CSLs in different conditions (e.g. perturbations).

Please make sure that you clustered the data and projected the result to the entire example dataset as described in the Cluster data into CSLs tutorial before running this tutorial.

[1]:
from pathlib import Path
import os

from IPython.display import display
import pandas as pd
import anndata as ad

from campa.pl import (
    plot_mean_size,
    plot_object_stats,
    plot_co_occurrence,
    plot_mean_intensity,
    get_intensity_change,
    plot_intensity_change,
    plot_co_occurrence_grid,
)
from campa.tl import Experiment, extract_features, FeatureExtractor
from campa.utils import load_config, init_logging, merged_config
from campa.constants import campa_config

# init logging with level INFO=20, WARNING=30
init_logging(level=30)
# read correct campa_config -- created with setup.ipynb
CAMPA_DIR = Path.cwd()
campa_config.config_fname = CAMPA_DIR / "params/campa.ini"
print(campa_config)
2022-11-25 14:34:34.778284: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 14:34:52.031324: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-25 14:34:55.054866: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Reading config from /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/campa.ini
CAMPAConfig (fname: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/campa.ini)
EXPERIMENT_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments
BASE_DATA_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_data
CO_OCC_CHUNK_SIZE: 10000000.0
data_config/exampledata: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/ExampleData_constants.py

Extract features

To extract features, the high-level API function extract_features is used.

Extracting co-occurrence scores can take a long time, and it is recommended to use the CLI to run the feature extraction in a script:

cd CAMPA_DIR/params
campa extract_features example_feature_params.py

To define which features should be extracted, a parameter dictionary is used. All parameters that can be set in this dictionary are documented with the extract_features function. Here, we are going to use an example feature params file that extracts intensity, co-occurrence, and object features (object size, circularity, etc.) from the test_pre_trained/CondVAE_pert-CC experiment we clustered in the clustering tutorial.

[2]:
# load parameter dictionary
params = load_config("params/example_feature_params.py")
# just use the first variable_params configuration here
for variable_params in params.variable_feature_params[:1]:
    cur_params = merged_config(params.feature_params, variable_params)
print(cur_params)
{'experiment_dir': 'test_pre_trained/CondVAE_pert-CC', 'cluster_name': 'clustering_res0.2', 'cluster_dir': 'aggregated/sub-pre', 'cluster_col': 'annotation', 'data_dirs': ['184A1_unperturbed/I09', '184A1_unperturbed/I11', '184A1_meayamycin/I12', '184A1_meayamycin/I20'], 'save_name': 'features_annotation.h5ad', 'force': False, 'features': ['intensity', 'co-occurrence', 'object-stats'], 'co_occurrence_params': {'min': 2.0, 'max': 60.0, 'nsteps': 5, 'logspace': True, 'num_processes': None}, 'object_stats_params': {'features': ['area', 'circularity', 'elongation', 'extent'], 'channels': []}}

Using these parameters, we can now extract the features. The extracted features will be saved to cur_params['save_name'] in each data directory in experiment_dir/aggregated/full_data.

Note that this step will take ~10 minutes to complete. For a faster result, you can turn off the computation of the co-occurrence features by setting cur_params['features'] = ['intensity', 'object-stats']

[3]:
extract_features(cur_params)
Cannot read with memmap:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test_pre_trained/CondVAE_pert-CC/aggregated/full_data/184A1_unperturbed/I09/clustering_res0.2.npy
Cannot read with memmap:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test_pre_trained/CondVAE_pert-CC/aggregated/full_data/184A1_unperturbed/I11/clustering_res0.2.npy
Cannot read with memmap:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test_pre_trained/CondVAE_pert-CC/aggregated/full_data/184A1_meayamycin/I12/clustering_res0.2.npy
Cannot read with memmap:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test_pre_trained/CondVAE_pert-CC/aggregated/full_data/184A1_meayamycin/I20/clustering_res0.2.npy
Explore and plot extracted features

Features are stored in AnnData objects with obs=cells and vars=channels. Intensity information is stored as layers, co-occurrence scores as obsm matrices (obs x distances for each cluster-cluster pair), and object features as matrices in uns.

The FeatureExtractor class loads this AnnData object and provides convenience functions to access feature information.

[4]:
# load features for each data_dir
exp = Experiment.from_dir("test_pre_trained/CondVAE_pert-CC")
extrs = [
    FeatureExtractor.from_adata(
        os.path.join(exp.full_path, "aggregated/full_data", data_dir, "features_annotation.h5ad")
    )
    for data_dir in exp.data_params["data_dirs"]
]

extrs[0].adata
[4]:
AnnData object with n_obs × n_vars = 12 × 34
    obs: 'mapobject_id', 'plate_name', 'well_name', 'well_pos_y', 'well_pos_x', 'tpoint', 'zplane', 'label', 'is_border', 'mapobject_id_cell', 'plate_name_cell', 'well_name_cell', 'well_pos_y_cell', 'well_pos_x_cell', 'tpoint_cell', 'zplane_cell', 'label_cell', 'is_border_cell', 'is_mitotic', 'is_mitotic_labels', 'is_polynuclei_HeLa', 'is_polynuclei_HeLa_labels', 'is_polynuclei_184A1', 'is_polynuclei_184A1_labels', 'is_SBF2_Sphase_labels', 'is_SBF2_Sphase', 'Heatmap-48', 'cell_cycle', 'description', 'dimensions', 'id', 'cell_type', 'EU', 'duration', 'perturbation', 'secondary_only', 'siRNA', 'perturbation_duration', 'LocalDensity_Nuclei_800', 'TR_factor', 'TR_norm', 'TR', 'TR_factor_DMSO-unperturbed', 'TR_norm_DMSO-unperturbed', 'obj_id_int'
    uns: 'clusters', 'co_occurrence_params', 'object_stats', 'object_stats_params', 'params'
    obsm: 'co_occurrence_Nuclear speckles_Nuclear speckles', 'co_occurrence_Nuclear speckles_Nucleolus', 'co_occurrence_Nuclear speckles_Nucleoplasm', 'co_occurrence_Nuclear speckles_PML bodies', 'co_occurrence_Nucleolus_Nuclear speckles', 'co_occurrence_Nucleolus_Nucleolus', 'co_occurrence_Nucleolus_Nucleoplasm', 'co_occurrence_Nucleolus_PML bodies', 'co_occurrence_Nucleoplasm_Nuclear speckles', 'co_occurrence_Nucleoplasm_Nucleolus', 'co_occurrence_Nucleoplasm_Nucleoplasm', 'co_occurrence_Nucleoplasm_PML bodies', 'co_occurrence_PML bodies_Nuclear speckles', 'co_occurrence_PML bodies_Nucleolus', 'co_occurrence_PML bodies_Nucleoplasm', 'co_occurrence_PML bodies_PML bodies', 'size'
    layers: 'intensity_Nuclear speckles', 'intensity_Nucleolus', 'intensity_Nucleoplasm', 'intensity_PML bodies'

The AnnData object contains all feature information

[5]:
extr = extrs[0]
print("AnnData read from", extr.fname)
print(extr.adata)
AnnData read from /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test_pre_trained/CondVAE_pert-CC/aggregated/full_data/184A1_unperturbed/I09/features_annotation.h5ad
AnnData object with n_obs × n_vars = 12 × 34
    obs: 'mapobject_id', 'plate_name', 'well_name', 'well_pos_y', 'well_pos_x', 'tpoint', 'zplane', 'label', 'is_border', 'mapobject_id_cell', 'plate_name_cell', 'well_name_cell', 'well_pos_y_cell', 'well_pos_x_cell', 'tpoint_cell', 'zplane_cell', 'label_cell', 'is_border_cell', 'is_mitotic', 'is_mitotic_labels', 'is_polynuclei_HeLa', 'is_polynuclei_HeLa_labels', 'is_polynuclei_184A1', 'is_polynuclei_184A1_labels', 'is_SBF2_Sphase_labels', 'is_SBF2_Sphase', 'Heatmap-48', 'cell_cycle', 'description', 'dimensions', 'id', 'cell_type', 'EU', 'duration', 'perturbation', 'secondary_only', 'siRNA', 'perturbation_duration', 'LocalDensity_Nuclei_800', 'TR_factor', 'TR_norm', 'TR', 'TR_factor_DMSO-unperturbed', 'TR_norm_DMSO-unperturbed', 'obj_id_int'
    uns: 'clusters', 'co_occurrence_params', 'object_stats', 'object_stats_params', 'params'
    obsm: 'co_occurrence_Nuclear speckles_Nuclear speckles', 'co_occurrence_Nuclear speckles_Nucleolus', 'co_occurrence_Nuclear speckles_Nucleoplasm', 'co_occurrence_Nuclear speckles_PML bodies', 'co_occurrence_Nucleolus_Nuclear speckles', 'co_occurrence_Nucleolus_Nucleolus', 'co_occurrence_Nucleolus_Nucleoplasm', 'co_occurrence_Nucleolus_PML bodies', 'co_occurrence_Nucleoplasm_Nuclear speckles', 'co_occurrence_Nucleoplasm_Nucleolus', 'co_occurrence_Nucleoplasm_Nucleoplasm', 'co_occurrence_Nucleoplasm_PML bodies', 'co_occurrence_PML bodies_Nuclear speckles', 'co_occurrence_PML bodies_Nucleolus', 'co_occurrence_PML bodies_Nucleoplasm', 'co_occurrence_PML bodies_PML bodies', 'size'
    layers: 'intensity_Nuclear speckles', 'intensity_Nucleolus', 'intensity_Nucleoplasm', 'intensity_PML bodies'
Intensity features

Intensity features are the mean intensity of channels in each cluster (CSL).

Intensity information for each CSL is contained in a separate layer in FeatureExtractor.adata.layers. Overall (per cell) intensity information is stored in FeatureExtractor.adata.X. In addition to the mean intensity per CSL, the adata also contains the size of each CSL per cell in FeatureExtractor.adata.obsm['size']

[6]:
# intensity per CSL
print("Intensity per CLS stored in", extr.adata.layers)

# whole cell intensity
print("Whole cell intensity stored in X with shape", extr.adata.X.shape)

# size of CSLs per cell
display(extr.adata.obsm["size"])
Intensity per CLS stored in Layers with keys: intensity_Nuclear speckles, intensity_Nucleolus, intensity_Nucleoplasm, intensity_PML bodies
Whole cell intensity stored in X with shape (12, 34)
all Nuclear speckles Nucleolus Nucleoplasm PML bodies
0 13668 1237.0 3090 9058 283
1 14816 1385.0 3457 9477 497
2 13048 1318.0 2384 8424 922
3 13478 890.0 2629 9475 484
4 22785 1202.0 5749 15396 438
5 6876 35.0 50 6580 211
6 10961 1081.0 1918 7760 202
7 10340 738.0 2120 7243 239
8 9010 0.0 367 8460 183
9 14668 841.0 2898 10590 339
10 10472 616.0 2091 7414 351
11 13547 1363.0 2807 9003 374

It is possible to export the intensity information in one csv file using FeatureExtractor.extract_intensity_csv. The resulting csv file contains the intensity per CSL for each channel as columns, with CSLs stacked on top of each other, as well as additionally defined columns. This saves a csv file in experiment_dir/aggregated/full_data/data_dir/export.

[7]:
for extr in extrs:
    extr.extract_intensity_csv(obs=["well_name", "perturbation_duration", "TR"])
[8]:
extr = extrs[0]
# check if results are stored
save_dir = os.path.join(os.path.dirname(extr.fname), "export")
print("csv exported to", save_dir)
print([n for n in os.listdir(save_dir) if "intensity" in n])

display(pd.read_csv(os.path.join(save_dir, "intensity_features_annotation.csv"), index_col=0))
csv exported to /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test_pre_trained/CondVAE_pert-CC/aggregated/full_data/184A1_unperturbed/I09/export
['intensity_features_annotation.csv']
01_CDK9_pT186 01_PABPC1 02_CDK7 03_CDK9 03_RPS6 05_GTF2B 05_Sm 07_POLR2A 07_SETD1A 08_H3K4me3 ... 21_COIL 21_NCL 00_DAPI 07_H2B size cluster mapobject_id well_name perturbation_duration TR
0 0.220263 0.168692 0.292341 0.227046 0.344410 0.400212 0.351876 0.268362 0.197742 0.309594 ... 0.347049 0.205577 0.420119 0.692879 13668.0 all 205776 I09 normal 357.672008
1 0.333146 0.382737 0.403504 0.367799 0.526189 0.556079 0.402247 0.302671 0.285122 0.310469 ... 0.350691 0.278747 0.379324 0.583052 14816.0 all 205790 I09 normal 428.364268
2 0.360394 0.480734 0.525491 0.355149 0.551694 0.361255 0.403097 0.348874 0.319844 0.488086 ... 0.413549 0.186384 0.419490 0.477315 13048.0 all 248082 I09 normal 250.488581
3 0.374549 0.272541 0.386451 0.359641 0.469129 0.475133 0.340987 0.407483 0.261632 0.405096 ... 0.317883 0.197697 0.611018 0.500728 13478.0 all 248102 I09 normal 515.735421
4 0.274677 0.239059 0.358933 0.218737 0.355724 0.362768 0.262636 0.304999 0.315362 0.348737 ... 0.267697 0.157225 0.522850 0.360451 22785.0 all 259784 I09 normal 348.150713
5 0.545745 0.589634 0.538518 0.790598 0.812851 0.687019 0.353689 0.564811 0.273066 0.620842 ... 0.586261 0.291767 0.744431 0.661208 6876.0 all 291041 I09 normal 443.565445
6 0.225679 0.147178 0.287102 0.215611 0.330553 0.347149 0.258178 0.152336 0.079622 0.224427 ... 0.339826 0.172255 0.536892 0.787214 10961.0 all 345908 I09 normal 316.004105
7 0.297851 0.296048 0.432829 0.343467 0.484858 0.489976 0.353426 0.433215 0.270189 0.437530 ... 0.362085 0.221666 0.534294 0.665742 10340.0 all 359378 I09 normal 367.904255
8 0.345589 0.388527 0.627053 0.528497 0.576135 0.671750 0.336406 0.337385 0.116071 0.348818 ... 0.568719 0.287813 0.669855 0.795703 9010.0 all 359393 I09 normal 513.668590
9 0.187118 0.153416 0.287734 0.215821 0.270915 0.352462 0.278277 0.222176 0.141322 0.275406 ... 0.307268 0.177784 0.418376 0.527903 14668.0 all 366493 I09 normal 330.847832
10 0.462329 0.304857 0.561744 0.448209 0.551895 0.686498 0.407504 0.552158 0.333329 0.504888 ... 0.478053 0.263166 0.483056 0.613083 10472.0 all 383341 I09 normal 388.012319
11 0.294427 0.227145 0.373680 0.315884 0.373401 0.480220 0.350587 0.404187 0.261491 0.322150 ... 0.338741 0.187825 0.408714 0.661022 13547.0 all 383793 I09 normal 376.243596
12 0.507860 0.116696 0.415587 0.411071 0.269115 0.466287 0.647966 0.425061 0.430787 0.434477 ... 0.551629 0.068284 0.344541 0.637030 1237.0 Nuclear speckles 205776 I09 normal 357.672008
13 0.623703 0.458236 0.522437 0.621646 0.584039 0.634695 0.659342 0.454861 0.618518 0.423877 ... 0.551859 0.140339 0.320592 0.544304 1385.0 Nuclear speckles 205790 I09 normal 428.364268
14 0.682698 0.406474 0.678200 0.579281 0.511746 0.443788 0.689661 0.466248 0.635434 0.669755 ... 0.534957 0.111079 0.382885 0.477689 1318.0 Nuclear speckles 248082 I09 normal 250.488581
15 0.829828 0.228499 0.542535 0.664138 0.394651 0.567538 0.591148 0.610244 0.561811 0.564465 ... 0.539236 0.097210 0.560445 0.503501 890.0 Nuclear speckles 248102 I09 normal 515.735421
16 0.587949 0.200030 0.538934 0.374302 0.298699 0.427932 0.459766 0.409769 0.776006 0.492420 ... 0.392491 0.074476 0.402964 0.293739 1202.0 Nuclear speckles 259784 I09 normal 348.150713
17 0.879115 0.505580 0.859152 1.269043 0.678345 0.870895 0.349357 0.862588 0.409897 0.872185 ... 0.627324 0.347079 0.843770 0.665566 35.0 Nuclear speckles 291041 I09 normal 443.565445
18 0.436311 0.122306 0.363612 0.341447 0.301133 0.391417 0.435425 0.200541 0.132738 0.296614 ... 0.507728 0.064535 0.437343 0.697406 1081.0 Nuclear speckles 345908 I09 normal 316.004105
19 0.613344 0.258710 0.613882 0.534657 0.417986 0.559939 0.628316 0.611163 0.541085 0.596809 ... 0.556250 0.100741 0.479679 0.628083 738.0 Nuclear speckles 359378 I09 normal 367.904255
20 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.0 Nuclear speckles 359393 I09 normal 513.668590
21 0.380903 0.128601 0.431963 0.363165 0.245710 0.391194 0.509589 0.299370 0.280922 0.379034 ... 0.510762 0.057614 0.350658 0.522174 841.0 Nuclear speckles 366493 I09 normal 330.847832
22 0.790890 0.239204 0.744155 0.655538 0.477792 0.850773 0.645294 0.733961 0.603453 0.685157 ... 0.707202 0.171665 0.440622 0.598995 616.0 Nuclear speckles 383341 I09 normal 388.012319
23 0.665019 0.190913 0.559104 0.586467 0.304415 0.566344 0.647416 0.622210 0.646608 0.465139 ... 0.533937 0.074984 0.372200 0.629311 1363.0 Nuclear speckles 383793 I09 normal 376.243596
24 0.148427 0.180788 0.273318 0.162722 0.453657 0.528312 0.234614 0.166509 0.119577 0.216034 ... 0.211342 0.629230 0.452858 0.596601 3090.0 Nucleolus 205776 I09 normal 357.672008
25 0.321246 0.384466 0.466361 0.349830 0.666562 0.809406 0.360467 0.256026 0.224605 0.293607 ... 0.297820 0.834984 0.452600 0.564111 3457.0 Nucleolus 205790 I09 normal 428.364268
26 0.265196 0.448149 0.616294 0.272459 0.676786 0.494530 0.300528 0.244217 0.186101 0.346716 ... 0.253866 0.547749 0.372457 0.361568 2384.0 Nucleolus 248082 I09 normal 250.488581
27 0.307583 0.260094 0.398412 0.283575 0.610882 0.663772 0.293926 0.322884 0.197018 0.339456 ... 0.235548 0.571796 0.659128 0.423576 2629.0 Nucleolus 248102 I09 normal 515.735421
28 0.169432 0.259798 0.308006 0.139552 0.434192 0.421139 0.183558 0.191187 0.220543 0.250773 ... 0.205292 0.416968 0.595195 0.300878 5749.0 Nucleolus 259784 I09 normal 348.150713
29 0.288970 0.573400 0.178885 0.152212 0.909482 1.961381 0.239195 0.112812 0.505988 0.202771 ... 0.254875 0.082196 0.316999 0.376961 50.0 Nucleolus 291041 I09 normal 443.565445
30 0.186602 0.144543 0.285613 0.146025 0.439281 0.470996 0.212429 0.097311 0.060434 0.127205 ... 0.247783 0.626981 0.597580 0.696738 1918.0 Nucleolus 345908 I09 normal 316.004105
31 0.235341 0.244843 0.435787 0.263856 0.556367 0.651079 0.288264 0.285059 0.173408 0.317414 ... 0.269057 0.652266 0.578100 0.589497 2120.0 Nucleolus 359378 I09 normal 367.904255
32 0.327060 0.284533 0.684505 0.439499 0.662197 0.902955 0.298708 0.255313 0.147835 0.302531 ... 0.402876 0.767738 0.690029 0.678832 367.0 Nucleolus 359393 I09 normal 513.668590
33 0.134715 0.196707 0.288979 0.158024 0.406262 0.498932 0.196501 0.158980 0.107610 0.208898 ... 0.218411 0.617844 0.483330 0.471182 2898.0 Nucleolus 366493 I09 normal 330.847832
34 0.402041 0.254182 0.602489 0.401866 0.629081 0.923648 0.343465 0.452137 0.271878 0.422481 ... 0.367254 0.704034 0.567679 0.560376 2091.0 Nucleolus 383341 I09 normal 388.012319
35 0.153056 0.246593 0.303968 0.178948 0.529094 0.628579 0.212314 0.205966 0.128629 0.206617 ... 0.205561 0.610992 0.439678 0.545569 2807.0 Nucleolus 383793 I09 normal 376.243596
36 0.200603 0.172094 0.278136 0.217488 0.319484 0.345095 0.347182 0.278014 0.187471 0.319911 ... 0.360972 0.082673 0.418794 0.731661 9058.0 Nucleoplasm 205776 I09 normal 357.672008
37 0.283121 0.369165 0.358578 0.332112 0.467734 0.446251 0.373474 0.290102 0.248082 0.294915 ... 0.336123 0.101634 0.361606 0.594361 9477.0 Nucleoplasm 205790 I09 normal 428.364268
38 0.327386 0.511329 0.464606 0.326807 0.526733 0.306853 0.374261 0.346206 0.301344 0.482867 ... 0.369829 0.102682 0.440043 0.506757 8424.0 Nucleoplasm 248082 I09 normal 250.488581
39 0.344895 0.283193 0.364815 0.347052 0.440229 0.412741 0.328092 0.408369 0.247858 0.400447 ... 0.315755 0.105059 0.604984 0.521843 9475.0 Nucleoplasm 248102 I09 normal 515.735421
40 0.283434 0.233945 0.359366 0.233103 0.332107 0.332370 0.274446 0.334147 0.308211 0.369987 ... 0.279331 0.067981 0.505745 0.389373 15396.0 Nucleoplasm 259784 I09 normal 348.150713
41 0.557869 0.593891 0.549823 0.810057 0.817546 0.689968 0.358099 0.577190 0.274604 0.632787 ... 0.591301 0.297003 0.746232 0.667527 6580.0 Nucleoplasm 291041 I09 normal 443.565445
42 0.206056 0.151637 0.273977 0.212636 0.307834 0.309180 0.243158 0.157471 0.077308 0.236853 ... 0.337918 0.074267 0.535696 0.821609 7760.0 Nucleoplasm 345908 I09 normal 316.004105
43 0.279551 0.315488 0.408583 0.340943 0.472250 0.429418 0.341596 0.454079 0.265506 0.451848 ... 0.366214 0.108504 0.526859 0.691131 7243.0 Nucleoplasm 359378 I09 normal 367.904255
44 0.346315 0.393278 0.624833 0.531133 0.573757 0.660094 0.337632 0.340097 0.114176 0.349965 ... 0.575995 0.267697 0.667318 0.803276 8460.0 Nucleoplasm 359393 I09 normal 513.668590
45 0.186315 0.144893 0.274057 0.218382 0.237644 0.308870 0.280488 0.232734 0.138392 0.282669 ... 0.313387 0.070174 0.405350 0.542516 10590.0 Nucleoplasm 366493 I09 normal 330.847832
46 0.434373 0.327475 0.520774 0.429435 0.538027 0.597603 0.396629 0.551351 0.316677 0.504983 ... 0.479980 0.149720 0.464575 0.629142 7414.0 Nucleoplasm 383341 I09 normal 388.012319
47 0.277232 0.228711 0.361804 0.313707 0.337872 0.420433 0.346364 0.427315 0.241685 0.331130 ... 0.345306 0.077412 0.404472 0.698643 9003.0 Nucleoplasm 383793 I09 normal 376.243596
48 0.376789 0.155016 0.415987 0.430936 0.278501 0.476820 0.488239 0.386589 0.361335 0.455047 ... 0.488911 0.113721 0.435417 0.746947 283.0 PML bodies 205776 I09 normal 357.672008
49 0.560107 0.419122 0.491514 0.465892 0.503227 0.669168 0.525055 0.442682 0.483283 0.408307 ... 0.435638 0.172683 0.371145 0.607132 497.0 PML bodies 205790 I09 normal 428.364268
50 0.447397 0.391609 0.628699 0.507518 0.513414 0.395720 0.522121 0.476079 0.383548 0.641615 ... 1.052344 0.124410 0.405646 0.507067 922.0 PML bodies 248082 I09 normal 250.488581
51 0.481626 0.212621 0.458021 0.459344 0.401864 0.501972 0.389035 0.476831 0.330262 0.559590 ... 0.399737 0.163967 0.560805 0.501328 484.0 PML bodies 248102 I09 normal 515.735421
52 0.488531 0.253733 0.518198 0.326205 0.312413 0.486280 0.344460 0.486728 0.547123 0.493350 ... 0.335386 0.112061 0.503552 0.308843 438.0 PML bodies 259784 I09 normal 348.150713
53 0.173208 0.474685 0.217995 0.255678 0.665857 0.262566 0.244024 0.236489 0.147212 0.305708 ... 0.500809 0.168967 0.773082 0.530786 211.0 PML bodies 291041 I09 normal 443.565445
54 0.223314 0.134024 0.395997 0.317227 0.328408 0.392892 0.321032 0.219565 0.066454 0.283898 ... 0.388524 0.195341 0.539376 0.805573 202.0 PML bodies 345908 I09 normal 316.004105
55 0.432739 0.276400 0.582298 0.535775 0.439132 0.680150 0.441125 0.565647 0.434058 0.577243 ... 0.462582 0.204928 0.539687 0.688922 239.0 PML bodies 359378 I09 normal 367.904255
56 0.349195 0.377453 0.614474 0.585159 0.513443 0.746960 0.355320 0.376605 0.139947 0.388643 ... 0.564927 0.255292 0.746669 0.679955 183.0 PML bodies 359393 I09 normal 513.668590
57 0.179455 0.111113 0.346545 0.264373 0.215761 0.366014 0.334439 0.241099 0.174705 0.359996 ... 0.370868 0.075611 0.438038 0.570519 339.0 PML bodies 366493 I09 normal 330.847832
58 0.835362 0.244223 0.864254 0.756968 0.515069 0.863117 0.601384 0.845983 0.577073 0.677438 ... 0.695252 0.193638 0.443788 0.612599 351.0 PML bodies 383341 I09 normal 388.012319
59 0.418801 0.175524 0.507001 0.409923 0.311571 0.492064 0.408276 0.540597 0.331909 0.451974 ... 0.468891 0.080946 0.411487 0.737467 374.0 PML bodies 383793 I09 normal 376.243596

60 rows × 40 columns

We can compare CSL intensities across discrete conditions using a dot plot.

For this, we first combine all intensity information in one adata, adding one observation per CSL, using FeatureExtractor.get_intensity_adata.

[9]:
# get combined adata for dotplots
adatas = [extr.get_intensity_adata() for extr in extrs]
adata_intensity = ad.concat(adatas, index_unique="-")

adata_intensity
[9]:
AnnData object with n_obs × n_vars = 230 × 34
    obs: 'mapobject_id', 'plate_name', 'well_name', 'well_pos_y', 'well_pos_x', 'tpoint', 'zplane', 'label', 'is_border', 'mapobject_id_cell', 'plate_name_cell', 'well_name_cell', 'well_pos_y_cell', 'well_pos_x_cell', 'tpoint_cell', 'zplane_cell', 'label_cell', 'is_border_cell', 'is_mitotic', 'is_mitotic_labels', 'is_polynuclei_HeLa', 'is_polynuclei_HeLa_labels', 'is_polynuclei_184A1', 'is_polynuclei_184A1_labels', 'is_SBF2_Sphase_labels', 'is_SBF2_Sphase', 'Heatmap-48', 'cell_cycle', 'description', 'dimensions', 'id', 'cell_type', 'EU', 'duration', 'perturbation', 'secondary_only', 'siRNA', 'perturbation_duration', 'LocalDensity_Nuclei_800', 'TR', 'obj_id_int', 'size', 'cluster'

adata_intensity contains cell-CSL pairs as observations and channels as columns

[10]:
print(adata_intensity.var_names)
print(adata_intensity.obs[["mapobject_id", "cluster"]])
Index(['01_CDK9_pT186', '01_PABPC1', '02_CDK7', '03_CDK9', '03_RPS6',
       '05_GTF2B', '05_Sm', '07_POLR2A', '07_SETD1A', '08_H3K4me3', '09_CCNT1',
       '09_SRRM2', '10_H3K27ac', '10_POL2RA_pS2', '11_KPNA2_MAX', '11_PML',
       '12_RB1_pS807_S811', '12_YAP1', '13_PABPN1', '13_POL2RA_pS5', '14_PCNA',
       '15_SON', '15_U2SNRNPB', '16_H3', '17_HDAC3', '17_SRSF2', '18_NONO',
       '19_KPNA1_MAX', '20_ALYREF', '20_SP100', '21_COIL', '21_NCL', '00_DAPI',
       '07_H2B'],
      dtype='object')
               mapobject_id     cluster
0-all-0              205776         all
1-all-0              205790         all
2-all-0              248082         all
3-all-0              248102         all
4-all-0              259784         all
...                     ...         ...
4-PML bodies-3       231218  PML bodies
5-PML bodies-3       270030  PML bodies
6-PML bodies-3       276005  PML bodies
7-PML bodies-3       287615  PML bodies
8-PML bodies-3       294517  PML bodies

[230 rows x 2 columns]

Using this combined adata, we can plot the mean intensity of each channel in each CSL and the size of each CSL in the unperturbed cells using plot_mean_intensity and plot_mean_size

[11]:
plot_mean_intensity(
    adata_intensity,
    groupby="cluster",
    limit_to_groups={"perturbation": "normal"},
    dendrogram=False,
    layer=None,
    standard_scale="var",
    cmap="bwr",
    vmin=-4,
    vmax=4,
)
plot_mean_size(
    adata_intensity,
    groupby_row="cluster",
    groupby_col="perturbation_duration",
    normby_row="all",
    vmax=0.3,
)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_intensity_features.py:222: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  c: adata[adata.obs[groupby_col] == c].obs.groupby(groupby_row).mean()["size"]
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_intensity_features.py:222: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  c: adata[adata.obs[groupby_col] == c].obs.groupby(groupby_row).mean()["size"]
_images/notebooks_extract_features_20_1.png
_images/notebooks_extract_features_20_2.png

Now, let us visualise the log2fold change in intensity in the Meayamycin perturbation compared to unperturbed cells with plot_intensity_change. This plots a dot plot of clusters by channels. The colour of each dot is the log2fold change in intensity compared to unperturbed cells. The size of the dots indicated the p-value. Small dots are non-significant intensity changes, large dots are significant (p > alpha). For the sake of speed, here, p-values are determined using a t-test, for more accurate p-values, please use pval='mixed_model', which will include well as random effect.

The first plot shows the log2fold change, and the second plot the relative log2fold change per CSL, obtained by dividing the values by the “all” column (norm_by_group='all')

[12]:
res = get_intensity_change(
    adata_intensity,
    groupby="cluster",
    reference_group="perturbation_duration",
    reference=["normal"],
    limit_to_groups={"perturbation_duration": ["normal", "Meayamycin-720"]},
    color="logfoldchange",
    size="pval",
    pval="ttest",
)
plot_intensity_change(**res, adjust_height=True, figsize=(15, 5), vmin=-2, vmax=2, dendrogram=True)

res = get_intensity_change(
    adata_intensity,
    groupby="cluster",
    reference_group="perturbation_duration",
    reference=["normal"],
    limit_to_groups={"perturbation_duration": ["normal", "Meayamycin-720"]},
    color="logfoldchange",
    size="pval",
    pval="ttest",
    norm_by_group="all",
)
plot_intensity_change(**res, adjust_height=True, figsize=(15, 5), vmin=-2, vmax=2)
WARNING: dendrogram data not found (using key=dendrogram_cluster). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
/home/icb/hannah.spitzer/miniconda3/envs/campa/lib/python3.9/site-packages/scanpy/plotting/_dotplot.py:749: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored
  dot_ax.scatter(x, y, **kwds)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_intensity_features.py:459: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.
  _, pvals = scipy.stats.ttest_ind(cur_ref_expr, g_expr, axis=0)
/home/icb/hannah.spitzer/miniconda3/envs/campa/lib/python3.9/site-packages/scanpy/plotting/_dotplot.py:749: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored
  dot_ax.scatter(x, y, **kwds)
_images/notebooks_extract_features_22_2.png
_images/notebooks_extract_features_22_3.png
Co-occurrence scores

Co-occurrence scores are calculated for each cluster-cluster pair. They are stored in adata.obsm['co_occurrence_{cluster1}_{cluster2}'] as a n cells x distances matrix. The distances used can be found in adata.uns['co_occurrence_params'].

[13]:
extr = extrs[0]
display(extr.adata.obsm["co_occurrence_Nucleolus_Nuclear speckles"])

print(extr.adata.uns["co_occurrence_params"])
0 1 2 3
0 0.020466 0.129805 0.733902 1.106458
1 0.118672 0.361029 0.867357 1.069965
2 0.062864 0.266752 0.787869 1.159926
3 0.035031 0.120766 0.550169 1.213713
4 0.000458 0.060348 0.623832 1.188438
5 0.000000 0.000000 0.129918 1.871468
6 0.043499 0.271834 0.901377 1.108016
7 0.042482 0.246719 0.788401 1.192697
8 0.000000 0.000000 0.000000 0.000000
9 0.011935 0.181210 1.058228 1.093186
10 0.114000 0.454269 0.865092 1.167536
11 0.019074 0.189035 0.644840 1.052670
{'interval': array([ 2.       ,  4.6806946, 10.954452 , 25.63722  , 60.       ],
      dtype=float32)}

It is possible to export the co-occurrence information in one csv file for each CSL-CSL pair using [FeatureExtractor.extract_co_occurrence_csv][]. The resulting csv file contains the co-occurrence scores for each distance interval as columns and cells as rows, as well as additionally defined columns. This saves one csv file per CSL-CSL pair in experiment_dir/aggregated/full_data/data_dir/export.

[14]:
for extr in extrs:
    extr.extract_co_occurrence_csv(obs=["well_name", "perturbation_duration", "TR"])
[15]:
extr = extrs[0]
# check if results are stored
save_dir = os.path.join(os.path.dirname(extr.fname), "export")
print("csv exported to", save_dir)
print([n for n in os.listdir(save_dir) if "co_occurrence" in n])

display(pd.read_csv(os.path.join(save_dir, "co_occurrence_Nucleoplasm_Nucleolus_features_annotation.csv"), index_col=0))
csv exported to /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test_pre_trained/CondVAE_pert-CC/aggregated/full_data/184A1_unperturbed/I09/export
['co_occurrence_PML bodies_Nuclear speckles_features_annotation.csv', 'co_occurrence_Nucleoplasm_Nuclear speckles_features_annotation.csv', 'co_occurrence_Nucleoplasm_Nucleolus_features_annotation.csv', 'co_occurrence_Nucleolus_PML bodies_features_annotation.csv', 'co_occurrence_Nuclear speckles_Nuclear speckles_features_annotation.csv', 'co_occurrence_PML bodies_Nucleolus_features_annotation.csv', 'co_occurrence_PML bodies_Nucleoplasm_features_annotation.csv', 'co_occurrence_Nucleolus_Nucleolus_features_annotation.csv', 'co_occurrence_Nuclear speckles_PML bodies_features_annotation.csv', 'co_occurrence_PML bodies_PML bodies_features_annotation.csv', 'co_occurrence_Nucleoplasm_PML bodies_features_annotation.csv', 'co_occurrence_Nuclear speckles_Nucleoplasm_features_annotation.csv', 'co_occurrence_Nucleolus_Nucleoplasm_features_annotation.csv', 'co_occurrence_Nuclear speckles_Nucleolus_features_annotation.csv', 'co_occurrence_Nucleolus_Nuclear speckles_features_annotation.csv', 'co_occurrence_Nucleoplasm_Nucleoplasm_features_annotation.csv']
2.00-4.68 4.68-10.95 10.95-25.64 25.64-60.00 mapobject_id well_name perturbation_duration TR
0 0.217142 0.451038 0.831327 1.048945 205776 I09 normal 357.672008
1 0.185153 0.335840 0.631343 1.033807 205790 I09 normal 428.364268
2 0.217809 0.353777 0.604462 1.113828 248082 I09 normal 250.488581
3 0.247925 0.441065 0.739990 1.001092 248102 I09 normal 515.735421
4 0.161814 0.340358 0.684436 1.029913 259784 I09 normal 348.150713
5 0.811214 1.009496 1.028775 0.992664 291041 I09 normal 443.565445
6 0.217856 0.447180 0.789221 1.093390 345908 I09 normal 316.004105
7 0.204482 0.405060 0.791247 1.076277 359378 I09 normal 367.904255
8 0.545410 0.831929 1.024679 1.010255 359393 I09 normal 513.668590
9 0.194497 0.413888 0.750494 1.032792 366493 I09 normal 330.847832
10 0.236510 0.432413 0.801058 1.047899 383341 I09 normal 388.012319
11 0.183172 0.370742 0.700137 1.014940 383793 I09 normal 376.243596

We can plot co-occurrence scores by using plot_co_occurrence or plot_co_occurrence_grid. First, we need to combine all adata objects into one. For this, we can use AnnData.concat.

[16]:
# get combined adata
adata_co_occ = ad.concat([extr.adata for extr in extrs], index_unique="-", uns_merge="same")

print("co-occurrence scores:", adata_co_occ.obsm)
co-occurrence scores: AxisArrays with keys: co_occurrence_Nuclear speckles_Nuclear speckles, co_occurrence_Nuclear speckles_Nucleolus, co_occurrence_Nuclear speckles_Nucleoplasm, co_occurrence_Nuclear speckles_PML bodies, co_occurrence_Nucleolus_Nuclear speckles, co_occurrence_Nucleolus_Nucleolus, co_occurrence_Nucleolus_Nucleoplasm, co_occurrence_Nucleolus_PML bodies, co_occurrence_Nucleoplasm_Nuclear speckles, co_occurrence_Nucleoplasm_Nucleolus, co_occurrence_Nucleoplasm_Nucleoplasm, co_occurrence_Nucleoplasm_PML bodies, co_occurrence_PML bodies_Nuclear speckles, co_occurrence_PML bodies_Nucleolus, co_occurrence_PML bodies_Nucleoplasm, co_occurrence_PML bodies_PML bodies, size

With plot_co_occurrence we can plot one cluster-cluster pair. With condition we can define the grouping of scores. Each group will be displayed by a separate line on the co-occurrence plot.

The co-occurrence plot shows the calculated co-occurrence scores for each distance interval. Here, we show the mean co-occurrence values and their 95th confidence interval obtained through bootstrapping.

[17]:
# plot meam co-occ scores
condition = "perturbation_duration"
condition_values = None

# for one cluster-cluster pairing
plot_co_occurrence(adata_co_occ, "Nucleolus", "Nuclear speckles", condition, condition_values, ci=95)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
_images/notebooks_extract_features_31_1.png

With plot_co_occurrence_grid we can plot an overview of all cluster-cluster pairs.

[18]:
fig, axes = plot_co_occurrence_grid(adata_co_occ, condition, condition_values, legend=False, ci=95, figsize=(20, 20))
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa/campa/pl/_spatial_features.py:70: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=('ci', 95)` for the same effect.

  g = sns.lineplot(data=scores, y="score", x="distance", hue=condition, ax=ax, **kwargs)
_images/notebooks_extract_features_33_1.png
Object statistics

Object statistics are features extracted from connected components per cluster for each cell. Possible features are area, circlularity, elongation, and extent of connected components. Per component/region features are calculated and stored in adata.uns['object_stats'].

[19]:
display(extrs[0].adata.uns["object_stats"])
area circularity elongation extent mapobject_id clustering
0 9045 0.043528 0.227821 0.471585 205776 Nucleoplasm
1 978 0.638093 0.116856 0.735338 205776 Nucleolus
2 29 1.000000 0.096517 0.805556 205776 Nucleolus
3 256 0.541414 0.534975 0.561404 205776 Nuclear speckles
4 38 1.000000 0.321553 0.791667 205776 PML bodies
... ... ... ... ... ... ...
346 98 0.810891 0.307977 0.753846 383793 Nuclear speckles
347 125 0.906486 0.290123 0.694444 383793 Nuclear speckles
348 14 1.000000 0.249786 0.700000 383793 Nuclear speckles
349 17 0.936106 0.524493 0.472222 383793 PML bodies
350 54 0.974758 0.269712 0.600000 383793 PML bodies

351 rows × 6 columns

To aggregate this information to per-cell level, FeatureExtractor.get_object_stats is used. This aggregated the data with the provided aggregation function and stores the result in adata.obsm['object_stats_agg']. In addition, we can filter small areas below area_threshold prior to aggregation.

[20]:
# aggregate object statistics using median
for extr in extrs:
    _ = extr.get_object_stats(area_threshold=10, agg=["median"])

# combined adatas for plotting
adata_object_stats = ad.concat([extr.adata for extr in extrs], index_unique="-", uns_merge="same")

adata_object_stats contains aggregated per cell object stats in obsm:

[21]:
adata_object_stats.obsm["object_stats_agg"]
[21]:
area_median|Nuclear speckles area_median|Nucleolus area_median|Nucleoplasm area_median|PML bodies circularity_median|Nuclear speckles circularity_median|Nucleolus circularity_median|Nucleoplasm circularity_median|PML bodies elongation_median|Nuclear speckles elongation_median|Nucleolus elongation_median|Nucleoplasm elongation_median|PML bodies extent_median|Nuclear speckles extent_median|Nucleolus extent_median|Nucleoplasm extent_median|PML bodies count|Nuclear speckles count|Nucleolus count|Nucleoplasm count|PML bodies
0-0 87.0 978.0 9045.0 42.5 0.861312 0.638093 0.043528 1.000000 0.285864 0.116856 0.227821 0.310047 0.633333 0.735338 0.471585 0.742063 11.0 3.0 1.0 6.0
1-0 97.5 438.0 9457.0 49.0 0.732411 0.411849 0.036234 0.959830 0.412936 0.337495 0.354512 0.253441 0.617695 0.551043 0.426221 0.711111 12.0 3.0 1.0 7.0
2-0 89.0 2365.0 4194.5 76.0 0.834565 0.131215 0.236317 0.957019 0.222408 0.349102 0.403929 0.199730 0.662551 0.532658 0.375361 0.714583 10.0 1.0 2.0 12.0
3-0 42.0 2593.0 13.0 36.0 0.863145 0.094925 1.000000 1.000000 0.355378 0.203823 0.307020 0.175429 0.607143 0.449004 0.501330 0.759736 9.0 1.0 3.0 12.0
4-0 72.0 1157.0 15376.0 57.0 0.905092 0.381709 0.040231 1.000000 0.381959 0.233705 0.330665 0.175379 0.610185 0.600415 0.513355 0.777778 16.0 3.0 1.0 9.0
5-0 32.0 18.0 6580.0 97.5 1.000000 1.000000 0.297695 0.858740 0.068028 0.226034 0.343594 0.125506 0.761905 0.666667 0.679051 0.799632 1.0 2.0 1.0 2.0
6-0 66.0 959.0 7758.0 27.0 0.961701 0.432387 0.051205 1.000000 0.244032 0.460579 0.261960 0.190607 0.716852 0.549069 0.502331 0.771429 14.0 2.0 1.0 6.0
7-0 65.0 1055.5 7234.0 38.0 0.885201 0.598381 0.054602 1.000000 0.366449 0.562931 0.310322 0.117958 0.635417 0.445033 0.502780 0.727273 11.0 2.0 1.0 5.0
8-0 0.0 78.0 8453.0 82.5 0.000000 0.512022 0.147361 0.884891 0.000000 0.290167 0.271034 0.148813 0.000000 0.537500 0.635659 0.675196 0.0 3.0 1.0 2.0
9-0 72.0 1444.5 10587.0 63.0 0.853768 0.380567 0.052200 0.993804 0.405150 0.681782 0.234737 0.123254 0.600000 0.510217 0.542367 0.750000 11.0 2.0 1.0 5.0
10-0 39.0 1043.0 7398.0 58.5 0.932875 0.436683 0.048323 1.000000 0.388201 0.458571 0.185267 0.236344 0.571429 0.477905 0.556324 0.738112 13.0 2.0 1.0 6.0
11-0 98.0 2805.0 8993.0 41.0 0.845314 0.239826 0.043590 0.988189 0.307977 0.211277 0.140934 0.336398 0.695378 0.572449 0.498062 0.696429 11.0 1.0 1.0 9.0
0-1 33.0 646.0 8387.0 90.0 0.984769 0.475397 0.056057 0.933442 0.402862 0.338597 0.085823 0.272680 0.565657 0.630244 0.541796 0.752778 7.0 3.0 1.0 2.0
1-1 40.0 4164.0 13403.0 41.0 1.000000 0.173680 0.036768 1.000000 0.376558 0.306077 0.159526 0.068881 0.641327 0.484186 0.510882 0.765152 20.0 1.0 1.0 10.0
2-1 88.0 1151.0 8304.0 23.0 0.742052 0.106195 0.064869 1.000000 0.315780 0.619785 0.103333 0.146457 0.599495 0.414924 0.572690 0.666667 8.0 1.0 1.0 9.0
3-1 64.0 2252.0 7856.0 43.0 0.857538 0.173152 0.059856 1.000000 0.267228 0.202500 0.245970 0.137212 0.653061 0.446825 0.508282 0.760000 9.0 1.0 1.0 7.0
4-1 91.0 2868.0 7387.0 51.0 0.756498 0.221948 0.035140 1.000000 0.312715 0.171164 0.349126 0.112509 0.606981 0.510684 0.467295 0.741667 10.0 1.0 1.0 6.0
5-1 31.0 1652.0 9770.0 47.0 0.689846 0.241960 0.050961 1.000000 0.211355 0.277617 0.288946 0.253809 0.574074 0.494759 0.549278 0.763788 5.0 1.0 1.0 6.0
6-1 57.0 873.0 22.0 27.0 0.863797 0.158189 0.751116 1.000000 0.337476 0.424378 0.370224 0.182968 0.631944 0.467984 0.418750 0.750000 9.0 2.0 4.0 8.0
7-1 54.0 659.0 6997.0 26.0 0.856820 0.465333 0.074767 1.000000 0.441801 0.482696 0.220509 0.162121 0.604167 0.591435 0.538396 0.722222 5.0 3.0 1.0 3.0
8-1 68.0 1110.5 8052.0 49.5 0.848859 0.808536 0.096679 1.000000 0.295531 0.124259 0.340513 0.243407 0.656250 0.757819 0.515559 0.756818 5.0 2.0 1.0 6.0
9-1 126.0 584.0 5990.0 57.5 0.752271 0.233462 0.068063 1.000000 0.353340 0.384750 0.112466 0.169162 0.641026 0.565739 0.548836 0.692956 5.0 2.0 1.0 4.0
10-1 77.5 48.0 12283.0 58.0 0.832699 0.389967 0.035018 1.000000 0.331494 0.679027 0.271263 0.087320 0.600069 0.325420 0.515464 0.714286 10.0 6.0 1.0 9.0
11-1 24.0 1602.0 12253.0 36.5 0.871560 0.371284 0.063537 1.000000 0.317529 0.486609 0.322754 0.254069 0.510204 0.526215 0.580024 0.722222 7.0 2.0 1.0 8.0
12-1 82.0 2723.0 8338.0 60.0 0.901766 0.418904 0.058905 1.000000 0.311033 0.392981 0.317050 0.219982 0.656250 0.702528 0.486891 0.732143 9.0 1.0 1.0 7.0
13-1 59.0 3417.0 10921.0 39.5 0.875924 0.156722 0.046924 1.000000 0.284057 0.460068 0.246043 0.162387 0.589927 0.483173 0.512867 0.755102 12.0 1.0 1.0 4.0
0-2 81.0 506.0 6095.0 59.5 0.874082 0.692071 0.061450 0.813102 0.271903 0.204649 0.566809 0.318936 0.640152 0.709939 0.376095 0.680556 6.0 3.0 1.0 2.0
1-2 112.0 1028.0 18268.0 47.0 0.684060 0.651009 0.057497 1.000000 0.461039 0.224796 0.272422 0.176686 0.534722 0.668787 0.555934 0.750000 9.0 3.0 1.0 11.0
2-2 158.0 2256.0 11749.0 58.0 0.772462 0.298946 0.072532 1.000000 0.262487 0.553125 0.176129 0.160893 0.670455 0.502674 0.568821 0.805556 7.0 1.0 1.0 7.0
3-2 231.5 2455.0 9577.0 50.0 0.704407 0.394730 0.070778 1.000000 0.377610 0.272570 0.451540 0.159108 0.657225 0.631430 0.454878 0.691358 6.0 1.0 1.0 5.0
4-2 151.0 3289.0 11322.0 40.0 0.767014 0.349287 0.048491 1.000000 0.311007 0.651149 0.319685 0.172907 0.633250 0.413399 0.471691 0.714286 12.0 1.0 1.0 7.0
5-2 205.5 16.0 11118.0 21.0 0.662480 1.000000 0.086700 1.000000 0.447544 0.193111 0.178973 0.181705 0.624603 0.601638 0.580514 0.718182 4.0 3.0 1.0 14.0
6-2 39.0 4629.0 18873.0 66.0 0.882109 0.277760 0.062439 1.000000 0.279717 0.502306 0.069330 0.102909 0.638826 0.495345 0.577209 0.738905 8.0 1.0 1.0 8.0
7-2 424.5 937.5 10973.0 73.5 0.552564 0.664226 0.079359 0.957464 0.440302 0.258909 0.249811 0.271398 0.596174 0.643831 0.513910 0.668182 4.0 2.0 1.0 8.0
8-2 86.0 1448.0 12608.0 80.0 0.748017 0.390063 0.051236 0.974549 0.345564 0.475289 0.208464 0.180974 0.547619 0.588658 0.555125 0.688312 11.0 2.0 1.0 5.0
9-2 109.0 2110.0 8335.0 47.0 0.819828 0.562741 0.081130 0.950839 0.330001 0.478976 0.084208 0.354715 0.644192 0.793233 0.521851 0.679457 6.0 1.0 1.0 4.0
10-2 76.0 2238.0 12062.0 37.0 0.715299 0.618056 0.077079 1.000000 0.481178 0.283293 0.247034 0.130520 0.633971 0.744016 0.574709 0.750000 8.0 1.0 1.0 9.0
0-3 26.0 38.5 10357.0 70.5 0.715492 0.848255 0.073592 0.955768 0.555150 0.312634 0.151869 0.268488 0.496795 0.592727 0.543047 0.671364 7.0 4.0 1.0 6.0
1-3 159.0 1142.0 8944.0 76.0 0.823861 0.261123 0.058375 0.995851 0.194221 0.462433 0.130762 0.143551 0.672059 0.454495 0.507951 0.773810 8.0 2.0 1.0 5.0
2-3 193.0 450.0 9745.0 34.0 0.802449 0.489826 0.052970 0.995891 0.389129 0.206004 0.216125 0.355088 0.642857 0.573980 0.541389 0.686012 7.0 3.0 1.0 8.0
3-3 130.0 780.0 15704.0 66.0 0.617096 0.618706 0.038042 0.990273 0.371340 0.227953 0.140536 0.143821 0.669643 0.633333 0.523991 0.686065 11.0 5.0 1.0 10.0
4-3 0.0 177.0 12530.0 25.5 0.000000 0.588240 0.092439 1.000000 0.000000 0.315782 0.190005 0.255248 0.000000 0.650735 0.617972 0.600446 0.0 3.0 1.0 6.0
5-3 115.5 726.0 12691.0 78.0 0.813633 0.541090 0.056721 0.915472 0.306740 0.242757 0.298018 0.127670 0.662764 0.583153 0.511322 0.702040 10.0 4.0 1.0 6.0
6-3 337.0 1795.0 12922.0 75.0 0.605931 0.522038 0.058115 0.946450 0.430918 0.306800 0.101516 0.095062 0.540936 0.620238 0.524432 0.705128 5.0 2.0 1.0 11.0
7-3 164.0 1272.5 18073.0 47.0 0.685409 0.676682 0.042065 1.000000 0.373616 0.197385 0.228095 0.141581 0.622322 0.655476 0.515988 0.750000 12.0 4.0 1.0 9.0
8-3 154.5 3365.0 13819.0 52.0 0.788247 0.321839 0.062378 1.000000 0.260891 0.695200 0.194789 0.192065 0.697316 0.449866 0.532073 0.757716 8.0 1.0 1.0 6.0

It is possible to export the object stats in a csv file using FeatureExtractor.extract_object_stats_csv. The resulting csv file contains the aggregated object stats matrix from adata_object_stats.obsm['object_stats_agg'] as well as additionally defined columns. This saves one csv file per CSL-CSL pair in experiment_dir/aggregated/full_data/data_dir/export.

[22]:
for extr in extrs:
    extr.extract_object_stats_csv(obs=["well_name", "perturbation_duration", "TR"])
[23]:
extr = extrs[0]
# check if results are stored
save_dir = os.path.join(os.path.dirname(extr.fname), "export")
print("csv exported to", save_dir)
print([n for n in os.listdir(save_dir) if "object_stats" in n])
csv exported to /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments/test_pre_trained/CondVAE_pert-CC/aggregated/full_data/184A1_unperturbed/I09/export
['object_stats_features_annotation.csv']

plot_object_stats can be used to plot a box-plot overview of the object stats. Again, we can define the grouping using group_key.

[24]:
plot_object_stats(adata_object_stats, group_key="perturbation_duration", figsize_mult=(4, 4))
_images/notebooks_extract_features_44_0.png