Setup and download data

This tutorials shows how to set up CAMPA and download an example dataset. To follow along with this and the following tutorials, please execute the following steps first:

  • install CAMPA (pip install campa)

  • download the tutorials to a new folder, referred to as CAMPA_DIR in the following

  • navigate to CAMPA_DIR in the terminal and start this notebook with jupyter notebook setup.py

Note that the following notebooks assume that you will run them from the same folder that you run this notebook in (CAMPA_DIR). If this is not the case, adjust CAMPA_DIR at the top of each notebook to point to the folder that you run this notebook in.

[1]:
from pathlib import Path

# set CAMPA_DIR to the current working directory
CAMPA_DIR = Path.cwd()
print(CAMPA_DIR)
/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test

Download parameter files

Before configuring CAMPA, we need to ensure that all parameter files for configuring the running the different CAMPA steps are present in the params subfolder. Note that in general, these files don’t need to be in a folder named params, but the following tutorials will follow this convention. Let us download the necessary parameter files from the github repository.

[2]:
import glob

import requests

# ensure params folder exists
(CAMPA_DIR / "params").mkdir(parents=True, exist_ok=True)

# download parameter files from git
for param_file in [
    "ExampleData_constants",
    "example_data_params",
    "example_experiment_params",
    "example_feature_params",
]:
    r = requests.get(f"https://raw.github.com/theislab/campa/main/notebooks/params/{param_file}.py")
    with open(CAMPA_DIR / "params" / f"{param_file}.py", "w") as f:
        f.write(r.text)

print(f'Files in {CAMPA_DIR / "params"}: {glob.glob(str(CAMPA_DIR / "params" / "*"))}')
Files in /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params: ['/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/example_experiment_params.py', '/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/example_data_params.py', '/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/ExampleData_constants.py', '/home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/example_feature_params.py']

Set up CAMPA config

CAMPA has one main config file: campa.ini. The overview describes how you can create this config file from the command line, but here we will see how we can create a config from within the campa module using the config file representation campa.constants.campa_config.

[3]:
from campa.constants import campa_config

print(campa_config)
2022-11-25 09:57:06.641175: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 09:57:24.354282: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-25 09:57:27.507035: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: EXPERIMENT_DIR is not initialised. Please create a config with "campa setup" or set campa_config.EXPERIMENT_DIR manually.
WARNING: BASE_DATA_DIR is not initialised. Please create a config with "campa setup" or set campa_config.BASE_DATA_DIR manually.
CAMPAConfig (fname: None)
EXPERIMENT_DIR: None
BASE_DATA_DIR: None
CO_OCC_CHUNK_SIZE: None

If you have not yet set up a config, this should look pretty empty. The lines WARNING: EXPERIMENT_DIR is not initialised and WARNING: BASE_DATA_DIR is not initialised are expected in this case and alert us that we need to set EXPERIMENT_DIR and BASE_DATA_DIR to that CAMPA knows where experiments and data is stored.

Let us set the EXPERIMENT_DIR and the BASE_DATA_DIR, and add the ExampleData data config. Here, we set the data and experiments paths relative to CAMPA_DIR defined above.

[4]:
# point to example data folder in which we will download the example data
campa_config.BASE_DATA_DIR = CAMPA_DIR / "example_data"
# experiments will be stored in example_experiments
campa_config.EXPERIMENT_DIR = CAMPA_DIR / "example_experiments"
# add ExampleData data_config (pointing to ExampleData_constants file that we just downloaded)
campa_config.add_data_config("ExampleData", CAMPA_DIR / "params/ExampleData_constants.py")
# set CO_OCC_CHUNK_SIZE (a parameter making co-occurrence calculation more memory efficient)
campa_config.CO_OCC_CHUNK_SIZE = 1e7

print(campa_config)
CAMPAConfig (fname: None)
EXPERIMENT_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments
BASE_DATA_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_data
CO_OCC_CHUNK_SIZE: 10000000.0
data_config/exampledata: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/ExampleData_constants.py

We can now save the config to quickly load it later on. Here, we store the config in the params directory in the current folder.

[5]:
# save config
campa_config.write(CAMPA_DIR / "params" / "campa.ini")
Reading config from /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/campa.ini

By default, campa looks for config files in the current directory and $HOME/.config/campa, but loading a config from any other file is also easy:

[6]:
# read config from non-standard location by setting campa_config.config_fname
campa_config.config_fname = CAMPA_DIR / "params" / "campa.ini"
print(campa_config)
Reading config from /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/campa.ini
CAMPAConfig (fname: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/campa.ini)
EXPERIMENT_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_experiments
BASE_DATA_DIR: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_data
CO_OCC_CHUNK_SIZE: 10000000.0
data_config/exampledata: /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/params/ExampleData_constants.py

Download example dataset

To follow along with the workflow tutorials, you need to download the example dataset.

Here, we store the example data in the BASE_DATA_DIR just configured in the config.

[7]:
from campa.data import load_example_data

example_data_path = load_example_data(Path(campa_config.BASE_DATA_DIR).parent)
print("Example data downloaded to: ", example_data_path)
Path or dataset does not yet exist. Attempting to download...
{'x-amz-id-2': 'HSPvG563oJllNzdrsV13AQCjHZ7P9FyV0mTfxhkmBn5sm1orzTIridTerZSrwwqhhJja8adJlLA=', 'x-amz-request-id': 'D1AWZ3CZHG6SQ9A8', 'Date': 'Fri, 25 Nov 2022 09:07:47 GMT', 'x-amz-replication-status': 'COMPLETED', 'Last-Modified': 'Fri, 28 Oct 2022 11:44:27 GMT', 'ETag': '"6300ee9228b5e78480a3a5a540e85730"', 'x-amz-tagging-count': '1', 'x-amz-server-side-encryption': 'AES256', 'Content-Disposition': 'attachment; filename="example_data.zip"', 'x-amz-version-id': 'WbEd4ye51WteRY2_BZaTchKIFVKkAxuw', 'Accept-Ranges': 'bytes', 'Content-Type': 'application/zip', 'Server': 'AmazonS3', 'Content-Length': '126837954'}
attachment; filename="example_data.zip"
Guessed filename: example_data.zip
Downloading... 126837954
123866it [00:04, 28644.04it/s]
Example data downloaded to:  /home/icb/hannah.spitzer/projects/pelkmans/software_new/campa_notebooks_test/example_data

The example data is now stored in your campa_config.BASE_DATA_DIR folder.

The data is represented as an MPPData object. For more information on this class and the data representation on disk see the Data representation tutorial.