Set up datasets for scSpecies

This tutorial demonstrates how to setup anndata scRNA-seq datasets such that they can be integrated by scSpecies.

We start by downloading three liver cell datasets from mice, humans and hamsters.

[ ]:
from scspecies.preprocessing import download_datasets
import os

path = os.path.abspath('').replace('\\', '/')+'/'
download_datasets()
Downloading human_liver.h5ad …
human_liver.h5ad downloaded. Size: 127.02 MB
Downloading mouse_liver.h5ad …
mouse_liver.h5ad downloaded. Size: 150.69 MB
Downloading hamster_liver.h5ad …
hamster_liver.h5ad downloaded. Size: 28.44 MB
All datasets have been downloaded to the ./data directory.
As a first step we load the context and target datasets as .h5ad files.
Custom datasets must have raw counts saved as Compressed Sparse Row sparse matrix of dtype ‘float32’.
As context dataset we will use the mouse_liver.h5ad file, which contains mouse liver cell samples.
As target datasets we will use the human_liver.h5ad file, which contains human liver cell samples and the hamster_liver.h5ad file, which contains hamster liver cell samples.

NOTE: The three datasets are subsampled from the mouse, hamster and human liver cell atlas. It is not possible to reproduce exact results of the publication with these datasets as they contain different gene sets and cells.

First, gene sets were reduced to 4000 highly variable genes that are expressed in more than 2.5% of cells.
Second, cells belonging to large cell types were randomly sampled to contain 1500 samples.
Third, unlabeled cells and cells with a labeling conflicts between fine and coarse labels were removed.
Lastly, we only included cells obtained via CITE-seq and scRNA-seq.

The full datasets can be accessed via https://www.livercellatlas.org/.

[ ]:
import anndata as ad

adata_mouse = ad.read_h5ad(path+"data/mouse_liver.h5ad")
adata_human = ad.read_h5ad(path+"data/human_liver.h5ad")
adata_hamster = ad.read_h5ad(path+"data/hamster_liver.h5ad")

# Forward slashes are not allowed in keys and have to be replaced.
def sanitize_obs_keys(adata, repl="/", new="-"):
    for col in adata.obs.columns:
        dtype = adata.obs[col].dtype
        if isinstance(dtype, pd.CategoricalDtype):
            adata.obs[col] = adata.obs[col].cat.rename_categories(
                {cat: str(cat).replace(repl, new) for cat in adata.obs[col].cat.categories}
            )
        else:
            adata.obs[col] = adata.obs[col].astype(str).str.replace(repl, new)
    return adata

adata_human = sanitize_obs_keys(adata_human)
adata_mouse = sanitize_obs_keys(adata_mouse)
adata_hamster = sanitize_obs_keys(adata_hamster)
Next, we specify the .obs keys under which the cell and batch labels for the context and target dataset are stored.
For the target dataset, cell type annotation is not required.
However, we will use existing annotation for visualization and to compute performance metrics. When the target cell annotation is unknown this can be indicated by target_cell_key = None.

NOTE: For precise metric calculation annotation should match across datasets.

[3]:
mouse_batch_key = 'batch'
mouse_cell_key = 'cell_type_fine'

human_batch_key = 'batch'
human_cell_key = 'cell_type_fine'

hamster_batch_key = 'batch'
hamster_cell_key = 'cell_type_coarse'
Next, we specify the gene naming convention of the datasets.
scSpecies can match homologous genes between arbitrary genomes via the NCBI Taxon ID using the package mygene.

A list of commonly used model species is provided below:

Species

NCBI Taxonomy ID

human

9606

mouse

10090

hamster

10036

rat

10116

zebrafish

7955

fruit fly

7227

chicken

9031

pig

9823

rhesus macaque

9544

rabbit

9986

Note: The Liver cell atlas has already mapped the gene names to mouse or human gene symbols. We will therefore set the Taxonomy ID to 10090 for the hamster dataset.

[4]:
mouse_NCBI_Taxon_ID = 10090
print('\nFirst mouse  gene names: ', adata_mouse.var_names[0:5])

human_NCBI_Taxon_ID = 9606
print('\nFirst human gene names: ', adata_human.var_names[0:5])

hamster_NCBI_Taxon_ID = 10090 #Gene symbols follow the mouse convention in the hamster liver cell dataset.
print('\nFirst hamster gene names: ', adata_hamster.var_names[0:5])

First mouse  gene names:  Index(['Sox17', 'Mrpl15', 'Sntg1', 'Adhfe1', 'Snhg6'], dtype='object')

First human gene names:  Index(['RP11-54O7.17', 'HES4', 'ISG15', 'TNFRSF18', 'TNFRSF4'], dtype='object')

First hamster gene names:  Index(['Tcf7l2', 'Adra2a', 'Dusp5', 'Mxi1', 'Add3'], dtype='object')
Next, we create a muon.MuData file (https://muon.readthedocs.io/en/latest/) which scSpecies uses during training.
Each species will occupy one modality.
We instantiate a preprocessing class and register context and target anndata.AnnData datasets.
This class translates the gene names, performs the data-level nearest neighbor search on homologous genes,
one-hot-encodes experimental batch effects, computes the library encoder prior parameters.
We recommend to subset to the gene sets of interest before inputting them into the preprocessing class.
Gene sets do not have to be of similar sizes. To demonstrate this, we randomly subset the hamster dataset to 3500 genes. The other datasets have 4000 genes.

NOTE: scSpecies needs integer count values without any preprocessing transformations such as log1p or normalisation.

[ ]:
from scspecies.preprocessing import create_mdata

import warnings
import numpy as np
warnings.filterwarnings("ignore", category=UserWarning)

preprocess = create_mdata(
    adata=adata_mouse,
    batch_key=mouse_batch_key,
    cell_key=mouse_cell_key,
    dataset_name='mouse',
    NCBI_Taxon_ID=mouse_NCBI_Taxon_ID,
    )

preprocess.setup_target_adata(
    adata=adata_human,
    batch_key=human_batch_key,
    cell_key=human_cell_key,
    dataset_name='human',
    NCBI_Taxon_ID=human_NCBI_Taxon_ID,
    )

adata_hamster = adata_hamster[:, np.random.choice(4000, size=3500, replace=False)]

preprocess.setup_target_adata(
    adata=adata_hamster,
    batch_key=hamster_batch_key,
    cell_key=hamster_cell_key,
    dataset_name='hamster',
    NCBI_Taxon_ID=mouse_NCBI_Taxon_ID,
    )

mdata = preprocess.return_mdata(save=True, save_name='liver_atlas')
Registering experimental batches for the mouse dataset. Kept 34, removed 34.
Compute prior parameters for the library encoder.
Filtering cells. Kept 37717, removed 0.
Done!
------------------------------------------------------------------------------------------
Registering experimental batches for the human dataset. Kept 30, removed 30.
Compute prior parameters for the library encoder.
Filtering cells. Kept 34993, removed 0.
Translating homologous gene names between mouse context and human target dataset.
Could map 3535 of 4000 target gene symbols to context species gene symbols
Found 1785 shared homologous genes between context and target dataset
Perform the data-level nearest neigbor search on the homologous gene set.
Evaluating data level NNS and calculating cells with the highest agreement for context labels key cell_type_fine.
Done!
------------------------------------------------------------------------------------------
Registering experimental batches for the hamster dataset. Kept 2, removed 2.
Compute prior parameters for the library encoder.
Filtering cells. Kept 5952, removed 3.
Translating homologous gene names between mouse context and hamster target dataset.
Could map 3500 of 3500 target gene symbols to context species gene symbols
Found 1464 shared homologous genes between context and target dataset
Perform the data-level nearest neigbor search on the homologous gene set.
Evaluating data level NNS and calculating cells with the highest agreement for context labels key cell_type_coarse.
Done!
------------------------------------------------------------------------------------------
Saved mdata to data/liver_atlas.h5mu.

The final mdata object contains the data set names as modality keys.

The create_mdata class has added additional keys to the adata input corresponding to the preprocessing steps.

mdata.mod[key].obs['library_log_mean'] - Library size encoder prior parameter (mu).
mdata.mod[key].obs['library_log_std'] - Library size encoder prior parameter (sigma).
mdata.mod[key].obs['dataset'] - Datset/species name.
mdata.mod[key].uns['metadata'] - Dataset metadata information, e.g. cell and batch annotation keys.
mdata.mod[key].uns['batch_dict']- Dictionary containing the experimental batches each cell types contains samples in.
mdata.mod[key].obsm['batch_label_enc'] - Onehot-encoded batch labels.
mdata.mod[target_key].obsm['ind_nns_hom_genes'] - Indices for candidate cells of the data-level nearest neighbor search.
[8]:
print(mdata)
MuData object with n_obs × n_vars = 78662 × 11500
  3 modalities
    mouse:      37717 x 4000
      obs:      'cell_type_coarse', 'batch', 'cell_type_fine', 'dataset', 'library_log_mean', 'library_log_std', 'n_genes'
      uns:      'metadata', 'batch_dict'
      obsm:     'batch_label_enc'
    human:      34993 x 4000
      obs:      'cell_type_coarse', 'batch', 'cell_type_fine', 'dataset', 'library_log_mean', 'library_log_std', 'n_genes', 'top_percent_cell_type_fine', 'pred_nns_cell_type_fine'
      var:      'var_names_transl'
      uns:      'metadata', 'batch_dict'
      obsm:     'batch_label_enc', 'ind_neigh_nns'
    hamster:    5952 x 3500
      obs:      'cell_type_coarse', 'batch', 'dataset', 'library_log_mean', 'library_log_std', 'n_genes', 'top_percent_cell_type_coarse', 'pred_nns_cell_type_coarse'
      var:      'var_names_transl'
      uns:      'metadata', 'batch_dict'
      obsm:     'ind_nns_hom_genes', 'batch_label_enc', 'ind_neigh_nns'