Set up datasets for scSpecies
This tutorial demonstrates how to setup anndata scRNA-seq datasets such that they can be integrated by scSpecies.
We start by downloading three liver cell datasets from mice, humans and hamsters.
[ ]:
from scspecies.preprocessing import download_datasets
import os
path = os.path.abspath('').replace('\\', '/')+'/'
download_datasets()
Downloading human_liver.h5ad …
human_liver.h5ad downloaded. Size: 127.02 MB
Downloading mouse_liver.h5ad …
mouse_liver.h5ad downloaded. Size: 150.69 MB
Downloading hamster_liver.h5ad …
hamster_liver.h5ad downloaded. Size: 28.44 MB
All datasets have been downloaded to the ./data directory.
.h5ad files.Compressed Sparse Row sparse matrix of dtype ‘float32’.mouse_liver.h5ad file, which contains mouse liver cell samples.human_liver.h5ad file, which contains human liver cell samples and the hamster_liver.h5ad file, which contains hamster liver cell samples.NOTE: The three datasets are subsampled from the mouse, hamster and human liver cell atlas. It is not possible to reproduce exact results of the publication with these datasets as they contain different gene sets and cells.
The full datasets can be accessed via https://www.livercellatlas.org/.
[ ]:
import anndata as ad
adata_mouse = ad.read_h5ad(path+"data/mouse_liver.h5ad")
adata_human = ad.read_h5ad(path+"data/human_liver.h5ad")
adata_hamster = ad.read_h5ad(path+"data/hamster_liver.h5ad")
# Forward slashes are not allowed in keys and have to be replaced.
def sanitize_obs_keys(adata, repl="/", new="-"):
for col in adata.obs.columns:
dtype = adata.obs[col].dtype
if isinstance(dtype, pd.CategoricalDtype):
adata.obs[col] = adata.obs[col].cat.rename_categories(
{cat: str(cat).replace(repl, new) for cat in adata.obs[col].cat.categories}
)
else:
adata.obs[col] = adata.obs[col].astype(str).str.replace(repl, new)
return adata
adata_human = sanitize_obs_keys(adata_human)
adata_mouse = sanitize_obs_keys(adata_mouse)
adata_hamster = sanitize_obs_keys(adata_hamster)
.obs keys under which the cell and batch labels for the context and target dataset are stored.target_cell_key = None.NOTE: For precise metric calculation annotation should match across datasets.
[3]:
mouse_batch_key = 'batch'
mouse_cell_key = 'cell_type_fine'
human_batch_key = 'batch'
human_cell_key = 'cell_type_fine'
hamster_batch_key = 'batch'
hamster_cell_key = 'cell_type_coarse'
mygene.A list of commonly used model species is provided below:
Species |
NCBI Taxonomy ID |
|---|---|
human |
9606 |
mouse |
10090 |
hamster |
10036 |
rat |
10116 |
zebrafish |
7955 |
fruit fly |
7227 |
chicken |
9031 |
pig |
9823 |
rhesus macaque |
9544 |
rabbit |
9986 |
Note: The Liver cell atlas has already mapped the gene names to mouse or human gene symbols. We will therefore set the Taxonomy ID to 10090 for the hamster dataset.
[4]:
mouse_NCBI_Taxon_ID = 10090
print('\nFirst mouse gene names: ', adata_mouse.var_names[0:5])
human_NCBI_Taxon_ID = 9606
print('\nFirst human gene names: ', adata_human.var_names[0:5])
hamster_NCBI_Taxon_ID = 10090 #Gene symbols follow the mouse convention in the hamster liver cell dataset.
print('\nFirst hamster gene names: ', adata_hamster.var_names[0:5])
First mouse gene names: Index(['Sox17', 'Mrpl15', 'Sntg1', 'Adhfe1', 'Snhg6'], dtype='object')
First human gene names: Index(['RP11-54O7.17', 'HES4', 'ISG15', 'TNFRSF18', 'TNFRSF4'], dtype='object')
First hamster gene names: Index(['Tcf7l2', 'Adra2a', 'Dusp5', 'Mxi1', 'Add3'], dtype='object')
muon.MuData file (https://muon.readthedocs.io/en/latest/) which scSpecies uses during training.anndata.AnnData datasets.NOTE: scSpecies needs integer count values without any preprocessing transformations such as log1p or normalisation.
[ ]:
from scspecies.preprocessing import create_mdata
import warnings
import numpy as np
warnings.filterwarnings("ignore", category=UserWarning)
preprocess = create_mdata(
adata=adata_mouse,
batch_key=mouse_batch_key,
cell_key=mouse_cell_key,
dataset_name='mouse',
NCBI_Taxon_ID=mouse_NCBI_Taxon_ID,
)
preprocess.setup_target_adata(
adata=adata_human,
batch_key=human_batch_key,
cell_key=human_cell_key,
dataset_name='human',
NCBI_Taxon_ID=human_NCBI_Taxon_ID,
)
adata_hamster = adata_hamster[:, np.random.choice(4000, size=3500, replace=False)]
preprocess.setup_target_adata(
adata=adata_hamster,
batch_key=hamster_batch_key,
cell_key=hamster_cell_key,
dataset_name='hamster',
NCBI_Taxon_ID=mouse_NCBI_Taxon_ID,
)
mdata = preprocess.return_mdata(save=True, save_name='liver_atlas')
Registering experimental batches for the mouse dataset. Kept 34, removed 34.
Compute prior parameters for the library encoder.
Filtering cells. Kept 37717, removed 0.
Done!
------------------------------------------------------------------------------------------
Registering experimental batches for the human dataset. Kept 30, removed 30.
Compute prior parameters for the library encoder.
Filtering cells. Kept 34993, removed 0.
Translating homologous gene names between mouse context and human target dataset.
Could map 3535 of 4000 target gene symbols to context species gene symbols
Found 1785 shared homologous genes between context and target dataset
Perform the data-level nearest neigbor search on the homologous gene set.
Evaluating data level NNS and calculating cells with the highest agreement for context labels key cell_type_fine.
Done!
------------------------------------------------------------------------------------------
Registering experimental batches for the hamster dataset. Kept 2, removed 2.
Compute prior parameters for the library encoder.
Filtering cells. Kept 5952, removed 3.
Translating homologous gene names between mouse context and hamster target dataset.
Could map 3500 of 3500 target gene symbols to context species gene symbols
Found 1464 shared homologous genes between context and target dataset
Perform the data-level nearest neigbor search on the homologous gene set.
Evaluating data level NNS and calculating cells with the highest agreement for context labels key cell_type_coarse.
Done!
------------------------------------------------------------------------------------------
Saved mdata to data/liver_atlas.h5mu.
The final mdata object contains the data set names as modality keys.
The create_mdata class has added additional keys to the adata input corresponding to the preprocessing steps.
mdata.mod[key].obs['library_log_mean'] - Library size encoder prior parameter (mu).mdata.mod[key].obs['library_log_std'] - Library size encoder prior parameter (sigma).mdata.mod[key].obs['dataset'] - Datset/species name.mdata.mod[key].uns['metadata'] - Dataset metadata information, e.g. cell and batch annotation keys.mdata.mod[key].uns['batch_dict']- Dictionary containing the experimental batches each cell types contains samples in.mdata.mod[key].obsm['batch_label_enc'] - Onehot-encoded batch labels.mdata.mod[target_key].obsm['ind_nns_hom_genes'] - Indices for candidate cells of the data-level nearest neighbor search.[8]:
print(mdata)
MuData object with n_obs × n_vars = 78662 × 11500
3 modalities
mouse: 37717 x 4000
obs: 'cell_type_coarse', 'batch', 'cell_type_fine', 'dataset', 'library_log_mean', 'library_log_std', 'n_genes'
uns: 'metadata', 'batch_dict'
obsm: 'batch_label_enc'
human: 34993 x 4000
obs: 'cell_type_coarse', 'batch', 'cell_type_fine', 'dataset', 'library_log_mean', 'library_log_std', 'n_genes', 'top_percent_cell_type_fine', 'pred_nns_cell_type_fine'
var: 'var_names_transl'
uns: 'metadata', 'batch_dict'
obsm: 'batch_label_enc', 'ind_neigh_nns'
hamster: 5952 x 3500
obs: 'cell_type_coarse', 'batch', 'dataset', 'library_log_mean', 'library_log_std', 'n_genes', 'top_percent_cell_type_coarse', 'pred_nns_cell_type_coarse'
var: 'var_names_transl'
uns: 'metadata', 'batch_dict'
obsm: 'ind_nns_hom_genes', 'batch_label_enc', 'ind_neigh_nns'