scspecies.models module

The main scSpecies class facilitating architecture alignment and subsequent analyses.

class scspecies.models.scSpecies[source]

Bases: object

The scSpecies cross-species architecture alignment framework built on scVI.

This class implements end-to-end preprocessing, variational encoding, decoding, and alignment for a “context” dataset (e.g., mouse) and a “target” dataset (e.g., human). It supports:

Training scVI models on context and target (latent or intermediate alignment).
Library size encoding and negative-binomial / zero-inflated NB likelihoods.
Establishing a direct correspondece between traget can context cell via a likelihood-based similarity measure
Latent-space nearest-neighbor label transfer based on the similarity measure.
Log-fold-change computation of homologous genes.

Parameters:

device (str) – PyTorch device identifier (‘cpu’, ‘mps’ or ‘cuda’).
mdata (mu.MuData) – Multi-modal container holding context and target AnnData objects, set up by the create_mdata class.
directory (str) – Base path for saving model parameters, data, and figures.
random_seed (int, default=369963) – Seed for NumPy and PyTorch RNGs.
context_key (str, default='mouse') – Key in mdata.mod for the context dataset.
target_key (str, default='human') – Key in mdata.mod for the target dataset.
context_optimizer (torch.optim.Optimizer classes) – Optimizer constructors for context and target models.
target_optimizer (torch.optim.Optimizer classes) – Optimizer constructors for context and target models.
context_hidden_dims_enc_outer (list[int]) – Hidden layer sizes for the outer encoders.
target_hidden_dims_enc_outer (list[int]) – Hidden layer sizes for the outer encoders.
hidden_dims_enc_inner (list[int]) – Hidden layer sizes for the inner encoder.
context_hidden_dims_l_enc (list[int]) – Hidden layer sizes for the library encoder.
target_hidden_dims_l_enc (list[int]) – Hidden layer sizes for the library encoder.
context_hidden_dims_dec (list[int]) – Hidden layer sizes for the decoder.
target_hidden_dims_dec (list[int]) – Hidden layer sizes for the decoder.
context_layer_order (list) – Layer specification lists for create_structure.
target_layer_order (list) – Layer specification lists for create_structure.
b_s (int, default=128) – Batch size for training and inference.
context_data_distr ({'nb', 'zinb'}) – Observation models for counts.
target_data_distr ({'nb', 'zinb'}) – Observation models for counts.
lat_dim (int, default=10) – Dimensionality of the latent space.
context_dispersion ({'dataset', 'batch', 'cell'}) – Dispersion parameterization strategy.
target_dispersion ({'dataset', 'batch', 'cell'}) – Dispersion parameterization strategy.
alignment ({'inter', 'latent'}) – Alignment mode between context and target. Either at the outer encoder output space or at the latent space.
k_neigh (int, default=25) – Number of neighbors candidates for alignment from the data-level NNS.
top_percent (float, default=20) – Percentile cutoff for selecting top-agreement neighbors.
context_beta_* (floats and ints) – Schedules for KL and alignment weight ramps.
target_beta_* (floats and ints) – Schedules for KL and alignment weight ramps.
eta_* (floats and ints) – Schedules for KL and alignment weight ramps.
use_lib_enc (bool, default=True) – Whether to include a library-size encoder.

__init__(device, mdata, directory, random_seed=369963, context_key='mouse', target_key='human', context_optimizer=torch.optim.Adam, target_optimizer=torch.optim.Adam, context_hidden_dims_enc_outer=[300], target_hidden_dims_enc_outer=[300], hidden_dims_enc_inner=[200], context_hidden_dims_l_enc=[200], target_hidden_dims_l_enc=[200], context_hidden_dims_dec=[200, 300], target_hidden_dims_dec=[200, 300], lat_dim=10, context_layer_order=['linear', 'layer_norm', ('act', torch.nn.ReLU), ('dropout', 0.1)], target_layer_order=['linear', 'layer_norm', ('act', torch.nn.ReLU), ('dropout', 0.1)], use_lib_enc=True, b_s=128, context_data_distr='zinb', target_data_distr='zinb', context_dispersion='batch', target_dispersion='batch', alignment='inter', k_neigh=25, top_percent=20, context_beta_start=0.1, context_beta_max=1, context_beta_epochs_raise=10, target_beta_start=0.1, target_beta_max=1, target_beta_epochs_raise=10, eta_start=10, eta_max=25, eta_epochs_raise=10)[source]

compute_logfold_change(eval_cell_types=None, eps=1e-06, lfc_delta=1, samples=50000, target_cell_key=None, b_s=128, confidence_level=0.9)[source]

Monte Carlo estimation of per-gene Log2-fold-changes and associated probabilities.

For each specified cell type (or the intersection of context/target types), samples from the scVI posterior, computes the ratio of target vs. context expression for each homologous gene, and aggregates: - Median Log2-fold-change (on normalized decoder space), - Probability(abs(Log2Fc) > lfc_delta), - Mean gene expression on normalized decoder space and NB parameter space.

Parameters:

eval_cell_types (sequence of str, optional) – Cell types to include; defaults to the intersection of context and target types.
eps (float, default=1e-6) – Small constant added before log to prevent small gene expression patterns from returning large LFC values.
lfc_delta (float, default=1) – Threshold for computing the probability of large fold-changes.
target_cell_key (str or None) – Column name in .obs specifying inferred cell type labels for the target dataset;
samples (int, default=50000) – Total number of Monte Carlo draws per cell.
b_s (int, default=128) – Batch size for sampling iterations.
confidence_level (float, default=0.9) – Outlier filtering threshold for latent space.

Returns:

lfc_dict – Dictionary with cell-wise data frames containing the keys: - ‘rho_median_context’ : Median context normalized gene expression, - ‘mu_median_context’ : Median context expected value gene expression, - ‘rho_median_target’ : Median target normalized gene expression, - ‘mu_median_target’ : Median target expected value gene expression, - ‘lfc’ : Median Log2 fold-change of the relative expression parameter rho, - ‘p’ : Probability of Log2 fold-change values greater than lfc_delta, - ‘lfc_rand’ : Median Log2 fold-change of the relative expression parameter rho on permuted data, - ‘p_rand’ : Probability of Log2 fold-change values greater than lfc_delta on permuted data.

Return type:

dict of str to pd.Dataframe

encode(x, s, encoder_outer=None, encoder_inner=None, lib_encoder=None)[source]

Encode data into biological and/or library latent variables.

Parameters:

x (Tensor, shape (n_cells, n_genes)) – Raw or log-transformed count matrix.
s (Tensor, shape (n_cells, n_batches)) – One-hot encoded batch labels.
encoder_outer (nn.Module, optional) – Outer encoder; if None, skips z/inter outputs.
encoder_inner (nn.Module, optional) – Inner encoder; if None, skips z/inter outputs.
lib_encoder (nn.Module, optional) – Library encoder; if None, skips l_mu/l_sig outputs.

Return type:

Union[Tuple[ndarray, ndarray, ndarray], Tuple[ndarray, ndarray, ndarray, ndarray, ndarray]]

Returns:

Depending on provided encoders
(z_mu, z_sig, inter) if lib_encoder is None.
(l_mu, l_sig) if only lib_encoder is provided.
(z_mu, z_sig, inter, l_mu, l_sig) if all provided.

generate_homologous_samples(samples=5000, target_cell_key=None, b_s=128, confidence_level=0.9)[source]

Decode homologous normalized expression profiles for context and target species by Monte Carlo sampling.

Parameters:

target_cell_key (str or None) – Column name in .obs specifying inferred cell type labels for the target dataset;
samples (int, default=5000) – Total number of decoded samples to return per cell type.
b_s (int, default=128) – Batch size for decoding iterations.
confidence_level (float, default=0.9) – Quantile threshold used in filter_outliers to remove extreme latent embeddings.

Return type:

Tuple[Dict[str, ndarray], Dict[str, ndarray]]

Returns:

target_rho_dict (dict of str to ndarray of shape (samples, genes)) – Decoded normalized expression (rho) for shared cell types in the target species.
context_rho_dict (dict of str to ndarray of shape (samples, genes)) – Decoded normalized expression (rho) for shared cell types in the context species.

get_representation(eval_model, save_intermediate=False, save_libsize=False)[source]

Compute and store biological latent and/or library latent representations for a dataset.

Parameters:

eval_model ({'context','target'}) – Which dataset to encode.
save_intermediate (bool, default=False) – If True, store the outer encoder output in .obsm[‘inter’].
save_libsize (bool, default=False) – If True, store library mean/log-std in .obsm[‘l_mu’]/[‘l_sig’].

initialize(initialize='both')[source]

Instantiate or reinstantiate context and/or target encoder and decoder modules.

Parameters:: initialize ({'context', 'context_decoder', 'target', 'both'}, default='both') – Which sub-model(s) to initialize.

load(models='both', save_key='')[source]

Load previously saved configs, optimizers, and weights.

Parameters:

models ({'context', 'target', 'both'})
save_key (str)

ret_pred_df(pred_key, target_label_key, context_label_key)[source]

Compute a normalized confusion matrix (%) and balanced accuracy for label transfer.

This evaluates how well the predicted context-derived labels match the true labels on the target dataset.

Parameters:

pred_key (str) – Key in self.mdata.mod[target_key].obs under which predicted labels are stored.
target_label_key (str) – Key in self.mdata.mod[target_key].obs for the ground-truth labels.
context_label_key (str) – Key in self.mdata.mod[context_key].obs for the reference context labels.

Return type:

Tuple[DataFrame, float]

Returns:

df (pd.DataFrame) – Confusion matrix (in percent) with - index: sorted labels of target_label_key, - columns: sorted labels of context_label_key, - values: percentage of cells with true label = row and predicted label = column.
bas (float) – Balanced accuracy score computed only over the subset of cells whose true labels also appear in the context set.

return_similarity_df(max_sample_targ=2000, max_sample_cont=50, scale='none')[source]

Compute and return similarity scores between target and context cell types by sampling from latent cell type ditributions and calculating likelihood differences. Computes the modal value of the resulting distribution as similarity score.

Parameters:

max_sample_targ (int, default=2000) – Number of samples from the target cell types.
max_sample_cont (int, default=50) – Number of samples from the context cell types per target cell.
scale ({'min_max', 'max', 'none'}, default='max') – Scaling strategy across rows: min-max normalization or max-based inversion.

Returns:

df – Similarity scores with - index: target cell types, - columns: context cell types.

Return type:

DataFrame

save(models='both', save_key='')[source]

Serialize model configuration, optimizers, and context and/or target scVI weights to disk.

Parameters:

models ({'context', 'target', 'both'}) – Which sub-models to save.
save_key (str) – Suffix for filenames.

save_mdata(save_key)[source]

Write the assembled MuData object to .h5mu.

Parameters:: save_key (str) – Suffix for the data filename.

similarity_metric(target_ind, context_ind, b_s=None, b_sc=None, display=True)[source]

Compute negative log-likelihood based similarity scores for target and context cells specified by their indices.

Parameters:

target_ind (array of integers) – Traget cell indices in self.mdata[target_key].X shape (n_target, 1)
context_ind (array of integers) – Context cell neighbors in self.mdata[context_key].X shape (n_target, k). Calculates the similarity of k candidates for a specific entry in the first axis.
b_s (int, optional) – Batch size for target.
b_sc (int, optional) – Chunk size for context neighbors.
display (bool) – If True, prints progress.

Returns:

similarities – Contains the similarity scores between the context cells and their k candidates, shape (n_target, k).

Return type:

ndarray

similarity_metric_on_latent_space(precompute_neighbors=True)[source]

Compute similarity scores for the whole context and target dataset pairs. Either for a precomputed set of neighbors based on the results of a latent spce neighborhood search to speed up computation or the whole dataset. (Should only be done for small datasets.)

Parameters:

precompute_neighbors (bool) – If True precomutes a set of 250 euclidean neighbors on the aligned latent space.

Return type:

Tuple[ndarray, ndarray]

Returns:

similarities (ndarray of shape (target.n_obs, 250) or (target.n_obs, context.n_obs))
context_ind (ndarray of shape (target.n_obs, 250) or (target.n_obs, context.n_obs))

train_context(epochs=40, raise_beta=True, save_model=True, train_decoder_only=False, save_key='')[source]

Pretrain the context scVI model on the context dataset.

Parameters:

epochs (int, default=40) – Number of training epochs.
raise_beta (bool, default=True) – If True, increase KL weight over initial epochs.
save_model (bool, default=True) – If True, save model parameters after training.
train_decoder_only (bool, default=False) – If True, freeze encoders and train only the decoder.
save_key (str, default='') – Filename suffix when saving.

train_target(epochs=40, save_model=True, raise_beta=True, raise_eta=True, save_key='')[source]

Train the target-side scVI model, optionally aligning to context.

Parameters:

epochs (int, default=40) – Number of training epochs.
save_model (bool, default=True) – Save parameters after training.
raise_beta (bool, default=True) – If True, increase KL weight over initial epochs.
raise_eta (bool, default=True) – If True, increase alignment weight over initial epochs.
save_key (str, default='') – Suffix for saved files.

transfer_labels_cell(target_ind, context_obs_transfer)[source]

Calculate similarity scores for a specific target cell specified by its index in self.mdata[target_key].X and all context cells. Transfers labels specifies in context_obs_transfer. Returns a dataframe of context cells sorted by similarity scores.

Parameters:

target_ind (int) – Target cell indices.
context_obs_transfer (str or List of str) – Observation key from context dataset to return as columns in the outpt (e.g., ‘cell_type’).

Returns:

Context labels, source indices, and similarity scores with the specified target cell.

Return type:

DataFrame

transfer_labels_data(context_obs_transfer, top_neigh=25, write_sim=False)[source]

Assign context-derived labels via similarity scores to each target cell by majority vote among its top candidates.

For each observation key in context_obs_transfer, finds the top_neigh most similar context cells (based on decoder likelihood in latent space), takes the most frequent label among those neighbors, and writes it into self.mdata.mod[target_key].obs[‘pred_sim_<obs_key>’]. When target cell annotation is unknown, the inferred values of the last entry in context_obs_transfer will serve as a replacement for target cell annotation in downstream analyses.

Parameters:

context_obs_transfer (List of str or str) – One or more keys in self.mdata.mod[context_key].obs whose values to transfer.
top_neigh (int, default=25) – Number of nearest neighbors to consider for the majority vote.
write_sim (bool, default=False) – If True, also stores raw similarity scores and neighbor indices in self.mdata.mod[target_key].obsm[‘similarities’] and [‘similarities_ind’].

scspecies.models.neighbors_workaround(adata, use_rep=None, n_neighbors=15, metric='euclidean')[source]

Compute the k-nearest-neighbors graph manually and store it in adata. Replacement for sc.pp.neighbors on M1/M2 chips to avoid kernel crashes.

Parameters:

adata (ad.AnnData) – Annotated data object.
use_rep (str) – Key in adata.obsm to use for neighbor search (e.g. ‘X_pca’), or None to use adata.X.
n_neighbors (int) – Number of nearest neighbors to use.
metric (str or None) – Distance metric to use (default ‘euclidean’).

Returns:

The same adata, with:

obsp[‘distances’] : sparse matrix of neighbor distances
obsp[‘connectivities’] : sparse binary connectivity matrix
uns[‘neighbors’] : dict of params & key names

Return type:

AnnData