scspecies package

Submodules

scspecies.models module

class scspecies.models.Decoder[source]

Bases: Module

Decoder mapping latent and label inputs back to data distribution parameters.

Parameters:

param_dict (dict) – Dictionary with keys: - ‘lat_dim’, ‘label_dim’, ‘data_dim’ (int): Latent, label, and data dims. - ‘dims_dec’ (list of int): Hidden layer sizes. - ‘layer_order’ (list): For create_structure. - ‘data_distr’ (str): ‘nb’ or ‘zinb’. - ‘dispersion’ (str): One of ‘dataset’, ‘batch’, ‘cell’. - ‘dispersion’ and ‘data_distr’ control parameter layers. - ‘homologous_genes’ (list of int): Indices for homologous genes.

model

Feed-forward network for decoder.

Type:

nn.Sequential

rho_pre

Linear layer for relative expression logits.

Type:

nn.Linear

log_alpha

Dispersion parameter(s) depending on dispersion.

Type:

Parameter or nn.Linear

pi_nlogit

Zero-inflation logits if data_distr == ‘zinb’.

Type:

nn.Linear, optional

__init__(param_dict)[source]
static __new__(cls, *args, **kwargs)
Return type:

Any

calc_nlog_likelihood(dec_outp, library, x, eps=1e-07)[source]

Compute negative log-likelihood under NB or ZINB self.

Parameters:
  • dec_outp (list of torch.Tensor) – [alpha, rho] or [alpha, rho, pi_nlogit] depending on distribution.

  • library (torch.Tensor) – Library size factor.

  • x (torch.Tensor) – Observed count data.

  • eps (float) – Numerical stability constant.

Returns:

Negative log-likelihood per sample.

Return type:

torch.Tensor

decode(z, label_inp)[source]

Decode latent and label inputs to distribution parameters.

Parameters:
  • z (torch.Tensor) – Latent tensor of shape (batch_size, lat_dim).

  • label_inp (torch.Tensor) – Label tensor of shape (batch_size, label_dim).

Returns:

outputs – [alpha, rho] or [alpha, rho, pi_nlogit].

Return type:

list of torch.Tensor

decode_homologous(z, label_inp)[source]

Decodes the latent variables and label input into gene expression for homologous genes. This method is specifically used to asess and compare the log2-fold change between species.

Parameters:
  • (Tensor) (label_inp)

  • (Tensor)

Returns:

Tensor

Return type:

The decoded gene expression probabilities for homologous genes.

forward(z, label_inp, library, x)[source]

Compute mean negative log-likelihood loss.

Parameters:
  • z (torch.Tensor) – Latent representations.

  • label_inp (torch.Tensor) – Labels.

  • library (torch.Tensor) – Library size factors.

  • x (torch.Tensor) – Observed data.

Returns:

Mean negative log-likelihood over batch.

Return type:

torch.Tensor

class scspecies.models.Encoder_inner[source]

Bases: Module

Inner encoder module producing Gaussian latent parameters and sampling latent variables. Will be shared between context and target scVI self.

Parameters:
  • device (str) – Device identifier for sampling (‘cpu’, ‘mps’ or ‘cuda’).

  • param_dict (dict) – Dictionary with keys: - ‘dims_enc_outer’ (list of int): Output dims of outer encoder. - ‘dims_enc_inner’ (list of int): Hidden layer sizes for inner encoder. - ‘lat_dim’ (int): Dimensionality of the latent space. - ‘layer_order’ (list): See create_structure.

model

Feed-forward network for intermediate representation.

Type:

nn.Sequential

mu

Linear layer mapping to latent mean.

Type:

nn.Linear

log_sig

Linear layer mapping to log-standard deviation.

Type:

nn.Linear

sampling_dist

Standard normal distribution for sampling latent representations.

Type:

Normal

__init__(device, param_dict)[source]
static __new__(cls, *args, **kwargs)
Return type:

Any

encode(inter)[source]

Compute latent mean and log-std from intermediate representation.

Parameters:

inter (torch.Tensor) – Intermediate features of shape (batch_size, dims_enc_inner[-1]).

Return type:

Tuple[Tensor, Tensor]

Returns:

  • mu (torch.Tensor) – Latent means of shape (batch_size, lat_dim).

  • log_sig (torch.Tensor) – Latent log-standard deviations of shape (batch_size, lat_dim).

forward(inter)[source]

Sample latent variable and compute KL-Divergence.

Parameters:

inter (torch.Tensor) – Intermediate features from outer encoder.

Return type:

Tuple[Tensor, Tensor]

Returns:

  • z (torch.Tensor) – Sampled latent tensor of shape (batch_size, lat_dim).

  • kl_div (torch.Tensor) – Scalar KL-Divergence across the batch.

class scspecies.models.Encoder_outer[source]

Bases: Module

Outer encoder module that concatenates data and labels, then applies a feed-forward network. Will be reinitialized by scSpecies after pre-training on a context dataset.

Parameters:

param_dict (dict) – Dictionary with keys: - ‘data_dim’ (int): Dimensionality of input data. - ‘label_dim’ (int): Dimensionality of input labels. - ‘dims_enc_outer’ (list of int): Hidden layer sizes after concatenation. - ‘layer_order’ (list): See create_structure for format.

model

The feed-forward network created by create_structure.

Type:

nn.Sequential

__init__(param_dict)[source]
static __new__(cls, *args, **kwargs)
Return type:

Any

forward(data, label_inp)[source]

Forward pass through the outer encoder layers.

Parameters:
  • data (torch.Tensor) – Input data tensor of shape (batch_size, data_dim).

  • label_inp (torch.Tensor) – Input label tensor of shape (batch_size, label_dim).

Returns:

Encoded representation of shape (batch_size, dims_enc_outer[-1]).

Return type:

torch.Tensor

class scspecies.models.Library_encoder[source]

Bases: Module

Encoder for library size factor, modeling a 1D log-normal distribution.

Parameters:
  • device (str) – Device identifier for sampling.

  • param_dict (dict) – Dictionary with keys: - ‘data_dim’, ‘label_dim’ (int): Input dims. - ‘dims_l_enc’ (list of int): Hidden layer sizes. - ‘lib_mu_add’ (float): Offset added to the mean. - ‘layer_order’ (list): For create_structure.

model

Feed-forward network for concatenated input.

Type:

nn.Sequential

mu

Layer mapping to log-mean of library.

Type:

nn.Linear

log_sig

Layer mapping to log-std of library.

Type:

nn.Linear

sampling_dist

Standard normal for sampling.

Type:

Normal

mu_add

Added to the decoded mean.

Type:

float

__init__(device, param_dict)[source]
static __new__(cls, *args, **kwargs)
Return type:

Any

encode(data, label_inp)[source]

Compute library log-mean and log-std from inputs.

Parameters:
  • data (torch.Tensor) – Data tensor of shape (batch_size, data_dim).

  • label_inp (torch.Tensor) – Label tensor of shape (batch_size, label_dim).

Return type:

Tuple[Tensor, Tensor]

Returns:

  • mu (torch.Tensor) – Adjusted log-mean of shape (batch_size, 1).

  • log_sig (torch.Tensor) – Log-std of shape (batch_size, 1).

forward(data, label_inp, prior_mu=None, prior_sig=None)[source]

Sample library factor and compute optional KL-Divergence to prior.

Parameters:
  • data (torch.Tensor) – Input data.

  • label_inp (torch.Tensor) – Input labels.

  • prior_mu (torch.Tensor, optional) – Precomputed prior mean parameter for KL-Divergence.

  • prior_sig (torch.Tensor, optional) – Precomputed prior std parameter for KL-Divergence.

Return type:

Union[Tensor, Tuple[Tensor, Tensor]]

Returns:

  • l (torch.Tensor) – Sampled library factor of shape (batch_size, 1).

  • kl_div (torch.Tensor, optional) – KL-Divergence if prior_mu and prior_sig are not None.

class scspecies.models.Progress_Bar[source]

Bases: object

A console progress bar that tracks multiple metrics over training iterations of scSpecies.

Parameters:
  • epochs (int) – Total number of epochs.

  • steps_per_epoch (int) – Number of steps (batches) in each epoch.

  • metrics (list of str) – Names of metrics to track (e.g., [‘nELBO’, ‘nlog_likeli’]).

  • avg_over_n_steps (int, optional) – Number of recent steps over which to average metric values for display.

  • sleep_print (float, optional) – Interval in seconds between console updates.

__init__(epochs, steps_per_epoch, metrics, avg_over_n_steps=100, sleep_print=0.5)[source]
static format_number(number, min_length)[source]
ret_sign(number, min_length)[source]
update(values)[source]
scspecies.models.color_str(value, mode)[source]
scspecies.models.create_structure(structure, layer_order)[source]

Builds neural network architectures based on a list of layer sizes and operation order.

Parameters:
  • structure (list of int) – No. of neurons in each layer, including input and output dimensions. For example, [input_dim, hidden1_dim, …, output_dim]. Must have at least two entries.

  • layer_order (list) –

    Sequence of layer specifications. Each element must be either

    1: ‘linear’,

    Affine linear transformation

    2: ‘batch_norm’,

    Batch normalization

    3: ‘layer_norm’,

    Layer normalization

    4: (‘act’, activation: nn.Module or ‘PReLU’, [min_clip, max_clip] optional),

    Activation function. Unbounded activation functions should be clipped for numerical stability, example: (‘act’, torch.nn.ReLU(), [0, 6])

    5: (‘dropout’, dropout rate - float in [0, 1]),

    Dropout layer, example: (‘dropout’, 0.1)

Returns:

A sequential container of PyTorch layers in the specified order for each pair in structure.

Return type:

nn.Sequential

class scspecies.models.make_act_bounded[source]

Bases: Module

Wrapper module that applies an activation and clips its output.

Parameters:
  • act (nn.Module) – Activation function to apply.

  • min (float) – Lower bound for clipping.

  • max (float) – Upper bound for clipping.

__init__(act, min, max)[source]
static __new__(cls, *args, **kwargs)
Return type:

Any

forward(x)[source]
Return type:

Tensor

scspecies.models.neighbors_workaround(adata, use_rep=None, n_neighbors=15, metric='euclidean')[source]

Compute the k-nearest-neighbors graph manually and store it in adata. Replacement for sc.pp.neighbors on M1/M2 chips to avoid kernel crashes.

Parameters:
  • adata (ad.AnnData) – Annotated data object.

  • use_rep (str) – Key in adata.obsm to use for neighbor search (e.g. ‘X_pca’), or None to use adata.X.

  • n_neighbors (int) – Number of nearest neighbors to use.

  • metric (str or None) – Distance metric to use (default ‘euclidean’).

Returns:

The same adata, with:
  • obsp[‘distances’] : sparse matrix of neighbor distances

  • obsp[‘connectivities’] : sparse binary connectivity matrix

  • uns[‘neighbors’] : dict of params & key names

Return type:

AnnData

class scspecies.models.scSpecies[source]

Bases: object

The scSpecies cross-species architecture alignment framework built on scVI.

This class implements end-to-end preprocessing, variational encoding, decoding, and alignment for a “context” dataset (e.g., mouse) and a “target” dataset (e.g., human). It supports:

  • Training scVI models on context and target (latent or intermediate alignment).

  • Library size encoding and negative-binomial / zero-inflated NB likelihoods.

  • Establishing a direct correspondece between traget can context cell via a likelihood-based similarity measure

  • Latent-space nearest-neighbor label transfer based on the similarity measure.

  • Log-fold-change computation of homologous genes.

Parameters:
  • device (str) – PyTorch device identifier (‘cpu’, ‘mps’ or ‘cuda’).

  • mdata (mu.MuData) – Multi-modal container holding context and target AnnData objects, set up by the create_mdata class.

  • directory (str) – Base path for saving model parameters, data, and figures.

  • random_seed (int, default=369963) – Seed for NumPy and PyTorch RNGs.

  • context_key (str, default='mouse') – Key in mdata.mod for the context dataset.

  • target_key (str, default='human') – Key in mdata.mod for the target dataset.

  • context_optimizer (torch.optim.Optimizer classes) – Optimizer constructors for context and target models.

  • target_optimizer (torch.optim.Optimizer classes) – Optimizer constructors for context and target models.

  • context_hidden_dims_enc_outer (list[int]) – Hidden layer sizes for the outer encoders.

  • target_hidden_dims_enc_outer (list[int]) – Hidden layer sizes for the outer encoders.

  • hidden_dims_enc_inner (list[int]) – Hidden layer sizes for the inner encoder.

  • context_hidden_dims_l_enc (list[int]) – Hidden layer sizes for the library encoder.

  • target_hidden_dims_l_enc (list[int]) – Hidden layer sizes for the library encoder.

  • context_hidden_dims_dec (list[int]) – Hidden layer sizes for the decoder.

  • target_hidden_dims_dec (list[int]) – Hidden layer sizes for the decoder.

  • context_layer_order (list) – Layer specification lists for create_structure.

  • target_layer_order (list) – Layer specification lists for create_structure.

  • b_s (int, default=128) – Batch size for training and inference.

  • context_data_distr ({'nb', 'zinb'}) – Observation models for counts.

  • target_data_distr ({'nb', 'zinb'}) – Observation models for counts.

  • lat_dim (int, default=10) – Dimensionality of the latent space.

  • context_dispersion ({'dataset', 'batch', 'cell'}) – Dispersion parameterization strategy.

  • target_dispersion ({'dataset', 'batch', 'cell'}) – Dispersion parameterization strategy.

  • alignment ({'inter', 'latent'}) – Alignment mode between context and target. Either at the outer encoder output space or at the latent space.

  • k_neigh (int, default=25) – Number of neighbors candidates for alignment from the data-level NNS.

  • top_percent (float, default=20) – Percentile cutoff for selecting top-agreement neighbors.

  • context_beta_* (floats and ints) – Schedules for KL and alignment weight ramps.

  • target_beta_* (floats and ints) – Schedules for KL and alignment weight ramps.

  • eta_* (floats and ints) – Schedules for KL and alignment weight ramps.

  • use_lib_enc (bool, default=True) – Whether to include a library-size encoder.

__init__(device, mdata, directory, random_seed=369963, context_key='mouse', target_key='human', context_optimizer=torch.optim.Adam, target_optimizer=torch.optim.Adam, context_hidden_dims_enc_outer=[300], target_hidden_dims_enc_outer=[300], hidden_dims_enc_inner=[200], context_hidden_dims_l_enc=[200], target_hidden_dims_l_enc=[200], context_hidden_dims_dec=[200, 300], target_hidden_dims_dec=[200, 300], lat_dim=10, context_layer_order=['linear', 'layer_norm', ('act', torch.nn.ReLU), ('dropout', 0.1)], target_layer_order=['linear', 'layer_norm', ('act', torch.nn.ReLU), ('dropout', 0.1)], use_lib_enc=True, b_s=128, context_data_distr='zinb', target_data_distr='zinb', context_dispersion='batch', target_dispersion='batch', alignment='inter', k_neigh=25, top_percent=20, context_beta_start=0.1, context_beta_max=1, context_beta_epochs_raise=10, target_beta_start=0.1, target_beta_max=1, target_beta_epochs_raise=10, eta_start=10, eta_max=25, eta_epochs_raise=10)[source]
static average_slices(array, slice_sizes)[source]

Compute the mean of consecutive subarrays of a flat 2D array. Helper for compute_logfold_change.

Parameters:
  • array (ndarray, shape (sum(slice_sizes), n_features)) – The concatenated data.

  • slice_sizes (sequence of int) – Positive integers that sum to array.shape[0].

Returns:

stacked_means – The mean of each slice.

Return type:

ndarray, shape (len(slice_sizes), n_features)

compute_logfold_change(eval_cell_types=None, eps=1e-06, lfc_delta=1, samples=50000, target_cell_key=None, b_s=128, confidence_level=0.9)[source]

Monte Carlo estimation of per-gene Log2-fold-changes and associated probabilities.

For each specified cell type (or the intersection of context/target types), samples from the scVI posterior, computes the ratio of target vs. context expression for each homologous gene, and aggregates: - Median Log2-fold-change (on normalized decoder space), - Probability(abs(Log2Fc) > lfc_delta), - Mean gene expression on normalized decoder space and NB parameter space.

Parameters:
  • eval_cell_types (sequence of str, optional) – Cell types to include; defaults to the intersection of context and target types.

  • eps (float, default=1e-6) – Small constant added before log to prevent small gene expression patterns from returning large LFC values.

  • lfc_delta (float, default=1) – Threshold for computing the probability of large fold-changes.

  • target_cell_key (str or None) – Column name in .obs specifying inferred cell type labels for the target dataset;

  • samples (int, default=50000) – Total number of Monte Carlo draws per cell.

  • b_s (int, default=128) – Batch size for sampling iterations.

  • confidence_level (float, default=0.9) – Outlier filtering threshold for latent space.

Returns:

lfc_dict – Dictionary with cell-wise data frames containing the keys: - ‘rho_median_context’ : Median context normalized gene expression, - ‘mu_median_context’ : Median context expected value gene expression, - ‘rho_median_target’ : Median target normalized gene expression, - ‘mu_median_target’ : Median target expected value gene expression, - ‘lfc’ : Median Log2 fold-change of the relative expression parameter rho, - ‘p’ : Probability of Log2 fold-change values greater than lfc_delta, - ‘lfc_rand’ : Median Log2 fold-change of the relative expression parameter rho on permuted data, - ‘p_rand’ : Probability of Log2 fold-change values greater than lfc_delta on permuted data.

Return type:

dict of str to pd.Dataframe

create_directory(directory)[source]

Create project subdirectories for parameters, data, and figures.

Parameters:

directory (str) – Base output directory.

encode(x, s, encoder_outer=None, encoder_inner=None, lib_encoder=None)[source]

Encode data into biological and/or library latent variables.

Parameters:
  • x (Tensor, shape (n_cells, n_genes)) – Raw or log-transformed count matrix.

  • s (Tensor, shape (n_cells, n_batches)) – One-hot encoded batch labels.

  • encoder_outer (nn.Module, optional) – Outer encoder; if None, skips z/inter outputs.

  • encoder_inner (nn.Module, optional) – Inner encoder; if None, skips z/inter outputs.

  • lib_encoder (nn.Module, optional) – Library encoder; if None, skips l_mu/l_sig outputs.

Return type:

Union[Tuple[ndarray, ndarray, ndarray], Tuple[ndarray, ndarray, ndarray, ndarray, ndarray]]

Returns:

  • Depending on provided encoders

  • (z_mu, z_sig, inter) if lib_encoder is None.

  • (l_mu, l_sig) if only lib_encoder is provided.

  • (z_mu, z_sig, inter, l_mu, l_sig) if all provided.

static filter_outliers(data, confidence_level=0.9)[source]

Identify inlier and outlier rows based on the Mahalanobis distance.

Computes the Mahalanobis distance of each row in data from the multivariate mean, uses a chi-squared cutoff at the given confidence_level, and returns boolean masks. Helper for compute_logfold_change.

Parameters:
  • data (ndarray, shape (n_samples, n_features)) – Input points in feature space.

  • confidence_level (float, default=0.9) – Threshold percentile for declaring a point an inlier.

Return type:

Tuple[ndarray, ndarray]

Returns:

  • inlier_mask (ndarray of bool, shape (n_samples,)) – True for rows whose Mahalanobis distance is below the threshold.

  • outlier_mask (ndarray of bool, shape (n_samples,)) – True for rows whose distance exceeds the threshold.

generate_homologous_samples(samples=5000, target_cell_key=None, b_s=128, confidence_level=0.9)[source]

Decode homologous normalized expression profiles for context and target species by Monte Carlo sampling.

Parameters:
  • target_cell_key (str or None) – Column name in .obs specifying inferred cell type labels for the target dataset;

  • samples (int, default=5000) – Total number of decoded samples to return per cell type.

  • b_s (int, default=128) – Batch size for decoding iterations.

  • confidence_level (float, default=0.9) – Quantile threshold used in filter_outliers to remove extreme latent embeddings.

Return type:

Tuple[Dict[str, ndarray], Dict[str, ndarray]]

Returns:

  • target_rho_dict (dict of str to ndarray of shape (samples, genes)) – Decoded normalized expression (rho) for shared cell types in the target species.

  • context_rho_dict (dict of str to ndarray of shape (samples, genes)) – Decoded normalized expression (rho) for shared cell types in the context species.

get_batch(array, step, *, perm=None, batch_size=None)[source]

Slice out a minibatch and move to device.

Parameters:
  • array (Tensor or sequence) – Data to batch (e.g., features, labels, indices).

  • step (int) – Batch index.

  • perm (sequence of int, optional) – Permutation for shuffling; if None, uses contiguous slices.

  • batch_size (int, optional) – Number of samples per batch; defaults to self.config_dict[‘b_s’].

Returns:

The selected batch, on the configured device if a Tensor.

Return type:

Tensor or sequence

get_representation(eval_model, save_intermediate=False, save_libsize=False)[source]

Compute and store biological latent and/or library latent representations for a dataset.

Parameters:
  • eval_model ({'context','target'}) – Which dataset to encode.

  • save_intermediate (bool, default=False) – If True, store the outer encoder output in .obsm[‘inter’].

  • save_libsize (bool, default=False) – If True, store library mean/log-std in .obsm[‘l_mu’]/[‘l_sig’].

hmu(model_name, save_key)[source]
initialize(initialize='both')[source]

Instantiate or reinstantiate context and/or target encoder and decoder modules.

Parameters:

initialize ({'context', 'context_decoder', 'target', 'both'}, default='both') – Which sub-model(s) to initialize.

load(models='both', save_key='')[source]

Load previously saved configs, optimizers, and weights.

Parameters:
  • models ({'context', 'target', 'both'})

  • save_key (str)

static mode_histogram(x)[source]

Return the mid-point of the histogram bin with the highest count. Helper for .self.similarity_cell_types

Parameters:

x (np.array,) – Array of values for which to calculate the modal value.

Returns:

mode – modal value of the empirical distribution

Return type:

np.float32

static most_frequent(arr)[source]

Return the modal value of a 1D array. Helper for the label_transfer function.

Parameters:

arr (array-like)

Returns:

The value occurring most often.

Return type:

element

opt(model_name, save_key)[source]
pkl(model_name, save_key)[source]
pth(model_name, save_key)[source]
ret_pred_df(pred_key, target_label_key, context_label_key)[source]

Compute a normalized confusion matrix (%) and balanced accuracy for label transfer.

This evaluates how well the predicted context-derived labels match the true labels on the target dataset.

Parameters:
  • pred_key (str) – Key in self.mdata.mod[target_key].obs under which predicted labels are stored.

  • target_label_key (str) – Key in self.mdata.mod[target_key].obs for the ground-truth labels.

  • context_label_key (str) – Key in self.mdata.mod[context_key].obs for the reference context labels.

Return type:

Tuple[DataFrame, float]

Returns:

  • df (pd.DataFrame) – Confusion matrix (in percent) with - index: sorted labels of target_label_key, - columns: sorted labels of context_label_key, - values: percentage of cells with true label = row and predicted label = column.

  • bas (float) – Balanced accuracy score computed only over the subset of cells whose true labels also appear in the context set.

return_similarity_df(max_sample_targ=2000, max_sample_cont=50, scale='none')[source]

Compute and return similarity scores between target and context cell types by sampling from latent cell type ditributions and calculating likelihood differences. Computes the modal value of the resulting distribution as similarity score.

Parameters:
  • max_sample_targ (int, default=2000) – Number of samples from the target cell types.

  • max_sample_cont (int, default=50) – Number of samples from the context cell types per target cell.

  • scale ({'min_max', 'max', 'none'}, default='max') – Scaling strategy across rows: min-max normalization or max-based inversion.

Returns:

df – Similarity scores with - index: target cell types, - columns: context cell types.

Return type:

DataFrame

save(models='both', save_key='')[source]

Serialize model configuration, optimizers, and context and/or target scVI weights to disk.

Parameters:
  • models ({'context', 'target', 'both'}) – Which sub-models to save.

  • save_key (str) – Suffix for filenames.

save_mdata(save_key)[source]

Write the assembled MuData object to .h5mu.

Parameters:

save_key (str) – Suffix for the data filename.

similarity_metric(target_ind, context_ind, b_s=None, b_sc=None, display=True)[source]

Compute negative log-likelihood based similarity scores for target and context cells specified by their indices.

Parameters:
  • target_ind (array of integers) – Traget cell indices in self.mdata[target_key].X shape (n_target, 1)

  • context_ind (array of integers) – Context cell neighbors in self.mdata[context_key].X shape (n_target, k). Calculates the similarity of k candidates for a specific entry in the first axis.

  • b_s (int, optional) – Batch size for target.

  • b_sc (int, optional) – Chunk size for context neighbors.

  • display (bool) – If True, prints progress.

Returns:

similarities – Contains the similarity scores between the context cells and their k candidates, shape (n_target, k).

Return type:

ndarray

similarity_metric_on_latent_space(precompute_neighbors=True)[source]

Compute similarity scores for the whole context and target dataset pairs. Either for a precomputed set of neighbors based on the results of a latent spce neighborhood search to speed up computation or the whole dataset. (Should only be done for small datasets.)

Parameters:

precompute_neighbors (bool) – If True precomutes a set of 250 euclidean neighbors on the aligned latent space.

Return type:

Tuple[ndarray, ndarray]

Returns:

  • similarities (ndarray of shape (target.n_obs, 250) or (target.n_obs, context.n_obs))

  • context_ind (ndarray of shape (target.n_obs, 250) or (target.n_obs, context.n_obs))

train_context(epochs=40, raise_beta=True, save_model=True, train_decoder_only=False, save_key='')[source]

Pretrain the context scVI model on the context dataset.

Parameters:
  • epochs (int, default=40) – Number of training epochs.

  • raise_beta (bool, default=True) – If True, increase KL weight over initial epochs.

  • save_model (bool, default=True) – If True, save model parameters after training.

  • train_decoder_only (bool, default=False) – If True, freeze encoders and train only the decoder.

  • save_key (str, default='') – Filename suffix when saving.

train_target(epochs=40, save_model=True, raise_beta=True, raise_eta=True, save_key='')[source]

Train the target-side scVI model, optionally aligning to context.

Parameters:
  • epochs (int, default=40) – Number of training epochs.

  • save_model (bool, default=True) – Save parameters after training.

  • raise_beta (bool, default=True) – If True, increase KL weight over initial epochs.

  • raise_eta (bool, default=True) – If True, increase alignment weight over initial epochs.

  • save_key (str, default='') – Suffix for saved files.

transfer_labels_cell(target_ind, context_obs_transfer)[source]

Calculate similarity scores for a specific target cell specified by its index in self.mdata[target_key].X and all context cells. Transfers labels specifies in context_obs_transfer. Returns a dataframe of context cells sorted by similarity scores.

Parameters:
  • target_ind (int) – Target cell indices.

  • context_obs_transfer (str or List of str) – Observation key from context dataset to return as columns in the outpt (e.g., ‘cell_type’).

Returns:

Context labels, source indices, and similarity scores with the specified target cell.

Return type:

DataFrame

transfer_labels_data(context_obs_transfer, top_neigh=25, write_sim=False)[source]

Assign context-derived labels via similarity scores to each target cell by majority vote among its top candidates.

For each observation key in context_obs_transfer, finds the top_neigh most similar context cells (based on decoder likelihood in latent space), takes the most frequent label among those neighbors, and writes it into self.mdata.mod[target_key].obs[‘pred_sim_<obs_key>’]. When target cell annotation is unknown, the inferred values of the last entry in context_obs_transfer will serve as a replacement for target cell annotation in downstream analyses.

Parameters:
  • context_obs_transfer (List of str or str) – One or more keys in self.mdata.mod[context_key].obs whose values to transfer.

  • top_neigh (int, default=25) – Number of nearest neighbors to consider for the majority vote.

  • write_sim (bool, default=False) – If True, also stores raw similarity scores and neighbor indices in self.mdata.mod[target_key].obsm[‘similarities’] and [‘similarities_ind’].

static update_param(parameter, min_value, max_value, steps)[source]

Linearly increment parameter toward max_value over steps.

Parameters:
  • parameter (float) – Current parameter value.

  • min_value (float) – Starting value.

  • max_value (float) – Final cap.

  • steps (int) – Number of increments until max.

Returns:

Updated (and capped) parameter.

Return type:

float

scspecies.plot module

scspecies.plot.is_bright(hex_color)[source]

Determine whether a hex RGB color is “bright” based on luminance, to choose black or white text for readability. Helper for label_transfer_acc.

Parameters:

hex_color (str) – Hex code (e.g. ‘#RRGGBB’).

Returns:

‘black’ if background is light, ‘white’ otherwise.

Return type:

str

scspecies.plot.label_transfer_acc(df_nns, df_sim, save_key=None)[source]

Compare balanced-accuracy of label-transfer by data-level NNs vs. scSpecies similarity-based label transfer and plot horizontal bar stacks of top-k context votes.

Parameters:
  • df_nns (pd.DataFrame) – Confusion-matrix-based accuracy of kNN transfers.

  • df_sim (pd.DataFrame) – Confusion-matrix-based accuracy using scSpecies similarity.

  • save_key (str or None, default=None) – If a string, the plot will be saved to figures/{save_key}.png. If None, it will only be displayed.

scspecies.plot.load_and_filter_pathways(gmt_path, adata, min_genes=5)[source]

Load pathway gene sets from a GMT file and filter to those with at least min_genes overlapping with adata.var_names.

Parameters:
  • gmt_path (str) – Path to the .gmt file.

  • adata (AnnData) – AnnData object with .var_names (genes).

  • min_genes (int) – Minimum number of overlapping genes to keep a pathway.

Returns:

filtered_pathways – Mapping of pathway names to lists of overlapping gene symbols.

Return type:

dict

scspecies.plot.plot_2D_representation(adata_concat, rep_key='X_umap', plot_annot='cell_type_fine', context_species='mouse', target_species='human', save_key=None)[source]

Scatter dataset representation of context vs. target in 2D (e.g., UMAP) with shared color mapping based on labels.

Parameters:
  • adata_concat (MuData) – Combined MuData with .obsm[rep_key] for both species.

  • rep_key (str, default='X_umap') – Key in .obsm for 2D coordinates.

  • plot_annot (str, default='cell_type_fine') – Observation key for the categorical annotation.

  • context_species (str, default='mouse')

  • target_species (str, default='human')

  • save_key (str or None, default=None) – If a string, the plot will be saved to figures/{save_key}.png. If None, it will only be displayed.

scspecies.plot.plot_lfc(lfc_dict, prob_delta=0.9, save_key=None)[source]

Scatter-plot Log2-Fold-change versus probability for each cell type, highlighting and annotating top up- and down-regulated genes.

Parameters:
  • lfc_dict (list) – List of LFC dataframes.

  • prob_delta (float, default=0.9) – Probability threshold for calling significant LFC.

  • save_key (str or None, default=None) – If a string, the plot will be saved to figures/{save_key}.png. If None, it will only be displayed.

scspecies.plot.plot_lfc_comparison(model, lfc_dict, save_key=None)[source]

Generate and display a grid of scatter plots comparing log₂‐fold changes estimated by scSpecies against LFC computed directly from the data.

Parameters:
  • model (scSpecies) – A trained and evaluated scSpecies model instance.

  • lfc_dict (dict of {str: pandas.DataFrame}) – List of LFC dataframes.

  • save_key (str or None, default=None) – If a string, the plot will be saved to figures/{save_key}.png. If None, it will only be displayed.

scspecies.plot.plot_prototype_sim_heatmap(df, save_key=None)[source]

Heatmap of prototype-similarity between target (rows) and context (columns) cell types, with top-2 matches annotated by rank.

Parameters:
  • df (pd.DataFrame) – Similarity matrix (target cell types × context cell types).

  • save_key (str or None, default=None) – If a string, the plot will be saved to figures/{save_key}.png. If None, it will only be displayed.

scspecies.plot.plot_similarity(adata_concat, df_neigbor, human_ind, rep_key='X_umap', plot_annot='cell_type_fine', context_species='mouse', target_species='human', save_key=None)[source]

Scatter dataset representation of context vs. target in 2D (e.g., UMAP) colored by similarity to a specified target cell.

Parameters:
  • adata_concat (MuData) – Combined MuData with .obsm[rep_key] for both species.

  • df_neigbor (pd.DataFrame) – DataFrame with columns [‘index’,’similarity_score’] for a single target cell.

  • human_ind (int) – Index of the target cell in adata_concat.

  • rep_key (str, default='X_umap') – Key in .obsm for 2D coordinates.

  • plot_annot (str, default='cell_type_fine') – Observation key for labeling the target cell.

  • context_species (str, default='mouse')

  • target_species (str, default='human')

  • save_key (str or None, default=None) – If a string, the plot will be saved to figures/{save_key}.png. If None, it will only be displayed.

scspecies.plot.progressive_moving_average(y, max_window=6000)[source]

Compute a moving average over a 1D array with a window that grows linearly (capped by max_window) to smooth early iterations more strongly and later ones less. Helper for plot_prototype_sim_history.

Parameters:
  • y (np.ndarray) – Input 1D array of values (e.g., losses or metrics over iterations).

  • max_window (int, default=6000) – Maximum size of the moving window.

Returns:

Smoothed values of the same shape as y.

Return type:

np.ndarray

scspecies.plot.ret_sign(number)[source]

Return ‘+’ if number ≥ 0, else ‘-‘. Helper for label_transfer_acc.

Parameters:

number (float)

Returns:

‘+’ or ‘-’

Return type:

str

scspecies.plot.return_palette(names, col_dict={})[source]

Build a color mapping for a list of labels, using predefined overrides and extending with Glasbey palette for unknowns.

Parameters:
  • names (sequence of str) – Labels to assign colors.

  • col_dict (dict, optional) – Predefined name→hex mappings.

Returns:

Mapping from each unique name in names to a hex color code.

Return type:

dict[str, str]

scspecies.preprocessing module

class scspecies.preprocessing.create_mdata[source]

Bases: object

Builder for MuData container that is used by scSpecies to align context & target AnnData datasets.

Handles downloading a gene-translation table from the mouse to human genome, preprocessing a “context” AnnData, and “target” AnnData from potentially multiple species, and saving the final MuData object.

__init__(adata, batch_key, cell_key, dataset_name='mouse', NCBI_Taxon_ID=10090, n_top_genes=None, min_non_zero_genes=0.025, min_cell_type_size=20, min_batch_size=20)[source]

Initialize and preprocess the context dataset.

Steps:

  1. Onehot-encode experimental batchs.

  2. Calculate library size encoder prior parameters for scVI

  3. Subset to top HVGs and filter out cells with low expression patterns as well as rare cell types and batches (optionally).

Parameters:
  • adata (ad.AnnData) – AnnData used as a context in scSpecies.

  • batch_key (str) – Observation key for experimental batch labels.

  • cell_key (str) – Observation key for cell-type annotation.

  • dataset_name (str, optional) – Tag for the context dataset (default ‘mouse’).

  • NCBI_Taxon_ID (int, optional) – Taxonomy ID of the context species (default mouse - 10090).

  • n_top_genes (int or None, optional) – Number of HVGs to retain (None to skip) (default None).

  • min_non_zero_genes (float, optional) – Min fraction of nonzero genes per cell (default 0.025).

  • min_cell_type_size (int, optional) – Min cells per cell-type, cell types with fewer samples are removed (default 20).

  • min_batch_size (int, optional) – Min cells per batch for encoding, batch with fewer samples are removed (default 20).

Effects:
  • - Ensures a `data/` directory exists.

  • - Annotates `adata.uns[‘metadata’]` with context dataset info.

  • - One-hot encodes batch labels, dropping any batches smaller than `min_batch_size`.

  • - Computes per-batch library size prior parameters.

  • - Subsets to top highly variable genes if `n_top_genes` is not None.

  • - Filters out cells with low gene detection and rare cell-types.

  • - Stores the processed AnnData in `self.dataset_collection`.

static compute_lib_prior_params(adata)[source]

Compute scVI library size prior parameters for each cell.

Parameters:

adata (anndata.AnnData) – Annotated data matrix with raw counts in adata.X.

Effects:
  • - Within each batch (from `adata.uns[‘metadata’][‘batch_key’]`),

  • calculates the mean and standard deviation of log-total counts.

  • - Stores values in `adata.obs[‘library_log_mean’]` and

  • `adata.obs[‘library_log_std’]` as float32 columns.

Return type:

AnnData

static encode_batch_labels(adata, min_batch_size=None)[source]

One‐hot encode experimental batch labels, excluding small batches.

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix with batch labels in adata.obs[…].

  • min_batch_size (int) – Smallest batch size to keep; batches with fewer cells are removed, must be >= 0.

Effects:
  • - Drops any batch categories with fewer than `min_batch_size` cells.

  • - Fits a OneHotEncoder to remaining batch labels.

  • - Saves the encoded batch matrix to `adata.obsm[‘batch_label_enc’]`.

  • - Builds `adata.uns[batch_dict]`, mapping each cell‐type (and ‘unknown’)

  • to batch labels in which they have samples.

Return type:

AnnData

static filter_cells(adata, min_non_zero_genes, min_cell_type_size)[source]

Filter cells based on minimum non-zero gene fraction and cell‐type size.

Parameters:
  • adata (anndata.AnnData) – The annotated data matrix to filter.

  • min_non_zero_genes (float) – Minimum fraction of genes that must have nonzero counts in a cell.

  • min_cell_type_size (int) – Minimum number of cells required to retain any given cell‐type.

Effects:
  • - Removes cells with fewer than `min_non_zero_genes * n_vars` detected genes.

  • - If a cell‐type key is set in `adata.uns[‘metadata’][‘cell_key’]`, discards

  • any cell‐types with fewer than `min_cell_type_size` cells.

Return type:

AnnData

pred_labels_nns_hom_genes(adata, context_label_keys=None, k=25)[source]

Predicts target cell-type labels using data-level k-nearest neighbor search results over homologous genes shared with the context dataset. Additionaly calculates the uncertainty score that will be used by scSpecies to decide which cells are aligned during fine-tuning.

Parameters:
  • adata (anndata.AnnData) – Target dataset that contains the neighbor indices in adata.obsm[‘ind_neigh_nns’].

  • context_label_keys (list of str) – Keys in the context dataset’s obs corresponding to categorical labels to be transferred (e.g., cell-type, tissue-type).

  • k (int) – Amount of neighbort to consider for majority voting

Effects:

- For each key in `context_label_keys`, assigns

  • adata.obs[‘pred_nns_<label_key>’]: predicted label (most frequent among neighbors).

  • adata.obs[‘top_percent_<label_key>’]: confidence score based on relative neighbor rank.

Return type:

AnnData

return_mdata(return_mdata=True, save=True, save_path=PosixPath('data'), save_name='mudata')[source]

Optionally save and/or return the assembled MuData object.

Parameters:
  • return_mdata (bool, optional) – If True, return the MuData object at the end (default True).

  • save (bool, optional) – If True, write the MuData object to disk (default True).

  • save_path (pathlib.Path, optional) – Directory in which to save the file; created if missing (default Path(“data”)).

  • save_name (str, optional) – Filename stem for the .h5mu file; ‘.h5mu’ is appended (default ‘mudata’).

Effects:
  • - If `save` is True

    • Ensures that save_path exists, creating it if necessary.

    • Writes the MuData assembled from self.dataset_collection to

    save_path/<save_name>.h5mu. - Prints messages about directory creation and file saving.

  • - If `return_mdata` is True

    • Returns the MuData object constructed from self.dataset_collection.

Return type:

MuData

setup_target_adata(adata, batch_key, cell_key=None, eval_nns_keys=None, dataset_name='human', NCBI_Taxon_ID=9606, n_top_genes=None, compute_log1p=True, nn_kwargs=None)[source]

Preprocess and align a target AnnData against the context.

Steps:

  1. Onehot-encode experimental batchs.

  2. Calculate library size encoder prior parameters for scVI

  3. Subset to top HVGs and filter out cells with low expression patterns as well as rare cell types and batches (optionally).

  4. Translate target gene symbols to context homologs.

  5. Compute and evaluate data-level nearest neighbors on the shared homologous gene set.

Parameters:
  • adata (ad.AnnData) – Target dataset.

  • batch_key (str) – Observation key for experimental batch labels.

  • cell_key (str or None) – Observation key for cell types (None if unkown).

  • eval_nns_keys (List of str or None) – List of context dataset obs keys that should be transferred by scSpecies. Defaults to [cell_key].

  • dataset_name (str, optional) – Defaults to ‘human’.

  • NCBI_Taxon_ID (int, optional) – Taxonomy ID for the target species (default human - 9606).

  • n_top_genes (int or None, optional) – Number of HVGs to keep (None to skip) (default None).

  • compute_log1p (bool, optional) – Use log1p counts for neighbor search if True (default True).

  • nn_kwargs (dict, optional) – Args for sklearn.neighbors.NearestNeighbors. Defaults to {‘n_neighbors’: 250, ‘metric’: ‘cosine’}.

Effects:
  • - Updates `adata.uns[‘metadata’]` with target dataset info.

  • - Filters and one-hot encodes batch (and cell-type, if provided).

  • - Computes library size prior parameters.

  • - Calls `translate_gene_list` to add translated gene symbols in the context genome to `var_names_transl`.

  • - Subsets to HVGs if `n_top_genes` is not None.

  • - Filters out low-coverage cells and rare cell-types.

  • - Identifies intersecting homologous genes with the context and performs a nearest-neighbor search on log1p (or raw) counts.

  • - Stores neighbor indices in `adata.obsm[‘ind_neigh_nns’]`.

  • - Calculates the percentage of neighbor label agreement and transfers labels based on the data-level nearest neighbor search.

  • - Inserts the processed AnnData into `self.dataset_collection`.

static subset_to_hvg(adata, n_top_genes)[source]

Subset dataset to the top highly variable genes using the Seurat method.

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix to subset.

  • n_top_genes (int) – Number of top highly variable genes to select.

Effects:

- Subsets `adata` to the top `n_top_genes` hvg genes.

Return type:

AnnData

translate_gene_list(adata)[source]

Translate gene symbols in var_names of a target AnnData to homologous context-species symbols.

Will download a HOM_AllOrganism.rpt if not present if context-target species pair consits of human, mouse, rat or zebrafish. Will fallback to map_homologs_silent for unsupported species pairs.

Parameters:

adata (anndata.AnnData) – Target AnnData whose var_names will be translated.

Effects:
  • - Prints a status message about which datasets are being translated.

  • - Downloads and saves `HOM_AllOrganism.rpt` if not already present.

  • - Reads the homology report into a DataFrame.

  • - Filters the table to context and target species.

  • - Computes a translated gene list via `get_key` or falls back to `map_homologs_silent` if species is not human, mouse, rat or zebrafish.

  • - Sets `adata.var[‘var_names_transl’]` to the mapped names.

Return type:

AnnData

scspecies.preprocessing.download_datasets()[source]

Download liver cell .h5ad datasets into ./data directory. Downloads each file and skips files already present.

Raises:

requests.HTTPError – If any of the dataset URLs returns a bad status.

scspecies.preprocessing.get_key(gene, homology_targsp_df, homology_context_df, i)[source]

Retrieve the homologous context gene symbol for a given target gene using homology tables from informatics.jax.org/downloads/reports/HOM_AllOrganism.rpt Can only be used for mouse, rat, human, zebrafish context-target dataset pairs,

Parameters:
  • gene (str) – Gene symbol in the ‘from’ DataFrame.

  • homology_targsp_df (pandas.DataFrame) – Homology table for the target species (columns include ‘Symbol’ and ‘DB Class Key’).

  • homology_context_df (pandas.DataFrame) – Homology table for the context species (same key column).

  • i (int) – Index of the gene in the original list, used to name unmapped genes.

Returns:

Context‐species gene symbol if found, otherwise ‘non_hom_<i>’.

Return type:

str

scspecies.preprocessing.map_homologs(gene_list, target_NCBI_Taxon_ID, context_NCBI_Taxon_ID)[source]

Maps a list of gene symbols from the target species to their homologous symbols of the context species using MyGeneInfo.

Parameters:
  • gene_list (list[str]) – Gene symbols in the target species to be translated.

  • target_NCBI_Taxon_ID (int) – NCBI Taxonomy ID of the target species.

  • context_NCBI_Taxon_ID (int) – NCBI Taxonomy ID of the source (context) species.

Returns:

Homologous gene symbols in the target species, with ‘non_hom_<i>’ for non homologous genes.

Return type:

list[str]

scspecies.preprocessing.map_homologs_silent(gene_list, target_NCBI_Taxon_ID, context_NCBI_Taxon_ID)[source]

Same as map_homologs but suppresses all console output as map_homologs outputs a print statement for each gene.

Parameters:
  • gene_list (list[str]) – Gene symbols in the target species to be translated.

  • target_NCBI_Taxon_ID (int) – NCBI Taxonomy ID of the target species.

  • context_NCBI_Taxon_ID (int) – NCBI Taxonomy ID of the source (context) species.

Returns:

Homologous gene symbols in the target species, with ‘non_hom_<i>’ for non homologous genes.

Return type:

list[str]

scspecies.preprocessing.set_random_seed(seed)[source]

Fix all relevant RNG seeds for reproducibility.

Parameters:

seed (int) – The seed value to use for Python, NumPy, random, and PyTorch.

Module contents

scspecies - a tool for aligning latent representations of single-cell datasets from different species.