scspecies.preprocessing module

Functions and classes to setup datasets for scSpecies.

class scspecies.preprocessing.create_mdata[source]

Bases: object

Builder for MuData container that is used by scSpecies to align context & target AnnData datasets.

Handles downloading a gene-translation table from the mouse to human genome, preprocessing a “context” AnnData, and “target” AnnData from potentially multiple species, and saving the final MuData object.

__init__(adata, batch_key, cell_key, dataset_name='mouse', NCBI_Taxon_ID=10090, n_top_genes=None, min_non_zero_genes=0.025, min_cell_type_size=20, min_batch_size=20)[source]

Initialize and preprocess the context dataset.

Steps:

  1. Onehot-encode experimental batchs.

  2. Calculate library size encoder prior parameters for scVI

  3. Subset to top HVGs and filter out cells with low expression patterns as well as rare cell types and batches (optionally).

Parameters:
  • adata (ad.AnnData) – AnnData used as a context in scSpecies.

  • batch_key (str) – Observation key for experimental batch labels.

  • cell_key (str) – Observation key for cell-type annotation.

  • dataset_name (str, optional) – Tag for the context dataset (default ‘mouse’).

  • NCBI_Taxon_ID (int, optional) – Taxonomy ID of the context species (default mouse - 10090).

  • n_top_genes (int or None, optional) – Number of HVGs to retain (None to skip) (default None).

  • min_non_zero_genes (float, optional) – Min fraction of nonzero genes per cell (default 0.025).

  • min_cell_type_size (int, optional) – Min cells per cell-type, cell types with fewer samples are removed (default 20).

  • min_batch_size (int, optional) – Min cells per batch for encoding, batch with fewer samples are removed (default 20).

Effects:
  • - Ensures a `data/` directory exists.

  • - Annotates `adata.uns[‘metadata’]` with context dataset info.

  • - One-hot encodes batch labels, dropping any batches smaller than `min_batch_size`.

  • - Computes per-batch library size prior parameters.

  • - Subsets to top highly variable genes if `n_top_genes` is not None.

  • - Filters out cells with low gene detection and rare cell-types.

  • - Stores the processed AnnData in `self.dataset_collection`.

static compute_lib_prior_params(adata)[source]

Compute scVI library size prior parameters for each cell.

Parameters:

adata (anndata.AnnData) – Annotated data matrix with raw counts in adata.X.

Effects:
  • - Within each batch (from `adata.uns[‘metadata’][‘batch_key’]`),

  • calculates the mean and standard deviation of log-total counts.

  • - Stores values in `adata.obs[‘library_log_mean’]` and

  • `adata.obs[‘library_log_std’]` as float32 columns.

Return type:

AnnData

return_mdata(return_mdata=True, save=True, save_path=PosixPath('data'), save_name='mudata')[source]

Optionally save and/or return the assembled MuData object.

Parameters:
  • return_mdata (bool, optional) – If True, return the MuData object at the end (default True).

  • save (bool, optional) – If True, write the MuData object to disk (default True).

  • save_path (pathlib.Path, optional) – Directory in which to save the file; created if missing (default Path(“data”)).

  • save_name (str, optional) – Filename stem for the .h5mu file; ‘.h5mu’ is appended (default ‘mudata’).

Effects:
  • - If `save` is True

    • Ensures that save_path exists, creating it if necessary.

    • Writes the MuData assembled from self.dataset_collection to

    save_path/<save_name>.h5mu. - Prints messages about directory creation and file saving.

  • - If `return_mdata` is True

    • Returns the MuData object constructed from self.dataset_collection.

Return type:

MuData

setup_target_adata(adata, batch_key, cell_key=None, eval_nns_keys=None, dataset_name='human', NCBI_Taxon_ID=9606, n_top_genes=None, compute_log1p=True, nn_kwargs=None)[source]

Preprocess and align a target AnnData against the context.

Steps:

  1. Onehot-encode experimental batchs.

  2. Calculate library size encoder prior parameters for scVI

  3. Subset to top HVGs and filter out cells with low expression patterns as well as rare cell types and batches (optionally).

  4. Translate target gene symbols to context homologs.

  5. Compute and evaluate data-level nearest neighbors on the shared homologous gene set.

Parameters:
  • adata (ad.AnnData) – Target dataset.

  • batch_key (str) – Observation key for experimental batch labels.

  • cell_key (str or None) – Observation key for cell types (None if unkown).

  • eval_nns_keys (List of str or None) – List of context dataset obs keys that should be transferred by scSpecies. Defaults to [cell_key].

  • dataset_name (str, optional) – Defaults to ‘human’.

  • NCBI_Taxon_ID (int, optional) – Taxonomy ID for the target species (default human - 9606).

  • n_top_genes (int or None, optional) – Number of HVGs to keep (None to skip) (default None).

  • compute_log1p (bool, optional) – Use log1p counts for neighbor search if True (default True).

  • nn_kwargs (dict, optional) – Args for sklearn.neighbors.NearestNeighbors. Defaults to {‘n_neighbors’: 250, ‘metric’: ‘cosine’}.

Effects:
  • - Updates `adata.uns[‘metadata’]` with target dataset info.

  • - Filters and one-hot encodes batch (and cell-type, if provided).

  • - Computes library size prior parameters.

  • - Calls `translate_gene_list` to add translated gene symbols in the context genome to `var_names_transl`.

  • - Subsets to HVGs if `n_top_genes` is not None.

  • - Filters out low-coverage cells and rare cell-types.

  • - Identifies intersecting homologous genes with the context and performs a nearest-neighbor search on log1p (or raw) counts.

  • - Stores neighbor indices in `adata.obsm[‘ind_neigh_nns’]`.

  • - Calculates the percentage of neighbor label agreement and transfers labels based on the data-level nearest neighbor search.

  • - Inserts the processed AnnData into `self.dataset_collection`.

static subset_to_hvg(adata, n_top_genes)[source]

Subset dataset to the top highly variable genes using the Seurat method.

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix to subset.

  • n_top_genes (int) – Number of top highly variable genes to select.

Effects:

- Subsets `adata` to the top `n_top_genes` hvg genes.

Return type:

AnnData

scspecies.preprocessing.download_datasets()[source]

Download liver cell .h5ad datasets into ./data directory. Downloads each file and skips files already present.

Raises:

requests.HTTPError – If any of the dataset URLs returns a bad status.

scspecies.preprocessing.set_random_seed(seed)[source]

Fix all relevant RNG seeds for reproducibility.

Parameters:

seed (int) – The seed value to use for Python, NumPy, random, and PyTorch.