5. Harmonizing data with scVI and scANVI

[ ]:
# If running in Colab, navigate to Runtime -> Change runtime type
# and ensure you're using a Python 3 runtime with GPU hardware accelerator
# installation in Colab can take several minutes

Open In Colab

[2]:
import sys
IN_COLAB = "google.colab" in sys.modules

def allow_notebook_for_test():
    print("Testing the harmonization notebook")

show_plot = True
test_mode = False
save_path = "data/"

if IN_COLAB:
    !pip install --quiet git+https://github.com/yoseflab/scvi@stable#egg=scvi[notebooks]
     |████████████████████████████████| 51kB 2.2MB/s
     |████████████████████████████████| 112kB 6.7MB/s
     |████████████████████████████████| 102kB 10.0MB/s
     |████████████████████████████████| 7.2MB 20.3MB/s
     |████████████████████████████████| 2.2MB 46.4MB/s
     |████████████████████████████████| 2.4MB 46.9MB/s
     |████████████████████████████████| 3.2MB 46.9MB/s
     |████████████████████████████████| 51kB 7.0MB/s
     |████████████████████████████████| 51kB 6.7MB/s
  Building wheel for scvi (setup.py) ... done
  Building wheel for loompy (setup.py) ... done
  Building wheel for numpy-groupies (setup.py) ... done
[3]:
import matplotlib
matplotlib.rcParams["pdf.fonttype"] = 42
matplotlib.rcParams["ps.fonttype"] = 42
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
import seaborn as sns

import numpy as np
import numpy.random as random
import pandas as pd
import scanpy as sc

from scvi.dataset.dataset import GeneExpressionDataset
from scvi.inference import UnsupervisedTrainer
from scvi.models import SCANVI, VAE

from umap import UMAP

# Control UMAP numba warnings
import warnings; warnings.simplefilter('ignore')

if IN_COLAB:
    %matplotlib inline

# Use GPU
use_cuda = True
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
/usr/local/lib/python3.6/dist-packages/anndata/_core/anndata.py:21: FutureWarning: pandas.core.index is deprecated and will be removed in a future version.  The public classes are available in the top-level namespace.
  from pandas.core.index import RangeIndex

The raw data is provided in the Seurat notebook and can be downloaded here

5.1. Introduction

This tutorial walks through the harmonization process, specifically making use of scVI and SCANVI, which are two tools that are applicable and useful for principled large-scale analysis of single-cell transcriptomics atlases. Data harmonization refers to the integration of two or more transcriptomics dataset into a single dataset on which any downstream analysis can be applied. The input datasets may come from very different sources and from samples with a different composition of cell types. scVI is a deep generative model that has been developed for probabilistic representation of scRNA-seq data and performs well in both harmonization and harmonization-based annotation, going beyond just correcting batch effects. SCANVI is a new method that is designed to harmonize datasets, while also explicitly leveraging any available labels to achieve more accurate annotation. SCANVI uses a semi-supervised generative model. The inference of both models (scVI, SCANVI) is done using neural networks, stochastic optimization, and variational inference and scales to millions of cells and multiple datasets. Furthermore, both methods provide a complete probabilistic representation of the data, which non-linearly controls not only for sample-to-sample bias, but also for other technical factors of variation such as over-dispersion, variable library size, and zero-inflation.

The following tutorial is designed to provide an overview of the data harmonization methods, scVI and SCANVI. This tutorial runs through two examples: 1) Tabula Muris dataset and 2) Human dataset (Seurat) Goals: - Setting up and downloading datasets - Performing data harmonization with scVI - Performing marker selection from differentailly expressed genes for each cluster - Performing differential expression within each cluster

5.1.1. Dataset loading

The cell below is used to load in two human PBMC dataset, one stimulated and one control.

[4]:
download_data = False
if IN_COLAB or download_data:
    !mkdir data
    !wget https://www.dropbox.com/s/79q6dttg8yl20zg/immune_alignment_expression_matrices.zip?dl=1 -O data/immune_alignment_expression_matrices.zip
    !cd data; unzip immune_alignment_expression_matrices.zip
--2020-05-07 16:04:31--  https://www.dropbox.com/s/79q6dttg8yl20zg/immune_alignment_expression_matrices.zip?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.9.1, 2620:100:601f:1::a27d:901
Connecting to www.dropbox.com (www.dropbox.com)|162.125.9.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/79q6dttg8yl20zg/immune_alignment_expression_matrices.zip [following]
--2020-05-07 16:04:32--  https://www.dropbox.com/s/dl/79q6dttg8yl20zg/immune_alignment_expression_matrices.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucb7f4572b462dfb76522b32d965.dl.dropboxusercontent.com/cd/0/get/A3R2-KcKdcUIFnBCRzDy04uUosN1jOCCh-hKXOxpOu9aQHkXJYL5Q_QbIJMclL16O4FpH0Njb9nkeQwe2NjIDzvSR7YC967DG6tsmycZMyFRRA/file?dl=1# [following]
--2020-05-07 16:04:32--  https://ucb7f4572b462dfb76522b32d965.dl.dropboxusercontent.com/cd/0/get/A3R2-KcKdcUIFnBCRzDy04uUosN1jOCCh-hKXOxpOu9aQHkXJYL5Q_QbIJMclL16O4FpH0Njb9nkeQwe2NjIDzvSR7YC967DG6tsmycZMyFRRA/file?dl=1
Resolving ucb7f4572b462dfb76522b32d965.dl.dropboxusercontent.com (ucb7f4572b462dfb76522b32d965.dl.dropboxusercontent.com)... 162.125.9.6, 2620:100:601f:6::a27d:906
Connecting to ucb7f4572b462dfb76522b32d965.dl.dropboxusercontent.com (ucb7f4572b462dfb76522b32d965.dl.dropboxusercontent.com)|162.125.9.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21329741 (20M) [application/binary]
Saving to: ‘data/immune_alignment_expression_matrices.zip’

data/immune_alignme 100%[===================>]  20.34M  65.1MB/s    in 0.3s

2020-05-07 16:04:33 (65.1 MB/s) - ‘data/immune_alignment_expression_matrices.zip’ saved [21329741/21329741]

Archive:  immune_alignment_expression_matrices.zip
  inflating: immune_control_expression_matrix.txt.gz
   creating: __MACOSX/
  inflating: __MACOSX/._immune_control_expression_matrix.txt.gz
  inflating: immune_stimulated_expression_matrix.txt.gz
  inflating: __MACOSX/._immune_stimulated_expression_matrix.txt.gz
[5]:
# This can take several minutes
from scvi.dataset.csv import CsvDataset
filename = 'immune_stimulated_expression_matrix.txt.gz' if IN_COLAB else 'immune_stimulated_expression_matrix.txt'
stimulated = CsvDataset(filename=filename,
                        save_path=save_path,sep='\t', new_n_genes=35635,
                        compression="gzip" if IN_COLAB else None)
filename = 'immune_control_expression_matrix.txt.gz' if IN_COLAB else 'immune_control_expression_matrix.txt'
control = CsvDataset(filename=filename,
                     save_path=save_path, sep='\t', new_n_genes=35635,
                     compression="gzip" if IN_COLAB else None)
[2020-05-07 16:04:34,849] INFO - scvi.dataset.csv | Preprocessing dataset
[2020-05-07 16:06:33,181] INFO - scvi.dataset.csv | Finished preprocessing dataset
[2020-05-07 16:06:35,072] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[2020-05-07 16:06:35,074] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2020-05-07 16:06:52,642] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2020-05-07 16:06:53,315] INFO - scvi.dataset.dataset | Downsampled from 12875 to 12875 cells
[2020-05-07 16:06:53,317] INFO - scvi.dataset.dataset | Not subsampling. Expecting: 1 < (new_n_genes=35635) <= self.nb_genes
[2020-05-07 16:06:53,318] INFO - scvi.dataset.csv | Preprocessing dataset
[2020-05-07 16:08:53,830] INFO - scvi.dataset.csv | Finished preprocessing dataset
[2020-05-07 16:08:55,180] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[2020-05-07 16:08:55,182] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2020-05-07 16:09:12,947] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2020-05-07 16:09:13,632] INFO - scvi.dataset.dataset | Downsampled from 13019 to 13019 cells
[2020-05-07 16:09:13,634] INFO - scvi.dataset.dataset | Not subsampling. Expecting: 1 < (new_n_genes=35635) <= self.nb_genes

5.2. Concatenate Datasets

[6]:
all_dataset = GeneExpressionDataset()
all_dataset.populate_from_datasets([control, stimulated])

# We subsample genes here so that computation works in Colab
if IN_COLAB:
    all_dataset.subsample_genes(4000)
[2020-05-07 16:09:13,643] INFO - scvi.dataset.dataset | Merging datasets. Input objects are modified in place.
[2020-05-07 16:09:13,644] INFO - scvi.dataset.dataset | Gene names and cell measurement names are assumed to have a non-null intersection between datasets.
[2020-05-07 16:09:13,715] INFO - scvi.dataset.dataset | Keeping 35635 genes
[2020-05-07 16:09:33,129] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2020-05-07 16:09:35,295] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[2020-05-07 16:09:35,297] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2020-05-07 16:09:53,771] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2020-05-07 16:09:56,814] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[2020-05-07 16:09:56,816] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2020-05-07 16:10:04,057] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[2020-05-07 16:10:04,060] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2020-05-07 16:10:05,429] INFO - scvi.dataset.dataset | extracting highly variable genes using seurat_v3 flavor
Transforming to str index.
[2020-05-07 16:10:21,960] INFO - scvi.dataset.dataset | Downsampling from 35635 to 4000 genes
[2020-05-07 16:10:26,223] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2020-05-07 16:10:26,601] INFO - scvi.dataset.dataset | Filtering non-expressing cells.
[2020-05-07 16:10:30,386] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2020-05-07 16:10:30,545] INFO - scvi.dataset.dataset | Downsampled from 25894 to 25894 cells

5.3. scVI (single-cell Variational Inference)

*scVI* is a hierarchical Bayesian model for single-cell RNA sequencing data with conditional distributions parametrized by neural networks. Working as a hybrid between a neural network and a bayesian network, scVI performs data harmonization. VAE refers to variational auto-encoders for single-cell gene expression data. scVI is similar to VAE as it tries to bring a more suitable structure to the latent space. While VAE allows users to make observations in a semi-supervised fashion, scVI is easier to train and specific cell-type labels for the dataset are not required in the pure unsupervised case.

5.3.1. Define the scVI model

  • First, we define the model and its hyperparameters:

    • n_hidden: number of units in the hidden layer

    • n_latent: number of dimensions in the shared latent space (how many dimensions in z)

    • n_layers: number of layers in the neural network

    • dispersion: ‘gene’: each gene has its own dispersion parameter; ‘gene-batch’: each gene in each batch has its own dispersion parameter

  • Then, we define a trainer using the model and the dataset to train it with

    • in the unsupervised setting, train_size=1.0 and all cells are used for training

[7]:
vae = VAE(all_dataset.nb_genes, n_batch=all_dataset.n_batches, n_labels=all_dataset.n_labels,
          n_hidden=128, n_latent=30, n_layers=2, dispersion='gene')

trainer = UnsupervisedTrainer(vae, all_dataset, train_size=1.0)
n_epochs = 100
trainer.train(n_epochs=n_epochs)
[2020-05-07 16:10:39,005] INFO - scvi.inference.inference | KL warmup phase exceeds overall training phaseIf your applications rely on the posterior quality, consider training for more epochs or reducing the kl warmup.
[2020-05-07 16:10:39,006] INFO - scvi.inference.inference | KL warmup for 400 epochs
training: 100%|██████████| 100/100 [06:14<00:00,  3.75s/it]
[2020-05-07 16:16:53,965] INFO - scvi.inference.inference | Training is still in warming up phase. If your applications rely on the posterior quality, consider training for more epochs or reducing the kl warmup.

5.4. Train the vae model for 100 epochs (this should take apporximately 12 minutes on a GPU)

If it is desired to save to model and take on the downstream analysis later, save the model, and comment out trainer.train(). Use the saved model to ensure that the down stream analysis cluster id are identical, but the result is robust to reruns of the model, although the exact numerical ids of the clusters might change

[ ]:
# trainer.train(n_epochs=100)
# torch.save(trainer.model.state_dict(),save_path+'harmonization.vae.allgenes.30.model.pkl')
[ ]:
# trainer.model.load_state_dict(torch.load(save_path+'harmonization.vae.allgenes.30.model.pkl'))
# trainer.model.eval()

5.5. Visualize the latent space

The latent space representation of the cells provides a way to address the harmonization problem, as all the cells are projected onto a joint latent space, inferred while controlling for their dataset of origin.

5.5.1. Obtain the latent space from the posterior object

First, the posterior object is obtained by providing the model that was trained on the dataset. Then, the latent space along with the labels is obtained.

[ ]:
full = trainer.create_posterior(trainer.model, all_dataset, indices=np.arange(len(all_dataset)))
latent, batch_indices, labels = full.sequential().get_latent()
batch_indices = batch_indices.ravel()

5.5.2. Use UMAP to generate 2D visualization

[ ]:
latent_u = UMAP(spread=2).fit_transform(latent)

5.5.3. Plot data colored by batch

[12]:
cm = LinearSegmentedColormap.from_list(
        'my_cm', ['deepskyblue', 'hotpink'], N=2)
fig, ax = plt.subplots(figsize=(5, 5))
order = np.arange(latent.shape[0])
random.shuffle(order)
ax.scatter(latent_u[order, 0], latent_u[order, 1],
           c=all_dataset.batch_indices.ravel()[order],
           cmap=cm, edgecolors='none', s=5)
plt.axis("off")
fig.set_tight_layout(True)

../_images/tutorials_harmonization_23_0.png
[ ]:
adata_latent = sc.AnnData(latent)
sc.pp.neighbors(adata_latent, use_rep='X', n_neighbors=30, metric='minkowski')
sc.tl.leiden(adata_latent, resolution=0.8)
clusters = adata_latent.obs.leiden.values.to_dense().astype(int)

5.5.4. plot clusters in 2D UMAP

[14]:
colors = ["#991f1f", "#ff9999", "#ff4400", "#ff8800", "#664014", "#665c52",
          "#cca300", "#f1ff33", "#b4cca3", "#0e6600", "#33ff4e", "#00ccbe",
          "#0088ff", "#7aa6cc", "#293966", "#0000ff", "#9352cc", "#cca3c9", "#cc2996"]

fig = plt.figure(figsize=(7, 7), facecolor='w', edgecolor='k')
for i, k in enumerate(np.unique(clusters)):
    plt.scatter(latent_u[clusters == k, 0], latent_u[clusters == k, 1], label=k,
                edgecolors='none', c=colors[k], s=5)
    plt.legend(borderaxespad=0, fontsize='large', markerscale=5, loc="upper right", bbox_to_anchor=(1.2, 1))
plt.axis('off')
fig.set_tight_layout(True)

../_images/tutorials_harmonization_26_0.png

Generate list of genes that is enriched for higher expression in cluster \(i\) compared to all other clusters

Here we compare the gene expression in cells from one cluster to all the other cells by * sampling mean parameter from the scVI ZINB model * compare pairs of cells from one subset v.s. the other * compute bayes factor based on the number of times the cell from the cluster of interest has a higher expression * generate DE genelist ranked by the bayes factor

[15]:
# change to output_file=True to get an Excel file with all DE information
de_res, de_clust = full.one_vs_all_degenes(cell_labels=clusters, n_samples=10000,
                                           M_permutation=10000, output_file=False,
                                           save_dir=save_path, filename='Harmonized_ClusterDE',
                                           min_cells=1)

# with open(save_path+'Harmonized_ClusterDE.pkl', 'wb') as f:
#     pickle.dump((de_res, de_clust), f)

# with open(save_path+'Harmonized_ClusterDE.pkl', 'rb') as f:
#     de_res, de_clust = pickle.load(f)


5.6. Find markers for each cluster

absthres is the minimum average number of UMI in the cluster of interest to be a marker gene

relthres is the minimum fold change in number of UMI in the cluster of interest compared to all other cells for a differentially expressed gene to be a marker gene

[ ]:
def find_markers(deres, absthres, relthres, ngenes):
    allgenes = []
    for i, x in enumerate(deres):
        markers = x.loc[
            (x["raw_mean1"] > absthres)
            & (x["raw_normalized_mean1"] / x["raw_normalized_mean2"] > relthres)
        ]
        if len(markers > 0):
            ngenes = np.min([len(markers), ngenes])
            markers = markers[:ngenes]
            allgenes.append(markers)
    if len(allgenes) > 0:
        markers = pd.concat(allgenes)
        return markers
    else:
        return pd.DataFrame(
            columns=["bayes_factor", "raw_mean1", "raw_mean2", "scale1", "scale2", "clusters"]
        )

[ ]:
clustermarkers = find_markers(de_res, absthres=0.5, relthres=2, ngenes=3)
[18]:
clustermarkers[['bayes_factor', 'raw_mean1', 'raw_mean2', 'scale1', 'scale2', 'clusters']]
[18]:
bayes_factor raw_mean1 raw_mean2 scale1 scale2 clusters
RPS13 3.701148 6.931537 3.878584 0.005961 0.002826 0
RPL35A 3.561951 5.304115 2.902222 0.004534 0.002187 0
RPL30 3.349852 4.857354 2.865836 0.004044 0.002023 0
HPSE 4.885323 0.938006 0.048769 0.000607 0.000027 1
GSTO1 4.747355 3.071604 0.450770 0.001583 0.000282 1
CD63 4.657600 7.800829 0.869967 0.004057 0.000471 1
TRAT1 2.434223 0.653009 0.176525 0.000605 0.000164 2
ZFP36L2 2.333329 1.178647 0.481154 0.001012 0.000495 2
PABPC1 2.026000 7.146336 3.131527 0.006392 0.002488 2
MS4A1 6.263994 0.746593 0.007679 0.000755 0.000042 3
IGLL5 5.093346 2.116022 0.004487 0.002248 0.000103 3
CD79A 5.045035 1.622836 0.016437 0.001592 0.000145 3
CD8A 3.803804 0.599408 0.058156 0.000696 0.000096 4
GZMH 3.651325 1.298471 0.056689 0.001745 0.000145 4
CD2 3.442277 1.204736 0.331336 0.001436 0.000312 4
GNLY 6.724225 15.801816 0.381276 0.016758 0.000729 5
CTSW 5.144079 0.754808 0.043502 0.000891 0.000065 5
CLIC3 4.998924 1.332799 0.044418 0.001490 0.000089 5
CNBP 3.748992 1.247217 0.519453 0.001228 0.000405 6
EIF1 3.607636 8.501465 4.868111 0.008121 0.002990 6
PAPOLA 3.550832 0.634446 0.263158 0.000661 0.000180 6
FCGR3A 6.569875 7.934952 0.195950 0.003251 0.000198 7
MS4A7 6.500782 7.377277 0.307061 0.002942 0.000238 7
MS4A4A 6.500782 2.226366 0.050806 0.000975 0.000052 7
ALDH2 2.575986 0.730159 0.053389 0.000200 0.000035 8
LYZ 2.438278 14.421164 1.275763 0.003706 0.000635 8
CPVL 2.389199 0.624339 0.059882 0.000179 0.000028 8
HBG2 18.420681 9.832580 0.005687 0.019221 0.000044 9
SNCA 18.420681 3.678733 0.029603 0.006022 0.000077 9
ALAS2 18.420681 5.592760 0.006388 0.009406 0.000050 9
IGJ 4.832960 2.014599 0.057382 0.001205 0.000056 10
DNAJC7 4.445082 0.503650 0.117754 0.000205 0.000074 10
EID1 4.427785 1.021898 0.215980 0.000421 0.000171 10
TUBB1 6.569875 1.733333 0.009195 0.002659 0.000020 11
SDPR 5.949637 2.708333 0.019943 0.003443 0.000064 11
PPBP 5.911796 17.058332 0.121634 0.014930 0.000270 11

5.7. Plotting known cluster unique genes

[ ]:
Markers = ["CD3D", "SELL", "CREM", "CD8B", "GNLY", "CD79A", "FCGR3A", "CCL2", "PPBP"]
[ ]:
def plot_marker_genes(latent_u, count, genenames, markers):
    nrow = (len(markers) // 3 + 1)
    figh = nrow * 4
    fig = plt.figure(figsize=(10, figh))
    for i, x in enumerate(markers):
        if np.sum(genenames == x)==1:
            exprs = count[:, genenames == x].ravel()
            idx = (exprs > 0)
            plt.subplot(nrow, 3, (i + 1))
            plt.scatter(latent_u[:, 0], latent_u[:, 1], c='lightgrey', edgecolors='none', s=5)
            plt.scatter(latent_u[idx, 0], latent_u[idx, 1], c=exprs[idx], cmap=plt.get_cmap('viridis_r'),
                        edgecolors='none', s=3)
            plt.title(x)
            fig.set_tight_layout(True)
[21]:
if len(clustermarkers) > 0:
    plot_marker_genes(latent_u[clusters >= 0, :], all_dataset.X[clusters >= 0, :],
                  all_dataset.gene_names,
                  np.asarray(Markers))
../_images/tutorials_harmonization_36_0.png

5.7.1. Here we plot the heatmap of average marker gene expression of each cluster

[ ]:
markergenes = ["CD3D", "CREM", "HSPH1", "SELL", "GIMAP5", "CACYBP", "GNLY",
    "NKG7", "CCL5", "CD8A", "MS4A1", "CD79A", "MIR155HG", "NME1", "FCGR3A",
    "VMO1", "CCL2", "S100A9", "HLA-DQA1", "GPR183", "PPBP", "GNG11", "HBA2",
    "HBB", "TSPAN13", "IL3RA", "IGJ"]
[ ]:
percluster_exprs = []
marker_names = []
for marker in markergenes:
    if np.sum(all_dataset.gene_names == marker) == 1:
        mean = [np.mean(all_dataset.X[clusters == i, all_dataset.gene_names == marker]) for i in np.unique(clusters)]
        mean = np.asarray(mean)
        percluster_exprs.append(np.log10(mean / np.mean(mean) + 1))
        marker_names.append(marker)

[24]:
if len(percluster_exprs) > 0:
    percluster_exprs = pd.DataFrame(percluster_exprs, index=marker_names)
    sns.clustermap(percluster_exprs, row_cluster=False, col_cluster=True)
../_images/tutorials_harmonization_40_0.png

5.8. Plotting scVI discovered marker genes

Each row contains the top 3 marker gene expression of its corresponding cluster

[25]:
plot_marker_genes(latent_u[clusters >= 0, :], all_dataset.X[clusters >= 0, :],
                  all_dataset.gene_names, np.asarray(clustermarkers.index))

../_images/tutorials_harmonization_42_0.png

5.9. Compare list of genes that are differentially expressed in each clusters

[ ]:
# change to output_file=True to get an Excel file with all DE information
de_res_stim, de_clust_stim = full.within_cluster_degenes(cell_labels=clusters,
                                                         states=all_dataset.batch_indices.ravel() == 1,
                                                         output_file=False, batch1=[1], batch2=[0],
                                                         save_dir=save_path, filename='Harmonized_StimDE',
                                                         min_cells=1)

# with open(save_path+'Harmonized_StimDE.pkl', 'wb') as f:
#     pickle.dump((de_res_stim,de_clust_stim), f)

# with open(save_path+'Harmonized_StimDE.pkl', 'rb') as f:
#     de_res_stim,de_clust_stim = pickle.load(f)

[ ]:
genelist = []
for i, x in enumerate(de_clust_stim):
    de = de_res_stim[i].loc[de_res_stim[i]["raw_mean1"] > 1]
    de = de.loc[de["bayes_factor"] > 2]
    if len(de) > 0:
        de["cluster"] = np.repeat(x, len(de))
        genelist.append(de)


if len(genelist) > 0:
    genelist = pd.concat(genelist)
    genelist["genenames"] = list(genelist.index)
    degenes, nclusterde = np.unique(genelist.index, return_counts=True)

5.9.1. Genes that are differentially expressed in at least 10 of the clusters

[28]:
if len(genelist) > 0:
    print(", ".join(degenes[nclusterde > 11]))
ISG15
[ ]:
if len(genelist) > 0:
    cluster0shared = genelist.loc[genelist['genenames'].isin(degenes[nclusterde > 10])]
    cluster0shared = cluster0shared.loc[cluster0shared['cluster'] == 0]
[ ]:
def plot_marker_genes_compare(latent_u, count, genenames, markers, subset):
    nrow = len(markers)
    figh = nrow * 4
    fig = plt.figure(figsize=(8, figh))
    notsubset = np.asarray([not x for x in subset])
    for i, x in enumerate(markers):
        if np.sum(genenames == x) == 1:
            exprs = count[:, genenames == x].ravel()
            idx = (exprs > 0)
            plt.subplot(nrow, 2, (i * 2 + 1))
            plt.scatter(latent_u[subset, 0], latent_u[subset, 1], c='lightgrey', edgecolors='none', s=5)
            plt.scatter(latent_u[idx, 0][subset[idx]], latent_u[idx, 1][subset[idx]], c=exprs[idx][subset[idx]],
                        cmap=plt.get_cmap('viridis_r'), edgecolors='none', s=3)
            plt.title(x + ' control')
            fig.set_tight_layout(True)
            plt.subplot(nrow, 2, (i * 2 + 2))
            plt.scatter(latent_u[notsubset, 0], latent_u[notsubset, 1], c='lightgrey', edgecolors='none', s=5)
            plt.scatter(latent_u[idx, 0][notsubset[idx]], latent_u[idx, 1][notsubset[idx]],
                        c=exprs[idx][notsubset[idx]], cmap=plt.get_cmap('viridis_r'), edgecolors='none', s=3)
            plt.title(x + ' stimulated')
[31]:
plot_marker_genes_compare(latent_u, all_dataset.X, all_dataset.gene_names,
                          ["CD3D", "GNLY", "IFI6", "ISG15", "CD14", "CXCL10"], batch_indices == 0)
../_images/tutorials_harmonization_50_0.png
[32]:
if len(genelist) > 0:
    plot_marker_genes_compare(latent_u, all_dataset.X,
                          all_dataset.gene_names, cluster0shared.index,
                          batch_indices == 0)
../_images/tutorials_harmonization_51_0.png