Tutorials

Example datasets

Generating 3D Data

To generate example data, we will use the built-in pancreas dataset from scvelo. The process of obtaining the velocity vector components is detailed in the scvelo tutorials. We will focus on one key difference that enables generating three-dimensional data.

import scanpy as sc
import scvelo as scv
adata = scv.datasets.pancreas()
adata
AnnData object with n_obs × n_vars = 3696 × 27998
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score'
    var: 'highly_variable_genes'
    uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced'
    obsp: 'distances', 'connectivities'

The dataset already contains a UMAP embedding, but it is two-dimensional.

adata.obsm["X_umap"].shape
(3696, 2)

Using scanpy.tl.umap, we will create a three-dimensional UMAP embedding instead. This will allow us to represent the cells in 3D space and the velocity vectors will be determined according to the dimensionality of the specified embedding.

sc.tl.umap(adata, n_components = 3)
adata.obsm["X_umap"].shape
(3696, 3)
scv.pp.filter_genes(adata, min_shared_counts=20)
scv.pp.normalize_per_cell(adata)
scv.pp.filter_genes_dispersion(adata, n_top_genes=2000)
scv.pp.log1p(adata)
adata
AnnData object with n_obs × n_vars = 3696 × 2000
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts'
    var: 'highly_variable_genes', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca', 'umap', 'log1p'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced'
    obsp: 'distances', 'connectivities'
scv.tl.velocity_graph(adata)
scv.tl.velocity(adata)
scv.tl.velocity_embedding(adata, basis="umap")
adata
AnnData object with n_obs × n_vars = 3696 × 2000
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts', 'velocity_self_transition'
    var: 'highly_variable_genes', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable', 'velocity_gamma', 'velocity_qreg_ratio', 'velocity_r2', 'velocity_genes'
    uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca', 'umap', 'log1p', 'velocity_params', 'velocity_graph', 'velocity_graph_neg'
    obsm: 'X_pca', 'X_umap', 'velocity_umap'
    layers: 'spliced', 'unspliced', 'Ms', 'Mu', 'velocity', 'variance_velocity'
    obsp: 'distances', 'connectivities'

The velocity vectors have been successfully determined and are located in obsm as velocity_umap.

Reducing the file size

Dash, which is used to create Cell Journey, has its limitations. Loading very large files can be automatically interrupted. Therefore, files for the Cell Journey analysis should be stripped of unnecessary data, especially large dense matrices. For the pancreas dataset it is sufficient to limit the data to what is contained in var, obs, obsm, and the sparse X matrix.

import scanpy as sc
import os
adata_slim = sc.AnnData(X=adata.X, obs=adata.obs, var=adata.var, obsm=adata.obsm)
adata_slim.write("pancreas_slim.h5ad")

For comparison, we can also save the entire adata dataset.

adata.write("pancreas_full.h5ad")
full_dataset = os.stat("pancreas_full.h5ad")
full_dataset_size = full_dataset.st_size / (1024 ** 2)
slim_dataset = os.stat("pancreas_slim.h5ad")
slim_dataset_size = slim_dataset.st_size / (1024 ** 2)
print(f"Full dataset: {full_dataset_size:.2f} MB, slim dataset: {slim_dataset_size:.2f} MB")
Full dataset: 1756.84 MB, slim dataset: 14.64 MB

Lineage tracing

In this section, we will prepare a dataset designed for cell lineage tracing. The data used originates from the study by C. Weinreb et al. (2020). The original files are hosted in Allon Klein’s lab GitHub repository.

All the required files can be downloaded automatically using the script provided below. Please ensure that you update the my_path variable to reflect your local directory structure.

import scanpy as sc
from urllib.request import urlretrieve
from pathlib import Path
from scipy.io import mmread
from pandas import read_csv
url_path = 'https://kleintools.hms.harvard.edu/paper_websites/state_fate2020/'
my_path = '/my/local/path/'

files = {
    'counts': 'stateFate_inVitro_normed_counts.mtx.gz',
    'genes': 'stateFate_inVitro_gene_names.txt.gz',
    'metadata': 'stateFate_inVitro_metadata.txt.gz',
    'clones': 'stateFate_inVitro_clone_matrix.mtx.gz'
}

Automated download:

for filename in files.values():
    urlretrieve(f'{url_path}{filename}', 
                f'{my_path}{filename}')

Once the download is complete, we load the counts, gene names, and metadata into an AnnData object.

X = mmread(Path(my_path, files['counts'])).tocsr()
genes = read_csv(Path(my_path, files['genes']), header=None, names=['gene_symbol'])
meta = read_csv(Path(my_path, files['metadata']), sep='\t')

adata = sc.AnnData(X=X)
adata.var_names = genes['gene_symbol'].values
adata.obs = meta
adata
AnnData object with n_obs × n_vars = 130887 × 25289
    obs: 'Library', 'Cell barcode', 'Time point', 'Starting population', 'Cell type annotation', 'Well', 'SPRING-x', 'SPRING-y'

We then integrate the clonal information into the obsm layer named Clones:

clones = mmread(Path(my_path, files['clones'])).tocsc()
adata.obsm['Clones'] = clones

Next, we process the dataset following a standard Scanpy workflow. For a deeper dive into these steps, please refer to the Scanpy documentation. Please remember that the umap function must have the n_components parameter set to 3.

sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_neighbors=20, n_pcs=30)
sc.tl.umap(adata, n_components=3)
AnnData object with n_obs × n_vars = 130887 × 25289
    obs: 'Library', 'Cell barcode', 'Time point', 'Starting population', 'Cell type annotation', 'Well', 'SPRING-x', 'SPRING-y'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap'
    obsm: 'Clones', 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

To ensure a seamless experience and optimal performance, we recommend removing any auxiliary data or dense matrices not utilized by Cell Journey.

del adata.obsm['X_pca']
del adata.uns
del adata.obsp
del adata.varp
adata
AnnData object with n_obs × n_vars = 130887 × 25289
    obs: 'Library', 'Cell barcode', 'Time point', 'Starting population', 'Cell type annotation', 'Well', 'SPRING-x', 'SPRING-y'
    obsm: 'Clones', 'X_umap'

Additionally, we can rename cell types. The mapping is based on the results section of the aforementioned article.

mapping = {
    'Undifferentiated': 'Undifferentiated',
    'Monocyte': 'Monocyte',
    'Neutrophil': 'Neutrophil',
    'Mono': 'Monocytes',
    'Baso': "Basophil",
    'Erythroid': 'Erythrocyte',
    'Mast': 'Mast cell',
    'Meg': 'Megakaryocyte',
    'Ccr7_DC': 'Ccr7+ migDC',
    'Lymphoid': 'Lymphoid precursor',
    'Eos': 'Eosinophil',
    'pDC': 'pDC'
}

adata.obs['Cell type annotation'] = adata.obs['Cell type annotation'].map(mapping)

Finally, save the processed dataset

adata.write(Path(my_path, 'clonal_data.h5ad'))

Recreating article figures

Pancreatic endocrinogenesis

  1. Upload data and select coordinates: Load pancreas.h5ad provided in the datasets directory.
  2. Upload data and select coordinates: Select X_umap(1), X_umap(2), and X_umap(3) as X, Y, and Z coordinates.
  3. Upload data and select coordinates: Select velocity_umap (1), velocity_umap (2), velocity_umap (3) as U, V, and W coordinates.
  4. Upload data and select coordinates: Click Submit selected coordinates.
  5. Upload data and select coordinates: Set Target sum to 10000 and click Lognormalize.
  6. Global plot configuration: change Axes switch to Hide.

Figure 1 B

  1. Scatter plot: select clusters from the Select feature dropdown menu.
  2. Scatter plot: change Use custom color palette to ON, and paste #1DACD6 #FFAACC #66FF66 #0066FF #FF7A00 #FC2847 #FDFF00 #000000 into the Space-separated list of color hex values (max 20 colors) field.
  3. Streamline plot: change Color scale to Greys, and Line width to 10.0.
  4. Cell Journey (trajectory): click Generate grid.
  5. Cell Journey (trajectory): set Number of clusters to 8, Number of automatically selected features to 200, Tube segments to 25, Features activities shown in heatmap to Relative to first segment, and Highlight selected cells to Don't highlight.
  6. Click on a random cell from the Ngn3 low EP cluster. Try a few cells within the suggested area if the first one didn't result in an appropriate trajectory.

Figure 2 A (SCATTER PLOT)

  1. Scatter plot: select clusters_coarse from the Select feature dropdown menu.
  2. Global plot configuration: change Legend: horizontal position and Legend: vertical position to obtain an optimal position, e.g. 0.50 and 0.30 accordingly.

Figure 2 A (CONE PLOT)

  1. Cone plot: select rainbow from the Color scale dropdown menu.
  2. Cone plot: set Cone size to 12.00.

Figure 2 A (STREAMLINES)

  1. Streamline plot: set Grid size to 20, Number of steps to 500, Step size to 2.00, and Difference threshold to 0.001.
  2. Streamline plot: click Generate trajectories (streamlines and streamlets).
  3. Streamline plot: uncheck Combine trajectories with the scatter plot switch.
  4. Streamline plot: change Line width to 4.0.

Figure 2 A (STREAMLETS)

  1. Repeat the steps for Figure C (STREAMLINES).
  2. Streamline plot: change Show streamlines to Show streamlets.
  3. Streamline plot: set Streamlets length to 10.
  4. Streamline plot: click Update streamlets.
  5. Streamline plot: change Color scale to Reds.

Figure 2 A (SCATTER + VOLUME PLOT)

  1. Scatter plot: input Serping1 in the Modality feature
  2. Scatter plot: select Turbo from the Built-in continuous color scale dropdown menu.
  3. Scatter plot: change Add volume plot to continuous feature and Single color scater when volume is plotted to ON.
  4. Scatter plot: select the second color from the left in the second row of the suggested colors (light grey box).
  5. Scatter plot: select linear from the Radial basis function dropdown menu.
  6. Scatter plot: change Point size to 1.00, Volume plot transparency cut-off quantile to 50, Volume plot opacity to 0.06, Gaussian filter standard deviation r to 2.00, and Radius scaler to 1.300.

Figure 2 A (SCATTER + STREAMLINES)

  1. Repeat the steps for Figure C (STREAMLINES).
  2. Streamline plot: change Combine trajectories with the scatter plot to ON.
  3. Streamline plot: set Subset current number of trajectories to 70 and click Confirm.
  4. Scatter plot: change Built-in continuous color scale to Balance.

Bone marrow mononuclear progenitors

  1. Upload data and select coordinates: Load bone_marrow.h5ad provided in the datasets directory.
  2. Upload data and select coordinates: Select RNA: X_umap(1), RNA: X_umap(2), and RNA: X_umap(3) as X, Y, and Z coordinates.
  3. Upload data and select coordinates: Select RNA: velocity_umap(1), RNA: velocity_umap(2), and RNA: velocity_umap(3) as U, V, and W coordinates.
  4. Upload data and select coordinates: Click Submit selected coordinates.
  5. Upload data and select coordinates: Select RNA modality, set Target sum to 10000, and click Lognormalize. Select ADT modality, set Target sum to 10000, and click Lognormalize.
  6. Global plot configuration: change Axes switch to Hide.

Figure 2 B (SCATTER PLOT + TRAJECTORY + TUBE CELLS)

  1. Global plot configuration: change Legend switch to Hide.
  2. Scatter plot: change Point size to 1.00.
  3. Scatter plot: select second color from the left in the second row of the suggested colors (light grey box).
  4. Streamline plot: set Grid size to 25, Number of steps to 500, and Difference threshold to 0.001, and click Generate trajectories (streamlines and streamlets).
  5. Streamline plot: change Show streamlines to Show streamlets, set Stremlets length to 10, and click Update streamlets.
  6. Streamline plot: change Color scale to Jet.
  7. Scatter plot: select ADT from the Modality dropdown menu and input CD34 in the field below.
  8. Scatter plot: select Reds from the Built-in continuous color scale field.
  9. Scatter plot: change Add volume plot to continuous feature and Single color scater when volume is plotted to ON.
  10. Scatter plot: set Volume plot transparency cut-off quantile to 55, Volume plot opacity to 0.04, Gaussian filter standard deviation multiplier to 3.00, and Radius scaler to 1.300.

Figure 2 B (SCATTER + STREAMLETS + VOLUME PLOT)

  1. Streamline plot: change Lide width to 5.0.
  2. Cell Journey (trajectory): click Generate grid.
  3. Cell Journey (trajectory): set Tube segments to 5 and Highlight selected cells to Each segment separately.
  4. Global plot configuration: change Legend: horizontal position and Legend: vertical position to obtain optimal position, e.g. to 0.20 in both cases.

Figure 2 B (RNA MODALITY HEATMAP)

  1. Cell Journey (trajectory): click Generate grid.
  2. Cell Journey (trajectory): set Step size to 2.00, Tube segments to 20, Number of clusters to 8, Number of automatically selected features to 50, and Heatmap color scale to Inferno.
  3. Scatter plot: select RNA from the Modality dropdown menu.
  4. Click on a random cell from the center the point cloud. Try few cells within the suggested area if the first one didn't result in an appropriate trajectory.
  5. Cell Journey (trajectory): select Box plot from the Plot type dropdown menu, set Trendline to Median-based cubic spline.
  6. Find HBB gene by hovering the heatmap. Click on any segment to obtain figure RNA: HBB ALONG TRAJECTORY.

Figure 2 B (ADT MODALITY HEATMAP)

  1. Cell Journey (trajectory): click Generate grid.
  2. Cell Journey (trajectory): set Step size to 2.00, Tube segments to 20, Number of clusters to 3, Number of automatically selected features to 10, and Heatmap color scale to Inferno.
  3. Scatter plot: select ADT from the Modality dropdown menu.
  4. Click on a random cell from the center of the point cloud. Try a few cells within the suggested area if the first one didn't result in an appropriate trajectory.
  5. Cell Journey (trajectory): select Box plot from the Plot type dropdown menu, set Trendline to Median-based cubic spline.
  6. Find CD34 gene by hovering the heatmap. Click on any segment to obtain figure (ADT: CD34 ALONG TRAJECTORY.

Lineage tracing

  1. Upload data and select coordinates: Load the clonal_data.h5ad file generated in the Lineage tracing tutorial.
  2. Upload data and select coordinates: Select X_umap(1), X_umap(2), and X_umap(3) as X, Y, and Z coordinates.
  3. Skip U, V, and W coordinates. Cell Journey will display a warning Please select all the required columns. This can be ignored, as trajectories are not integrated in this case.
  4. Upload data and select coordinates: Click Submit selected coordinates.
  5. Global plot configuration: change Axes switch to Hide.

Figure 3 A

  1. Scatter plot: select Cell type annotation from the Select feature dropdown menu.
  2. Scatter plot: change Use custom color palette to ON, and paste following hex codes into the Space-separated list of color hex values (max 20 colors) field: #000000 #FF6666 #9933FF #E085C2 #9EE09E #FF8000 #F0C1E1 #B3FF66 #B30000 #FFBF80 #604020.

Figure 3 B

  1. Scatter plot: change Use custom color palette to ON, and paste #000000 #1F75FE #FF6666 into the Space-separated list of color hex values (max 20 colors) field.
  2. Scatter plot: set Point size to 1.0.
  3. Scatter plot (clonal data): change Click on a cell to find the clones to ON.
  4. Click on a cells at the edge of the neutrophil cluster. You may need to try a few times to obtain the same results as shown in the article.

Figure 3 C

  1. Follow the steps from Figure 3 B.
  2. Scatter plot: select Solar fro the Built-in continuous color scale
  3. Scatter plot: switch Reverse order of color scale to ON.
  4. Scatter plot: select the color black from the color picker.
  5. Scatter plot (volume plot): set Volume plot transparency cut-off quantile to 15, Volume plot opacity to 0.20, Radial basis function to gaussian, Smoothing level to 100, Gaussian filter standard deviation multiplier to 1, and Grid size to 40.
  6. Switch Add volume plot to continuous feature and Single color scatter when volume is plotted to ON.
  7. Click on a cell at the edge of the neutrophil cluster. You may need to try a few times to reproduce the article's results. Note that when the volume plot in ON, Cell Journey will not respond to clicking on cells. You must switch Add volume plot to continuous feature to OFF to retry selection.