Building a TileDB-SOMA with LINCS data

data science

data engineering

blog

Author

Published

December 26, 2025

This post are my notes on how to use a very large genomics dataset in a under-resourced computer - using TileDB and a combination of python and R. I have been interested in TileDB for a while but I never had opportunity to use it at work. Luckily, one of my many personal projects involves querying a very large data set (LINCS 2020, 115 GB data) which doesn’t fit in memory, and thus is perfect to give tiledb a proper go. This might be useful for folks working with larger than memory data, specially single-cell data.

There are many other ways this data could be represented on-disk and still be performant. I picked TileDB mostly because I can use the existing Soma API, which was built for single-cell data but could be extended for other gene expression data.

Note

If you have single-cell data in anndata format, most of the data transformations being done in this post are not necessary and you can skip straight to Section 5.4.

There is a lot of rambling and concepts being explained in this post, not all necessary to understand the technical bits, so here is a diagram of what I will be doing:

Code

---
config:
  look: handDrawn
  theme: forest
---

flowchart LR
subgraph h5 is out-of-memory and BioC friendly
    A[GTX] --> |Gene expression| B[h5]
end
    D[csv] --> |Gene and Experiment metadata| C

subgraph anndata has chunked ingestion to TileDB
    B --> C(SingleCellExperiment)
    C --> E(anndata)
    E --> F[(TileDB-SOMA)]
end

---
config:
  look: handDrawn
  theme: forest
---

flowchart LR
subgraph h5 is out-of-memory and BioC friendly
    A[GTX] --> |Gene expression| B[h5]
end
    D[csv] --> |Gene and Experiment metadata| C

subgraph anndata has chunked ingestion to TileDB
    B --> C(SingleCellExperiment)
    C --> E(anndata)
    E --> F[(TileDB-SOMA)]
end

Diagram of how data will be moved between formats to finally get it into the DB. Only necessary to keep memory use as low as possible.

Biological motivation

LINCS2 is a pretty unique dataset consisting of millions of chemical and genetic perturbations in multiple (cancer) cell lines (see Section 3.1). The chemical perturbation signatures have reportedly been used extensively for drug re-purposing, mechanism of action (MoA) identification, and model training. However, the genetic perturbation data is less recent and has (i) not been as used, and (ii) is not so well characterized.

From a data science / computational biology perspective, and in my view, the genetic perturbations have one key advantage:

There is a very clear ground truth - if a gene is knocked out, the expression of that gene should go down.

This means that if one were to create a model to predict the effect of a gene perturbation, the validation is quite straightforward. For chemical perturbations the situation is far more complex and making predictions is harder:

Not all compounds have a known target protein
Some compounds might have multiple targets owing to:
- The target is a protein complex
- Polypharmacology
- Off-target effects
The MoA of a compound is not know (or is poorly annotated)
The pathways affected by the compound is not now.

So whilst the chemical perturbations are very useful for Drug Discovery and Development, I have some doubts about how good the modelling of those can be.

But there is a catch: this is a very large large dataset (by the standards of gene expression data). At 115 GB for the matrix alone (3,026,460 x 12,328), excluding the perturbation metadata - a measly 643.81 MB - this too large for a regular laptop to handle in memory. So how to use this data in the most computationally efficient manner?

Computational solution

TileDB-SOMA with it’s out-of-memory capabilities to facilitate data querying, and potentially ML, seems like a good solution. This pretty important because I will be using a dataset of 115 GB in size to make a DB and perform queries on a 16GB RAM laptop. Normally this would require loading the data into memory but I will show it doesn’t have to be that way.

In addition, it implements the CZI data standard for single cell data (soma) which means I could leverage this API for some the work.

Not that his is super relevant, but the laptop is running Ubuntu server edition, which means that those 16GB stretch more than if one was running a Desktop edition. The GUI (Gnome) takes away quite a bit of memory.

I will again use a mix of R - to take advantage of existing and mature genomics packages - and python because the TileDB-SOMA API is slightly more mature.

Some concepts

LINCS2

What is LINCS2? It’s a data set of the gene expression obtained after different perturbations:

It was generated in many cell lines
Includes chemical (small molecules) and genetic perturbations perturbations (CRISPR, shRNAs, but also over-expression)
The data came out in several iterations, with LINCS 2020 being the final release.

This is a pretty comprehensive data set:

The LINCS 2020 beta release data set contains 1.2 million signatures with 720,216 compound treatment GESs of 34,418 compounds, 34,171 gene over-expression on 4,040 genes, 318,208 gene knockdowns on 7,976 genes using shRNAs on 4,917 genes (177263) and CRISPR on 5,158 genes (140945). Out of 720,216 compound treatments, it includes 34,418 drug-like small molecules tested on 230 different cell lines at multiple concentrations and treatment times. To minimize redundancy of perturbagens having many signatures in different cell lines, dosage and treatment times, the ‘exemplar’ signature for each perturbagen in selected cell lines was assembled. These signatures are annotated from CLUE group and are generally picked based on TAS (Transcriptional Activity Score), such that the signature with the highest TAS is chosen as exemplar. The generated LINCS2 dataset contains moderated z-scores from DE analysis of 12,328 genes from 30,456 compound treatments of 58 cell lines corresponding to a total of 136,460 signatures. It is exactly the same as the reference database used for the Query Tool in CLUE website. Like the LINCS database, users have the option to assemble any custom collection from the original LINCS 2020 beta release dataset. source

L1000

What should also be noted is that the LINCS data is not an genome-wide gene expression survey. The method used is a bead-hybridization assay in which only 978 genes are directly measure - thus L1000 (I allow the creative license for the rounding up). The authors demonstrated that 1000 genes where enough to create a model which captured nearly all gene expression variation and could be used to estimate (predict) the expression another 11,350 genes (see).

Why only measure 1000 genes instead using RNA-seq?

This makes the method extremely cheap vs conventional sequencing. At the time Perturb-seq had just been published, and DRUG-seq (which I quite like!) was not a thing. We have to take in mind that this dataset is made of ~3 million data point (a well with treated cells). At a very generous estimate of €50 per data point for a DRUG-seq type sequencing run, that’s about 1 million euros total. And that excludes any other costs associated with the experiments (staff, reagents, instrumentation). It’s a lot, so L1000 at around €2 per data point starts looking like a good deal.

The price advantage of L1000 was not enough for the method to take off, and to the best of my knowledge it hasn’t been used outside of LINCS and similar projects. Besides only measuring ~1000 genes, which need to be well expressed, another disadvantage) is that readout is noisier than sequencing. With decreasing prices of sequencing, an novel single-cell perturbation methodologies coming into the space the price advantage became less important.

The LINCS dataset set itself is still pretty unique though.

CMAP

CMAP stands for connectivity map and is the database of non-redundant perturbation signatures, one per per compound / RNAi per cell lines, and allows researchers to match their own signatures to the database. A typical use case would be:

A gene expression signature for a compound of unknown activity was obtained and using CMAP it can be used to match existing signatures. The closest matches will give clues as to what is the most likely MoA, class of compounds, or gene/proteins target. At least in theory. In practice the query results tend to be quite messy and requiring a lot of cleaning up before one can gain any meaningful insights - at least in my experience.

Note

A gene signature is basically a ranked list of up and down-regulated genes. The are a number of methods (similarity metrics) which can be used to retrieve the best matches - quite a few them implemented in signatureSearch. The method used in LINCS is a variation of the Kolmogorov-Smirnov enrichment statistic (ES). These methods tend to be computationally slow.

LINCS2 in TileDB: why?

As I mentioned, I have a small project in the works that will use this datasets. Whilst there is a package, signatureSearchData, that contains most of the LINCS2020 data and wrapper code to query it, sadly it is not a viable option for me:

I am interested in the genetic perturbations which are not part of the provided DB;
Whilst signatureSearchData provides instructions to create new DBs for the genetic perturbations, the data is loaded into memory and I only have 16GB RAM available.

Note

Yes I could use a “cheap” AWS instance or whatever, but I would need something like a r8g.8xlarge instance (256 GB RAM), which is fairly expensive for a personal budget. Secondly, if you ever worked with AWS, cheap or simple are not words you would associate with it.

TileDB, and specially their SOMA data model serves my purposes quite well. TileDB is an unusual but powerful proposition in data as it has DB on it’s name, but it’s less a database and more a way to structuring and connecting different data modalities. It’s also the name of company behind the tech. Their blurb is:

TileDB is foundational software designed by scientists for scientific discovery. TileDB structures all data types, including data that does not easily fit into relational databases. Built on a powerful shape-shifting array database, TileDB handles the complexities of non-traditional “unstructured” multimodal data, such as genomic variants, bulk and single-cell transcriptomics, proteomics, biomedical imaging, as well as the frontier data of the future.

I like it because:

The engine is open-source
It’s cloud native, meaning you can prototype your “DB” locally and move it to cloud storage for production
Written in c++, which means that it will be efficient, but has APIs in python and R making it developer/analyst friendly
Plenty of tutorials with examples for genomics work
Has libraries/functions to make single-cell data analysis quite straightforward

This last part caught my eye because this means that in principle any gene expression like dataset that can be fitted into an anndata or SingleCellExperiment would leverage existing APIs - removing the need to write a lot of new (redundant) code. If you look carefully the data spec looks very much like that of Biocondutor’s singleCellExperiment (itself an extension of SummarizedExperiment):

And it turns out that the LINCS data format, GCTx which is based on HDF5, but is also conceptually similar to singleCellExperiment and by extension to Tile-db soma. This means that I could use a series to data format conversations, using existing tools, to obtain a DB which allows out-of-memory querying.

Yay for the many redundant data formats in bioinformatics.

Create TileDB

Some good folks were kind enough to wrap the data in a R / bioconductor SummarizedExperiment object with an delayed array back end. And this important because it reduces the need to load data into memory. However, this object doesn’t include all perturbations - the genetic perturbations are missing, specially those released in 2020 and corresponding to largest set.

Another reason why I can use this existing signatureSearchData dataset is that only level 5 data is included. These are single vectors of gene expression, one per perturbagen and per cell line - meaning that replicate values, or multiple concentrations in the case of compound perturbations, are not longer included. This is fine for biological exploration, but not for QC and data exploration. My database will be on level 4 data which includes differential expression for individual replicates vs the plate median. I want that granularity.

https://lincsproject.org/LINCS/tools/workflows/find-the-best-place-to-obtain-the-lincs-l1000-data

So how to build the tiledbsoma from the raw data (h5 and plain text files)? Since I was not keen on re-inventing the wheel, I though I could make use of the tutorial provided in the vignette to make a SummarizedExperiment and then convert it to

Basically:

gtx -> h5 -> singleCellExperiment -> tiledbsoma

The critical aspect is the conversion to h5 which will allow out of memory operations.

1. Data download and prep

The data is stored in GEO, but the most convenient location are the S3 download links provided by clue.io:

Code

cd data/cmap_tiledb/
wget https://s3.amazonaws.com/macchiato.clue.io/builds/LINCS2020/level4/level4_beta_trt_xpr_n420583x12328.gctx -O data/raw/level4_beta_trt_xpr_n420583x12328.gctx --show-progress

wget https://s3.amazonaws.com/macchiato.clue.io/builds/LINCS2020/level4/level4_beta_trt_oe_n131668x12328.gctx -O data/raw/level4_beta_trt_oe_n131668x12328.gctx --show-progress

wget https://s3.amazonaws.com/macchiato.clue.io/builds/LINCS2020/level4/level4_beta_trt_sh_n453175x12328.gctx -O data/raw/level4_beta_trt_sh_n453175x12328.gctx --show-progress

wget https://s3.amazonaws.com/macchiato.clue.io/builds/LINCS2020/level4/level4_beta_all_n3026460x12328.gctx -O data/raw/level4_beta_all_n3026460x12328.gctx --show-progress

Before we can start any format conversion, there is of course some data cleaning involved. I’ve learned the hard way that to create level 4 databases, and possibly level 3, we need to obtains the column names from the gtcx file - otherwise It’s a mess.

Code

#library("ExperimentHub")
library("signatureSearch")
library("rhdf5")
library("SingleCellExperiment")
library("HDF5Array")
library("data.table")
library("magrittr")

For this I will be using a number of functions from the signatureSearch package, which are not exported and are sourced directly.

Code

source_lines <- function(file, lines){
  source(textConnection(readLines(file)[lines]))
}

# fix.datatypes
source_lines("https://raw.githubusercontent.com/girke-lab/signatureSearch/b64347cb80e1f2856033c254203310e3d1acf67c/R/io.R", 70:103)

# subset_to_ids
source_lines("https://raw.githubusercontent.com/girke-lab/signatureSearch/b64347cb80e1f2856033c254203310e3d1acf67c/R/io.R", 451:457)


# check_colnames
source_lines("https://raw.githubusercontent.com/girke-lab/signatureSearch/b64347cb80e1f2856033c254203310e3d1acf67c/R/io.R", 460:475)

# read.gctx.meta
source_lines("https://raw.githubusercontent.com/girke-lab/signatureSearch/b64347cb80e1f2856033c254203310e3d1acf67c/R/io.R", 107:149)

The data preparation step consists of collecting the row (sample) and column ids (genes) to ingest later. This is necessary to avoid issues data points which are present in the full sample metadata table but not in present in the matrix for a particular data level - missing data due to not passing QC for example.

The column names are also formatted to make them computer-friendly - another one of those things that will cause headaches in the future if note done upfront.

Code

gctx_path <- "/home/adomingues/data/cmap_tiledb/data/raw/level4_beta_trt_xpr_n420583x12328.gctx"

col_meta <- read.gctx.meta(
  gctx_path,
  dimension="col"
) %>%
  setDT()

row_meta <- read.gctx.meta(
  gctx_path,
  dimension="row"
) %>%
  setDT()


## check uniqueness
col_meta$id %>%
  duplicated() %>%
  table()
row_meta$id %>%
  duplicated() %>%
  table()

col_names <- col_meta$id
new_colnames <- make.names(col_names)
row_names <- row_meta$id

head(col_names)
# [1] "ABY001_A375_XH_X1_B15:A03" "ABY001_A375_XH_X1_B15:A04" "ABY001_A375_XH_X1_B15:A05" "ABY001_A375_XH_X1_B15:A06" "ABY001_A375_XH_X1_B15:A07" "ABY001_A375_XH_X1_B15:A08"

head(new_colnames)
# [1] "ABY001_A375_XH_X1_B15.A03" "ABY001_A375_XH_X1_B15.A04" "ABY001_A375_XH_X1_B15.A05" "ABY001_A375_XH_X1_B15.A06" "ABY001_A375_XH_X1_B15.A07" "ABY001_A375_XH_X1_B15.A08"

2. Build `h5` and `summarizedExperiment`

The fist set is to convert ctgx to h5:

Code

fs::dir_create("/home/adomingues/data/cmap_tiledb/data/processed_data/")
h5file <- "/home/adomingues/data/cmap_tiledb/data/processed_data/level4_beta_trt_xpr.h5"
gctx2h5(
  gctx_path,
  cid = col_names,
  new_cid = new_colnames,
  h5file = h5file,
  by_ncol = 5000,
  overwrite = TRUE
)

And then from h5 to sce:

Code

sce <- SingleCellExperiment(HDF5Array(h5file, name = "assay"))
assayNames(sce, 1) <- "counts"
colnames(sce) <- make.names(col_names)
# colData(sce)$sample_id <- col_names
rownames(sce) <- row_names

This is ok but only contains the matrix with the expression values, and column row names. Not of much use as is. So I’ll need to add some information about the perturbations (and the genes).

We will need metadata for the perturbations:

Code

instinfo_beta <- fread(
  "data/raw/instinfo_beta.txt"
)

pert_metadata <- instinfo_beta %>%
  .[, new_id := make.names(sample_id)] %>%
  .[match(colnames(sce), new_id)] %>%
  DataFrame()
rownames(pert_metadata) <- pert_metadata$new_id

head(pert_metadata) %>%
  as.data.frame() %>%
  subset(select = c(1:2, 7:12)) %>%
  knitr::kable()

Sneak peak of the perturbations and associated metadada.
	bead_batch	nearest_dose	pert_itime	pert_time_unit	cell_mfc_name	pert_mfc_id	det_plate	det_well
ABY001_A375_XH_X1_B15.A03	b15	NA	3 h	h	A375	DMSO	ABY001_A375_XH_X1_B15	A03
ABY001_A375_XH_X1_B15.A04	b15	NA	3 h	h	A375	DMSO	ABY001_A375_XH_X1_B15	A04
ABY001_A375_XH_X1_B15.A05	b15	NA	3 h	h	A375	DMSO	ABY001_A375_XH_X1_B15	A05
ABY001_A375_XH_X1_B15.A06	b15	NA	3 h	h	A375	DMSO	ABY001_A375_XH_X1_B15	A06
ABY001_A375_XH_X1_B15.A07	b15	NA	3 h	h	A375	DMSO	ABY001_A375_XH_X1_B15	A07
ABY001_A375_XH_X1_B15.A08	b15	NA	3 h	h	A375	DMSO	ABY001_A375_XH_X1_B15	A08

out of curiosity we can also summarize the perturbation types included:

Code

table(pert_metadata$pert_type) %>%
  as.data.frame() %>%
  knitr::kable()

Summary of all the perturbation types included in the dataset
Var1	Freq
ctl_untrt	32516
ctl_vector	59251
ctl_vehicle	102696
ctl_x	12
trt_aby	1655
trt_cp	1805898
trt_lig	24301
trt_oe	131668
trt_sh	453175
trt_si	472
trt_xpr	414816

And gene info (symbols and other ids)

Code

geneinfo_beta <- fread(
  "/home/adomingues/data/cmap_tiledb/data/raw/geneinfo_beta.txt"
)

gene_metadata <- geneinfo_beta %>%
  .[, gene_id := as.character(gene_id)] %>%
  .[match(rownames(sce), gene_id)] %>%
  DataFrame()
rownames(gene_metadata) <- gene_metadata$gene_id

head(gene_metadata)   %>%
  knitr::kable()

Sneak peak of the gene metadada.
	gene_id	gene_symbol	ensembl_id	gene_title	gene_type	src	feature_space
10	10	NAT2	ENSG00000156006	N-acetyltransferase 2	protein-coding	NCBI	inferred
100	100	ADA	ENSG00000196839	adenosine deaminase	protein-coding	NCBI	best inferred
1000	1000	CDH2	ENSG00000170558	cadherin 2	protein-coding	NCBI	best inferred
10000	10000	AKT3	ENSG00000117020	AKT serine/threonine kinase 3	protein-coding	NCBI	best inferred
10001	10001	MED6	ENSG00000133997	mediator complex subunit 6	protein-coding	NCBI	best inferred
10003	10003	NAALAD2	ENSG00000077616	N-acetylated alpha-linked acidic dipeptidase 2	protein-coding	NCBI	best inferred

I will do some sanity checks to make sure the columns and row order are all lined up:

Code

(rownames(sce) == gene_metadata$gene_id) %>% table()
(colnames(sce) == rownames(pert_metadata)) %>% table()

rowData(sce) <- cbind(rowData(sce), gene_metadata)
colData(sce) <- cbind(colData(sce), pert_metadata)

All good. We can move to the next step.

3. Convert `singlecellexperiment` to `anndata`

Now the sce object has the assay data (gene expression matrix), colData (perturbation metadata) and rowData (gene IDs), we can finally save it has an anndata object. This is basically python’s version of the singleCellExperiment object but it’s supported by TielDB-SOMA for chunk ingestion (goes nice on the memory):

One would expect that chunked ingestion would also be possible with singleCellExperiment but the API is lagging behind:

However, the Python API does support chunked-ingests of H5AD files via AnnData’s “backed mode.” This means the data from the .X matrix is streamed directly from the H5AD file into the SOMA array on disk, without loading the entire matrix into RAM. https://github.com/single-cell-data/TileDB-SOMA/issues/4291#issuecomment-3434169342

So let’s do that extra step and save the anndata h5ad :

Code

library("zellkonverter")
library("reticulate")
py_require("anndata")

andata_path <- "/mnt/synologydrive/level4_beta_all.h5ad"
fs::dir_create(dirname(andata_path))

writeH5AD(sce, file = andata_path)

4. Convert `anndata` to `TileDB`

Now we can finally pop into python and convert to tileDB:

Code

import anndata as ad
import tiledb
import tiledb.cloud
import tiledbsoma
import tiledbsoma.io
from pathlib import Path

tiledbsoma.show_package_versions()

def ingest_file():
  tiledbsoma.show_package_versions()

  large_h5ad_path = "/mnt/synologydrive/level4_beta_all.h5ad"
  experiment_uri = '/mnt/synologydrive/somas/level4_beta_all'
  outdir = Path(experiment_uri)
  print("creating soma destination")
  Path(outdir).mkdir(parents=True, exist_ok=True)
  print(outdir)

  print("open anndata in backed mode")
  adata = ad.read_h5ad(large_h5ad_path, backed='r')

  print("convert to tiledb soma")
  tiledbsoma.io.from_anndata(experiment_uri=experiment_uri, measurement_name="level_4", anndata=adata)


if __name__ == '__main__':
    ingest_file()

That’s it. Pretty “easy”.

Query the data

I am querying in R but this could have been done in python. Important to know is that:

Feature-level annotations are stored in the var array, which is always located at the top-level of each SOMA Measurement.
assay data (e.g., RNA expression levels) is stored in SOMASparseNDArrays within the X collection.
- Each array within X is referred to as a layer.
obs contains the perturbation metadata.

Code

library("data.table")
library("magrittr")
library("tiledb")
library("tiledbsoma")
library("tictoc")


Attaching package: 'tictoc'

The following object is masked from 'package:data.table':

    shift

Code

experiment_uri <- "/mnt/synologydrive/somas/level4_beta_all"

experiment <- SOMAOpen(experiment_uri)
experiment$obs

<SOMADataFrame>
  uri: file:///mnt/synologydrive/somas/level4_beta_all/obs
  dimensions: soma_joinid 
  attributes: obs_id, bead_batch, nearest_dose, pert_dose, pert_dose_unit, pert_idose, pert...

Code

experiment$ms$get("level_4")$var

<SOMADataFrame>
  uri: file:///mnt/synologydrive/somas/level4_beta_all/ms/level_4/var
  dimensions: soma_joinid 
  attributes: var_id, gene_id, gene_symbol, ensembl_id, gene_title, gene_type, src, feature...

Code

<SOMADataFrame>
  uri: file:///data/cmap_tiledb/data/processed_data/somas/level4_beta_trt_xpr/obs
  dimensions: soma_joinid
  attributes: obs_id, bead_batch, nearest_dose, pert_dose, pert_dose_unit, pert_idose, pert_time, pert_itime, pert_time_unit, cell_mfc_name, pert_mfc_id, det_plate,...
<SOMADataFrame>
  uri: file:///data/cmap_tiledb/data/processed_data/somas/level4_beta_trt_xpr/ms/level_4/var
  dimensions: soma_joinid
  attributes: var_id, gene_id, gene_symbol, ensembl_id, gene_title, gene_type, src, feature_space

We can query the perturbations only (metadata):

Code

tic()
## peak experiment, colData
experiment$obs$read()$concat()$to_data_frame() %>%
 head() %>%
 tibble::as_tibble()

# A tibble: 6 × 33
  soma_joinid obs_id bead_batch nearest_dose pert_dose pert_dose_unit pert_idose
        <int> <chr>  <fct>             <dbl>     <dbl> <fct>          <fct>     
1           0 ABY00… b15                  NA        NA ""             ""        
2           1 ABY00… b15                  NA        NA ""             ""        
3           2 ABY00… b15                  NA        NA ""             ""        
4           3 ABY00… b15                  NA        NA ""             ""        
5           4 ABY00… b15                  NA        NA ""             ""        
6           5 ABY00… b15                  NA        NA ""             ""        
# ℹ 26 more variables: pert_time <dbl>, pert_itime <fct>, pert_time_unit <fct>,
#   cell_mfc_name <fct>, pert_mfc_id <chr>, det_plate <fct>, det_well <fct>,
#   rna_plate <fct>, rna_well <fct>, count_mean <int>, count_cv <int>,
#   qc_f_logp <dbl>, qc_iqr <dbl>, qc_slope <int>, pert_id <chr>,
#   sample_id <chr>, pert_type <fct>, cell_iname <fct>, qc_pass <int>,
#   dyn_range <dbl>, inv_level_10 <dbl>, build_name <lgl>, failure_mode <fct>,
#   project_code <fct>, cmap_name <chr>, new_id <chr>

Code

toc()

6.229 sec elapsed

Now let’s get some expression data for a subset of the array:

Code

tic()
## format is interesting with columns for X, Y and data. Not a rectangular format as we would expect
experiment$ms$get("level_4")$X$get("data")$read(
  coords = list(0:10L, 0:10L)
)$tables()$concat()$to_data_frame() %>%
  tibble::as_tibble()

# A tibble: 121 × 3
   soma_dim_0 soma_dim_1 soma_data
        <int>      <int>     <dbl>
 1          0          0   -0.764 
 2          0          1   -0.360 
 3          0          2   -0.761 
 4          0          3    2.10  
 5          0          4   -0.913 
 6          0          5    0.0150
 7          0          6    0.260 
 8          0          7   -0.733 
 9          0          8   -0.635 
10          0          9    0.798 
# ℹ 111 more rows

Code

toc()

0.843 sec elapsed

What we would normally want to do though is to filter the data data with a combination of perturbations and / or obtain the expression of certain genes. Here I was asking for the expression values of GRIN1 and GRIN2B in a all the genetic perturbations of GRIN1 that pass experimental QC:

Code

tic()
query <- experiment$axis_query(
  measurement_name = "level_4",
  obs_query = SOMAAxisQuery$new(
    #value_filter = "cmap_name == YAP1 & cell_iname == MCF7 & qc_pass == 1"
    value_filter = "cmap_name == GRIN1 & qc_pass == 1"
  ),
  var_query = SOMAAxisQuery$new(
    value_filter = "gene_symbol %in% c('GRIN1', 'GRIN2B')"
  )
)

c("wells" = query$n_obs, "genes" = query$n_vars)

wells genes 
   60     2

Code

mat <- query$to_sparse_matrix(
  collection = "X",
  layer_name = "data",
  obs_index = "sample_id",
  var_index = "gene_id"
)
toc()

0.907 sec elapsed

Code

## it's a transposed matrix
mat

60 x 2 sparse Matrix of class "dgTMatrix"
                                     2902    2904
XPR009_A375.311_96H_X1.L2_B36:E14 -0.1411 -0.2839
XPR009_A375.311_96H_X1.L2_B36:E24  0.3011  1.6599
XPR009_A375.311_96H_X2.L2_B36:E14 -0.0326 -0.6986
XPR009_A375.311_96H_X2.L2_B36:E24 -0.3460 -0.2312
XPR009_A375.311_96H_X3_B35:E14    -0.2968 -0.4835
XPR009_A375.311_96H_X3_B35:E24     1.6715  1.7778
XPR009_A549.311_96H_X1_B35:E14    -0.5252 -0.0893
XPR009_A549.311_96H_X1_B35:E24    -0.2725 -0.0743
XPR009_A549.311_96H_X2_B35:E14    -0.6319  0.1191
XPR009_A549.311_96H_X2_B35:E24     1.6526  1.4496
XPR009_A549.311_96H_X3_B35:E14     1.2311  0.8436
XPR009_A549.311_96H_X3_B35:E24     0.9749  1.2205
XPR009_AGS.311_96H_X1_B35:E14      0.3148  0.6937
XPR009_AGS.311_96H_X1_B35:E24      0.8301  0.6623
XPR009_AGS.311_96H_X2_B35:E14     -0.0658  0.9372
XPR009_AGS.311_96H_X2_B35:E24      1.1508  0.1973
XPR009_AGS.311_96H_X3_B35:E14      0.2911 -0.4499
XPR009_AGS.311_96H_X3_B35:E24      0.7797 -0.0007
XPR009_BICR6.311_96H_X1_B36:E14   -0.8922 -0.2405
XPR009_BICR6.311_96H_X1_B36:E24    2.6261  2.3374
XPR009_BICR6.311_96H_X2_B36:E14   -0.2716  0.2245
XPR009_BICR6.311_96H_X2_B36:E24   -0.4772  0.0832
XPR009_BICR6.311_96H_X3_B36:E14   -0.4211 -0.9377
XPR009_BICR6.311_96H_X3_B36:E24    1.9985  2.4667
XPR009_ES2.311_96H_X1_B35:E14      0.4795  0.3000
XPR009_ES2.311_96H_X1_B35:E24      1.4036  0.7925
XPR009_ES2.311_96H_X2_B35:E14      0.4407  0.2199
XPR009_ES2.311_96H_X2_B35:E24     -0.0908 -0.2502
XPR009_ES2.311_96H_X3_B35:E14      1.9129  0.5264
XPR009_ES2.311_96H_X3_B35:E24      1.7384  1.2412
XPR009_HT29.311_96H_X1_B35:E14     1.0522  0.6696
XPR009_HT29.311_96H_X1_B35:E24     1.0583  1.8101
XPR009_HT29.311_96H_X2_B35:E14    -1.4502 -1.5794
XPR009_HT29.311_96H_X2_B35:E24    -1.1962 -0.5248
XPR009_HT29.311_96H_X3_B35:E14    -0.2497 -1.4597
XPR009_HT29.311_96H_X3_B35:E24    -1.1278 -0.0450
XPR009_MCF7.311_96H_X1_B35:E14    -1.2902 -0.5335
XPR009_MCF7.311_96H_X1_B35:E24     1.0739  2.2515
XPR009_MCF7.311_96H_X2_B35:E14    -0.2550  0.1322
XPR009_MCF7.311_96H_X2_B35:E24    -0.3133 -0.2212
XPR009_MCF7.311_96H_X3_B35:E14    -0.7899  0.3837
XPR009_MCF7.311_96H_X3_B35:E24    -0.2946 -0.0513
XPR009_PC3.311B_96H_X1_B35:E14    -0.3921 -0.2881
XPR009_PC3.311B_96H_X1_B35:E24     2.8025  2.0913
XPR009_PC3.311B_96H_X2_B35:E14     1.2951  0.7561
XPR009_PC3.311B_96H_X2_B35:E24     2.3810  1.7218
XPR009_PC3.311B_96H_X3_B35:E14    -0.3117 -0.7446
XPR009_PC3.311B_96H_X3_B35:E24     0.5850 -1.0340
XPR009_U251MG.311_96H_X1_B35:E14   0.1904  0.5355
XPR009_U251MG.311_96H_X1_B35:E24   1.4981  1.6613
XPR009_U251MG.311_96H_X2_B35:E14  -0.2223 -0.1052
XPR009_U251MG.311_96H_X2_B35:E24   2.7340  0.7473
XPR009_U251MG.311_96H_X3_B35:E14   0.4815  0.3048
XPR009_U251MG.311_96H_X3_B35:E24  -0.0294 -0.3032
XPR009_YAPC.311_96H_X1_B35:E14    -1.0998 -1.0925
XPR009_YAPC.311_96H_X1_B35:E24    -0.2415 -0.0687
XPR009_YAPC.311_96H_X2.L2_B37:E14 -0.6469 -0.5809
XPR009_YAPC.311_96H_X2.L2_B37:E24  0.1052 -0.1643
XPR009_YAPC.311_96H_X3_B36:E14     1.0056  0.3091
XPR009_YAPC.311_96H_X3_B36:E24     0.3060 -0.0382

The query is wicked fast! Specially if we consider how large the object is.

I could also retrieve metadata information:

Code

tic()
obs <- query$obs(column_names = c("sample_id", "pert_type", "cmap_name"))$concat()
obs$to_data_frame() %>%
  tibble::as_tibble()

# A tibble: 60 × 3
   sample_id                         pert_type cmap_name
   <chr>                             <fct>     <chr>    
 1 XPR009_A375.311_96H_X1.L2_B36:E14 trt_xpr   GRIN1    
 2 XPR009_A375.311_96H_X1.L2_B36:E24 trt_xpr   GRIN1    
 3 XPR009_A375.311_96H_X2.L2_B36:E14 trt_xpr   GRIN1    
 4 XPR009_A375.311_96H_X2.L2_B36:E24 trt_xpr   GRIN1    
 5 XPR009_A375.311_96H_X3_B35:E14    trt_xpr   GRIN1    
 6 XPR009_A375.311_96H_X3_B35:E24    trt_xpr   GRIN1    
 7 XPR009_A549.311_96H_X1_B35:E14    trt_xpr   GRIN1    
 8 XPR009_A549.311_96H_X1_B35:E24    trt_xpr   GRIN1    
 9 XPR009_A549.311_96H_X2_B35:E14    trt_xpr   GRIN1    
10 XPR009_A549.311_96H_X2_B35:E24    trt_xpr   GRIN1    
# ℹ 50 more rows

Code

toc()

0.158 sec elapsed

The drawbacks

The main one is that doing these operations, mostly on-disk, is very slow. Each of these transformations takes several hours to a day. This is not unexpected because are talking about > 100GB of data begin passed around in a fairly low end system.

This comes to the second point: it uses a lot of disk space. As far as I can tell none of these data formats, including TileDB does any meaningful compression, so using a 500GB disk was not enough and I had to use a mounted remote disk. It is in my home network, but it still impacts speed massively.

Comments

I welcome any comments or suggestions at amjdomingues at gmail.com or ping me on Linkedin / Mastodon

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{domingues2025,
  author = {Domingues, António},
  title = {Building a {TileDB-SOMA} with {LINCS} Data},
  date = {2025-12-26},
  url = {https://amjdomingues.com/posts/2025-11-02-cmap-tile-db-build/},
  langid = {en}
}

For attribution, please cite this work as:

Domingues, António. 2025. “Building a TileDB-SOMA with LINCS Data.” December 26, 2025. https://amjdomingues.com/posts/2025-11-02-cmap-tile-db-build/.

Biological motivation

Computational solution

Some concepts

LINCS2

L1000

CMAP

LINCS2 in TileDB: why?

Create TileDB

1. Data download and prep

2. Build h5 and summarizedExperiment

3. Convert singlecellexperiment to anndata

4. Convert anndata to TileDB

Query the data

The drawbacks

Comments

Reuse

Citation

2. Build `h5` and `summarizedExperiment`

3. Convert `singlecellexperiment` to `anndata`

4. Convert `anndata` to `TileDB`