Datasets

There is an increasing need for new data-driven approaches to solve problems in the field of Raman spectroscopy. Yet, the lack of large, high-quality datasets has been a major bottleneck in the development and validation of new algorithms and pipelines.

To accelerate algorithmic and pipeline development, RamanSPy provides several big, well-curated Raman spectroscopic datasets acquired from researchers around the world for different modelling and predictive tasks. With RamanSPy, users can readily access these datasets via established data loading methods and experiment with them in a variety of applications.

Datasets are available in the ramanspy.datasets module.

Bacteria data

ramanspy.datasets.bacteria(dataset='train', folder=None) → Tuple[SpectralContainer, ndarray][source]

Raman spectra acquired from different bacterial and yeast isolates.

>80k spectra across 30+ isolates. Ideal for classification modelling.

Data from Ho, CS. et al. (2019).

Must be downloaded first. Provided by authors on DropBox.

Parameters:

dataset (str, default='train') –
Which bacteria dataset to load.

Available datasets are:
- 'train' - 60k spectra, 2k for each of 30 different reference bacterial and yeast isolates;
- 'val' - 3k spectra, 100 spectra for each of the reference isolates;
- 'test' - 3k spectra, 100 spectra for each of the reference isolates;
- 'clinical2018' - 12k spectra, 400 spectra for each of 30 patient isolates (distributed across 5 species);
- 'clinical2019' - 2.5k spectra, 100 spectra for each of 25 patient isolates (distributed across 5 species);
- 'labels' - The names of the species and antibiotics corresponding to the 30 classes.
folder (str, default=None) – Path to the folder containing the downloaded data. If None, will use the root location. Irrelevant if dataset='labels'.

Returns:

SpectralContainer with spectral_data of shape (N, B) – The Raman spectra provided in the selected dataset.
np.ndarray[int] of shape (N, ) – The corresponding labels - indicating which bacteria species each data point corresponds to.

References

Ho, CS., Jean, N., Hogan, C.A. et al. Rapid identification of pathogenic bacteria using Raman spectroscopy and deep learning. Nat Commun 10, 4927 (2019).

Examples:

import ramanspy as rp

# Load training and testing datasets
X_train, y_train = rp.datasets.bacteria("train", path_to_data="path/to/data")
X_test, y_test = rp.datasets.bacteria("test", path_to_data="path/to/data"))

# Load the names of the species and antibiotics corresponding to the 30 classes
y_labels, antibiotics_labels = rp.datasets.bacteria("labels")

Volumetric cell data

ramanspy.datasets.volumetric_cells(cell_type='THP-1', folder=None) → List[SpectralVolume][source]

A single volumetric scan of hiPSC cells.

Data from Kallepitis et al. (2017).

Must be downloaded first. Provided by authors on Zenodo.

Parameters:

cell_type (str, default='THP-1') –
The cell type to load. Supported cell types are:
- 'THP-1' - THP-1 cells (n=4);
folder (str, default=None) – Path to the folder containing the data. If None, will use the root location.

Returns:

A collection of volumetric data of the given cell type.

Return type:

list[SpectralVolume]

References

Kallepitis, C., Bergholt, M., Mazo, M. et al. Quantitative volumetric Raman imaging of three dimensional cell cultures. Nat Commun 8, 14843 (2017).

Examples:

import ramanspy as rp

cells_volume = rp.datasets.volumetric_cells(cell_type='THP-1', path_to_data="path/to/data")

MDA-MB-231 cells data

ramanspy.datasets.MDA_MB_231_cells(dataset='train', folder=None) → Tuple[SpectralContainer, SpectralContainer][source]

170k pairs of low- and high-SNR data. I

Ideal for developing and validating denoising models and algorithms.

Data from Horgan, C.C. et al. (2021).

Must be downloaded first. Provided by authors on Google Drive.

All data has spectral dimensionality of 500, in the range (500, 1800) cm:sup:-1.

Parameters:

dataset (str, default='train') –
Which bacteria dataset to load.

Available datasets are:
- 'train' - Just under 160k spectra.
- 'test' - Just under 13k spectra.
folder (str, default=None) – Path to the folder containing the downloaded data. If None, will use the root location. Irrelevant if dataset='labels'.

Returns:

SpectralContainer – Low SNR input.
SpectralContainer – The corresponding high SNR target output.

References

Horgan, C.C., Jensen, M., Nagelkerke, A., St-Pierre, J.P., Vercauteren, T., Stevens, M.M. and Bergholt, M.S., 2021. High-Throughput Molecular Imaging via Deep-Learning-Enabled Raman Spectroscopy. Analytical Chemistry, 93(48), pp.15850-15860.

Examples:

import ramanspy as rp

input, output = rp.datasets.MDA_MB_231_cells(path_to_data="path/to/data")

COVID-19 data

ramanspy.datasets.covid19(file) → Tuple[SpectralContainer, ndarray, ndarray][source]

Raman spectra acquired from patients with COVID-19 and healthy controls.

Data from Yin G. et al. (2021).

Must be downloaded first. Available on Kaggle.

Parameters:

file (str, default=None) – Path to the file containing the downloaded data.

Returns:

SpectralContainer with spectral_data of shape (N, B) – The Raman spectra provided in the selected dataset.
np.ndarray[int] of shape (N, ) – The corresponding labels - indicating which group each data point corresponds to.
np.ndarray[string] of shape (B, ) – The names of the labels.

References

Yin G, Li L, Lu S, Yin Y, Su Y, Zeng Y, Luo M, Ma M, Zhou H, Orlandini L, Yao D. An efficient primary screening of COVID‐19 by serum Raman spectroscopy. Journal of Raman Spectroscopy. 2021 May;52(5):949-58.

Yin G, Li L, Lu S, Yin Y, Su Y, Zeng Y, Luo M, Ma M, Zhou H, Yao D, Liu G, Lang J. Data and code on serum Raman spectroscopy as an efficient primary screening of coronavirus disease in 2019 (COVID-19). figshare; 2020.

Examples:

import ramanspy as rp

# Load training dataset
spectra, labels, label_names = rp.datasets.covid19(path_to_data="path/to/data")

Adenine data

ramanspy.datasets.adenine(file=None, download=True) → Tuple[SpectralContainer, ndarray, ndarray][source]

Raman spectra acquired from samples representing different levels of adenine concentrations.

Data from Fornasaro, Stefano, et al. (2020).

Can be downloaded directly through the function or downloaded separately. In the latter case, users just need to specifiy the location of the file to be loaded.

Available on Zenodo.

Parameters:

file (str, default=None) – Path to the file containing the downloaded data. Not used if download=True.
download (bool, default=True) – If True, will download the data from Zenodo. Otherwise, will look for the data in the specified path given by path_to_data. Note that if download=True, data will be downloaded which may take some time.

Returns:

SpectralContainer with spectral_data of shape (N, B) – The Raman spectra provided in the selected dataset.
pandas.DataFrame of shape (N, 8) – 8 additional features indicating sample collection parameters..
np.ndarray[int] of shape (N, ) – The corresponding labels - indicating the adenine concentration which each data point corresponds to.

References

Fornasaro S, Alsamad F, Baia M, Batista de Carvalho LA, Beleites C, Byrne HJ, Chiadò A, Chis M, Chisanga M, Daniel A, Dybas J. Surface enhanced Raman spectroscopy for quantitative analysis: results of a large-scale European multi-instrument interlaboratory study. Analytical chemistry. 2020 Feb 11;92(5):4053-64.

Examples:

import ramanspy as rp

# Load dataset
spectra, additional_features, labels = rp.datasets.adenine()

Wheat lines data

ramanspy.datasets.wheat_lines(file=None, download=True) → Tuple[SpectralContainer, ndarray, ndarray][source]

Raman spectra acquired from groups of wheat lines:

'COM' - Commercial cultivar;

'COM - 125mM' - Commercial cultivar treated with 125mM NaCl;

'ML1 - 125mM' - Mutant Line 1 treated with 125mM NaCl;

'ML2 - 125mM' - Mutant Line 2 treated with 125mM NaCl.

Data from ŞEN A. et al. (2023).

Available on Zenodo.

Can be downloaded directly through the function or downloaded separately. In the latter case, users just need to specifiy the location of the file to be loaded.

Parameters:

file (str, default=None) – Path to the file containing the downloaded data. Not used if download=True.
download (bool, default=True) – If True, will download the data from Zenodo. Otherwise, will look for the data in the specified path given by path_to_data. Note that if download=True, data will be downloaded which may take some time.

Returns:

SpectralContainer with spectral_data of shape (N, B) – The Raman spectra provided in the selected dataset.
np.ndarray[int] of shape (N, ) – The corresponding labels - indicating which group each data point corresponds to.
np.ndarray[string] of shape (B, ) – The names of the labels.

References

ŞEN A, Kecoglu I, Ahmed M, Parlatan U, Unlu M. Differentiation of advanced generation mutant wheat lines: Conventional techniques versus Raman spectroscopy. Frontiers in Plant Science. 2023;14.

Examples:

import ramanspy as rp

# Load training dataset
spectra, labels, label_names = rp.datasets.wheat_lines()

RRUFF data

ramanspy.datasets.rruff(dataset: str, folder=None, download: bool = True) → Tuple[List[SpectralContainer], List[dict]][source]

Raman spectra acquired from various minerals.

Data from the RRUFF database.

Can be downloaded directly through the function or downloaded separately. In the latter case, users just need to specifiy the location of the file to be loaded.

Parameters:

dataset (str) – The name of the RRUFF Raman dataset to load. Check available datasets here.
folder (str, default=None) – Path to the folder containing the downloaded data. If None, will use the root location. Irrelevant if ``download=True`.
download (bool, optional, default=True) – Whether to download the specified dataset or load it from a local directory. If download=False, all .txt files from the directory provided via dataset will be loaded.

Returns:

list[Spectrum] – The Raman spectra provided.
list[dict] – List of metadata dictionaries, extracted from the header of the RRUFF data file.

References

Lafuente B, Downs R T, Yang H, Stone N (2015) The power of databases: the RRUFF project. In: Highlights in Mineralogical Crystallography, T Armbruster and R M Danisi, eds. Berlin, Germany, W. De Gruyter, pp 1-30.

Examples:

import ramanspy as rp

# downloaded from the Internet
rp.datasets.rruff('fair_oriented')

# loaded from the given folder
rp.datasets.rruff('path/to/dataset/folder/fair_oriented', download=False)