Datasets
There is an increasing need for new data-driven approaches to solve problems in the field of Raman spectroscopy. Yet, the lack of large, high-quality datasets has been a major bottleneck in the development and validation of new algorithms and pipelines.
To accelerate algorithmic and pipeline development, RamanSPy provides several big, well-curated Raman spectroscopic datasets acquired from researchers around the world for different modelling and predictive tasks. With RamanSPy, users can readily access these datasets via established data loading methods and experiment with them in a variety of applications.
Datasets are available in the ramanspy.datasets module.
Bacteria data
- ramanspy.datasets.bacteria(dataset='train', folder=None) Tuple[SpectralContainer, ndarray][source]
Raman spectra acquired from different bacterial and yeast isolates.
>80k spectra across 30+ isolates. Ideal for classification modelling.
Data from Ho, CS. et al. (2019).
Must be downloaded first. Provided by authors on DropBox.
- Parameters:
dataset (str, default='train') –
Which bacteria dataset to load.
Available datasets are:
'train'- 60k spectra, 2k for each of 30 different reference bacterial and yeast isolates;'val'- 3k spectra, 100 spectra for each of the reference isolates;'test'- 3k spectra, 100 spectra for each of the reference isolates;'clinical2018'- 12k spectra, 400 spectra for each of 30 patient isolates (distributed across 5 species);'clinical2019'- 2.5k spectra, 100 spectra for each of 25 patient isolates (distributed across 5 species);'labels'- The names of the species and antibiotics corresponding to the 30 classes.
folder (str, default=None) – Path to the folder containing the downloaded data. If None, will use the root location. Irrelevant if
dataset='labels'.
- Returns:
SpectralContainer with spectral_data of shape (N, B) – The Raman spectra provided in the selected dataset.
np.ndarray[int] of shape (N, ) – The corresponding labels - indicating which bacteria species each data point corresponds to.
References
Ho, CS., Jean, N., Hogan, C.A. et al. Rapid identification of pathogenic bacteria using Raman spectroscopy and deep learning. Nat Commun 10, 4927 (2019).
Examples:
import ramanspy as rp # Load training and testing datasets X_train, y_train = rp.datasets.bacteria("train", path_to_data="path/to/data") X_test, y_test = rp.datasets.bacteria("test", path_to_data="path/to/data")) # Load the names of the species and antibiotics corresponding to the 30 classes y_labels, antibiotics_labels = rp.datasets.bacteria("labels")
Volumetric cell data
- ramanspy.datasets.volumetric_cells(cell_type='THP-1', folder=None) List[SpectralVolume][source]
A single volumetric scan of hiPSC cells.
Data from Kallepitis et al. (2017).
Must be downloaded first. Provided by authors on Zenodo.
- Parameters:
cell_type (str, default='THP-1') –
The cell type to load. Supported cell types are:
'THP-1'- THP-1 cells (n=4);
folder (str, default=None) – Path to the folder containing the data. If
None, will use the root location.
- Returns:
A collection of volumetric data of the given cell type.
- Return type:
list[SpectralVolume]
References
Kallepitis, C., Bergholt, M., Mazo, M. et al. Quantitative volumetric Raman imaging of three dimensional cell cultures. Nat Commun 8, 14843 (2017).
Examples:
import ramanspy as rp cells_volume = rp.datasets.volumetric_cells(cell_type='THP-1', path_to_data="path/to/data")
MDA-MB-231 cells data
- ramanspy.datasets.MDA_MB_231_cells(dataset='train', folder=None) Tuple[SpectralContainer, SpectralContainer][source]
170k pairs of low- and high-SNR data. I
Ideal for developing and validating denoising models and algorithms.
Data from Horgan, C.C. et al. (2021).
Must be downloaded first. Provided by authors on Google Drive.
All data has spectral dimensionality of 500, in the range (500, 1800) cm:sup:-1.
- Parameters:
dataset (str, default='train') –
Which bacteria dataset to load.
Available datasets are:
'train'- Just under 160k spectra.'test'- Just under 13k spectra.
folder (str, default=None) – Path to the folder containing the downloaded data. If None, will use the root location. Irrelevant if
dataset='labels'.
- Returns:
SpectralContainer – Low SNR input.
SpectralContainer – The corresponding high SNR target output.
References
Horgan, C.C., Jensen, M., Nagelkerke, A., St-Pierre, J.P., Vercauteren, T., Stevens, M.M. and Bergholt, M.S., 2021. High-Throughput Molecular Imaging via Deep-Learning-Enabled Raman Spectroscopy. Analytical Chemistry, 93(48), pp.15850-15860.
Examples:
import ramanspy as rp input, output = rp.datasets.MDA_MB_231_cells(path_to_data="path/to/data")
COVID-19 data
- ramanspy.datasets.covid19(file) Tuple[SpectralContainer, ndarray, ndarray][source]
Raman spectra acquired from patients with COVID-19 and healthy controls.
Data from Yin G. et al. (2021).
Must be downloaded first. Available on Kaggle.
- Parameters:
file (str, default=None) – Path to the file containing the downloaded data.
- Returns:
SpectralContainer with spectral_data of shape (N, B) – The Raman spectra provided in the selected dataset.
np.ndarray[int] of shape (N, ) – The corresponding labels - indicating which group each data point corresponds to.
np.ndarray[string] of shape (B, ) – The names of the labels.
References
Yin G, Li L, Lu S, Yin Y, Su Y, Zeng Y, Luo M, Ma M, Zhou H, Orlandini L, Yao D. An efficient primary screening of COVID‐19 by serum Raman spectroscopy. Journal of Raman Spectroscopy. 2021 May;52(5):949-58.
Yin G, Li L, Lu S, Yin Y, Su Y, Zeng Y, Luo M, Ma M, Zhou H, Yao D, Liu G, Lang J. Data and code on serum Raman spectroscopy as an efficient primary screening of coronavirus disease in 2019 (COVID-19). figshare; 2020.
Examples:
import ramanspy as rp # Load training dataset spectra, labels, label_names = rp.datasets.covid19(path_to_data="path/to/data")
Adenine data
- ramanspy.datasets.adenine(file=None, download=True) Tuple[SpectralContainer, ndarray, ndarray][source]
Raman spectra acquired from samples representing different levels of adenine concentrations.
Data from Fornasaro, Stefano, et al. (2020).
Can be downloaded directly through the function or downloaded separately. In the latter case, users just need to specifiy the location of the file to be loaded.
Available on Zenodo.
- Parameters:
file (str, default=None) – Path to the file containing the downloaded data. Not used if
download=True.download (bool, default=True) – If
True, will download the data from Zenodo. Otherwise, will look for the data in the specified path given bypath_to_data. Note that ifdownload=True, data will be downloaded which may take some time.
- Returns:
SpectralContainer with spectral_data of shape (N, B) – The Raman spectra provided in the selected dataset.
pandas.DataFrame of shape (N, 8) – 8 additional features indicating sample collection parameters..
np.ndarray[int] of shape (N, ) – The corresponding labels - indicating the adenine concentration which each data point corresponds to.
References
Fornasaro S, Alsamad F, Baia M, Batista de Carvalho LA, Beleites C, Byrne HJ, Chiadò A, Chis M, Chisanga M, Daniel A, Dybas J. Surface enhanced Raman spectroscopy for quantitative analysis: results of a large-scale European multi-instrument interlaboratory study. Analytical chemistry. 2020 Feb 11;92(5):4053-64.
Examples:
import ramanspy as rp # Load dataset spectra, additional_features, labels = rp.datasets.adenine()
Wheat lines data
- ramanspy.datasets.wheat_lines(file=None, download=True) Tuple[SpectralContainer, ndarray, ndarray][source]
Raman spectra acquired from groups of wheat lines:
'COM'- Commercial cultivar;'COM - 125mM'- Commercial cultivar treated with 125mM NaCl;'ML1 - 125mM'- Mutant Line 1 treated with 125mM NaCl;'ML2 - 125mM'- Mutant Line 2 treated with 125mM NaCl.
Data from ŞEN A. et al. (2023).
Available on Zenodo.
Can be downloaded directly through the function or downloaded separately. In the latter case, users just need to specifiy the location of the file to be loaded.
- Parameters:
file (str, default=None) – Path to the file containing the downloaded data. Not used if
download=True.download (bool, default=True) – If
True, will download the data from Zenodo. Otherwise, will look for the data in the specified path given bypath_to_data. Note that ifdownload=True, data will be downloaded which may take some time.
- Returns:
SpectralContainer with spectral_data of shape (N, B) – The Raman spectra provided in the selected dataset.
np.ndarray[int] of shape (N, ) – The corresponding labels - indicating which group each data point corresponds to.
np.ndarray[string] of shape (B, ) – The names of the labels.
References
ŞEN A, Kecoglu I, Ahmed M, Parlatan U, Unlu M. Differentiation of advanced generation mutant wheat lines: Conventional techniques versus Raman spectroscopy. Frontiers in Plant Science. 2023;14.
Examples:
import ramanspy as rp # Load training dataset spectra, labels, label_names = rp.datasets.wheat_lines()
RRUFF data
- ramanspy.datasets.rruff(dataset: str, folder=None, download: bool = True) Tuple[List[SpectralContainer], List[dict]][source]
Raman spectra acquired from various minerals.
Data from the RRUFF database.
Can be downloaded directly through the function or downloaded separately. In the latter case, users just need to specifiy the location of the file to be loaded.
- Parameters:
dataset (str) – The name of the RRUFF Raman dataset to load. Check available datasets here.
folder (str, default=None) – Path to the folder containing the downloaded data. If None, will use the root location. Irrelevant if ``download=True`.
download (bool, optional, default=True) – Whether to download the specified dataset or load it from a local directory. If
download=False, all .txt files from the directory provided via dataset will be loaded.
- Returns:
list[Spectrum] – The Raman spectra provided.
list[dict] – List of metadata dictionaries, extracted from the header of the RRUFF data file.
References
Lafuente B, Downs R T, Yang H, Stone N (2015) The power of databases: the RRUFF project. In: Highlights in Mineralogical Crystallography, T Armbruster and R M Danisi, eds. Berlin, Germany, W. De Gruyter, pp 1-30.
Examples:
import ramanspy as rp # downloaded from the Internet rp.datasets.rruff('fair_oriented') # loaded from the given folder rp.datasets.rruff('path/to/dataset/folder/fair_oriented', download=False)
See also
Check relevant tutorials in the Datasets and metrics section for more information about how to use ramanspy to load data from the datasets built into the package.