| Title: | Optimal Distribution Preserving Down-Sampling of Bio-Medical Data |
|---|---|
| Description: | An optimized method for distribution-preserving class-proportional down-sampling of bio-medical data <doi:10.1371/journal.pone.0255838>. |
| Authors: | Jorn Lotsch [aut, cre] (ORCID: <https://orcid.org/0000-0002-5818-6958>), Sebastian Malkusch [aut] (ORCID: <https://orcid.org/0000-0001-6766-140X>), Alfred Ultsch [aut] (ORCID: <https://orcid.org/0000-0002-7845-3283>) |
| Maintainer: | Jorn Lotsch <[email protected]> |
| License: | GPL-3 |
| Version: | 1.6 |
| Built: | 2026-06-25 08:48:45 UTC |
| Source: | https://github.com/jornlotsch/opdisdownsampling |
Data set of 6 flow cytometry-based lymphoma makers from 55,843 cells from healthy subjects (class 1) and 55,843 cells from lymphoma patients (class 2).
data("FlowcytometricData")data("FlowcytometricData")
Size 111686 x 6 , stored in FlowcytometricData$[Var_1,Var_2,Var_3,Var_4,Var_5,Var_6]
Classes 2, stored in FlowcytometricData$Cls
data(FlowcytometricData) str(FlowcytometricData)data(FlowcytometricData) str(FlowcytometricData)
Functions for recovering the original seed value that produced the current random number generator state. Provides both R and C++ implementations with the C++ version offering significantly improved performance for large search spaces.
get_seed(range = NULL, fallback_seed = 42, max_search = 2147483647, step_size = 50000, use_cpp = TRUE, ...) get_seed_cpp(range = NULL, fallback_seed = 42, max_search = 2147483647, step_size = 50000, batch_size = 10000, verbose = TRUE)get_seed(range = NULL, fallback_seed = 42, max_search = 2147483647, step_size = 50000, use_cpp = TRUE, ...) get_seed_cpp(range = NULL, fallback_seed = 42, max_search = 2147483647, step_size = 50000, batch_size = 10000, verbose = TRUE)
range |
Optional integer vector of specific seed values to search. If provided, only these seeds will be tested instead of systematic range searching. |
fallback_seed |
Integer seed value to return if no matching seed is found during the search process (default: 42). |
max_search |
Maximum seed value to search up to when performing systematic range searching. Must be a positive integer within the valid range for R's random number generator (default: 2147483647). |
step_size |
Step size for systematic range searching when no specific range is provided. Larger values speed up search but may miss the target seed if it falls between steps (default: 50000). |
use_cpp |
Logical; if |
batch_size |
Integer specifying the number of seeds to process in each C++ batch operation. Larger batches are more memory efficient but require more RAM. Only used in |
verbose |
Logical; if |
... |
Additional arguments passed to |
The functions work by systematically testing seed values to find one that reproduces the current RNG state stored in .Random.seed. The search process:
Tests each candidate seed by setting it and comparing the resulting RNG state
Uses efficient C++ implementation for faster processing of large search spaces
Supports both targeted searching (via range parameter) and systematic range searching
Employs batched processing to optimize memory usage and performance
Performance Considerations:
The C++ implementation (get_seed_cpp()) provides significant performance improvements:
Batch processing reduces overhead for large search spaces
Optimized memory management prevents excessive RAM usage
Native C++ random number generation matching R's implementation
Progress reporting for long-running searches
Search Strategy:
If range is provided: Tests only the specified seed values
If range is NULL: Performs systematic search from 1 to max_search in steps of step_size
Search terminates immediately when a matching seed is found
Returns fallback_seed if no match is found within the search parameters
Memory Management:
The C++ implementation uses batched processing controlled by batch_size to:
Process large search ranges without excessive memory allocation
Provide regular progress updates during long searches
Allow interruption of long-running operations
Returns an integer representing the seed value that reproduces the current random number generator state.
If no matching seed is found within the search parameters, returns the fallback_seed value.
Requires an active RNG state (i.e., .Random.seed must exist)
Large search ranges may take considerable time even with C++ optimization
The search is deterministic but computationally intensive
Consider using smaller step_size values if the initial search fails
## Basic seed recovery after generating random numbers set.seed(123) recovered_seed <- get_seed() print(recovered_seed)## Basic seed recovery after generating random numbers set.seed(123) recovered_seed <- get_seed() print(recovered_seed)
Dataset of 30000 instances with 10 variables that are Gaussian mixtures and belong to classes Cls = 1, 2, or 3, with different means and standard deviations and equal weights of 0.5, 0.4, and 0.1, respectively.
data("GMMartificialData")data("GMMartificialData")
Size 30000 x 10, stored in GMMartificialData$[X1,X2,X3,X4,X5,X6,X7,X8,X9,X10]
Classes 3, stored in GMMartificialData$Cls
data(GMMartificialData) str(GMMartificialData)data(GMMartificialData) str(GMMartificialData)
The package provides functions for optimal distribution-preserving down-sampling of large (bio-medical) data sets. It draws statistically representative subsets of data while preserving the class proportions and original data distribution.
opdisDownsampling(Data, Cls, Size, Seed = "simple", nTrials = 1000, TestStat = "ad", MaxCores = getOption("mc.cores", 2L), PCAimportance = FALSE, JobSize = 0, verbose = FALSE)opdisDownsampling(Data, Cls, Size, Seed = "simple", nTrials = 1000, TestStat = "ad", MaxCores = getOption("mc.cores", 2L), PCAimportance = FALSE, JobSize = 0, verbose = FALSE)
Data |
Numeric data as a vector, matrix, or data frame. Each row represents an instance, each column a variable. |
Cls |
Optional vector with class labels for each instance in |
Size |
The number (integer) or proportion (0<Size<1) of instances to draw from the dataset. The reduction is class proportional and aims to preserve the variable distributions. |
Seed |
Seed value. Options: |
nTrials |
Number of random sampling trials used to find the optimal subset (default: 1000). |
TestStat |
Character string defining the statistical test used to assess distribution similarity. Available options are:
|
MaxCores |
Maximum number of CPU cores to use for parallel computing (default is value stored in |
PCAimportance |
Logical; if |
JobSize |
Integer specifying the number of trials to process in each chunk.
If |
verbose |
Logical; if |
Chunked processing can be used to reduce memory usage when dealing with large datasets
or high numbers of trials. Set JobSize = NULL to enable automatic memory-aware
chunk-size calculation. The automatic chunking strategy considers:
Data size, defined by number of rows and columns
Available system memory, detected on Linux systems
Number of processor cores
Number of trials to perform
Set JobSize = 0 to process all trials in a single batch. Set JobSize
to a positive integer to manually define the number of trials processed per chunk.
Variable Selection Method:
If PCAimportance = TRUE, PCA-based variable selection is used. Variables are
ranked by their loadings in the first principal components, and variables with higher
importance scores are used for distribution comparisons.
Returns a list with the following elements:
ReducedData |
Down-sampled data set (as data frame or matrix) including only the selected instances. |
RemovedData |
Data not included in the sample. |
ReducedInstances |
Row indices (or names) of the selected instances from the original data set. |
RemovedInstances |
Row indices (or names) of the unselected instances from the original data set. |
Jorn Lotsch
Lotsch, J., Malkusch, S., Ultsch, A. (2021):\ Optimal distribution-preserving downsampling of large biomedical data sets.\ PLoS ONE 16(8): e0255838. doi:10.1371/journal.pone.0255838
## Example: Down-sample the Iris dataset to 50 points data(iris) Iris50percent <- opdisDownsampling(Data = iris[,1:4], Cls = as.integer(iris$Species), Size = 50, Seed = 42, MaxCores = 1) ## Example: Down-sample with custom chunk size and verbose output data(iris) Iris50percent <- opdisDownsampling(Data = iris[,1:4], Cls = as.integer(iris$Species), Size = 50, Seed = 42, MaxCores = 1, JobSize = 25, verbose = TRUE) ## Example: Use PCA-based variable selection data(iris) Iris_pca <- opdisDownsampling(Data = iris[,1:4], Cls = as.integer(iris$Species), Size = 50, Seed = 42, PCAimportance = TRUE, MaxCores = 1) ## Example: Memory-efficient processing of large dataset with many trials ## Not run: # For large datasets, automatic chunking can reduce memory usage LargeDataSample <- opdisDownsampling(Data = large_dataset, Size = 0.1, Seed = 42, nTrials = 5000, JobSize = NULL, verbose = TRUE) ## End(Not run)## Example: Down-sample the Iris dataset to 50 points data(iris) Iris50percent <- opdisDownsampling(Data = iris[,1:4], Cls = as.integer(iris$Species), Size = 50, Seed = 42, MaxCores = 1) ## Example: Down-sample with custom chunk size and verbose output data(iris) Iris50percent <- opdisDownsampling(Data = iris[,1:4], Cls = as.integer(iris$Species), Size = 50, Seed = 42, MaxCores = 1, JobSize = 25, verbose = TRUE) ## Example: Use PCA-based variable selection data(iris) Iris_pca <- opdisDownsampling(Data = iris[,1:4], Cls = as.integer(iris$Species), Size = 50, Seed = 42, PCAimportance = TRUE, MaxCores = 1) ## Example: Memory-efficient processing of large dataset with many trials ## Not run: # For large datasets, automatic chunking can reduce memory usage LargeDataSample <- opdisDownsampling(Data = large_dataset, Size = 0.1, Seed = 42, nTrials = 5000, JobSize = NULL, verbose = TRUE) ## End(Not run)