| Title: | Euclidean Distance-Optimized Data Transformation |
|---|---|
| Description: | A data transformation method which takes into account the special property of scale non-invariance with a breakpoint at 1 of the Euclidean distance. |
| Authors: | Jorn Lotsch [aut, cre] (ORCID: <https://orcid.org/0000-0002-5818-6958>), Alfred Ultsch [aut] |
| Maintainer: | Jorn Lotsch <[email protected]> |
| License: | GPL-3 |
| Version: | 0.3.5 |
| Built: | 2026-05-22 11:03:43 UTC |
| Source: | https://github.com/jornlotsch/edotrans |
The package provides the necessary functions for performing the EDO data transformation.
EDOtrans(Data, Cls, PlotIt = FALSE, FitAlg = "normalmixEM", Criterion = "LR", MaxModes = 8, MaxCores = getOption("mc.cores", 2L), Seed = "simple")EDOtrans(Data, Cls, PlotIt = FALSE, FitAlg = "normalmixEM", Criterion = "LR", MaxModes = 8, MaxCores = getOption("mc.cores", 2L), Seed = "simple")
Data |
the data as a numerical vector (1D). |
Cls |
the class information, if any, as a vector of similar length as instances in the data. |
PlotIt |
whether to plot the fit directly. |
FitAlg |
which fit algorithm to use: "ClusterRGMM" = GMM from ClusterR, "densityMclust" from mclust, "DO" from DistributionOptimization (slow), "MCMC" = NMixMCMC from mixAK, or "normalmixEM" from mixtools. |
Criterion |
which criterion should be used to establish the number of modes from the best GMM fit: "AIC", "BIC", "FM", "GAP", "LR" (likelihood ratio test), "NbClust" (from NbClust), "SI" (Silverman). |
MaxModes |
for automated GMM assessment: the maximum number of modes to be tried. |
MaxCores |
for automated GMM assessment: the maximum number of processor cores used under Unix. |
Seed |
Seed value for reproducibility. Options: |
Returns a list of transformed data and class assignments.
DataEDO |
the EDO transformed data. |
EDOfactor |
the factor by which each data value has been divided. |
Cls |
the class information for each data instance. |
Jorn Lotsch and Alfred Ultsch
Ultsch A, Lotsch J. Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans). BMC Bioinformatics. 2022 Jun 16;23(1):233, https://doi.org/10.1186/s12859-022-04769-w
## example 1 data(iris) IrisEDOdata <- EDOtrans(Data = as.vector(iris[,1]), Cls = iris$Species, Seed = 42) ## example 2 data(iris) IrisEDOdata <- EDOtrans(Data = as.vector(iris[,1]), Cls = as.integer(iris$Species), MaxModes = 1, Seed = 42) ## example 3 data(iris) set.seed(42) IrisEDOdata <- EDOtrans(Data = as.vector(iris[,1]), Cls = iris$Species)## example 1 data(iris) IrisEDOdata <- EDOtrans(Data = as.vector(iris[,1]), Cls = iris$Species, Seed = 42) ## example 2 data(iris) IrisEDOdata <- EDOtrans(Data = as.vector(iris[,1]), Cls = as.integer(iris$Species), MaxModes = 1, Seed = 42) ## example 3 data(iris) set.seed(42) IrisEDOdata <- EDOtrans(Data = as.vector(iris[,1]), Cls = iris$Species)
Data set of 4 flow cytometry-based lymphoma makers from 1559 cells from healthy subjects (class 1) and 1441 cells from lymphoma patients (class 2).
data("FACSdata")data("FACSdata")
Size 3000 x 4 , stored in FACSdata$[FS,CDa,CDb,CDd]
Original classes 2, stored in FACSdata$Cls
data(FACSdata) str(FACSdata)data(FACSdata) str(FACSdata)
Functions for recovering the original seed value that produced the current random number generator state. Provides both R and C++ implementations with the C++ version offering significantly improved performance for large search spaces.
get_seed(range = NULL, fallback_seed = 42, max_search = 2147483647, step_size = 50000, use_cpp = TRUE, ...) get_seed_cpp(range = NULL, fallback_seed = 42, max_search = 2147483647, step_size = 50000, batch_size = 10000, verbose = TRUE)get_seed(range = NULL, fallback_seed = 42, max_search = 2147483647, step_size = 50000, use_cpp = TRUE, ...) get_seed_cpp(range = NULL, fallback_seed = 42, max_search = 2147483647, step_size = 50000, batch_size = 10000, verbose = TRUE)
range |
Optional integer vector of specific seed values to search. If provided, only these seeds will be tested instead of systematic range searching. |
fallback_seed |
Integer seed value to return if no matching seed is found during the search process (default: 42). |
max_search |
Maximum seed value to search up to when performing systematic range searching. Must be a positive integer within the valid range for R's random number generator (default: 2147483647). |
step_size |
Step size for systematic range searching when no specific range is provided. Larger values speed up search but may miss the target seed if it falls between steps (default: 50000). |
use_cpp |
Logical; if |
batch_size |
Integer specifying the number of seeds to process in each C++ batch operation. Larger batches are more memory efficient but require more RAM. Only used in |
verbose |
Logical; if |
... |
Additional arguments passed to |
The functions work by systematically testing seed values to find one that reproduces the current RNG state stored in .Random.seed. The search process:
Tests each candidate seed by setting it and comparing the resulting RNG state
Uses efficient C++ implementation for faster processing of large search spaces
Supports both targeted searching (via range parameter) and systematic range searching
Employs batched processing to optimize memory usage and performance
Performance Considerations:
The C++ implementation (get_seed_cpp()) provides significant performance improvements:
Batch processing reduces overhead for large search spaces
Optimized memory management prevents excessive RAM usage
Native C++ random number generation matching R's implementation
Progress reporting for long-running searches
Search Strategy:
If range is provided: Tests only the specified seed values
If range is NULL: Performs systematic search from 1 to max_search in steps of step_size
Search terminates immediately when a matching seed is found
Returns fallback_seed if no match is found within the search parameters
Memory Management:
The C++ implementation uses batched processing controlled by batch_size to:
Process large search ranges without excessive memory allocation
Provide regular progress updates during long searches
Allow interruption of long-running operations
Returns an integer representing the seed value that reproduces the current random number generator state.
If no matching seed is found within the search parameters, returns the fallback_seed value.
Requires an active RNG state (i.e., .Random.seed must exist)
Large search ranges may take considerable time even with C++ optimization
The search is deterministic but computationally intensive
Consider using smaller step_size values if the initial search fails
## Basic seed recovery after generating random numbers set.seed(123) recovered_seed <- get_seed() print(recovered_seed)## Basic seed recovery after generating random numbers set.seed(123) recovered_seed <- get_seed() print(recovered_seed)
Dataset of 3000 instances with 3 variables that are Gaussian mixtures and belong to classes Cls = 1, 2, or 3, with different means and standard deviations and equal weights of 0.7, 0.3, and 0.1, respectively.
data("GMMartificialData")data("GMMartificialData")
Size 3000 x 3, stored in GMMartificialData$[Var1,Var2,Var3]
Classes 3, stored in GMMartificialData$Cls
data(GMMartificialData) str(GMMartificialData)data(GMMartificialData) str(GMMartificialData)