generateData—A 2D data generator

generateData is a MATLAB/Octave function for generating 2D data clusters. Data is created along straight lines, which can be more or less parallel depending on the selected input parameters. The function also allows to fine-tune the generated data with respect to number of clusters, total data points, average cluster separation and several other distributional properties.


Introduction
Massive amounts of data are being produced everyday, raising a number of challenges on its storing and processing [1][2][3]. However, there are still many instances where, for the intended purposes, data is insufficient and/or expensive -thus making synthetic data generation an appealing alternative [4][5][6][7]. One such instance is data generation for evaluating clustering algorithms [8,9].
In this paper we discuss the impact of generateData, a MAT-LAB [10] and GNU Octave [11] function for generating 2D data primarily aimed at testing clustering algorithms. Section 2 offers an overview of how the data generation algorithm works as well as some output examples. The impact of generateData is presented in Section 3.
The limitations of this work, as well as potential improvements are discussed in Section 4.

Description
generateData is a MATLAB/Octave function for generating 2D data clusters. The function allows to fine-tune several characteristics * Corresponding author at: HEI-Lab -Digital Human-Environment Interaction Lab, Lusófona University, Lisbon, Portugal.
E-mail addresses: nuno.fachada@ulusofona.pt (N. Fachada), acrosa@laseeb.org (A.C. Rosa). of the generated data through a number of required and optional parameters, summarized in Tables 1 and 2, respectively. In any case, data is created along straight lines. The exact angle of these lines with respect to the axis is drawn from the normal distribution. The mean and standard deviation of this distribution correspond to parameters angleMean and angleStd in Table 1. The latter influences how parallel are the lines supporting the data. A standard deviation of zero yields completely parallel lines, while higher values increasingly randomize line orientation. In turn, line length is drawn from the folded normal distribution, with mean and standard deviation given as parameters lengthMean and lengthStd, respectively.  lateralStd Cluster ''fatness'', i.e., the standard deviation of the distance from each point to its projection on the line. The way this distance is obtained is controlled by the optional pointOffset parameter. totalPoints Total points in generated data. These will be randomly divided between clusters using the half-normal distribution with unit standard deviation. default), or on a second line, perpendicular to the original one, using a normal distribution (a 1D placement). In either case, the projection is used as the mean value, while the standard deviation is given by the lateralStd parameter. The type of placement, 1D or 2D, is defined by the optional pointOffset parameter. Fig. 1 shows four datasets created with generateData using the parameters given in Table 3.

Impact
The generateData script was originally created to test the AMVIDC clustering algorithm [12]. The algorithm performs agglomerative hierarchical clustering using minimum volume increase and minimum direction change clustering criteria, and was inspired by the typical layout of spectrometric data after being processed with principal component analysis (PCA). More specifically, the PCA score plots of spectrometric data were found to exhibit distinct groups scattered along a preferential direction, forming low volume clusters. The generate-Data script was designed to generate data of this kind, offsetting the lack of actual experimental data and thus allowing to better tune and test the AMVIDC algorithm. In reference [13], Zamberletti et al. ''investigated the role of individual wetlands within a wetlandscape in sustaining an amphibian population''. Wetlandscapes -sets of multiple hydrologically connected wetlands -were modeled as networks, with nodes representing individual wetlands, and connections corresponding to flows of organisms.
Zamberletti et al. used generateData to create a random network of wetlands for running a population dynamics model. The script was suited to this problem, since it allowed to define deterministic topological network parameters (e.g., number of clusters), while adding some random variation via its stochastic arguments (e.g., average cluster separation).  Fig. 1.
Parameter Fig. 1(a) Fig. 1(b) Fig. 1(c) Fig. 1 With the goal of evaluating D3CAS, a dynamical and big dataoriented clustering algorithm for processing data streams, Molina & Hasperué [14,15] used generateData to create datasets with up to 100 000 points, an adequate size for their testing requirements.
Alabdulatif et al. [16][17][18] made use of generateData for a series of investigations on cloud and edge computing privacy-related data analytics. Datasets were generated with the purpose of evaluating distributed and privacy-preserving versions of several clustering algorithms in a number of different scenarios.
In reference [19], Hao et al. presented a video summarization approach consisting of generating a short video summary while maintaining the overall meaning of the original video. The approach worked by applying sparse subspace clustering with automatically estimated number of clusters to deep features of objects in key-frames. gener-ateData was used to produce synthetic data for testing the accuracy of the method for estimating the number of clusters.
Olukanmi et al. [20,21] used generateData to assemble scenarios with one million data points with the purpose of assessing the proposed -means-lite and -means-lite++ clustering algorithmshighly scalable versions of their non-lite counterparts.

Limitations and potential improvements
The main drawback of generateData is obviously being limited to 2D. Nonetheless, the basic ideas of how data is generated are extendable to -D, with several parameters such as angle mean/standard deviation and average separation by axis being given as vectors instead of scalars. Another potential limitation is that the function is only available for either the MATLAB or GNU Octave environments. The former is a proprietary offering, potentially inaccessible to many researchers, while the latter is a worthwhile open source, largely compatible alternative. However, languages such as Python, R and Julia have been gaining popularity in the scientific computing community [22,23]. As such, one of our goals is to port generateData to these languages, making it available to a wider audience.    Table 3. The allowEmpty parameter was always set to false.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.