OPFython: A Python-Inspired Optimum-Path Forest Classifier

Machine learning techniques have been paramount throughout the last years, being applied in a wide range of tasks, such as classification, object recognition, person identification, and image segmentation. Nevertheless, conventional classification algorithms, e.g., Logistic Regression, Decision Trees, and Bayesian classifiers, might lack complexity and diversity, not suitable when dealing with real-world data. A recent graph-inspired classifier, known as the Optimum-Path Forest, has proven to be a state-of-the-art technique, comparable to Support Vector Machines and even surpassing it in some tasks. This paper proposes a Python-based Optimum-Path Forest framework, denoted as OPFython, where all of its functions and classes are based upon the original C language implementation. Additionally, as OPFython is a Python-based library, it provides a more friendly environment and a faster prototyping workspace than the C language.


Introduction
Artificial Intelligence has become one of the most fostered research areas throughout the last years [1]. It is common to observe an increasing trend of automating tasks [2], which also fosters minimal human interaction algorithms, denoted as Machine Learning.
Machine learning research consists of developing new types of algorithms that do not need explicit instructions, relying on patterns and inferences [3]. Also, they are designed in a way that humans can be assisted in decision-making tasks or even in daily activities automation, such as data retrieval [4], intelligent gadgets [5], self-driving cars [6], among others. In the past decades, most machine learning-based algorithms were developed as symbolic-and knowledge-based models due to the difficulty in dealing with probabilistic models at that time [7]. Nevertheless, with the advent of computational power, probabilistic models were put in the spotlight as the availability of digitized information was no longer a problem [8]. Hence, most of today's algorithms rely on mathematical models and data sampling, i.e., models that are capable of learning occult patterns in training data and predicting unseen data, and are to a wide variety of tasks, such as computer vision [9] and natural language processing [10].
One can observe that it is possible to divide machine learning algorithms into two types of learning: supervised learning [11] and unsupervised learning [12], as depicted in Figure 1. Concerning the supervised learning ones, such as classification and regression tasks, the algorithms aim to build mathematical models from labeled data, i.e., data containing the input features and the possible outputs (classes), and perform predictions on unseen data. Regarding unsupervised learning, the algorithms aim to build mathematical models capable of aggregating sets of data with common characteristics, known as clusters. In other words, unsupervised learning can discover patterns in data and group them into categories without knowing their actual labels. Furthermore, it is crucial to observe that the machine learning area is closely related to other fields, such as data mining, optimization, and statistics. Regarding data mining, while machine learning focus on predicting information based on already-known properties from the data, the data mining area focuses on finding new properties in the data and transforming new knowledge into real-knowledge, for further application in machine learning algorithms [13].
Concerning the optimization area, it is common to observe that most machine learning algorithms models are formulated as optimization problems, where some loss function is minimized over a set of training data. Essentially, the loss function is capable of expressing the discrepancy between the model's predictions and the actual samples, assisting the algorithm in learning the data's patterns and being capable of predicting unseen information [14]. Finally, regarding the statistics field, it is possible to perceive that statistics focus on drawing inferences from samples, while machine learning focuses on finding generalizable prediction patterns [15]. Additionally, due to their intimate relationship, some researchers combined machine learning and statistical methods in a new field of study, known as statistical learning [16].
Recently, a new graph-based classifier proposed by Papa et al. [17], known as Optimum-Path Forest (OPF), attempts to fulfill the literature with a parameterless classifier, which is effective during the learning step and efficiently when performing new predictions. Several works introduced the capacity of OPF and its state-of-the-art performance, being comparable to the well-known Support Vector Machines (SVM) [18] in supervised [19] and unsupervised learning [20] tasks. Additionally, it provides tools, such as graph-cutting and K-Nearest Neighbors (KNN) graphs [21], to reduce the training set size with negligible effects on the accuracy of the classification. Nevertheless, the problem arises because there is only one official implementation based on the C language, making it difficult to be integrated with other well-known machine learning frameworks. Furthermore, there is a Python-based trend in the machine learning community.
This paper proposes an open-source Python Optimum-Path Forest classification library, called OPFython 1 . Mainly, the idea is to provide a user-friendly environment to work with Optimum-Path Forest classifiers by creating high-level methods and classes, removing the burden of programming from the user at a mathematical level. The main contributions of this paper are threefold: (i) to introduce an Optimum-Path Forest classification library in the Python language, (ii) to provide an easy-to-go implementation and user-friendly framework, and (iii) to fill the lack of research regarding Optimum-Path Forest classifiers.
The remainder of this paper is organized as follows. Section 2 presents a literature review and related works concerning Optimum-Path Forest classifiers frameworks. Section 3 introduces a theoretical background concerning the supervised and unsupervised Optimum-Path Forest classifiers. Section 4 introduces thoughts of the OPFython library, such as its architecture, and an overview of the included packages. Section 5 provides more profound notions about the library, such as how to install, how to understand its documentation, some pre-included examples, and how to perform unitary tests. Furthermore, Section 6 presents vital knowledge about the usage of the library, i.e., how to run pre-defined examples and model a new experiment. Finally, Section 7 states conclusions and future works.

Literature Review and Related Works
Optimum-Path Forest classifiers have arisen as a new approach to tackle supervised and unsupervised problems. They offer a parameterless graph-based implementation capable of executing an effective learning procedure while being extremely efficient when performing new predictions. It is possible to find its usage in a wide range of applications, such as feature selection [22], image segmentation [23,24], signals classification [25,26]. For instance, Iliev et al. applied an Optimum-Path Forest classification using glottal features for spoken emotion recognition, achieving state-ofthe-art results comparable to the SVM classifier. Moreover, Ramos et al. [27] applied an OPF-based classification for detecting non-technical energy losses, achieving outstanding results comparable to state-of-the-art artificial intelligence techniques. Furthermore, Fernandes et al. [28] proposed a probabilistic-driven OPF classifier for detecting non-technical energy losses, improving the baselines obtained by the standard OPF.
Even though numerous works in the literature fosters the Optimum-Path Forests, there are some gaps in works regarding frameworks or open-sourced libraries. There is only an official implementation provided by Papa et al. [29], denoted as LibOPF, which does not provide straightforward tools for users to design new experiments or integrate with other frameworks. Additionally, it lacks documentation and test suites, which help users understand the code and implement new methods and classes. Moreover, the library is implemented in C language, making it extremely difficult to integrate with other frameworks or packages, primarily because the machine learning community is turning their attention to the Python language.
Therefore, OPFython attempts to fill the gaps concerning Optimum-Path Forest frameworks. It is purely implemented in Python and provides comprehensive documentation, test suites, and several pre-loaded examples. Furthermore, every line of code is commented, there are continuous integration tests for every new push to its repository, a great readme that teaches how to get started with the library and full-time maintenance and support.

Theoretical Foundation
Before diving into OPFython's library, we present a theoretical foundation about the Optimum-Path Forest. In the next subsections, we mathematically explain how the supervised and unsupervised classifiers work.

Supervised Optimum-Path Forest
The Optimum-Path Forest is a multi-class classifier developed by Papa et al. [17], being efficient in the training step and effective in the testing stage. Its foremost ability is to segment the feature space without requiring massive volumes of data. Essentially, the OPF classifier is a graph, having two possible adjacent relations: a complete graph or a KN N graph. The difference between both methods is the adjacency relation, the methodology to estimate the prototypes 2 , and the path cost function.
The principal idea behind the supervised OPF is to construct a complete graph, where any two samples are connected.
In this case, the nodes represent the samples' features vector, and the edges connect all nodes. Regarding the prototypes, the same are chosen throughout Minimum Spanning Trees (MST) 3 To find the nearest samples from different classes, namely, by selecting samples located in the classes frontiers 4 . After the prototypes definition, they compete to conquer adjacent nodes while trying to find the best path (lowest cost) defined by the path cost function and create Optimum-Path Trees (OPT). Finally, during the testing phase, OPF inserts each new sample into the graph and finds the prototype, which offers the minimum cost path (class labeling).
Let Z be a dataset, where Z = Z 1 Z 2 , and Z 1 and Z 2 represents the training and testing sets, respectively. Each sample s ∈ Z can be represented by its feature vector − → v (s) ∈ n . OPF cpl graph is represented by G = (V, A), where A refers to the set of edges that connects all nodes pairs and V is the features vectors set − → v (s), ∀s ∈ Z. In addition, let λ(·) be a function that assigns a real label for each sample in Z.

Training Step
Let the graph G 1 = (V 1 , A) be inducted from the training set, where V 1 holds all feature vectors from samples belonging to the training set. The first objective of the training phase is to obtain a set of prototypes S, onde S ⊂ Z 1 .
Let a path π s , with ending in s, be in G and a function f (π s ) that associates a value to this path. In order to a prototype conquer adjacent samples, the purpose is to minimize f (π s ) through a path cost function given by the following equation: (1) where f max (π · s, t ) computes the maximum distance between adjacent samples s and t along the path π · s, t . A path π s is referred as optimum if f (π s ) ≤ f (τ s ) for any other path τ s .
The minimization of f max assigns to each sample t ∈ Z 1 an optimum path P * (t), whose minimum cost C(t) is given by the following equation:

Testing Step
The testing set graph Each sample t is connected to a sample s ∈ V 1 , making t as part of the original graph. The objective is to find an optimum path P * (t) from S to t with class λ(R(t)) of its prototype R(t) ∈ S. Consequently, the sample t is removed from the graph. This path can be identified by evaluating the optimum cost value C(t):

Unsupervised Optimum-Path Forest
Let N be a dataset such that for every sample s ∈ N there is a feature vector v(s). Additionally, let d(s, t) be the distance between samples s and t in the feature space, which is described by d( A graph (N , A) is defined by arcs (s, t) ∈ A that connect k-nearest neighbors in the feature space. The arcs are weighted by d(s, t) and the nodes s ∈ N are weighted by a density value ρ(s), given by Equation 4: where |A(s)| = k, σ = Moreover, let a path π t be the sequence of adjacent samples starting from a root R(t) and ending at a sample t, being π t = t a trivial path and π s · s, t the concatenation of π s and arc (s, t). Among all possible paths π t with roots on the maxima of the p.d.f., the problem lies in finding a path with the lowest density value. Each path defines an influence zone (cluster) by selecting strongly connected samples. Mathematically speaking, Equation 5 maximizes f (π t ) for all t ∈ N where: where δ = min ∀(s,t)∈A|ρ(t) =ρ(s) |ρ(t) − ρ(s)| and R is a root with one element set for each maximum of the p.d.f.. One can see that higher values of δ reduces the number of maxima. Additionally, in this library, we are using δ = 1.0 and ρ(t) ∈ [1, 1000].
Finally, the OPF algorithm maximizes f (π t ) such that the optimum paths compose an Optimum-Path Forest, i.e., a predecessor no-cycling map P which assigns to each sample t / ∈ R its predecessor P (t) from the optimum path R or a marker nil when t ∈ R. Each p.d.f. maximum (prototype) is the root of an OPT, commonly known as a cluster. Furthermore, the collection of all OPTs is the so-called Optimum-Path Forest.

OPFython
OPFython is distributed among several packages, each one being accountable for particular classes and methods. Figure 2 represents a summary of OPFython's architecture, while the next sections present each of its packages within more details.

Core
The core package serves as the origin of all OPFython's sub-classes. It assists as a building base for implementing more appropriate structures that one may require when creating an Optimum-Path Forest-based classifier. As portrayed in Figure 3, four modules compose the core package, as follows: • Heap: The heap assists OPF in stacking nodes' according to their costs and further unstacking them to build the subgraph; • Node: When working with graph-based structures, each of their pieces is represented by a node. In OPFython, we use the node structure to store valuable information of a sample, such as its features, label, and other information that OPF might need; • OPF: The OPF class serves as the classifier itself. It implements some basic methods that are common to its children, as well as some methods that assist users in saving and loading pre-trained models; • Subgraph: The subgraph is one of the most fundamental structures of the OPF classifier. A graph-based classifier uses nodes and arcs to build the optimum-path costs and find the prototype nodes, which conquer the remaining samples and propagate their labels.

Math
To ease the user's life, OPFython offers a mathematical package, containing low-level math implementations, illustrated by Figure 4. Naturally, some repeated functions that are used throughout the library are represented in this package, as follows: • Distance: A distance metric is used to calculate the cost between nodes. Hence, we offer a variety of distance metrics that fulfills every task needs; • General: Common-use functions that do not have a special division are defined in this module; • Random: Lastly, some methods might use random numbers for sampling or setting a heuristic. This module can generate uniform and Gaussian random numbers.

Models
Several approaches are to be conducted when designing an optimum-path forest classifier, such as supervised, unsupervised, semi-supervised, among others. Therefore, the models' package provides classes and methods that compose these high-level abstractions and implement the classifying strategies. Currently, OPFython offers four types of classifiers, which are illustrated by Figure 5 and described as follows: • KNNSupervisedOPF [21]: A supervised Optimum-Path Forest classifier that uses a KNN-based subgraph, providing a more effective way to build up the connectivity subgraph; • SemiSupervisedOPF [31]: A semi-supervised Optimum-Path Forest classifier, which is extremely useful in labeling unknown samples; • SupervisedOPF [17]: The classical supervised Optimum-Path Forest classifier, which is suitable for training on labeled datasets and performing new predictions; • UnsupervisedOPF [32]: The standard unsupervised Optimum-Path Forest classifier, which is suitable for clustering unlabeled datasets.

Stream
The stream package deals with every pre-processing step of the input data. It is essentially responsible for loading the data, parsing it into samples and labels, and splitting it into new sets, such as training, validation, and testing. Figure 6 depicts its modules, as well as we provide a brief description of them as follows: • Loader: A loading module that assists users in pre-loading datasets. Currently, it is possible to load files in .txt, .csv and .json formats; • Parser: After loading the files, it is necessary to parse the pre-loaded arrays into samples and labels; • Splitter: Finally, if necessary, one can split the loaded and parsed dataset into new sets, such as training, validation, and testing.

Subgraphs
As mentioned before, the subgraph is one of the essential structures of the classification process. Nevertheless, one can observe that distinct classifiers might need distinct subgraphs. Therefore, we are glad to offer additional subgraphs implementations as portrayed by Figure 7 and described as follows: • KNNSubgraph: When dealing with KNN-based classifiers, it is crucial to use a KNN-based subgraph, as it implements some additional functions that the classifier might need. Figure 7: Flowchart of OPFython's subgraphs package.

Utils
A utility package implements standard tools shared over the library, as it is a better approach to implement once and re-use them across other modules, as shown in Figure 8. This package implements the subsequent modules: • Constants: Constants are fixed numbers that do not alter throughout the code. For the sake of easiness, they are implemented in the same module; • Converter: Most of OPF users are familiarized to the specific file format it uses. Hence, we implement an own module that is capable of converting .opf files into .txt, .csv, and .json; • Decorator: Wrappers that provide common functionalities before running pieces of code; • Exception: In order to assist users, the exception module implements common errors and exceptions that might happen when invalid arguments are used in OPFython classes and methods; • Logging: Every method that is invoked in the library is logged onto a log file. One can watch the log to detect potential errors, essential warnings, or even success messages throughout the classification procedure.

Library Usage
In this section, we describe how to install the OPFython library and the first steps to start playing with it. Essentially, one can study its documentation or make use of the already-included examples. Besides, there are implemented methods that conduct unitary tests and verify if everything is operating as presumed.

Installation
First of all, we understand that everything has to be smooth without being tricky or daunting. Therefore, OPFython will always be the one-to-go package, from the very first installation to its further usage. Just execute the following command under the most preferred Python environment (standard, conda, virtualenv):

pip install opfython
Alternatively, it is possible to use the bleeding-edge version by cloning its repository and installing it: git clone https://github.com/gugarosa/opfython.git pip install .
Note that there is no other requirement to use OPFython. As its single dependency is the Numpy package, it can be installed everywhere, despite the machine's operational system.

Documentation
One might have an enthusiasm for mastering the concepts and strategies behind OPFython. Hence, we provide a fully documented reference 5 containing everything that the library offers. From elementary classes to more complex methods, OPFython's documentation is the perfect reference for learning how the library was developed or even improving it with contributions.
• Utils: convert_from_opf.py; Each example is constituted of high-level explanations of how to use predefined classes and methods. One can observe that it provides a standard description of how to instantiate each class and decide which arguments should be employed.

Test Suites
OPFython is prepared with tests to give a more in-depth analysis of the code. Also, the intention behind any test is to check whether everything is running as demanded or not. Thus, there are two main methods in order to execute the tests: • PyTest: The first method is running the solo command pytest tests/, as depicted by Figure 9. It will fulfill all the implemented tests and return an output indicating whether they succeeded or failed; • Coverage:, An interesting extension to PyTest is the coverage module. Despite granting the same outputs from PyTest, it will also present a report stating how much the tests cover the code, as illustrated by Figure 10. Its usage is also straightforward: coverage run -m pytest tests/ and coverage report -m.

Applications
In this section, we explain how to perform a classification task with OPFython, as well as briefly describe the seven pre-loaded applications that are included with the library. Each example comprises the following pipeline: loading the dataset, parsing the dataset, splitting the dataset, instantiating a classifier, fitting the training data, predicting the validation/test data, and calculating the classifier's performance. Finally, after performing the classification process, it is possible to save the model in a disk-file for further inspection. Figure 11 illustrates the output logs generated by an OPFython classification.

Getting Started
The difference between the provided scripts consists of the type of classifier. While supervised classifications attempt to learn a set of classes from particular samples representing them, the unsupervised classification tries to aggregate samples into clusters, i.e., dense regions where samples share some similar traits. As for now, we offer three supervised classifications, e.g., KNN-based supervised OPF, semi-supervised OPF, and supervised OPF, as well as unsupervised classification, the unsupervised OPF. Additionally, we offer three extensions of the supervised OPF, e.g., supervised OPF agglomerative learning (learns from mistakes over the validation set), supervised OPF learning (learns the best classifier over a validation set), and supervised OPF pruning (prunes nodes while maintaining the accuracy).

Modeling a New Classification
In order to model a new classification, some conventional rules need to be comprehended. First of all, the data should be loaded and parsed, which in this case, we will be loading a common dataset known as Boat: import opfython . stream . loader as l import opfython . stream . parser as p # Loading a . txt file to a numpy array txt = l . load_txt ( ' data / boat . txt ') # Parsing a pre -loaded numpy array X , Y = p . parse_loader ( txt ) Furthermore, if necessary, we can split the data into new sets, such as training and testing, as follows: import opfython . stream . splitter as s # Splitting data into training and testing sets X_train , X_test , Y_train , Y_test = s . split (X , Y , percentage =0.5 , random_state =1) Afterward, we can instantiate an OPF classifier: Figure 11: Output logs generated by executing an OPFython classification.
from opfython . models import SupervisedOPF # Creates a SupervisedOPF instance opf = SupervisedOPF ( distance = ' l o g _ s q u a r e d _ e u c l i d e a n ' , p r e _ c o m p u t e d _ d i s t a n c e = None ) Finally, we can fit the classifier and perform new predictions: # Fits training data into the classifier opf . fit ( X_train , Y_train ) # Predicts new data preds = opf . predict ( X_test ) After predicting new samples, it is possible to evaluate the classifier's performance: import opfython . math . general as g # Calculating accuracy acc = g . opf_accuracy ( Y_test , preds )

Conclusions
This article introduces an open-source Python-inspired library for handling Optimum-Path Forest classifiers, known as OPFython. Based on an object-oriented paradigm, OPFython provides a modern yet straightforward implementation, allowing users to prototype new OPF-based classifiers swiftly.
The library implements a wide variety of Optimum-Path Forest classifiers, such as supervised, semi-supervised, and unsupervised ones, and auxiliary functions that assist the classifiers' workflow, i.e., distance functions, classification metrics, data processing, errors logging. Additionally, as the original LibOPF thoroughly inspires OPFython's library, it is possible to use the same loading format (OPF file format) and available methods in the original package. Furthermore, OPFython provides a model-saving method, which can be used to pre-train classifiers and retrieve insightful information about the classification procedure.
Regarding future works, we intend to make available more OPF-based classifiers, as well as a visualization package, which will allow users to feed their saved models and furnish charts. Furthermore, we aim to improve our implementations by distributing the calculations, i.e., employing a parallel computing concept, which will hopefully reduce our computational burden.