Octopus
kmeans_clustering_oct_m Module Reference

Functions/Subroutines

subroutine, public assign_points_to_centroids_finite_bc (mesh, centroids, ip_to_ic)
 Assign each grid point to the closest centroid. A centroid and its set of nearest grid points defines a cluster. More...
 
subroutine, public update_centroids (mesh, weight, ip_to_ic, centroids)
 Compute a new set of centroids. More...
 
subroutine compute_grid_difference (points, updated_points, tol, points_differ)
 Compute the difference in two grids as \(abs(\mathbf{g}_1 - \mathbf{g}_2)\). More...
 
subroutine report_differences_in_grids (points, updated_points, tol, points_differ)
 Report differences returned from compute_grid_difference. More...
 
subroutine, public weighted_kmeans (space, mesh, weight, centroids, n_iter, centroid_tol, discretize)
 Weighted K-means clustering. More...
 

Function/Subroutine Documentation

◆ assign_points_to_centroids_finite_bc()

subroutine, public kmeans_clustering_oct_m::assign_points_to_centroids_finite_bc ( class(mesh_t), intent(in)  mesh,
real(real64), dimension(:, :), intent(in)  centroids,
integer, dimension(:), intent(out)  ip_to_ic 
)

Assign each grid point to the closest centroid. A centroid and its set of nearest grid points defines a cluster.

This can be mathematically expressed as:

\[ C_\mu=\left\{Z\left(\mathbf{r}_i\right) \mid \operatorname{dist}\left(Z\left(\mathbf{r}_i\right), Z\left(\mathbf{r}_\mu\right)\right) \leq \operatorname{dist}\left(Z\left(\mathbf{r}_i\right), Z\left(\mathbf{r}_m\right)\right) \text { for all } m \neq \mu\right\} \]

where \( Z(\mathbf{r}_i) \) are data points and \( Z(\mathbf{r}_\mu) \) is the centroid. The distance metric, \(\operatorname{dist}\), is defined as the dot product in this implementation.

See eqs 10-13 in [Complex-valued K-means clustering of interpolative separable density fitting algorithm for large-scale hybrid functional enabled ab initio molecular dynamics simulations within plane waves](https:

Note
For any grid point that is equidistant from two or more centroids, it is assigned to the the lowest-indexed centroid. This is because the algorithm updates the assignment when the current minimum is less than the stored minimum. If this equality was less than or equal, the point would be assigned to the highest-indexed centroid. Note, some algorithms will assign equidistant points at random from the set of centroids at minimum distance. One might consider implementing this behaviour.
Because we store the map from grid points to centroid indices, and not the opposite map, it is possible to silently miss assignment a centroid. This will typically happen if two or more centroids are duplicates (or indistinguishable). As described above, the first instance of the centroid, with the lowest index, will get assigned to the grid point and any duplicates will get passed over.
Parameters
[in]meshReal-space grid (np, spacedim)
[in]centroidsCentroid positions (spacedim, Ncentroids)
[out]ip_to_icIndex array that maps grid indices to centroid indices

Definition at line 159 of file kmeans_clustering.F90.

◆ update_centroids()

subroutine, public kmeans_clustering_oct_m::update_centroids ( class(mesh_t), intent(in)  mesh,
real(real64), dimension(:), intent(in)  weight,
integer, dimension(:), intent(in)  ip_to_ic,
real(real64), dimension(:, :), intent(inout), contiguous  centroids 
)

Compute a new set of centroids.

A centroid is defined as:

\[ \mathbf{r}_\mu = \frac{\sum_{\mathbf{r}_j \in C_\mu} \mathbf{r}_j w(\mathbf{r}_j)}{\sum_{\mathbf{r}_j \in C_\mu} w(\mathbf{r}_j)} \]

where \(\mathbf{r}_j\) and \(w(\mathbf{r}_j\) are grid points and weights restricted to the cluster \( C_\mu \), respectively.

If using domain decomposition, the routine initially computes centroids using only the local set of grid points, then reduces this sum over all domains.

Parameters
[in]meshReal-space grid instance
[in]weightWeights (meshnp)
[in]ip_to_icIndex array that maps grid indices to centroid indices
[in,out]centroidsIn: centroid positions (spacedim, Ncentroids)

Definition at line 212 of file kmeans_clustering.F90.

◆ compute_grid_difference()

subroutine kmeans_clustering_oct_m::compute_grid_difference ( real(real64), dimension(:, :), intent(in)  points,
real(real64), dimension(:, :), intent(in)  updated_points,
real(real64), intent(in)  tol,
logical, dimension(:), intent(out)  points_differ 
)
private

Compute the difference in two grids as \(abs(\mathbf{g}_1 - \mathbf{g}_2)\).

If a component of the difference vector for grid point ip is greater than the tolerance, the element points_differ(ip) is set to true.

Parameters
[in]pointsReal-space grid (n_dims, N)
[in]updated_pointsReal-space grid (n_dims, N)
[in]tolTolerance
[out]points_differ|a_i - b_i|

Definition at line 270 of file kmeans_clustering.F90.

◆ report_differences_in_grids()

subroutine kmeans_clustering_oct_m::report_differences_in_grids ( real(real64), dimension(:, :), intent(in)  points,
real(real64), dimension(:, :), intent(in)  updated_points,
real(real64), intent(in)  tol,
logical, dimension(:), intent(in)  points_differ 
)
private

Report differences returned from compute_grid_difference.

Parameters
[in]pointsReal-space grids (n_dims, N)
[in]points_differIf any element of point_i > tol

Definition at line 301 of file kmeans_clustering.F90.

◆ weighted_kmeans()

subroutine, public kmeans_clustering_oct_m::weighted_kmeans ( class(space_t), intent(in)  space,
class(mesh_t), intent(in)  mesh,
real(real64), dimension(:), intent(in)  weight,
real(real64), dimension(:, :), intent(inout), contiguous  centroids,
integer, intent(in), optional  n_iter,
real(real64), intent(in), optional  centroid_tol,
logical, intent(in), optional  discretize 
)

Weighted K-means clustering.

The K-means algorithm divides a set of \(N_r\) samples (in this case, grid points) into \(N_\mu\) disjoin clusters \(C\). The mean of each cluster defines the cluster centroid. Note that centroids are not, in general, points from the discrete grid

  • they can take any continuous values that span the grid limits.

Theory

Given a grid, and some initial guess at centroids, the K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion:

\[ argmin \sum^{N_\mu}_{\mu=1} \sum_{\mathbf{r}_\mathbf{k} \in C_\mu} || Z(\mathbf{r}_k) - Z(\mathbf{r}_\mu)||^2 \]

This implementation is based on Algorithm 1. given in [Complex-valued K-means clustering of interpolative separable density fitting algorithm for large-scale hybrid functional enabled ab initio molecular dynamics simulations within plane waves](https: however it is equivalent to implementations found in packages such as [scikit-learn](https:

Algorithm Description

The K-means algorithm consists of looping between two steps:

  1. The first step assigns each sample to its nearest centroid. See assign_points_to_centroids_finite_bc
  2. The second step creates new centroids by taking the mean value of all of the samples assigned to each previous centroid. See update_centroids The difference between the old and the new centroids are computed, and the algorithm repeats these last two steps until this value is less than a threshold.

MPI Implementation

  • Routine is MPI-safe for domain decomposition.
  • It expects a local mesh grid and returns centroids that are defined on all ranks.
    Parameters
    [in]spaceSpatial dimensions and periodic dimensions
    [in]meshReal-space grid instance
    [in]weightWeights (n_points)
    [in,out]centroidsIn: Initial centroids (n_dim, n_centroid)
    [in]n_iterOptional max number of iterations
    [in]centroid_tolOptional convergence criterion
    [in]discretizeOptional Discretize centroid values to grid points

Definition at line 365 of file kmeans_clustering.F90.