Functions/Subroutines
subroutine, public	assign_points_to_centroids_finite_bc (mesh, centroids, ip_to_ic)
	Assign each grid point to the closest centroid. A centroid and its set of nearest grid points defines a cluster. More...

subroutine, public	update_centroids (mesh, weight, ip_to_ic, centroids)
	Compute a new set of centroids. More...

subroutine	compute_grid_difference (points, updated_points, tol, points_differ)
	Compute the difference in two grids as \(abs(\mathbf{g}_1 - \mathbf{g}_2)\). More...

subroutine	report_differences_in_grids (points, updated_points, tol, points_differ)
	Report differences returned from `compute_grid_difference`. More...

subroutine, public	weighted_kmeans (space, mesh, weight, centroids, n_iter, centroid_tol, discretize, inertia)
	Weighted K-means clustering. More...

subroutine, public	sample_initial_centroids (mesh, centroids, seed_value)
	Sample initial centroids from the full mesh. More...

subroutine	compute_centroid_inertia (mesh, centroids, weight, ip_to_ic, inertia)
	Compute the inertia of all centroids. More...

Function/Subroutine Documentation

◆ assign_points_to_centroids_finite_bc()

subroutine, public kmeans_clustering_oct_m::assign_points_to_centroids_finite_bc	(	class(mesh_t), intent(in)	mesh,
		real(real64), dimension(:, :), intent(in)	centroids,
		integer, dimension(:), intent(out)	ip_to_ic
	)

Assign each grid point to the closest centroid. A centroid and its set of nearest grid points defines a cluster.

This can be mathematically expressed as:

\[ C_\mu=\left\{Z\left(\mathbf{r}_i\right) \mid \operatorname{dist}\left(Z\left(\mathbf{r}_i\right), Z\left(\mathbf{r}_\mu\right)\right) \leq \operatorname{dist}\left(Z\left(\mathbf{r}_i\right), Z\left(\mathbf{r}_m\right)\right) \text { for all } m \neq \mu\right\} \]

where \( Z(\mathbf{r}_i) \) are data points and \( Z(\mathbf{r}_\mu) \) is the centroid. The distance metric, \(\operatorname{dist}\), is defined as the dot product in this implementation.

See eqs 10-13 in [Complex-valued K-means clustering of interpolative separable density fitting algorithm for large-scale hybrid functional enabled ab initio molecular dynamics simulations within plane waves](https:

Note: For any grid point that is equidistant from two or more centroids, it is assigned to the the lowest-indexed centroid. This is because the algorithm updates the assignment when the current minimum is less than the stored minimum. If this equality was less than or equal, the point would be assigned to the highest-indexed centroid. Note, some algorithms will assign equidistant points at random from the set of centroids at minimum distance. One might consider implementing this behaviour.; Because we store the map from grid points to centroid indices, and not the opposite map, it is possible to silently miss assigning a centroid. This will typically happen if two or more centroids are duplicates (or indistinguishable). As described above, the first instance of the centroid, with the lowest index, will get assigned to the grid point and any duplicates will get passed over.

Parameters

[in]	mesh	Real-space grid (np, spacedim)
[in]	centroids	Centroid positions (spacedim, Ncentroids)
[out]	ip_to_ic	Index array that maps grid indices to centroid indices

Definition at line 171 of file kmeans_clustering.F90.

◆ update_centroids()

subroutine, public kmeans_clustering_oct_m::update_centroids	(	class(mesh_t), intent(in)	mesh,
		real(real64), dimension(:), intent(in)	weight,
		integer, dimension(:), intent(in)	ip_to_ic,
		real(real64), dimension(:, :), intent(inout), contiguous	centroids
	)

Compute a new set of centroids.

A centroid is defined as:

\[ \mathbf{r}_\mu = \frac{\sum_{\mathbf{r}_j \in C_\mu} \mathbf{r}_j w(\mathbf{r}_j)}{\sum_{\mathbf{r}_j \in C_\mu} w(\mathbf{r}_j)} \]

where \(\mathbf{r}_j\) and \(w(\mathbf{r}_j\) are grid points and weights restricted to the cluster \( C_\mu \), respectively.

If using domain decomposition, the routine initially computes centroids using only the local set of grid points, then reduces this sum over all domains.

Parameters

[in]	mesh	Real-space grid instance
[in]	weight	Weights (meshnp)
[in]	ip_to_ic	Index array that maps grid indices to centroid indices
[in,out]	centroids	In: centroid positions (spacedim, Ncentroids)

Definition at line 230 of file kmeans_clustering.F90.

◆ compute_grid_difference()

subroutine kmeans_clustering_oct_m::compute_grid_difference	(	real(real64), dimension(:, :), intent(in)	points,
		real(real64), dimension(:, :), intent(in)	updated_points,
		real(real64), intent(in)	tol,
		logical, dimension(:), intent(out)	points_differ
	)

private

Compute the difference in two grids as \(abs(\mathbf{g}_1 - \mathbf{g}_2)\).

If a component of the difference vector for grid point ip is greater than the tolerance, the element points_differ(ip) is set to true.

Parameters

[in]	points	Real-space grid (n_dims, N)
[in]	updated_points	Real-space grid (n_dims, N)
[in]	tol	Tolerance
[out]	points_differ	\|a_i - b_i\|

Definition at line 288 of file kmeans_clustering.F90.

◆ report_differences_in_grids()

subroutine kmeans_clustering_oct_m::report_differences_in_grids	(	real(real64), dimension(:, :), intent(in)	points,
		real(real64), dimension(:, :), intent(in)	updated_points,
		real(real64), intent(in)	tol,
		logical, dimension(:), intent(in)	points_differ
	)

private

Report differences returned from compute_grid_difference.

Parameters

[in]	points	Real-space grids (n_dims, N)
[in]	points_differ	If any element of point_i > tol

Definition at line 319 of file kmeans_clustering.F90.

◆ weighted_kmeans()

subroutine, public kmeans_clustering_oct_m::weighted_kmeans	(	class(space_t), intent(in)	space,
		class(mesh_t), intent(in)	mesh,
		real(real64), dimension(:), intent(in)	weight,
		real(real64), dimension(:, :), intent(inout), contiguous	centroids,
		integer, intent(in), optional	n_iter,
		real(real64), intent(in), optional	centroid_tol,
		logical, intent(in), optional	discretize,
		real(real64), intent(out), optional	inertia
	)

Weighted K-means clustering.

The K-means algorithm divides a set of \(N_r\) samples (in this case, grid points) into \(N_\mu\) disjoin clusters \(C\). The mean of each cluster defines the cluster centroid. Note that centroids are not, in general, points from the discrete grid

they can take any continuous values that span the grid limits.

Theory

Given a grid, and some initial guess at centroids, the K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion:

\[ argmin \sum^{N_\mu}_{\mu=1} \sum_{\mathbf{r}_\mathbf{k} \in C_\mu} || Z(\mathbf{r}_k) - Z(\mathbf{r}_\mu)||^2 \]

This implementation is based on Algorithm 1. given in [Complex-valued K-means clustering of interpolative separable density fitting algorithm for large-scale hybrid functional enabled ab initio molecular dynamics simulations within plane waves](https: however it is equivalent to implementations found in packages such as [scikit-learn](https:

Algorithm Description

The K-means algorithm consists of looping between two steps:

The first step assigns each sample to its nearest centroid. See assign_points_to_centroids_finite_bc
The second step creates new centroids by taking the mean value of all of the samples assigned to each previous centroid. See update_centroids The difference between the old and the new centroids are computed, and the algorithm repeats these last two steps until this value is less than a threshold.

MPI Implementation

Routine is MPI-safe for domain decomposition.

It expects a local mesh grid and returns centroids that are defined on all ranks.

Parameters

[in]	space	Spatial dimensions and periodic dimensions
[in]	mesh	Real-space grid instance
[in]	weight	Weights (n_points)
[in,out]	centroids	In: Initial centroids (n_dim, n_centroid)
[in]	n_iter	Optional max number of iterations
[in]	centroid_tol	Optional convergence criterion
[in]	discretize	Optional Discretize centroid values to grid points
[out]	inertia	Optional metric quantifying the quality of the centroids

Definition at line 387 of file kmeans_clustering.F90.

◆ sample_initial_centroids()

subroutine, public kmeans_clustering_oct_m::sample_initial_centroids	(	class(mesh_t), intent(in)	mesh,
		real(real64), dimension(:, :), intent(out), contiguous	centroids,
		integer(int64), intent(inout), optional	seed_value
	)

Sample initial centroids from the full mesh.

Points are chosen at random, with no replacement (no point can be chosen twice).

Parameters

[out]	centroids	Initial centroids
[in,out]	seed_value	Initial seed value for PRNG

Definition at line 470 of file kmeans_clustering.F90.

◆ compute_centroid_inertia()

subroutine kmeans_clustering_oct_m::compute_centroid_inertia	(	class(mesh_t), intent(in)	mesh,
		real(real64), dimension(:, :), intent(in)	centroids,
		real(real64), dimension(:), intent(in)	weight,
		integer, dimension(:), intent(in)	ip_to_ic,
		real(real64), intent(out)	inertia
	)

private

Compute the inertia of all centroids.

Inertia is defined as the sum of squared distances of grid points to their closest cluster center, weighted by the grid weights:

\[ I = \sum_{\text{ip}}^{N_p} w_{\text{ip}} |\mathbf{C}_{\text{ip} \in C} - \mathbf{r}_{\text{ip}}|^2 \]

Parameters

[in]	mesh	Real-space grid instance
[in]	centroids	Centroids (n_dim, n_centroid)
[in]	weight	Weights (n_points)
[in]	ip_to_ic	Map grid point to associated centroid

Definition at line 511 of file kmeans_clustering.F90.

Functions/Subroutines

Function/Subroutine Documentation

◆ assign_points_to_centroids_finite_bc()

◆ update_centroids()

◆ compute_grid_difference()

◆ report_differences_in_grids()

◆ weighted_kmeans()

Theory

Algorithm Description

MPI Implementation

◆ sample_initial_centroids()

◆ compute_centroid_inertia()