# Description

The NCutYX package includes functions for clustering genomic data using graph theory. Each function in this package is a variation on the NCut measure used to cluster vertices in a graph. The running theme is to use data sets from different sources and types to improve the clustering results.

• The ncut function clusters the columns of a data set using the classical normalized cut measure from graph theory.
• The ancut function clusters one type of data, say gene expressions, with the help of a second type of data, like copy number aberrations.
• The muncut function clusters a three-layered graph into K different clusters of 3 different data types, say gene expression, copy number aberrations and proteins.
• The pwncut function clusters the columns of X into K clusters by giving a weight for each cluster while penalizing them to be similar to each other.
• The mlbncut function works similarly to muncut but it also clusters samples into R clusters.
• The awncut builds similarity matrices for the row of X and an assisted dataset Z. Clusters them into K groups while conducting feature selection based on the AWNCut method.

To install:

• latest development version:
1. install and load package devtools
2. install_github("Seborinos/NCutYX")

# NCut

The Normalized Cut (NCut) clusters the columns of Y into K groups using the NCut graph measure. Builds a similarity matrix for the columns of Y and clusters them into K groups based on the NCut graph measure. Correlation, Euclidean and Gaussian distances can be used to construct the similarity matrix. The NCut measure is minimized using the cross entropy method, a Monte Carlo optimization technique.

## Method

Consider a dataset with $$n$$ iid samples. For the $$i$$th sample, assume that measurements are available on $$p$$ variables, denoted as $$Y_i = (Y_{i1}, Y_{i2}, \cdots, Y_{ip})'$$. Define the weight matrix $$\mathbf{W}=(w_{jl})_{p\times p}$$, where the non-negative element $$w_{jl}$$ measures the similarity between columns $$j$$ and $$l$$ of $$\mathbf{Y}$$. We can define $$w_{jl}$$ to be equal the Gaussian kernel, the absolute value of the sample correlation or the inverse of their Euclidean distance. Denote $$A=\{A_1, \ldots, A_K\}$$ as a partition of $$\{1, \ldots, p \}$$ which leads to $$K$$ disjoint clusters. For a set $$A_k$$, denote $$A_k^c$$ as its complement. Then, the NCut measure is defined as $\text{NCut}(A)=\sum \limits_{k=1}^{K}\frac{\text{cut}(A_k,A_k^c;\mathbf{W})} {\text{cutvol}(A_k;\mathbf{W} )},$ where $\text{cut}(A_k,A_k^c;\mathbf{W})=\sum \limits_{j\in A_k,l \in A_k^c} w_{jl},$ and $\text{cutvol}(A_k; \mathbf{W})=\sum \limits_{j,l \in A_k} w_{jl}.$ With a fixed $$K$$, the optimal clustering minimizes the NCut measure. The minimization is done with the cross entropy method, a Monte Carlo optimization approach.

## Example

First, we set up the simulation parameters.

We define the covariance matrix, the true incidence function and sample the data.

Apply ncut to the data Y and calculate the estimation error of the clusters.

# ANCut

The Assisted NCut (ANcut) clusters the columns of a data set $$Y$$ into $$K$$ groups with the help of an external data set $$X$$, which is associated linearly with $$Y$$.

## Method

Consider a dataset with $$n$$ iid samples. For the $$i$$th sample, assume that measurements are available on $$p$$ gene expressions (GEs), denoted as $$Y_i = (Y_{i1}, Y_{i2}, \cdots, Y_{ip})'$$. In addition, assume that measurements are also available on $$q$$ copy number aberrations (CNAs), denoted as $$X_i = (X_{i1}, X_{i2}, \cdots, X_{iq})'$$. For another type of data, the proposed approach is directly applicable. Our strategy is to use information in $$X_i$$’s to assist the clustering of $$Y_i$$’s. We describe the $$Y$$-$$X$$ association using regression. Specifically, consider $Y_i=\mathbf{\beta}X_i+\epsilon_i,$

where $$\mathbf{\beta}$$ is the matrix of unknown regression coefficients, and $$\epsilon_i$$ is the vector of random errors. This form is especially suitable for analysis when $$Y$$ and $$X$$ have high-dimension. For estimating $$\mathbf{\beta}$$, we consider the penalized estimate $\widehat{\mathbf{\beta}}=\underset{\mathbf{\beta}}{\textrm{argmin}}\left\{ ||\mathbf{Y}-\mathbf{\beta}\mathbf{X}||_2^2+\lambda\left((1-\alpha)||\mathbf{\beta}||^2_2 + \alpha||\mathbf{\beta}||_1\right)\right\},$ where $$\mathbf{Y}$$ and $$\mathbf{X}$$ are matrices consisting of $$Y_i$$’s and $$X_i$$’s, and $$\lambda>0$$ and $$0\leq \alpha \leq 1$$ are data-dependent tuning parameters. The penalization approach is adopted to accommodate the high data dimensionality and for selection: a specific $$Y$$ is expected to be affected by only a few $$X$$’s, and a $$X$$ is expected to affect only a few $$Y$$’s. The elastic net (Enet) penalty is adopted for its simplicity and to accommodate (possibly high) correlations among $$X$$’s. This estimation is effectively realized using R package $$glmnet$$. The two tuning parameters $$\lambda$$ and $$\alpha$$ are selected using V-fold cross validation.

With the estimate $$\widehat{\mathbf{\beta}}$$, denote the predicted’’ $$Y$$ values as $$\widehat{\mathbf{Y}}=\widehat{\mathbf{\beta}} \mathbf{X}$$, which describe the component of GE directly regulated by the regulators. Accordingly, $$\mathbf{Y}^c= \mathbf{Y} \setminus \widehat{\mathbf{Y}}$$ describes the levels of GEs regulated by other regulators (that are not included in $$\mathbf{X}$$) as well as affected by other mechanisms.

For GEs, consider the weight matrix $$\mathbf{W}=(w_{jl})_{p\times p}$$, where the non-negative element $$w_{jl}$$ measures the similarity between gene $$j$$ and $$l$$. For a pair of the original GE measurements (as included in $$\mathbf{Y}$$), we define $$w_{jl}$$ equal to the inverse of their Euclidean distance. Note that there are multiple ways of defining the similarity. This definition is adopted because of its simplicity. It shares the same spirit as the popular K-means approach. Further, we define $$\widehat{\mathbf{W}}$$, which is defined in a similar way as $$\mathbf{W}$$ but using $$\widehat{\mathbf{Y}}$$, the regulated component of GEs.

Denote $$A=\{A_1, \ldots, A_K\}$$ as a partition of $$\{1, \ldots, p \}$$ which leads to $$K$$ disjoint clusters. For a set $$A_k$$, denote $$A_k^c$$ as its complement. We propose the ANCut (Assisted NCut) measure as $\text{ANCut}(A)=\sum \limits_{k=1}^{K}\frac{\text{cut}(A_k,A_k^c;\mathbf{W})} {\text{cutvol}(A_k;\widehat{\mathbf{W}} )},$ where $\text{cut}(A_k,A_k^c;\mathbf{W})=\sum \limits_{j\in A_k,l \in A_k^c} w_{jl},$ and $\text{cutvol}(A_k; \widehat{\mathbf{W}})=\sum \limits_{j,l \in A_k} \widehat{w}_{jl}.$ With a fixed $$K$$, the optimal clustering minimizes the ANCut measure. The minimization can be done with simulated annealing.

## Simulation Example

First we define some of the simulation parameters below.

The data will be simulated as Y = X*B + e where X will be normal with a convariance matrix S with 2 blocks of correlated variables. This induces the correlation among the Y’s as well. W0 is an incidence matrix that will be used to calculate the error of the procedure.

We apply the function ANCut to Y which will cluster the columns into K=2 groups. It uses the help of X. First, it creates a model of Y=XB+e using the elastic net. You can choose the number of cross-validations with ncv and the parameter alpha in the penalty of the elastic net.

If you wish to plot the results you can do:

On the left panel we see the path of the objective function as it is minimized through simulated annealing. On the right are represented the clusters. The perfect solution is a perfect checker board panel and the ANCut solution misses slightly. As n or h2 are increased, the solution will get closer to the true cluster structure of the data.

# MuNCut

This example shows how to use the muncut function. The Multilayer NCut (MuNCut) clusters the columns of data from 3 different sources. It clusters the columns of $$Z$$, $$Y$$ and $$X$$ into $$K$$ clusters by representing each data type as one network layer. It represents the $$Z$$ layer depending on $$Y$$, and the $$Y$$ layer depending on $$X$$. Elastic net can be used before the clustering procedure by using the predictions of $$Z$$ and $$Y$$ instead of the actual values to improve the cluster results. The function muncut will output $$K$$ clusters of columns of $$Z$$, $$Y$$ and $$X$$.

## Method

Denote $$Z=(Z_1, \ldots, Z_q)$$, $$Y=(Y_1, \ldots, Y_p)$$, and $$X=(X_1, \ldots, X_r)$$ as the length $$q$$, $$p$$, and $$r$$ vectors of proteins, GEs, and CNVs, respectively. With multilayer data both within- and across-layer connections need to be considered.

## NCut clustering within the same layers

First consider one data type or one layer, for example CNVs. Denote $$\mathbf{W}_{C}=(w_{jl,c})_{r \times r}$$ as the weight matrix, where the non-negative element $$w_{jl,c}$$ measures the similarity between CNVs $$j$$ and $$l$$. We set $$w_{jl,c}$$ equal to the Gaussian kernel. Denote $$A_{1,C}, \ldots, A_{K,C}$$ as a partition of $$\{1, \ldots, r \}$$ which leads to $$K$$ disjoint CNV clusters. Here in the subscript, $$C$$’’ is used to represent CNV. For $$A_{k,C}$$, denote $$A_{k,C}^c$$ as its complement set. Consider the NCut measure $\begin{eqnarray}\label{eq:ncut.cnv} \mbox{NCut}_C= \sum \limits_{k=1}^{K}\frac{\text{cut}(A_{k,C},A_{k,C}^c;\mathbf{W}_C)} {\text{cutvol}(A_{k,C}; \mathbf{W}_C )}, \end{eqnarray}$ where $$$\label{eq:cuty} \text{cut}(A_{k,C},A_{k,C}^c;\mathbf{W}_C)=\sum \limits_{j\in A_{k,C},l \in A_{k,C}^c} w_{jl,c},$$$ and $$$\label{eq:cutvoly} \text{cutvol}(A_{k,C}; \mathbf{W}_C)=\sum \limits_{j,l \in A_{k,C}} {w}_{jl,c}.$$$ In a similar way, we can define the NCut measures for GEs and proteins and denote them as $$\mbox{NCut}_G$$ and $$\mbox{NCut}_P$$, respectively. Note that each layer has its own weight matrix, namely $$\mathbf{W}_C$$, $$\mathbf{W}_G$$, and $$\mathbf{W}_P$$. Overall, define the single-layer NCut measure as $\begin{eqnarray}\label{within} \mbox{NCut}_{single}=\mbox{NCut}_C+\mbox{NCut}_G+\mbox{NCut}_P. \end{eqnarray}$ The optimal cutting is defined as the one that minimizes $$\mbox{NCut}_{single}$$. Note that $$\mbox{NCut}_{single}$$ does not take into account the regulations (interconnections) across layers, and working with this measure is equivalent to conducting the NCut clustering with each layer individually.

## NCut clustering across layers

In the above subsection, we have focused on the interconnections (similarity) for omics measurements within the same layers. Now we consider the interconnections between omics measurements belonging to different layers (for example, CNVs and GEs). Following the literature , we first adopt a regression-based approach to describe the regulations. Specifically, consider the models: $\begin{eqnarray}\label{regress.model} Y=X\mathbf{\beta_1}+\epsilon_1, ~~ Z=Y\mathbf{\beta_2}+\epsilon_2, \end{eqnarray}$ where $$\mathbf{\beta_1}$$ and $$\mathbf{\beta_2}$$ are the $$r\times p$$ and $$p\times q$$ matrices of unknown regression coefficients, and $$\epsilon_1$$ and $$\epsilon_2$$ are random errors (which may also include regulation mechanisms not measured). Assume $$n$$ iid subjects. Denote $$\mathbf{Y}$$ and $$\mathbf{X}$$ as the data matrices composed of the $$Y$$’s and $$X$$’s, respectively. For estimating the regression coefficient matrices, we consider a penalized approach, where the estimate of $$\mathbf{\beta_1}$$ is defined as $$$\label{eq:Enet1} \widehat{\mathbf{\beta_1}}=\underset{\mathbf{\beta}} {\textrm{argmin}}\left\{||\mathbf{Y}-\mathbf{X}\mathbf{\beta_1}||_2^2+ \lambda\left((1-\alpha)||\mathbf{\beta_1}||^2_2 + \alpha||\mathbf{\beta_1}||_1\right)\right\}.$$$ $$\lambda>0$$ and $$0\leq \alpha\leq 1$$ are data-dependent tuning parameters, and $$||\cdot||_{2(1)}$$ denotes the $$\ell_{2(1)}$$ norm. The estimate of $$\mathbf{\beta_2}$$ can be defined in a similar manner. With the estimates, define $$\mathbf{\widehat{Y}}=\mathbf{X}\mathbf{\widehat{\beta_1}}$$ and $$\mathbf{\widehat{Z}}=\mathbf{Y}\mathbf{\widehat{\beta_2}}$$.

For $$(\mathbf{X}, \mathbf{\widehat{Y}}, \mathbf{\widehat{Z}})$$, the length $$r+p+q$$ mega’’ vector of omics measurements, define the $$(r+p+q)\times (r+p+q)$$ weight matrix $\begin{eqnarray} \mathbf{\widetilde{W}}= \left( \begin{array}{ccc} \mathbf{0} & \mathbf{W_{\widehat{Z}:\widehat{Y}}} & \mathbf{0}\\ \mathbf{W_{\widehat{Z}:\widehat{Y}}}^T & \mathbf{0} & \mathbf{W_{\widehat{Y}:X}}\\ \mathbf{0} & \mathbf{W_{\widehat{Y}:X}}^T & \mathbf{0} \end{array} \right), \end{eqnarray}$ where $$\mathbf{0}$$ denotes a matrix with all components zero (note that in different places, it may have different dimensions), $$\mathbf{W_{\widehat{Z}:\widehat{Y}}}$$ and $$\mathbf{W_{\widehat{Y}:X}}$$ are the matrices of similarity between $$\mathbf{\widehat{Z}}$$ and $$\mathbf{\widehat{Y}}$$, and between $$\mathbf{\widehat{Y}}$$ and $$\mathbf{X}$$ respectively, and the superscript $$T$$ denotes transpose.

For a partition of $$\{1, \ldots, r+p+q \}$$ (which leads to a clustering structure), using $$\mathbf{\widetilde{W}}$$, we can compute the NCut measure $$\mbox{NCut}_{multi}$$ in the same manner as in $$\mbox{NCut}_C$$.

## MuNCut Objective Function

With a fixed $$K$$, let $$A=\{A_1, \ldots, A_K \}$$ denotes a disjoint partition of the CNVs, GEs, and proteins. Note that the cluster represented by $$A_k$$ may contain multiple types of omics measurements. For $$A_k$$ denote $$A_{k, C}$$, $$A_{k, G}$$, and $$A_{k, P}$$ as its components that are CNVs, GEs, and proteins, respectively. For $$A$$, we define its MuNCut measure as $\begin{eqnarray}\label{overall} \mbox{MuNCut}(A)=\mbox{NCut}_{multi}+\gamma \times \mbox{NCut}_{single}, \end{eqnarray}$ where $$\mbox{NCut}_{multi}$$ and $$\mbox{NCut}_{single}$$ are as defined above, and $$\gamma\geq 0$$ is a tuning parameter. The optimal clustering is defined as the one that minimizes $$\mbox{MuNCut}(A)$$.

## Toy Example

Three data types are considered: proteins in the upper layer; gene expressions in the middle layer; and CNVs in the lower layer. One dot represents one variable. Two dots are connected by a line if the corresponding variables are interconnected. Left panel: the true data structure with four clusters. Middle panel: MuNCut clustering. Right panel: K-means clustering. For K-means and MuNCut, different clusters are represented using different colors.

## Simulation Example

First, we define the simulation parameters, including the covariance matrix S of the $$X$$’s.

The matrix W0 will be used to evaluate how close the estimate is to the true cluster structure.

The code below shows how to sample the data from three layers. The multilayer network is such that $$\mathbf{X}$$ is the base layer, $$\mathbf{Y}$$ is the second layer which depends on $$\mathbf{X}$$, and $$\mathbf{Z}$$ is the third layer which depends on $$\mathbf{Y}$$.

The code below computes clusters using the MuNCut measure. With model = FALSE the raw data $$\mathbf{Y}$$ and $$\mathbf{Z}$$ are used. If model = TRUE, the predictions $$\widehat{\mathbf{Y}}$$ and $$\widehat{\mathbf{Z}}$$ are used instead of $$\mathbf{Y}$$ and $$\mathbf{Z}$$, respectively.

### References:

• Sebastian J. Teran Hidalgo and Shuangge Ma. “Clustering Multilayer Omics Data using MuNCut.” Revise and resubmit.

## PWNCut

The Penalized Weighted NCut (PWNCut) clusters the columns of $$X$$ into $$K$$ clusters by giving a weighted cluster membership while shrinking weights towards each other.

### Simulation Example

This sets up the initial parameters for the simulation.

This simulates the data:

To use PWNCut use the function pwncut.

### References:

• Sebastian J. Teran Hidalgo, Mengyun Wu and Shuangge Ma. “Penalized and weighted clustering of gene expression data using PWNCut.” Submitted.

## MLBNCut

The Multilayer Biclustering NCut (MLBNCut) clusters the columns and the rows simultaneously of data from 3 different sources. It clusters the columns of $$Z$$,$$Y$$ and $$X$$ into $$K$$ clusters and the samples into $$R$$ clusters by representing each data type as one network layer. It represents the $$Z$$ layer depending on $$Y$$, and the $$Y$$ layer depending on $$X$$. This function will output $$K$$ clusters of columns of $$Z$$, $$Y$$ and $$X$$ and $$R$$ clusters of the samples.

### Simulation Example

This sets up the initial parameters for the simulation.

X <- matrix(0,n,p)
Y <- matrix(0,n,p)
Z <- matrix(0,n,p)

Sigma=matrix(0,p,p)
Sigma[1:(p/5),1:(p/5)] <- rho
Sigma[(p/5+1):(3*p/5),(p/5+1):(3*p/5)] <- rho
Sigma[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- rho
Sigma[(4*p/5+1):p,(4*p/5+1):p] <- rho
Sigma <- Sigma - diag(diag(Sigma))
Sigma <- Sigma + diag(p)

X[1:(n/2),]   <- mvrnorm(n/2,rep(mu,p),Sigma)
X[(n/2+1):n,] <- mvrnorm(n/2,rep(-mu,p),Sigma)

B11 <- matrix(0,p,p)
B12 <- matrix(0,p,p)
B21 <- matrix(0,p,p)
B22 <- matrix(0,p,p)

B11[1:(p/5),1:(p/5)]                     <- runif((p/5)^2, h/2, h)*rbinom((p/5)^2, 1, 0.5)
B11[(p/5+1):(3*p/5),(p/5+1):(3*p/5)]     <- runif((2*p/5)^2, h/2, h)*rbinom((2*p/5)^2, 1, 0.5)
B11[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- runif((p/5)^2, h/2, h)*rbinom((p/5)^2, 1, 0.5)
B11[(4*p/5+1):p,(4*p/5+1):p]             <- runif((1*p/5)^2, h/2, h)*rbinom((1*p/5)^2, 1, 0.5)

B12[1:(p/5),1:(p/5)]                     <- runif((p/5)^2, -h, -h/2)*rbinom((p/5)^2, 1, 0.5)
B12[(p/5+1):(3*p/5),(p/5+1):(3*p/5)]     <- runif((2*p/5)^2, -h, -h/2)*rbinom((2*p/5)^2, 1, 0.5)
B12[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- runif((p/5)^2, -h, -h/2)*rbinom((p/5)^2, 1, 0.5)
B12[(4*p/5+1):p,(4*p/5+1):p]             <- runif((1*p/5)^2, -h, -h/2)*rbinom((1*p/5)^2, 1, 0.5)

B21[1:(p/5),1:(p/5)]                     <- runif((p/5)^2, h/2, h)*rbinom((p/5)^2,1,0.5)
B21[(p/5+1):(3*p/5),(p/5+1):(3*p/5)]     <- runif((2*p/5)^2, h/2, h)*rbinom((2*p/5)^2,1,0.5)
B21[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- runif((p/5)^2, h/2, h)*rbinom((p/5)^2,1,0.5)
B21[(4*p/5+1):p,(4*p/5+1):p]             <- runif((1*p/5)^2, h/2, h)*rbinom((1*p/5)^2,1,0.5)

B22[1:(p/5),1:(p/5)]                     <- runif((p/5)^2, -h, -h/2)*rbinom((p/5)^2, 1, 0.5)
B22[(p/5+1):(3*p/5),(p/5+1):(3*p/5)]     <- runif((2*p/5)^2, -h, -h/2)*rbinom((2*p/5)^2, 1, 0.5)
B22[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- runif((p/5)^2, -h, -h/2)*rbinom((p/5)^2, 1, 0.5)
B22[(4*p/5+1):p,(4*p/5+1):p]             <- runif((1*p/5)^2, -h, -h/2)*rbinom((1*p/5)^2, 1, 0.5)

Y[1:(n/2), ]   <- X[1:(n/2),]%*%B11+matrix(rnorm((n/2)*p, 0, 0.25), n/2, p)
Y[(n/2+1):n, ] <- X[(n/2+1):n,]%*%B12+matrix(rnorm((n/2)*p, 0, 0.25), n/2, p)

Z[1:(n/2), ]   <- Y[1:(n/2),]%*%B21+matrix(rnorm((n/2)*p, 0, 0.25),n/2,p)
Z[(n/2+1):n, ] <- Y[(n/2+1):n,]%*%B22+matrix(rnorm((n/2)*p, 0, 0.25),n/2,p)

### References:

• Sebastian J. Teran Hidalgo and Shuangge Ma. “Multilayer Biclustering of Omics Data using MLBNCut.” Work in progress.

## AWNCut

The Assisted Weighted NCut builds the similarity matrices for the rows of $$X$$ and an assisted dataset $$Z$$. Clusters them into $$K$$ groups while conducting feature selection based on the AWNCut method.

### References:

• Li, Yang; Bie, Ruofan; Teran Hidalgo, Sebastian; Qin, Yinchen; Wu, Mengyun; Ma, Shuangge. “Assisted gene expression-based clustering with AWNCut.” Submitted.