---
title: "Time Complexity Analysis in fastcpd"
output: rmarkdown::html_vignette
description: |
  This vignette explores the computational complexity of the `fastcpd.mean`
  function and demonstrates the impact of SeDG on change point detection
  using `fastcpd.lasso()`.
vignette: >
  %\VignetteIndexEntry{Time Complexity Analysis in fastcpd}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


# Introduction

This vignette examines two aspects of the **fastcpd** package:

1. **Time Complexity of `fastcpd.mean()`:**
   We assess how the execution time scales with the length of the data.

2. **Impact of SeDG in `fastcpd.lasso()`:**
   We simulate a Lasso regression setting with multiple change points and compare the detected change points under two settings of the `vanilla_percentage` parameter. This highlights the performance improvement provided by SeDG.

# Time Complexity Simulation for `fastcpd.mean()`

In this section, we generate multivariate normal data with varying lengths and measure the execution time of the `fastcpd.mean()` function. We then create a log-log plot of the execution times and perform a linear regression on the log-transformed data to estimate the power law coefficient.


``` r
# Load necessary libraries
library(ggplot2)
library(fastcpd)

# Set a seed for reproducibility
set.seed(1)

# Function to evaluate execution time for different data lengths
evaluate_time_complexity <- function(ns, p) {
  times <- numeric(length(ns))  # Vector to store execution times

  for (i in seq_along(ns)) {
    n <- ns[i]
    # Generate n samples of p-dimensional data from a multivariate normal distribution
    data_matrix <- mvtnorm::rmvnorm(n, mean = rep(0, p), sigma = diag(1, p))
    # Record the elapsed time for detecting change points
    times[i] <- system.time(fastcpd.mean(data_matrix, r.progress = FALSE, cp_only = TRUE))[[1]]
  }

  return(times)
}

# Define a sequence of data lengths
ns <- 10^5 * seq_len(10)
p <- 5  # Dimensionality of the data

# Evaluate execution times for each data length
execution_times <- evaluate_time_complexity(ns, p)

# Prepare data for plotting
time_data <- data.frame(
  n = ns,
  time = execution_times
)

# Plot execution times on a log-log scale
ggplot(time_data, aes(x = n, y = time)) +
  geom_point() +
  geom_line() +
  scale_x_log10() +
  scale_y_log10() +
  labs(
    title = "Time Complexity of fastcpd.mean",
    x = "Data Length (log10 scale)",
    y = "Execution Time (seconds, log10 scale)"
  ) +
  theme_minimal()
```

![plot of chunk time-complexity](time-complexity/time-complexity-1.png)

``` r

# Log-transform the data for linear regression
log_ns <- log10(ns)
log_times <- log10(execution_times)

# Perform linear regression to estimate the power coefficient
regression_model <- lm(log_times ~ log_ns)
summary(regression_model)
#> 
#> Call:
#> lm(formula = log_times ~ log_ns)
#> 
#> Residuals:
#>        Min         1Q     Median         3Q        Max 
#> -0.0166987 -0.0093375 -0.0004944  0.0062624  0.0281940 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -5.57486    0.08242  -67.64 2.54e-12 ***
#> log_ns       0.97266    0.01455   66.84 2.79e-12 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0139 on 8 degrees of freedom
#> Multiple R-squared:  0.9982,	Adjusted R-squared:  0.998 
#> F-statistic:  4468 on 1 and 8 DF,  p-value: 2.792e-12

# Extract and display the slope (power coefficient)
power_coefficient <- coef(regression_model)[2]
power_coefficient
#>    log_ns 
#> 0.9726593
```

# Impact of SeDG in `fastcpd.lasso()`

In this section, we simulate a Lasso regression model with change points. We compare the performance of the vanilla approach (without SeDG) and the SeDG-enhanced approach by varying the `vanilla_percentage` parameter. The detected change points are extracted for both settings.


``` r
# Load required libraries
library(fastcpd)

# Set seed for reproducibility
set.seed(1)

# Simulation parameters
n <- 480         # Total number of observations
p_true <- 6      # Number of true predictors with non-zero coefficients
p <- 50          # Total number of predictors

# Generate design matrix with n observations and p predictors
x <- mvtnorm::rmvnorm(n, rep(0, p), diag(p))

# Create true coefficient matrix for 4 segments
theta_0 <- rbind(
  runif(p_true, -5, -2),
  runif(p_true, -3, 3),
  runif(p_true, 2, 5),
  runif(p_true, -5, 5)
)
# Pad the coefficient matrix with zeros for the remaining predictors
theta_0 <- cbind(theta_0, matrix(0, ncol = p - p_true, nrow = 4))

# Simulate response variable with change points across segments
y <- c(
  x[1:80, ] %*% theta_0[1, ] + rnorm(80, 0, 2),
  x[81:200, ] %*% theta_0[2, ] + rnorm(120, 0, 2),
  x[201:320, ] %*% theta_0[3, ] + rnorm(120, 0, 2),
  x[321:n, ] %*% theta_0[4, ] + rnorm(160, 0, 2)
)

# Combine response and predictors into a data frame
lasso_data <- data.frame(y = y, x = x)

# Detect change points using fastcpd.lasso without SeDG (vanilla_percentage = 0)
system.time(result_seg_non_vanilla <- fastcpd.lasso(lasso_data, vanilla_percentage = 0, r.progress = FALSE))
#>    user  system elapsed 
#>   5.475   0.678   6.179
cat("Change points with SeDG (vanilla_percentage = 0):\n")
#> Change points with SeDG (vanilla_percentage = 0):
result_seg_non_vanilla@cp_set
#> [1]  79 203 320

# Detect change points using fastcpd.lasso with the vanilla approach (vanilla_percentage = 1)
system.time(result_seg_vanilla <- fastcpd.lasso(lasso_data, vanilla_percentage = 1, r.progress = FALSE))
#>    user  system elapsed 
#> 116.042   6.229 122.451
cat("Change points with vanilla approach (vanilla_percentage = 1):\n")
#> Change points with vanilla approach (vanilla_percentage = 1):
result_seg_vanilla@cp_set
#> [1] 200 321
```

# Notes

This document is generated by the following code:

```shell
R -e 'knitr::knit("vignettes/time-complexity.Rmd.original", output = "vignettes/time-complexity.Rmd")' && rm -rf vignettes/time-complexity && mv -f time-complexity vignettes
```

# Appendix: all code snippets


``` r
knitr::opts_chunk$set(
  collapse = TRUE, comment = "#>", eval = TRUE, warning = FALSE,
  fig.path="time-complexity/"
)
library(fastcpd)
# Load necessary libraries
library(ggplot2)
library(fastcpd)

# Set a seed for reproducibility
set.seed(1)

# Function to evaluate execution time for different data lengths
evaluate_time_complexity <- function(ns, p) {
  times <- numeric(length(ns))  # Vector to store execution times

  for (i in seq_along(ns)) {
    n <- ns[i]
    # Generate n samples of p-dimensional data from a multivariate normal distribution
    data_matrix <- mvtnorm::rmvnorm(n, mean = rep(0, p), sigma = diag(1, p))
    # Record the elapsed time for detecting change points
    times[i] <- system.time(fastcpd.mean(data_matrix, r.progress = FALSE, cp_only = TRUE))[[1]]
  }

  return(times)
}

# Define a sequence of data lengths
ns <- 10^5 * seq_len(10)
p <- 5  # Dimensionality of the data

# Evaluate execution times for each data length
execution_times <- evaluate_time_complexity(ns, p)

# Prepare data for plotting
time_data <- data.frame(
  n = ns,
  time = execution_times
)

# Plot execution times on a log-log scale
ggplot(time_data, aes(x = n, y = time)) +
  geom_point() +
  geom_line() +
  scale_x_log10() +
  scale_y_log10() +
  labs(
    title = "Time Complexity of fastcpd.mean",
    x = "Data Length (log10 scale)",
    y = "Execution Time (seconds, log10 scale)"
  ) +
  theme_minimal()

# Log-transform the data for linear regression
log_ns <- log10(ns)
log_times <- log10(execution_times)

# Perform linear regression to estimate the power coefficient
regression_model <- lm(log_times ~ log_ns)
summary(regression_model)

# Extract and display the slope (power coefficient)
power_coefficient <- coef(regression_model)[2]
power_coefficient
# Load required libraries
library(fastcpd)

# Set seed for reproducibility
set.seed(1)

# Simulation parameters
n <- 480         # Total number of observations
p_true <- 6      # Number of true predictors with non-zero coefficients
p <- 50          # Total number of predictors

# Generate design matrix with n observations and p predictors
x <- mvtnorm::rmvnorm(n, rep(0, p), diag(p))

# Create true coefficient matrix for 4 segments
theta_0 <- rbind(
  runif(p_true, -5, -2),
  runif(p_true, -3, 3),
  runif(p_true, 2, 5),
  runif(p_true, -5, 5)
)
# Pad the coefficient matrix with zeros for the remaining predictors
theta_0 <- cbind(theta_0, matrix(0, ncol = p - p_true, nrow = 4))

# Simulate response variable with change points across segments
y <- c(
  x[1:80, ] %*% theta_0[1, ] + rnorm(80, 0, 2),
  x[81:200, ] %*% theta_0[2, ] + rnorm(120, 0, 2),
  x[201:320, ] %*% theta_0[3, ] + rnorm(120, 0, 2),
  x[321:n, ] %*% theta_0[4, ] + rnorm(160, 0, 2)
)

# Combine response and predictors into a data frame
lasso_data <- data.frame(y = y, x = x)

# Detect change points using fastcpd.lasso without SeDG (vanilla_percentage = 0)
system.time(result_seg_non_vanilla <- fastcpd.lasso(lasso_data, vanilla_percentage = 0, r.progress = FALSE))
cat("Change points with SeDG (vanilla_percentage = 0):\n")
result_seg_non_vanilla@cp_set

# Detect change points using fastcpd.lasso with the vanilla approach (vanilla_percentage = 1)
system.time(result_seg_vanilla <- fastcpd.lasso(lasso_data, vanilla_percentage = 1, r.progress = FALSE))
cat("Change points with vanilla approach (vanilla_percentage = 1):\n")
result_seg_vanilla@cp_set
```