The univariate kernel density estimation (`Kde1d`) API

This notebook demonstrates the usage of the Kde1d class from pyvinecopulib for various types of kernel density estimation tasks. Each section includes data generation, model fitting, evaluation, and visualization:

Unbounded continuous data: Standard kernel density estimation with automatic bandwidth selection
Bounded continuous data: Boundary-corrected estimation with custom degree parameters
Discrete data: Specialized estimation for integer-valued data
Zero-inflated data: Handling excess zeros with mixed continuous-discrete models
Weighted estimation: Incorporating observation weights in the estimation process
Simulation and inference: Generating samples and computing quantiles from fitted densities

Key Features

Automatic boundary correction for bounded data
Support for different polynomial degrees
Automatic bandwidth selection
Comprehensive statistical interface (pdf, cdf, quantile, simulate, plot)

Setup and Imports

[2]:

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pyvinecopulib as pv

# Set random seed for reproducibility
np.random.seed(42)

# Configure matplotlib for better plots
plt.style.use("default")
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["font.size"] = 11

print("Libraries imported successfully!")
print(f"pyvinecopulib version: {pv.__version__}")

Libraries imported successfully!
pyvinecopulib version: 0.7.4

1. Unbounded Continuous Data

Let’s start with the simplest case: unbounded continuous data using a normal distribution.

[ ]:

# Generate normal data
x_normal = np.random.normal(0, 1, 500)

# Create and fit Kde1d object
kde_normal = pv.Kde1d()  # Default constructor for continuous unbounded data
kde_normal.fit(x_normal)

print(f"Kde1d fitted to normal data: {kde_normal}")
print(f"Log-likelihood: {kde_normal.loglik:.4f}")
print(f"Effective degrees of freedom: {kde_normal.edf:.2f}")

# Evaluate density at specific points
eval_points = np.array([0.0, -1.0, 1.0, -2.0, 2.0])
density_values = kde_normal.pdf(eval_points)

print("\nDensity evaluation:")
for point, density in zip(eval_points, density_values):
  true_density = stats.norm.pdf(point)  # True normal density for comparison
  print(
    f"x = {point:4.1f}: estimated = {density:.4f}, true = {true_density:.4f}, diff = {abs(density - true_density):.4f}"
  )

Kde1d fitted to normal data: <pyvinecopulib.Kde1d> Kde1d(xmin=nan, xmax=nan, type='continuous', bandwidth=0.763029, multiplier=1, degree=2)
Number of grid points: 401
Log-likelihood: -696.9225
Effective degrees of freedom: 7.00

Density evaluation:
x =  0.0: estimated = 0.4084, true = 0.3989, diff = 0.0095
x = -1.0: estimated = 0.2512, true = 0.2420, diff = 0.0092
x =  1.0: estimated = 0.2350, true = 0.2420, diff = 0.0070
x = -2.0: estimated = 0.0428, true = 0.0540, diff = 0.0112
x =  2.0: estimated = 0.0526, true = 0.0540, diff = 0.0014

[12]:

# Plot the density estimate
plt.figure(figsize=(12, 6))
# Add true density for comparison
x_range = np.linspace(-4, 4, 200)
true_density = stats.norm.pdf(x_range)
kde_density = kde_normal.pdf(x_range)
plt.plot(x_range, kde_density, "b-", label="KDE")
plt.plot(x_range, true_density, "r--", label="True")
plt.title("Unbounded Continuous Data: Normal Distribution")
plt.xlabel("x")
plt.ylabel("Density")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

2. Bounded Continuous Data

Now let’s work with bounded data using gamma and beta distributions.

[15]:

# Generate gamma and betadata
x_gamma = np.random.gamma(shape=1, scale=1, size=500)
x_beta = np.random.beta(a=2, b=5, size=500)

# Create Kde1d for bounded data (gamma distribution)
# Set xmin=0 since gamma distribution is defined on [0, ∞)
kde_gamma = pv.Kde1d(xmin=0)
kde_gamma.fit(x_gamma)

print(f"Kde1d fitted to gamma data (bounded at 0): {kde_gamma}")
print(f"Log-likelihood: {kde_gamma.loglik:.4f}")
print(f"Effective degrees of freedom: {kde_gamma.edf:.2f}")

# Create Kde1d for bounded data (beta distribution)
# Set xmin=0 and xmax=1 since beta distribution is defined on [0, 1]
kde_beta = pv.Kde1d(xmin=0, xmax=1)
kde_beta.fit(x_beta)

print(f"Kde1d fitted to beta data (bounded between 0 and 1): {kde_beta}")
print(f"Log-likelihood: {kde_beta.loglik:.4f}")
print(f"Effective degrees of freedom: {kde_beta.edf:.2f}")

Kde1d fitted to gamma data (bounded at 0): <pyvinecopulib.Kde1d> Kde1d(xmin=0, xmax=nan, type='continuous', bandwidth=0.813595, multiplier=1, degree=2)
Log-likelihood: -491.5955
Effective degrees of freedom: 7.68
Kde1d fitted to beta data (bounded between 0 and 1): <pyvinecopulib.Kde1d> Kde1d(xmin=0, xmax=1, type='continuous', bandwidth=0.36799, multiplier=1, degree=2)
Log-likelihood: 235.5237
Effective degrees of freedom: 6.12

[18]:

# Overlay true density for comparison - Manual plotting approach
eps = 1e-2  # Small epsilon to avoid boundary issues
x_range_gamma = np.linspace(0 + eps, 10, 200)
true_density_gamma = stats.gamma.pdf(x_range_gamma, a=1)  # True gamma density
kde_density_gamma = kde_gamma.pdf(x_range_gamma)
x_range_beta = np.linspace(0 + eps, 1 - eps, 200)
true_density_beta = stats.beta.pdf(x_range_beta, a=2, b=5)  # True beta density
kde_density_beta = kde_beta.pdf(x_range_beta)

plt.figure(figsize=(12, 6))


plt.subplot(1, 2, 1)
plt.plot(x_range_gamma, kde_density_gamma, "b-", linewidth=2, label="KDE")
plt.plot(
  x_range_gamma, true_density_gamma, "r--", linewidth=2, label="True density"
)
plt.xlabel("x")
plt.ylabel("Density")
plt.title("Lower Bounded KDE")
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(x_range_beta, kde_density_beta, "b-", linewidth=2, label="KDE")
plt.plot(
  x_range_beta, true_density_beta, "r--", linewidth=2, label="True density"
)
plt.xlabel("x")
plt.ylabel("Density")
plt.title("Lower and Upper Bounded KDE")
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

3. Discrete Data

For discrete data, we need to specify the type as “discrete” and set appropriate bounds.

[20]:

# Generate discrete data (binomial distribution)
x_binomial = np.random.binomial(n=5, p=0.5, size=500)

# Create Kde1d for discrete data
# Set type="discrete" and specify bounds [0, 5] since binomial(n=5) has support on {0,1,2,3,4,5}
kde_discrete = pv.Kde1d(xmin=0, xmax=5, type="discrete")
kde_discrete.fit(x_binomial)

print(f"Kde1d fitted to discrete data (binomial): {kde_discrete}")
print(f"Log-likelihood: {kde_discrete.loglik:.4f}")
print(f"Effective degrees of freedom: {kde_discrete.edf:.2f}")

# Evaluate density at integer points (the support of binomial distribution)
eval_points_discrete = np.array([0, 1, 2, 3, 4, 5])
density_discrete = kde_discrete.pdf(eval_points_discrete.astype(float))

print("\nDensity evaluation for discrete data:")
print("Point | Estimated | True (Binom) | Empirical Frequency")
print("------|-----------|--------------|---------------")
frequency_counts = np.bincount(x_binomial, minlength=6)
empirical_frequencies = frequency_counts / len(x_binomial)
for point in eval_points_discrete:
  estimated = kde_discrete.pdf(np.array([float(point)]))[0]
  true_density = stats.binom.pmf(point, n=5, p=0.5)
  count = np.sum(x_binomial == point)
  print(
    f"  {point}   |   {estimated:.4f}  |    {true_density:.4f}    |      {empirical_frequencies[point]:.4f}"
  )

Kde1d fitted to discrete data (binomial): <pyvinecopulib.Kde1d> Kde1d(xmin=0, xmax=5, type='discrete', bandwidth=1.11374, multiplier=1, degree=2)
Log-likelihood: -758.9803
Effective degrees of freedom: 4.02

Density evaluation for discrete data:
Point | Estimated | True (Binom) | Empirical Frequency
------|-----------|--------------|---------------
  0   |   0.0375  |    0.0312    |      0.0360
  1   |   0.1642  |    0.1562    |      0.1640
  2   |   0.3173  |    0.3125    |      0.3160
  3   |   0.3094  |    0.3125    |      0.3140
  4   |   0.1451  |    0.1562    |      0.1440
  5   |   0.0266  |    0.0312    |      0.0260

4. Zero-Inflated Data

Zero-inflated data requires special treatment using the “zero-inflated” type to handle the excess zeros.

[26]:

# Generate zero-inflated data (exponential with added zeros)
x_zi = np.random.exponential(scale=1, size=500)
zero_indices = np.random.choice(500, size=200, replace=False)
x_zi[zero_indices] = 0

# Create Kde1d for zero-inflated data
# Set type="zero-inflated" and xmin=0 since we know the data is non-negative
kde_zi = pv.Kde1d(xmin=0, type="zero-inflated")
kde_zi.fit(x_zi)

print(f"Kde1d fitted to zero-inflated data: {kde_zi}")
print(f"Log-likelihood: {kde_zi.loglik:.4f}")
print(f"Effective degrees of freedom: {kde_zi.edf:.2f}")
print(f"Probability at zero (prob0): {kde_zi.prob0:.4f}")


# Evaluate density at specific points including zero
eval_points_zi = np.array([0.0, 0.5, 1.0, 2.0, 4.0, 6.0])
density_zi = kde_zi.pdf(eval_points_zi)

print("\nDensity evaluation for zero-inflated data:")
for point, density in zip(eval_points_zi, density_zi):
  if point == 0:
    print(f"x = {point:4.1f}: estimated = {density:.4f} (includes point mass)")
  else:
    # True density for continuous part: 0.6 * exp(-x/2) / 2
    true_continuous = 0.6 * stats.expon.pdf(point, scale=2)
    print(
      f"x = {point:4.1f}: estimated = {density:.4f}, true continuous = {true_continuous:.4f}"
    )

Kde1d fitted to zero-inflated data: <pyvinecopulib.Kde1d> Kde1d(xmin=0, xmax=nan, type='zero-inflated', bandwidth=0.768109, multiplier=1, degree=2)
Log-likelihood: -628.2624
Effective degrees of freedom: 8.48
Probability at zero (prob0): 0.4000

Density evaluation for zero-inflated data:
x =  0.0: estimated = 0.4000 (includes point mass)
x =  0.5: estimated = 0.4139, true continuous = 0.2336
x =  1.0: estimated = 0.2476, true continuous = 0.1820
x =  2.0: estimated = 0.0721, true continuous = 0.1104
x =  4.0: estimated = 0.0087, true continuous = 0.0406
x =  6.0: estimated = 0.0000, true continuous = 0.0149

[29]:

x_range = np.linspace(0.1, 10, 200)
kde_density_zi = kde_zi.pdf(x_range)
true_density_zi = 0.6 * stats.expon.pdf(x_range, scale=1)

# Plot the zero-inflated density estimate
plt.figure(figsize=(12, 6))

plt.plot(
  x_range, kde_density_zi, "b-", label="KDE (continuous part)", linewidth=2
)
plt.plot(
  x_range, true_density_zi, "r--", label="True (continuous part)", linewidth=2
)
# Add point mass at zero
plt.plot(
  0,
  kde_zi.prob0,
  "bo",
  markersize=10,
  label=f"Estimated point mass at zero ({kde_zi.prob0:.2f})",
  markerfacecolor="blue",
)
plt.plot(
  0,
  0.4,
  "ro",
  markersize=10,
  label="True point mass at zero (0.4)",
  markerfacecolor="red",
)

plt.title("Zero-Inflated Data: Exponential with 40% Zeros")
plt.xlabel("x")
plt.ylabel("Density")
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlim(-0.5, 10)
plt.show()

5. Weighted Density Estimation

Kde1d supports weighted estimation, useful for bootstrap procedures or when observations have different importance.

[30]:

# Generate data for weighted estimation
x_weighted = np.random.normal(0, 1, 100)
weights = np.random.exponential(scale=1, size=100)  # Random weights

# Weighted estimation - fit the same data with and without weights
kde_weighted = pv.Kde1d()
kde_weighted.fit(x_weighted, weights)

kde_unweighted = pv.Kde1d()
kde_unweighted.fit(x_weighted)

print("Weighted vs. Unweighted estimation:")
print(
  f"Weighted   - Bandwidth: {kde_weighted.bandwidth:.4f}, Log-likelihood: {kde_weighted.loglik:.4f}"
)
print(
  f"Unweighted - Bandwidth: {kde_unweighted.bandwidth:.4f}, Log-likelihood: {kde_unweighted.loglik:.4f}"
)

# Compare density estimates at specific points
eval_points_weight = np.array([-2, -1, 0, 1, 2])
density_weighted = kde_weighted.pdf(eval_points_weight)
density_unweighted = kde_unweighted.pdf(eval_points_weight)

print("\nDensity comparison (weighted vs. unweighted):")
print("Point | Weighted | Unweighted | Difference")
print("------|----------|------------|------------")
for point, w_dens, uw_dens in zip(
  eval_points_weight, density_weighted, density_unweighted
):
  diff = abs(w_dens - uw_dens)
  print(f"  {point:2.0f}  |  {w_dens:.4f}  |   {uw_dens:.4f}   |   {diff:.4f}")

print(
  f"\nWeight statistics: min={weights.min():.3f}, max={weights.max():.3f}, mean={weights.mean():.3f}"
)

Weighted vs. Unweighted estimation:
Weighted   - Bandwidth: 1.0910, Log-likelihood: -126.7015
Unweighted - Bandwidth: 0.8605, Log-likelihood: -136.2553

Density comparison (weighted vs. unweighted):
Point | Weighted | Unweighted | Difference
------|----------|------------|------------
  -2  |  0.0454  |   0.0512   |   0.0058
  -1  |  0.2374  |   0.2064   |   0.0310
   0  |  0.4691  |   0.4371   |   0.0320
   1  |  0.2244  |   0.2512   |   0.0268
   2  |  0.0194  |   0.0463   |   0.0269

Weight statistics: min=0.006, max=4.020, mean=0.916

[32]:

x_range = np.linspace(-4, 4, 200)
true_density = stats.norm.pdf(x_range)
kde_density_weighted = kde_weighted.pdf(x_range)
kde_density_unweighted = kde_unweighted.pdf(x_range)

# Plot comparison of weighted vs. unweighted estimates
plt.figure(figsize=(12, 6))
plt.plot(x_range, kde_density_weighted, "b-", label="Weighted KDE", linewidth=2)
plt.plot(
  x_range, kde_density_unweighted, "g--", label="Unweighted KDE", linewidth=2
)
plt.plot(x_range, true_density, "r:", label="True Density", linewidth=2)

plt.title("Weighted vs. Unweighted Density Estimation")
plt.xlabel("x")
plt.ylabel("Density")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

6. Plots, Simulation and Additional Features

The Kde1d class also supports plotting directly, simulation from the fitted density and more.

[ ]:

kde_normal.plot()

# Simulate from fitted density
simulated_data = kde_normal.simulate(n=1000, seeds=[123])

print("Simulation from fitted normal density:")
print(f"Original data: mean={x_normal.mean():.3f}, std={x_normal.std():.3f}")
print(
  f"Simulated data: mean={simulated_data.mean():.3f}, std={simulated_data.std():.3f}"
)

# CDF and quantile functions
print("\nCDF and quantile evaluation for normal kde:")
test_points = np.array([-1, 0, 1])
cdf_values = kde_normal.cdf(test_points)
print(f"CDF at [-1, 0, 1]: {cdf_values}")

quantiles = np.array([0.25, 0.5, 0.75])
quantile_values = kde_normal.quantile(quantiles)
print(f"Quantiles at [0.25, 0.5, 0.75]: {quantile_values}")

# True values for comparison
true_cdf = stats.norm.cdf(test_points)
true_quantiles = stats.norm.ppf(quantiles)
print(f"True CDF values: {true_cdf}")
print(f"True quantiles: {true_quantiles}")

# Plot histogram of simulated data vs. original
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.hist(x_normal, bins=20, density=True, alpha=0.7, label="Original data")
plt.hist(
  simulated_data, bins=30, density=True, alpha=0.7, label="Simulated data"
)
plt.title("Histogram Comparison")
plt.xlabel("x")
plt.ylabel("Density")
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)

# Q-Q plot
stats.probplot(simulated_data, dist=stats.norm, plot=plt)
plt.title("Q-Q Plot: Simulated vs. Normal")
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Simulation from fitted normal density:
Original data: mean=0.007, std=0.980
Simulated data: mean=0.029, std=0.922

CDF and quantile evaluation for normal kde:
CDF at [-1, 0, 1]: [0.149304   0.5003765  0.84513907]
Quantiles at [0.25, 0.5, 0.75]: [-0.65620319 -0.00092187  0.65901036]
True CDF values: [0.15865525 0.5        0.84134475]
True quantiles: [-0.67448975  0.          0.67448975]

The univariate kernel density estimation (Kde1d) API