Convolutional Kernel Zoo
A family index of 2D convolutional kernels and the operations built on them. The discrete convolution is the fundamental operation in classical image processing, modern computer vision, and convolutional neural networks. This note enumerates the classic linear kernels, the nonlinear neighborhood operators that extend them, the morphological and multi-resolution variants, the kernels embedded in modern CNN and ViT architectures, and the canonical applications of each family.
1. Mathematical Foundation
1.1 Discrete 2D Convolution
For input image I and kernel K of size (2m+1) × (2n+1):
(I * K)(x, y) = Σ_{i=-m..m} Σ_{j=-n..n} I(x − i, y − j) · K(i, j)
Cross-correlation (often what library functions actually compute) drops the index flip:
(I ⋆ K)(x, y) = Σ I(x + i, y + j) · K(i, j)
For symmetric kernels these are identical, which is why the distinction is often glossed over.
1.2 Separable Kernels
K is separable if K = u · vᵀ for column vector u and row vector v. Convolution then factors:
I K = (I u) * vᵀ
Cost drops from O((2m+1)²) per pixel to O(2 · (2m+1)). Gaussian, box, Sobel-along-axis, and many others are separable; bilateral and rotated Gabors generally are not.
1.3 Stride, Padding, Dilation
- Stride s — output sampled every s pixels. Reduces resolution by factor s.
- Padding — pad input to control output size. Modes: zero, reflect, replicate, circular (periodic).
- Dilation r — gap of r−1 zeros inserted between kernel taps. Increases receptive field without parameter cost.
- Output size for input W with kernel size K, padding P, stride S, dilation D: ⌊(W + 2P − D(K−1) − 1)/S⌋ + 1.
1.4 Even vs Odd Kernel Sizes
Odd-sized kernels have a well-defined center pixel; even-sized ones cause half-pixel shifts. Almost all practical kernels are odd: 3×3, 5×5, 7×7. Exceptions: ViT 16×16 patch kernels (stride = kernel size avoids the shift), some pooling layers.
2. Smoothing Kernels
2.1 Box Filter (Mean)
Uniform K_ij = 1/(N²) for an N×N kernel. Separable. Cheapest blur; integral images make it O(1) per pixel regardless of kernel size (summed-area tables, Crow 1984).
2.2 Gaussian Filter
K(i, j) = (1/(2πσ²)) exp(−(i² + j²)/(2σ²)). Separable. Truncation rule of thumb: size ≈ 6σ + 1 captures > 99.7% of the mass. The unique low-pass filter that is rotation invariant and scale-space generating (Lindeberg’s scale-space axioms).
2.3 Median Filter
Output pixel = median of neighborhood values. Nonlinear, rank-order; excellent for salt-and-pepper noise; preserves edges better than Gaussian. O(N²) naive; O(N log N) with sorting; O(N) with histogram-based methods (Huang 1979, Perreault–Hébert 2007 constant-time).
2.4 Bilateral Filter
Tomasi–Manduchi (1998). Edge-preserving smoothing:
BF(p) = (1/W) Σ_q I(q) · G_σs(‖p − q‖) · G_σr(|I(p) − I(q)|)
Spatial Gaussian times range Gaussian. Brute-force O(N² · K²); fast bilateral grid (Paris–Durand 2006) is O(N²) using piecewise-linear approximation in a 3D space.
2.5 Guided Filter
He–Sun–Tang (ECCV 2010, TPAMI 2013). Linear-time edge-preserving filter using a guidance image. Used in dehazing, matting, joint upsampling, depth refinement. O(1) per pixel.
2.6 Non-Local Means (NLM)
Buades–Coll–Morel (CVPR 2005). Each pixel averaged with all other pixels in the image, weighted by patch similarity. O(N² · S² · P²) naive; integral-image acceleration; basis for many denoisers.
2.7 BM3D
Block-Matching and 3D Filtering (Dabov–Foi–Katkovnik–Egiazarian 2007). Combines block matching, collaborative 3D transform-domain filtering (DCT, wavelet), and Wiener filtering. State-of-the-art classical denoising for over a decade until deep learning surpassed it on PSNR.
3. Edge Detection Kernels
3.1 Sobel (1968)
Horizontal gradient:
[-1 0 1]
[-2 0 2]
[-1 0 1]
Vertical is the transpose. Separable: [1; 2; 1] * [-1, 0, 1]. Combines smoothing with central-difference; the workhorse first-derivative kernel.
3.2 Prewitt
Same structure as Sobel but uniform weights:
[-1 0 1]
[-1 0 1]
[-1 0 1]
Slightly less smoothing.
3.3 Roberts Cross (1965)
2×2 kernels:
[1 0] [ 0 1]
[0 -1] [-1 0]
Diagonal gradients; one of the earliest edge detectors. High noise sensitivity due to small size.
3.4 Scharr
[-3 0 3]
[-10 0 10]
[-3 0 3]
Scharr (2000 PhD thesis) optimized weights for Frobenius-norm rotation invariance. Better directional accuracy than Sobel, especially for sub-pixel orientation estimation.
3.5 Laplacian (Second Derivative)
Isotropic second derivative. Two common discretizations:
4-connected:
[ 0 1 0]
[ 1 -4 1]
[ 0 1 0]
8-connected (better rotation invariance):
[1 1 1]
[1 -8 1]
[1 1 1]
Zero-crossings of the Laplacian mark edge locations.
3.6 Laplacian-of-Gaussian (LoG)
Marr–Hildreth (1980). Convolve with a Gaussian first, then apply Laplacian. The LoG kernel is the analytical Laplacian of the Gaussian; zero-crossings give edges at scale σ. Theoretical basis for “primal sketch” theory of vision.
3.7 Difference-of-Gaussians (DoG)
DoG_σ = G_σ₁ − G_σ₂, approximates LoG when σ₂/σ₁ ≈ 1.6. Used in SIFT keypoint detection (Lowe 1999, 2004) where extrema in DoG scale-space yield scale-invariant feature points.
3.8 Canny Edge Detector (1986)
John Canny’s multi-stage operator, not a single kernel:
- Gaussian smoothing.
- Sobel gradients → magnitude and orientation.
- Non-maximum suppression along gradient direction.
- Double thresholding (high and low).
- Edge tracking by hysteresis.
Still the default edge detector in OpenCV (cv2.Canny) decades later.
4. Derivative and Difference Kernels
4.1 Finite Differences
- Forward — [−1, 1].
- Backward — [−1, 1] shifted, or [1, −1].
- Central — [−1, 0, 1] / 2. Second-order accurate.
- Five-point — [1, −8, 0, 8, −1] / 12. Fourth-order accurate.
4.2 Sobel and Scharr as Derivative Approximations
Sobel and Scharr both implement smoothed central differences. Scharr is provably optimal under a rotation-invariance criterion (Scharr 2000).
5. Sharpening Kernels
5.1 Unsharp Mask
Output = I + α(I − G_σ I) where G_σ I is a blurred version. Equivalent to convolving with a high-pass kernel. Origins in photography (subtracting a blurred negative).
5.2 Laplacian Sharpening
Output = I − α · ∇²I. Boosts high frequencies. Classic 3×3 sharpening kernel:
[ 0 -1 0]
[-1 5 -1]
[ 0 -1 0]
Equivalent to identity minus a Laplacian.
5.3 High-Pass Kernels
Generally: identity minus a low-pass. Box high-pass:
[-1 -1 -1]
[-1 8 -1]
[-1 -1 -1]
Used in feature emphasis, defect detection.
6. Embossing
Directional gradient with brightness offset to produce a relief illusion:
[-2 -1 0]
[-1 1 1]
[ 0 1 2]
Output shifted by 128 (in 8-bit) to center around mid-gray. Variations for different “light directions” by rotating the kernel.
7. Texture and Orientation Filter Banks
7.1 Gabor Filter
Daugman (1985). Product of a Gaussian envelope and a complex sinusoid:
g(x, y; λ, θ, ψ, σ, γ) = exp(−(x′² + γ²y′²)/(2σ²)) · exp(i(2π x′/λ + ψ))
where x′ = x cos θ + y sin θ, y′ = −x sin θ + y cos θ.
Tuned to a particular orientation θ, spatial frequency 1/λ, and bandwidth σ. Models simple-cell receptive fields in V1 (Daugman, Marčelja). Used in iris recognition (Daugman 1993), texture classification, face recognition.
7.2 Leung–Malik (LM) Filter Bank
Leung–Malik (IJCV 2001). 48 filters: first and second derivatives of Gaussians at 6 orientations × 3 scales, 8 LoGs, 4 Gaussians. Benchmark texture descriptor.
7.3 Schmid Filter Bank
Schmid (2001). 13 isotropic Gabor-like filters of the form (cos(2π τ r) · G_σ(r) − offset), where r is radius. Rotation-invariant by construction.
7.4 MR8 Filter Bank
Varma–Zisserman (IJCV 2005). 38 anisotropic filters reduced to 8 responses by taking max over orientation, then 6 fixed. Compact and powerful for material classification.
7.5 Steerable Filters
Freeman–Adelson (TPAMI 1991). Given a set of basis filters, the response at any orientation can be computed as a linear combination of basis responses. For a Gaussian second-derivative basis, three filters suffice to steer to any angle. Foundation for efficient orientation analysis.
8. Morphological Operators
Nonlinear, set-theoretic operators on binary or grayscale images. Defined via a structuring element (SE) B.
8.1 Basic Operations
- Erosion — (I ⊖ B)(x) = min{I(x + b) : b ∈ B}. Shrinks bright regions.
- Dilation — (I ⊕ B)(x) = max{I(x − b) : b ∈ B}. Grows bright regions.
- Opening — (I ⊖ B) ⊕ B. Removes small bright structures.
- Closing — (I ⊕ B) ⊖ B. Fills small dark gaps.
- Morphological gradient — (I ⊕ B) − (I ⊖ B). Edge-like.
- Top-hat — I − opening(I). Highlights small bright features.
- Bottom-hat — closing(I) − I. Highlights small dark features.
8.2 Common Structuring Elements
- Square 3×3 — 8-connectivity.
- Cross / plus — 4-connectivity.
- Disk of radius r — isotropic morphology.
- Line at angle θ — directional morphology.
8.3 Foundations
Matheron (1967) and Serra (1982) developed mathematical morphology at the École des Mines de Paris. Connected-component labeling, watershed segmentation (Vincent–Soille 1991), and granulometry all build on these operators.
9. Distance Transforms
Map each pixel to its distance to the nearest “foreground” pixel.
- Chamfer distance — integer approximations of Euclidean, propagated with two passes (Borgefors 1986).
- Exact Euclidean Distance Transform — Saito–Toriwaki (1994), Maurer et al (2003). Linear-time.
- Felzenszwalb–Huttenlocher (2004) — generalized distance transform for sampled functions, O(n).
Used in skeletonization, shape matching, watershed seeding, segmentation post-processing.
10. Multi-Resolution / Pyramid Kernels
10.1 Gaussian Pyramid
Burt–Adelson (1983). Repeatedly blur (with a small 5×5 Gaussian-like kernel a = [1, 4, 6, 4, 1] / 16) and downsample by 2. Each level halves the resolution.
10.2 Laplacian Pyramid
L_k = G_k − upsample(G_{k+1}). Captures band-pass information at each scale; original signal reconstructible by summing levels. Used in image compositing (Burt–Adelson “A multiresolution spline” 1983), super-resolution, generative models.
10.3 Wavelet Transforms
Decompose into low- and high-pass subbands at each level via filter banks.
- Haar wavelet (Haar 1909) — simplest, [1, 1]/√2 and [1, −1]/√2.
- Daubechies wavelets — compactly supported, orthogonal (Daubechies 1988).
- Biorthogonal CDF 9/7 — JPEG 2000 lossy compression standard (Cohen–Daubechies–Feauveau).
- CDF 5/3 — JPEG 2000 lossless.
- Symlets, Coiflets — variations with different symmetry/vanishing-moment properties.
10.4 Steerable Pyramid
Simoncelli–Freeman (1995). Multi-orientation, multi-scale decomposition; combines pyramid with steerable filters. Used in texture synthesis (Heeger–Bergen 1995), motion estimation, image quality metrics.
11. Frequency-Domain Convolution
11.1 FFT-Based Convolution
For large kernels, direct convolution costs O(N² K²); via FFT it is O(N² log N). Convolution theorem: I * K = F⁻¹(F(I) · F(K)). Crossover where FFT wins is typically K ≳ 15–25 depending on hardware.
11.2 Discrete Cosine Transform (DCT)
8×8 block DCT underpins JPEG compression. The DCT-II is the most common variant; energy compaction concentrates most of an image patch’s energy in a few low-frequency coefficients.
11.3 Fourier Basis
Each cosine/sine kernel detects one spatial frequency; the full 2D DFT decomposes an image into N² basis kernels. Used in frequency-domain filtering (low/high/band-pass), phase-only correlation, image registration (Reddy–Chatterji 1996).
12. Color and Channel Conversions
12.1 RGB → Grayscale
Per-channel weights vary by standard:
- Rec. 601 (SDTV) — Y = 0.299 R + 0.587 G + 0.114 B.
- Rec. 709 (HDTV / sRGB) — Y = 0.2126 R + 0.7152 G + 0.0722 B.
- Rec. 2020 (UHDTV) — Y = 0.2627 R + 0.6780 G + 0.0593 B.
These are 1×1×3 convolutions over the color channel axis.
12.2 Color Space Transforms
- HSV / HSL — hue, saturation, value/lightness. Useful for color thresholding.
- YCbCr / YUV — luma-chroma; basis for JPEG/MPEG compression.
- CIE Lab — perceptually uniform-ish; standard for color difference (ΔE).
- CIE XYZ — device-independent linear space; bridge between RGB and Lab.
Transforms are matrix products with optional nonlinearity (gamma).
13. CNN Architecture Kernels
13.1 VGG-Style 3×3 Stacks
Simonyan–Zisserman (ICLR 2015). Stack many 3×3 stride-1 convolutions to build large effective receptive fields cheaply. Two stacked 3×3 has the receptive field of a 5×5 with 18 vs 25 parameters; three stacked 3×3 ≈ 7×7 with 27 vs 49 parameters.
13.2 ResNet Bottleneck
He–Zhang–Ren–Sun (CVPR 2016). 1×1 channel reduction → 3×3 spatial → 1×1 channel expansion. Reduces compute while maintaining representational capacity. Used in all ResNet-50/101/152 variants and later.
13.3 Depthwise Separable Convolution
Howard et al MobileNet (2017), preceded by Sifre (PhD 2014) and Chollet’s Xception (CVPR 2017). Factorize K×K×Cin×Cout convolution into:
- Depthwise — K×K applied per channel.
- Pointwise — 1×1 over channels.
Cost reduces from K² · Cin · Cout to K² · Cin + Cin · Cout. Foundation of MobileNet, EfficientNet, ShuffleNet, and mobile vision generally.
13.4 Grouped Convolution
Originally a memory hack in AlexNet (Krizhevsky 2012) to split across two GPUs. Generalized by ResNeXt (Xie et al CVPR 2017): split channels into G groups, convolve each independently. Depthwise is the G = C extreme; standard conv is G = 1.
13.5 Dilated / Atrous Convolution
Yu–Koltun (ICLR 2016), DeepLab (Chen et al 2018). Insert zeros between taps to enlarge receptive field without losing resolution. Used in semantic segmentation (no pooling needed), audio (WaveNet — van den Oord et al 2016).
13.6 Deformable Convolution
Dai et al (ICCV 2017), v2 (Zhu et al CVPR 2019). Learn offsets for each tap location, sampling input with bilinear interpolation. Adapts receptive field to object geometry; improves detection and segmentation.
13.7 1×1 Convolution
Originally “Network-in-Network” (Lin–Chen–Yan 2014). A 1×1 conv is a per-pixel MLP across channels. Used for: dimensionality reduction (Inception, ResNet), feature mixing, projection heads.
14. Pooling and Downsampling
14.1 Max Pooling
2×2 stride 2 → 4× reduction. Selects the strongest activation in each window. Translation-tolerant; loses precise localization.
14.2 Average Pooling
Mean over the window. Smoother than max. Global Average Pooling (Lin–Chen–Yan 2014) — average over the entire spatial extent; used in place of FC layers in many modern architectures (ResNet, EfficientNet).
14.3 Strided Convolution as Pooling
Many modern architectures (post-AllConvNet, Springenberg et al 2014) replace pooling with stride-2 convolutions, treating downsampling as a learnable operation.
14.4 Max-Blur-Pool / Anti-Aliasing
Zhang’s “Making CNNs Shift Invariant” (ICML 2019). Insert a low-pass blur (typically a triangular [1, 2, 1]/4 separable kernel) before strided subsampling to respect the sampling theorem. Improves shift consistency at modest accuracy cost.
15. Transposed Convolution and Upsampling
15.1 Transposed (Fractionally-Strided) Convolution
“Deconvolution” in early literature, though the term is mathematically incorrect. Long–Shelhamer–Darrell FCN (CVPR 2015) used it for dense semantic segmentation. Implementation: insert zeros in the input, then convolve.
15.2 Checkerboard Artifacts
Odena–Dumoulin–Olah (Distill 2016) showed transposed convolutions produce checkerboard artifacts when kernel size isn’t divisible by stride. Mitigation: resize-then-conv, or PixelShuffle.
15.3 Pixel Shuffle (Sub-Pixel Conv)
Shi et al ESPCN (CVPR 2016). Output rs² · C channels, then reshape (rearrange) into r-fold spatial upsampling. Used in super-resolution and many GAN/diffusion decoders.
16. Initialization for Convolutional Kernels
- Xavier / Glorot Initialization — Glorot–Bengio (AISTATS 2010). Var = 2/(fan_in + fan_out); designed for tanh.
- He / Kaiming Initialization — He–Zhang–Ren–Sun (ICCV 2015). Var = 2/fan_in; designed for ReLU’s halving of variance.
- Orthogonal Initialization — Saxe–McClelland–Ganguli (ICLR 2014). For deep linear and recurrent networks.
- LSUV (Mishkin–Matas 2016) — Layer-Sequential Unit Variance, normalize after each layer.
For convolutional layers, fan_in = K² · C_in, fan_out = K² · C_out.
17. Vision Transformer Patch Embedding
Dosovitskiy et al ViT (ICLR 2021). The “patch embedding” step is implemented as a conv with kernel size = stride = patch size (typically 16). Each 16×16 RGB patch becomes one token via a 16×16×3 → D linear projection. This is the simplest convolutional layer in modern architectures, but its design choice (no overlap, large kernel) is fundamentally different from CNNs.
Subsequent ViT variants tweak this: Swin (Liu et al 2021) uses 4×4 patches with shifted-window attention; ConvNeXt (Liu et al 2022) returns to depthwise 7×7 convs while keeping the patch-style stem.
18. Application Mapping
Quick reference for “which kernel for which job”:
- Edge detection — Sobel, Scharr, Canny pipeline.
- Denoising (smooth regions) — Gaussian, NLM, BM3D.
- Denoising (edge-preserving) — bilateral, guided filter, BM3D.
- Salt-and-pepper noise — median filter.
- Sharpening — unsharp mask, Laplacian sharpening.
- Texture analysis — Gabor bank, LM, MR8.
- Blob detection — LoG, DoG (scale-space extrema).
- Keypoint detection — DoG scale-space (SIFT), Harris corner (gradient covariance).
- Background subtraction / change — high-pass, top-hat.
- Shape skeletonization — morphology + distance transform.
- Feature extraction for ML — stacked 3×3 convs (CNN), depthwise separable (mobile).
- Image-to-image translation — encoder-decoder with stride-2 conv down, pixel-shuffle up.
- Compression — DCT (JPEG), wavelet (JPEG 2000).
- Multi-scale analysis — Gaussian/Laplacian pyramid, wavelets, steerable pyramid.
- Semantic segmentation — dilated convs (DeepLab), encoder-decoder (U-Net Ronneberger 2015).
19. Implementation Notes
19.1 Boundary Handling
- Zero padding — simple but introduces artificial dark borders.
- Reflect / mirror — natural-looking, default in many image libraries.
- Replicate / clamp — extends edge values; can produce streaks.
- Wrap / circular — periodic boundary, mathematically natural for FFT convolution.
19.2 Numerical Precision
Floating-point convolution is associative-ish but not exactly. Integer kernels with explicit normalization (e.g., dividing by sum) are still common in hardware-constrained pipelines.
19.3 Libraries
- OpenCV —
cv2.filter2D,cv2.GaussianBlur,cv2.Sobel,cv2.Canny,cv2.morphologyEx. - scikit-image —
skimage.filters.gaussian,skimage.feature.canny, full morphology inskimage.morphology. - scipy.ndimage —
convolve,gaussian_filter,median_filter, generic_filter. - PIL / Pillow —
ImageFilter.GaussianBlur, etc. - PyTorch —
F.conv2d,F.conv_transpose2d,nn.Conv2dwith groups/dilation/padding modes. - JAX —
jax.scipy.signal.convolve2d,lax.conv_general_dilated.
20. Historical Perspective
Image convolution as we know it traces back to the 1960s–70s with Roberts, Sobel, and Prewitt building edge detectors for early computer vision at MIT, Stanford, and elsewhere. The 1980s introduced scale-space theory (Witkin 1983, Lindeberg) and multi-resolution methods (Burt–Adelson). The 1990s saw morphology mature (Serra) and wavelets enter the mainstream (Mallat 1989, Daubechies). The 2000s introduced sophisticated edge-preserving filters (bilateral, guided, NLM) and discriminative texture banks. The 2010s reframed convolution as a learnable operation in CNNs — first LeNet (LeCun 1998), then AlexNet (2012), VGG, ResNet, MobileNet — culminating in vision transformers that returned to large-stride “patch” kernels. The classical operators in this note still ship in every image-processing library and remain useful preprocessing for ML pipelines, robust baselines for benchmarking, and pedagogical entry points for understanding what learned kernels are doing.
Adjacent
- probability-distribution-zoo — Gaussian as both a kernel and a distribution.
- optimization-algorithm-taxonomy — backpropagation through learned convolutions.
- fourier-analysis-foundations — frequency-domain interpretation of all linear kernels.
- wavelet-analysis — multi-resolution kernel families.
- deep-learning-architectures — CNN, ResNet, EfficientNet, ConvNeXt, ViT consumers of these kernels.
- computer-vision-tasks — detection, segmentation, recognition pipelines built on these primitives.
- image-compression — DCT and wavelet kernels in JPEG / JPEG 2000.
- signal-processing-foundations — 1D analogs, FIR filter design.
- linear-algebra-foundations — convolution as Toeplitz matrix multiplication.
- numerical-analysis-foundations — finite-difference kernels for derivatives.