Derivative Free Optimization Method (Based Feature Visulization) Notes 1

Task discription
Algorithm in current Experimental Code
CMA-ES Covariance Matrix Adaptation Evolution Strategy
Amoeba Algorithm
Algorithm in the ZOO-Black Box Attack paper
Targets Next Step (Updated 2019.7)

Task discription

To find/generate the stimuli that evoked the strongest response of a neuron in a visual system is in essense an optimization problem. But the optimization task on hand has several unique features that are essential to the choice of optimization algorithms, for example,

Black box setting
- No explicit/analytical form of the function. So the method must not rely on the analytical form of gradient (1st order information).
Stochasticity:
- Neuronal response (spike count or other) is stochastic with noise of some sort (specific noise class / noise level is to be characterized from data). Thus the “energy landscape” is vibrating/fluctuating. Because of this, the derivative is dominated by high frequency fluctuation, thus we cannot estimate the derivatve from finite difference method, or the finite difference method must make use of larger scale structure to get rid of noise.
Very high-dimensional input space, esp. the image space.
- Dimension reduction of the input space is crucial. The current method of GAN compressed code images (from millions dimension to 4096) has greatly reduced the dimension of space.
- This requirement is the same with CNN black box attack setting. So many of their dimension reduction techniques of the input space should be useful in our setting
Non-linear / Non-convex / Multipeak of target function:
- Most preferably, we would like to get all the preferred stimuli (roughly correspond to the local maxima) for a neuron and get the respective score for those peaks.
Purpose of the optimization:
- Our purpose of doing “optimization” is different from that in many other practical applications.
- The “accuracy” of solution is not pursuited in our project, (there is no point spending 80% of time updating a small fraction of pixels’ value). Nor do we aim at minor improvement in the response score, knowing one single most favorable stimuli to a neuron is not very informative, but knowing a bunch of relatively strongly responded stimuli is educational.

Algorithm in current Experimental Code

It is a classical genetic algorithm.

So the question is whether genetic algorithm is suitable for our problem. According to building block hypothesis (BBH) Genetic algorithm is useful for cases where each coordinate (gene)/ small bunch of coordinates determine the fitness separately, and those coordinates work as building blocks for the whole system.

I’m not sure of whether that is the case for the compressed image code space generated by Generative Adversarial Method. It may need further work to get the answer.

CMA-ES Covariance Matrix Adaptation Evolution Strategy

Comparing to genetic algorithm, CMA ES is an Evolution Algorithm with more rigorous mathematical formalism permitting a more geometrical understanding.

It models each generation/population of code (“genetic pool”) as a parametrized statistical distribution $P(X,\phi)$. Each step, it generates a sample from the distribution ${X_i}$, i.e. the generation of samples, and then evaluate function value (“fitness”) at these sample points. And then use these sampled values to update the parameters of the distribution $\phi$. And it iterates this process until convergence.

CMA-ES Demo

Specificly, it uses a Multivariate Normal Distribution $\bf{m}+\sigma^{(g)}\mathcal N(0,C)$ as the model disctribution class. And sampling from it $X_k^{(g+1)} \sim \bf{m}^{(g)}+\sigma^{(g)}\mathcal N(0,C^{(g)})$

The mean of the distribution $\bf{m}$ is the estimate of optimal value.
The normal distribution add random noise to the center simulating the mutation process.
$\sigma^{(g)}$ controls the explorative step length. $C^{(g)}$ controls the shape of the equi-density ellipsoid in high-dim space, which controls the direction of exploration in the space.

The method evaluates the target function at the sampled points ${X_k^{(g+1)}}$, and uses these values $f(X_k^{(g+1)})$ to update the mean, step size and covariance matrix parameter $\bf{m}^{(g+1)}, \sigma^{(g+1)}, C^{(g+1)}$.

Mean update weighted average the top $\lambda$ fittest sample in the generation.

\[\bf{m}^{(g+1)}=\sum^\mu_{i=1} w_iX_{i:\lambda}\]

Covariant Matrix Adaptation As the name of the algorithm indicates, this is the most sophisticated and developped step. The tutorial paper introduced 3 different strategies to do Cov Mat update, and presents a blended strategy of the 3 in the demo code. And the target of all these sophistication is to ameliorate the sampling of next generation, i.e. sampling more informative good points while keeping informative.

Naive Cov Matrix Estimation: i.e. estimate the new covariant matrix from the selected best $\mu$ samples.

\[C_{CMNA}=\frac1\mu \sum_{i=1}^\mu (X_{i:\lambda}^{(g+1)}-m^{(g+1)})(X_{i:\lambda}^{(g+1)}-m^{(g+1)})^T\\ m^{(g+1)}=\frac1\mu \sum_{i=1}^\mu X_{i:\lambda}^{(g+1)}\]

The drawback of which is, it requires relatively large $\mu$ to get a faithful estimate of covariant matrix.

Rank $\mu$ update: which is similar to the strategy 1, but add a “momentum” / memory term from the last generation. Thus by taking the samples from last generation into account, they can use a smaller number of $\mu$ to estimate $C$. The covariant information from current generation is denoted by $C_\mu^{(g+1)}$, and the updating rate is governed by $c_\mu$.

Note that, they especially use $m^{(g)}$ centered and $\sigma^{(g)}$ normalized deviation $Y_{i:\lambda}^{(g+1)} = \frac{X_{i:\lambda}^{(g+1)}-m^{(g)}}{\sigma^{(g)}}$. The purpose of this is to get rid of the effect of overall stepsize, and make the deviation comparable between generation.

\[C^{(g+1)}=(1-c_\mu)C^{(g)}+ c_\mu C_\mu^{(g+1)}\\ C_\mu^{(g+1)}= \frac 1{(\sigma^{(g)})^2}\sum_{i=1}^\mu (X_{i:\lambda}^{(g+1)}-m^{(g)})(X_{i:\lambda}^{(g+1)}-m^{(g)})^T = \sum_{i=1}^\mu Y_{i:\lambda}^{(g+1)} {Y_{i:\lambda}^{(g+1)}}^T\]

Note, we can also add $w_i$ weight to do weighted estimation of covariant matrix.

Rank 1 update: it is to directly add a rank one component $c_1pp^T$ to $C$, which is to increase the spread on the direction of vector $p$, i.e. targeting the ellipsoid into $p$ direction. ¹ Like the second strategy, they also utilize a momentum coefficient $c_1$ to control the update rate of covariant matrix.

So which direction should $p$ be target at? In the paper, they cumulate the normalized steps $\frac{m^{(g+1)}-m^{(g)}}{\sigma^{(g)}}$ in the last few generations to form the $p_c$ vector, and the momentum coefficient is $c_c$. The rationale behind it is that if the steps are always in one direction, then they should try in this direction even harder.

\[C^{(g+1)}=(1-c_1)C^{(g)} + c_1 p_c^{(g+1)}{p_c^{(g+1)}}^T\\ p_c^{(g+1)} = (1-c_c) p_c^{(g)} + \sqrt{c_c(2-c_c) \mu_{eff} } \frac{m^{(g+1)}-m^{(g)}}{\sigma^{(g)}}\]

So mix the 3 strategy together, we get the final form of Covariant Matrix Adaptation.

\[C^{(g+1)}=(1-c_\mu-c_1) C^{(g)}+ c_\mu \sum_{i=1}^\mu w_i Y_{i:\lambda}^{(g+1)} {Y_{i:\lambda}^{(g+1)}}^T + c_1 p_c^{(g+1)}{p_c^{(g+1)}}^T\]

Cumulative Step Size Adaptation They separate the update of the overall spread scale $\sigma$ and the spread scale in different directions $C$.

So this step is to control the overall step length. According to the authors, the strategy is that, if the evolution steps of center is targeting in similar direction, then the step size should be increase to accelerate the walk. If the evolution steps are not correlated and cancel each other, then the step size should be decreased to facilitate more local search.

$p_s$ is the cumulated evolution path length. $c_s$ control the update rate. And the most important thing to note is that they insert a scaling matrix ${C^{(g)}}^{-1/2}$. The idea is that if the selection of next generation is random, then the ${C^{(g)}}^{-1/2} \frac{m^{(g+1)}-m^{(g)}}{\sigma^{(g)}}$ should be a standard normal random vector subject to $\mathcal N(0,I)$. Thus

\[p_s^{(g+1)} = (1-c_s) p_s^{(g)} + \sqrt{c_s(2-c_s) \mu_{eff} } {C^{(g)}}^{-1/2} \frac{m^{(g+1)}-m^{(g)}}{\sigma^{(g)}}\]

More specificly, they update $\sigma^{(g)}$ according to whether the evolution path length $p_s^{(g+1)}$ is longer or shorter than its expectation in random selection condition.

\[\ln \sigma^{(g+1)} = \ln \sigma^{(g)} + \frac{c_s}{d_s} (\frac{\| p_s \|_2}{E\|\mathcal N(0,I)\|_2}-1)\\ \approx \ln \sigma^{(g)} + \frac{c_s}{d_s} (\frac{\| p_s \|_2 }{\sqrt{n}}-1)\]

In another word

\[\sigma^{(g+1)}= \sigma^{(g)} \exp {\frac{c_s}{d_s} (\frac{\| p_s \|_2 }{\sqrt{n}}-1) }\]

To sum up, the parameters controling this algorithm are

Population size $\lambda$
${w_i}$ weight, and selecting number $\mu$
- $\mu_{eff}=(\sum^\mu_{i=1} w_i^2 )^{-1}$
$c_1$, $c_\mu$ controls the fraction of rank 1 and rank $\mu$ update in CMA step
$c_c$, $c_s$ controls the evolution path updating rate of $p_c$ in CMA step and of $p_s$ in $\sigma$ adaptation step respectively.
$d_s$ controls the updating rate of $\sigma^{(g)}$.

Matlab Code And from wiki, I copied the core code implemented in matlab.

% Sort by fitness and compute weighted mean into xmean
[arfitness, arindex] = sort(arfitness); % minimization
    xold = xmean;
    xmean = arx(:,arindex(1:mu))*weights;   % recombination, new mean value
    % sliced mat * vect = mean col of feature 

% Cumulation: Update evolution paths
ps = (1-cs)*ps ... 
    + sqrt(cs*(2-cs)*mueff) * invsqrtC * (xmean-xold) / sigma; 
hsig = norm(ps)/sqrt(1-(1-cs)^(2*counteval/lambda))/chiN < 1.4 + 2/(N+1); 
pc = (1-cc)*pc ...
    + hsig * sqrt(cc*(2-cc)*mueff) * (xmean-xold) / sigma;

% Adapt covariance matrix C
artmp = (1/sigma) * (arx(:,arindex(1:mu))-repmat(xold,1,mu));
C = (1-c1-cmu) * C ...                  % regard old matrix  
   + c1 * (pc*pc' ...                 % plus rank one update
        + (1-hsig) * cc*(2-cc) * C) ... % minor correction if hsig==0
   + cmu * artmp * diag(weights) * artmp'; % plus rank mu update

% Adapt step size sigma
sigma = sigma * exp((cs/damps)*(norm(ps)/chiN - 1)); 

Remarks:

Some says CMA-ES can be understood as to fit a quadratic function on the landscape / get second order (Hessian) information and choose the best direction to go!
Wiki says this method works quite well in 10s ~ 100s dimension space! Not sure our 4096 dim space will work.
By definition, the algorithm exhibits invariance to order preserving transforming of function $f$, scaling of $X$, rotation of domain space $R^n$.

Amoeba Algorithm

Amoeba Algorithm (a.k.a Down hill simplex algorithm / Nelder Mead Algorithm) wiki, randomly set a simplex in high dimensional space, and move the simplex step by step according to the function value at the vertices.

Amoeba Algorithm Demo

Remark

In some sense, it is a 1st-order method (comparing to CMA-ES which estimates 2nd order info), as it estimates the “gradient” information from a simplex.
In a $n$ dimensional space, the initiation needs to evaluate $n+1$ function values. And then 1 evaluation is needed each step.
The initialization is tricky and maybe hard to control where the algorithm converge to.

(DETAILS TO BE ADDED.)

Algorithm in the ZOO-Black Box Attack paper

Combine stochastic coordinate descent and dimension reduction of the input space dimension.

(DETAILS TO BE ADDED.)

Targets Next Step (Updated 2019.7)

Adapt a demo for CMA-ES algorithm in low dimensional problems and get intuition for the rather compicated update rules.
Try it in normal CNN setting
Try it in noisy CNN setting (shallower/low-dim ones and deeper ones)
- Get the right noise scheme for CNN. (Cf. Carlos, Poisson Noise mean+-2s.d. covers the full dynamic range of the unit.)
Compare with genetic algorithm
Try in monkey experimental setting

Note that, for a covariant matrix $C$ of a vector of random variables $X$. $v^TCv$ is the variance of the recombined variable $y=v^TX$, i.e. the variance of the whole distribution projected to the line spanned by $v$. ↩