Comparison of Major Deep Learning Frameworks (Updating)
Objective
Here I want to compare several common deep learning frameworks and make sense of their workflow.
Core Logic
Tensorflow
General Comments: TF is more like a library, in which many low-level operations are defined and programs are long. In contrast, Keras which can use tensorflow as backend has the similar level of abstraction as PyTorch, which is a higher level deep learning package. TFLearn may also be a higher level wrapper.
Besides, the design logic is quite different. TF is designed for static graph. Construct the computational graph first, and then put in data, and operate really in Session. It normally follows lazy execution instead of eager execution.
So it’s more like a compile and run language like C++ or Julia, than script language like Python.
- General Programming Model: Graph and Execution
- Use tensorflow
operationsto create a computational Graph - Evaluate / Run the graph in
session.- Input can be feed into a graph, and results could be fetched in
sess.run(fetches,feeds). - Example:
sess.run([output, intermediate], feed_dict={input1:[7.], input2:[2.]}) Sessionis interacting with the C++ runtime evaluating the graph.
- Input can be feed into a graph, and results could be fetched in
- Use tensorflow
- Basic TF usage
- Data Structure
- Tensor. the basic datatype in computational graph.
tensor.eval()will evaluate the tensor by evaluating all the operations along the graph.tf.convert_to_tensoris the way to convert other things intotensortf.constantis just a type of tensor. not really special.
- Variable, like old
PyTorchit’s a wrapper over tensor. It can keep states over severalruncall.- Tensor vs Variable
- Varibales have to be
initializeto use. You can do that by executingtf.global_variables_initializer().
- Placeholder, is a way to let user inject data into computational graph. You have to specify this when run a computational graph
placeholder(dtype, shape=None, name=None)
- Tensor. the basic datatype in computational graph.
- Gradient Computation
tf.gradientscan explicit compute gradient liketorch.autograd.grad. see official note.z = tf.subtract(tf.sin(x), tf.pow(y,3)); grad = tf.gradients(z, [x, y])
- Loss variables can be sent into
optimizerso that gradient could be computed towards target variables. Useful functionalities likeoptimizer.compute_gradients(L,var_list=[v1,v2])returns the variable and gradient pairs for each variable invar_list! Good to see the gradients if you want to manipulate it.optimizer.apply_gradients(grads_and_vars,)will apply gradients to update variables.optimizer.minimize(L)combines the 2 steps and update the variables.
vstr = tf.Variable([1,2,3.0],dtype=tf.float32)
vstr2 = tf.Variable([3.0,2,1.0],dtype=tf.float32)
L = tf.tensordot(vstr, vstr2, axes=1)
optimizer = tf.train.AdamOptimizer(beta1=0.9,beta2=0.8)
cG = optimizer.compute_gradients(L,var_list=[vstr2,vstr])
Min = optimizer.apply_gradients(cG)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run(L))
print(sess.run(cG))
print(sess.run([vstr,vstr2,L,Min]))
print(sess.run([vstr,vstr2]))
print(sess.run(L))
- Neural Networks *
PyTorch
Logic of pytorch is quite intuitive. PyTorch is designed for dynamics computational graph, very useful for debugging, try out things and for RNN training.
Generally, PyTorch is a great tool for general purpose differentiable computing, not just deep learning.
Data Structure
torch.tensoris the basic data type, which is quite similar tonp.arrayformat innumpytensorbehave much likenumpyso many basic visualization tool can still work with tensor!- But
tensoronly interact withtensordon’t addnp.arraywithtensor - (For ancient
PyTorchversions, Tensors could be wrapped up asVariable, and forVariableyou can set therequires_gradflag asTrueto enable gradient computation, now tensor also support gradient computation. ) tmp = torch.tensor([1.0,2,3], requires_grad=True)
- Just as
numpythere are different datatypes of data intensorliketorch.floattsr.type(dtype)will returns you a copy of tensor in the given format.
tensorcan live in different devices, and only tensor on the same device could operate with each other.tsr.deviceshow you the device it lives in.tsr.cpu()andtsr.cuda()returns you the same object if it’s already on that device. It will returns you copy if it’s not!tsr=tsr.cuda()will transfer data to gpu, buttsr.cuda()itself will not.
Gradients computation
requires_gradcould be set fortensor(andVariable) to enable gradient flow.- This flag will propagate automatically when other variables are constructed depending on a
Variablethat requires gradient. - i.e. Computational graph is generated while performing operation.
- This flag will propagate automatically when other variables are constructed depending on a
detachgives you a tensor detached from the computational graph!.detach()gives you a copy, and.detach_()just modifies the target tensor.tensors could be put intotorch.autogradmachinery ortorch.optimoptimizers to compute gradients and optimize!optimizer = torch.optim.Adam([img_tensor], lr=0.1, weight_decay=1e-6)torch.autograd.grad(loss,img_tsr)
Neural Networks
- Basic component of PyTorch is
module, it normally implements aforwardand abackwardfunction.- Note, as
autogradis available, you don’t always need to write abackwardfunction explicitly. You canloss.backward()then the gradient just flow back on the graph constructed duringforwardpass.
- Note, as
modules could be chained or constructed together to makemodelswhich we usually call networks.- Flags
model.cuda()returns you a copy of model with all parameters living on gpumodel.eval()set the flag into evaluation mode instead of training mode.
- Basic component of PyTorch is
Serialization
Torch save model with 2 formats, serialize whole model object with
torch.saveor save weight throughtorch.save(model.state_dict())See the recommended method- The former method mandates you have the class of the model defined or loading will fail.
- The latter is more general, as long as you have the class definition for the model.
Conventions
Axis convention
B,C,H,Wsame as caffe, but the channel is inRGBorder.
Starting Tutorial to Learn Pytorch
How and When to use Module, Sequential, ModuleList and ModuleDict
Caffe
- Data Structure
- Basic datastructure in Caffe is just
numpy.arraynothing special. - The input and output to layers are
blobs
- Basic datastructure in Caffe is just
- Neural Networks
- Net Layer Blobs
- Neural Network structure could be specified in
caffe.protoformat, which is a form of googleprotobuf
- Conventions
- Axis convention
B,C,H,Wand also the channel direction for images are flipped, so the[0,1,2]channel isBGR
- Axis convention
Matlab
- Data Structure
- Basic data structure in matlab deep learning toolbox is
dlarray. You have to wrap your normal matrix intodlarrayto use as input to models. This is liketorch.Tensorin pytorch, this wrap is needed to compute gradient and ease gpu acceleration. extractdatais liketorch.Tensor.numpy()that you can get your data back from the wrap, but this also breaks the gradient trace for yourdlarraydlarrayhas a labelled version and a un-labelled version. Labels will tell the framework whats the meaning of each axis.- Some operation requires a un-labelled
dlarraysome requires a labelleddlarray
- Some operation requires a un-labelled
- Note the order of dimension in matlab is different from most python frameworks. Matlab uses
[H,W,C,B]torch uses[B,C,H,W]. Besides, matlab is row (first dim) major array storage, python is column (last dim) major storage. So reshape can be very different in 2. - As for weight of conv layer, matlab stores it as
[FilterSize(1),FilterSize(2),NumChannels,NumFilters] - In comparison, PyTorch uses
[out_channels(NumFilters), in_channels(NumChannels), kernel_size ]
- Basic data structure in matlab deep learning toolbox is
- Neural Networks
- Neural Networks have
Layersand agraph predictandactivationsseems like the function that calculate the activation likeforwardcalculateActivationsis the core function underlying it.
- Neural Networks have
- Gradient Computation
- In 2019b, matlab auto differentiation is done through
dlgradientwhich could compute first order gradient for multiple input function usingdlarray. However its current usage is not liketorch.autogradsee example . It has to be used in a function that gets passed todlfeval. - Currently it doesn’t support higher order derivative in the backward mode! (like Hessian. So still needs pytorch for Hessian computation. )
- However if you do artificial forward differentiation to approximate the Hessian! There you only first order gradients.
- In 2019b, matlab auto differentiation is done through
[f,g] = dlfeval(@model,net,dlX,t);
function [f,g] = model(net,dlX,T)
% Calculate objective using supported functions for dlarray
y = forward(net,dlX);
f = fcnvalue(y,T); % crossentropy or similar
g = dlgradient(f,net.Learnables); % Automatic gradient
end
x0 = dlarray([-1,2]);
[fval,gradval] = dlfeval(@rosenbrock,x0)
function [y,dydx] = rosenbrock(x)
% calculate the dlgradient inside the function within dlfeval
y = 100*(x(2) - x(1).^2).^2 + (1 - x(1)).^2;
dydx = dlgradient(y,x);
end
- Weight Initialization
- Matlab provides some init algorithms like
glorot(Xavierin torch)He(Kaimingin torch).
- Matlab provides some init algorithms like
- Learning Control
WeightLearnRateFactorcan control the weight learning in each layer. likerequires_gradin pytorch.
Peripheral Functionalities
Tensorflow
Tensorboardawesome visualization for the training phase!- It can also visualize the computational graph as well!
Lucidawesome infrastruture to visualize and interpret deep neural networks.- Input pipeline of Tensorflow, is well handled by
tf.data
PyTorch
torch.nn.DataParallelis a great tool, makes parallelized training at almost no cost!- PyTorch support tensorboard now! (Through tensorboardX in older versions.)
- In parallel
Lucentcame out in May 2020 as a pytorch version ofLucid.