Recent developments in Deep Learning

Recently, I watched Hinton’s talk on recent developments in deep learning. Main points are as below:

1. Replace sigmoid function with rectified linear function: easily for training and test, plus efficient. 

2. Dropout training and test could improve accuracy significantly, becasue this is basicly aggregating different highly regularized deep learning model by a geometric mean.

This might be a standard recipe for current deep learning. Based on this recipe, several students of his have won many Kaggle chagllenge. 

Posted in Life | Tagged | Leave a comment

Con­di­tional Gra­di­ent Descent (a.k.a. Frank-Wolfe algorithm)

Recently, I read Martin Wainwright’s old paper: A new class of upper bounds on the log partition function, and found that conditional gradient was used in his paper to optimize edge appearance probability.  Sébastien Bubeck gave a nice introduction on the conditional gradient.

Posted in Machine learning | Tagged , , | Leave a comment

Connection between prediction market and stochastic mirror descent

My Ph.D research topic is about machine learning market, primarily trying to apply market mechanism into machine learning to drive efficient and large-scale machine learning tasks, for example, we have used prediction market mechanism to do belief aggregation. We treat each agent in the prediction market as a machine learning model (probabilistic classifier), and these agent with their own wealth maximize the expected utility to decide how much they purchase each good in the market (corresponding to the class each data point belongs to). After their betting, their purchase will be paid out if they achieve the correct outcome. The final market equilibrium price capture the consensus belief across all the participant agents over the outcome space.

One interesting paper by Rafael Frongillo in NIPS 2012 detailed the connection between prediction market and stochastic mirror descent (SMD). The market price update in prediction market is actually a stochastic mirror descent. The gradient of objective function F(x,d)  w.r.t. x is the -d(C,x), where d here is demand function. The Bregman divergence part uses the conjugate dual of cost function C as the regularization function.

  1. From the stochastic online optimisation perspective, the market price x is updated by minimizing  a potential objective function F(x,d), e.g. if the agent bets using Kelly betters, F(x;d) = W * KL(p || x), where W is the wealth of this agent and p is its distribution over the outcome space. We can see that in each update, market tries to match the price x with the agent’s belief distribution under a specific regularization term (i.e. Bregman divergence term). This regularization can prevent the market price to move into agent belief exactly.
  2. The interesting part to me is  the following relationship-d(C,x) = grad of F(x;d).
    The form of demand function determines the form of this potential objective function F: Kelly betters leads to KL divergence, isoelastic utility leads to Renyi divergence.



Posted in Machine learning | Tagged , , , | Leave a comment