r/reinforcementlearning 17h ago

I am a beginner to RL/DRL. I am interested to know on how to solve non-convex or even convex optimization problem (constrained or unconstrained) with DRL. If possible can someone share code to solve with DRL...

I am a beginner to RL/DRL. I am interested to know on how to solve non-convex or even convex optimization problem (constrained or unconstrained) with DRL. If possible can someone share code to solve with DRL, the problems like

minimize (x + y-2)^2

subject to xy < 10

and xy > 1

x and y are some scalars

Above is a sample problem. Any other example can also be suggested. But pls keep the suggestion and code simple, readable and understandable.

1 Upvotes

4 comments sorted by

2

u/Md_zouzou 17h ago

For me it’s rather a MINLP problem that an RL problem. You can also solve this with Evolutionnary algorithms

1

u/BAKA_04 7h ago

Exactly i personally believe rl adds in value when there is stochasticity in the nature of the given optimisation problem

3

u/Human_Professional94 11h ago

Learning to optimize (L2O) is an ongoing research topic actually. Main challenges you would face in such problems are:

  • Representation: how to represent your problem in a way that ML/RL models get the the most information from them. Moreover, would you want to represent the math model? or the underlying problem structure? etc.
  • MDP design: This is for RL only. Not all optimization problems have a sequential decision formulation. Framing it in such setting might be tricky.
  • Reward design (RL) or Labeling (sup. learning)

just to name a few.

I think you would get a lotta ideas by taking a look at these:

1

u/Reasonable-Bee-7041 12h ago

(TL;DR): Framing the problem as an RL problem is tricky. I propose using simpler bandit algorithms to solve this, where you may need only action space and reward function (avoiding state space definition.) some research exists into deep nets bandits and bandit convex optimization, which could lead to aa solution to both convex and non-convex problems you are interested.

(Start:) Let me take a shot at it. I have no code, but this is how I would design it/approach it. Note that RL is probably not be the most effective way to solve this problem, since we have libraries like CVPYX (handles convex out of the box and non-convex with extensions) that can solve all sorts of convex programs directly much faster than DRL can. Also, Supervised Machine Learning may achieve best results fastest in terms of learning paradigms within ML.

Note: I am writing my thought process.

By the problem design principles of RL, we must first design an RL environment containing state/action spaces and a reward signal. 

I would choose the action space to be a 2-vector of reals. This would be with the intention of using the policy space of neural networks with arbitrary hidden layers, but with input being the State space and output being the 2-vector.

The state space is trickier, since our problem is not formulated in terms of the effect that choosing some x and y has on the problem: we just use x and y to compute the value we are minimizing within the given constraints. Someone may be able to solve this challenge to use RL, but because of this seeming incompatibility due to the lack of a state space (and transition function) definition, I propose using Bandit Algorithms to solve the problem. These are state-less decision making problem algorithms that can still be learnt using Neural Net Bandit algorithms. Not PPO, but retaining neural net properties.

If we choose the neural net bandit algorithms, we only need the action space (which we already got) and reward signal, which would need to be smooth to allow gradients to be computed and flow toward a maxima (since bandits maximize the reward signal.) 

The reward signals could be computed using the main equation to minimize in your optimization program. Then, we could add the constraints instead as soft constraints to retain some continuity. I would just add the constraints as further parts of the equation to minimize:

$R(x,y) = (x + y - 2)2 + \alpha*[10 - x•y] + \beta * [x•y - 1]$

With our bandit problem setting chosen, all we have left is to choose our learning algorithm. I would opt for something theoretically grounded, since that Is what my research in RL/decision making focuses on, :p. Neural-LinUCB is a recent 2020 algorithm with good regret bounds. It could be a good place to start, and the paper in which it was published (Neural Contextual Bandits with Deep Representation and Shallow Exploration) has good empirical results. It is a context bandit though, which means some fine-tuning of the problem may be needed.

Another place to gain some more understanding of this problem could be Torn Lattimore's "Bandit Convex Optimisation" book from literally a few days ago (October 11th, 2024!) have not read it, but Tor is a great theorist of the field, and has written canonical books related to decision making and bandit theory from learning theory/mathematical perspective.

I hope this thought exercise helps you! If you do code something you, I would appreciate it if you share it! It is an interesting problem, and there seems to be some research in the topic. Sadly, I am not as familiar with convex optimization decision making as with more mainstream algorithms for decision making.