• User Guides
  • Acronyms and Terms
  • Admin User Guide
  • Knowledge Expert (KE) User Guide
  • Learning Service Guide
  • Request Support
  • Subject Matter Expert (SME) User Guide


The AI part of HIRO is based on Reinforcement Learning algorithms.

Reinforcement Learning is an aspect of Machine learning where an agent learns to behave in an environment, by performing certain actions and observing the rewards/results which it get from those actions.

The algorithm (agent) evaluates a current situation (state), takes an action, and receives feedback (reward) from the environment after each act. Positive feedback is a reward (in its colloquial meaning), and negative feedback is punishment for making a mistake. Reinforcement Learning algorithms learn how to act best through many attempts and failures.

The current implementation of Reinforcement Learning in HIRO uses so-called Monte Carlo methods.

To make Reinforcement Learning working successfully, we need to follow some basic rules when creating issues and writing KIs.

Concepts and Notations

First, lets introduce some RL-specific concepts and notations:

Reinforcement Learning Loop
  1. An Episode in HIRO is called an Issue/Task.

  2. HIRO is an Agent.

  3. To resolve an issue, HIRO is interacting with an Environment.

  4. A Time step consists of choosing and executing a single KI.

  5. Therefore at each time step HIRO is taking some Action (executing some KI).

  6. At each time step HIRO is observing an environment. This particular observation at a particular time step is called a State.

  7. Once a KI is executed, HIRO can obtain some Reward (numerical value). It is not necessary that a reward is received at every time step.

  8. Once the reward is obtained, HIRO can (but not necessarily) observe a New State. When the issue is resolved in this step, this will be the Final state.

  9. HIRO takes steps until the end of the episode/issue is reached.

  10. The main goal is to maximize the total reward and to make sure we take the most promising steps (execute the best possible KIs).

A short description of this process could look like this:

`state => action => reward => nextstate => action => reward => nextstate => … => action => reward => finalstate`


An example to illustrate this is the famous Tower of Hanoi.

Each disk can use one particular rod. That will be the State. We can represent the state as a string:

Hanoi Towers start configuration

All three disks are currently using the first rod. This state can be represented as HanoiPos_1_1_1

Hanoi Towers mid configuration

Two disks are using the first rod and the third disk is using the second rod. This state can be represented as HanoiPos_1_1_2

To train the system to finish the task as fast as possible, we assign a reward of "-1" for each time step.

The Reward is crucial for our system to work correctly. We can obtain the reward directly from the environment or from the KI. In the last case we should increment the ogit/RL/totalReward variable by this local reward.

ogit/RL/totalReward = ogit/RL/totalReward + local_reward

To penalize some action we can use a negative reward:

ogit/RL/totalReward = ogit/RL/totalReward + (-local_reward)

The main idea is to encourage good actions/KIs and penalize bad actions/KIs, when it is possible. This approach is critical for HIRO and its Learning Service to learn optimal behavior. If you don_t provide any reward during the whole Issue/Episode, the system will not be able to learn anything useful from this episode.

Another example is the MAZE environment:

  • The Episode starts at some initial point and ends when we move to the exit.

  • The Environment is the whole gridworld/maze, where we can navigate our agent from one state/position to another.

  • The State can be represented as the current coordinates (X, Y) or just as a number 1,2,3..,N. We always store the current state internally.

  • The Reward is a numerical value, which is given to our agent via KIs. In case of maze, after each time step HIRO gives the reward "-1" to our agent. By maximizing the total reward we are able to find the shortest path to the exit (e.g. our final reward is -100, which means it’s 100 steps, and that’s better than the reward -200, which is 200 steps). We store all the rewards obtained for each time step.


There are two main variables in the Reinforcement Learning workflow:

  • ogit/RL/totalReward

  • ogit/RL/state

Both are controlled via KIs.

a) Use ogit/RL/totalReward when you need to encourage or penalize a particular KI.

If you need to treat your KI as an agent_s step, you should explicitly add some reward inside of this KI. e.g. in the KI definition, inside of the DO section:

ogit/RL/totalReward = ogit/RL/totalReward - 1

This will add a reward of -1 to this action/KI, it therefore penalizes this KI.

For example, in case of Maze or Tower of Hanoi, just use ogit/RL/totalReward = ogit/RL/totalReward - 1 for each KI so that you penalize extra steps taken and make HIRO to learn the shortest path.

b) Use ogit/RL/state when you need to set or change your current state.

Defining a meaningful state can be tricky. You need to think about what combination of environment parameters can be used as a state. For example, in a Maze environment a state can be just a X and Y coordinates (e.g. a string “22_15” which represents x=22 and y=15). In Tower of Hanoi this will be HanoiPos_1_1_1, where each "1" represents a disk using the first rod.

Example of a KI using the Reinforcement Learning system:

  id: "HN3_R1"
  version: "0@rs"
  description: "Move disc 3 to position 1."
  MakeMove == "Hanoi 6.2"
  DiskPos[1] as DP1
  DiskPos[2] as DP2
  DiskPos[3] != 1
  StepCounter < 1000
  DiskPos[3] = 1
  ogit/RL/state = "HanoiPos_${DP1}_${DP2}_1"
  ogit/RL/totalReward = ogit/RL/totalReward - 1
  StepCounter = StepCounter + 1

Single state issue.

Sometimes there is only one state, depending on the issue. In this case the Learning Service will optimize for the number of KIs which were executed during this issue. It means that the Learning Service will try to choose those KIs which will help us to finish the issue faster, with a lower number of timesteps. Another approach is to choose only KIs which produce higher rewards.

A good example of a single state issue is a Multi-armed bandit problem. The state is always the same, but the actions are different (one action for each bandit). The goal is to choose the best actions and get the highest total reward from these actions.