# sample

Sample experiences from replay memory buffer

## Syntax

``experience = sample(buffer,batchSize)``
``experience = sample(buffer,batchSize,Name=Value)``
``[experience,Mask] = sample(buffer,batchSize,Name=Value)``

## Description

example

````experience = sample(buffer,batchSize)` returns a mini-batch of N experiences from the replay memory `buffer`, where N is specified using `batchSize`.```
````experience = sample(buffer,batchSize,Name=Value)` specifies additional sampling options using one or more name-value pair arguments.```
````[experience,Mask] = sample(buffer,batchSize,Name=Value)` returns a sequence padding mask indicating which the padded experiences at the end of a sampled sequence.```

## Examples

collapse all

Define observation specifications for the environment. For this example, assume that the environment has a single observation channel with three continuous signals in specified ranges.

```obsInfo = rlNumericSpec([3 1],... LowerLimit=0,... UpperLimit=[1;5;10]);```

Define action specifications for the environment. For this example, assume that the environment has a single action channel with two continuous signals in specified ranges.

```actInfo = rlNumericSpec([2 1],... LowerLimit=0,... UpperLimit=[5;10]);```

Create an experience buffer with a maximum length of 20,000.

`buffer = rlReplayMemory(obsInfo,actInfo,20000);`

Append a single experience to the buffer using a structure. Each experience contains the following elements: current observation, action, next observation, reward, and is-done.

For this example, create an experience with random observation, action, and reward values. Indicate that this experience is not a terminal condition by setting the `IsDone` value to 0.

```exp.Observation = {obsInfo.UpperLimit.*rand(3,1)}; exp.Action = {actInfo.UpperLimit.*rand(2,1)}; exp.NextObservation = {obsInfo.UpperLimit.*rand(3,1)}; exp.Reward = 10*rand(1); exp.IsDone = 0;```

Append the experience to the buffer.

`append(buffer,exp);`

You can also append a batch of experiences to the experience buffer using a structure array. For this example, append a sequence of 100 random experiences, with the final experience representing a terminal condition.

```for i = 1:100 expBatch(i).Observation = {obsInfo.UpperLimit.*rand(3,1)}; expBatch(i).Action = {actInfo.UpperLimit.*rand(2,1)}; expBatch(i).NextObservation = {obsInfo.UpperLimit.*rand(3,1)}; expBatch(i).Reward = 10*rand(1); expBatch(i).IsDone = 0; end expBatch(100).IsDone = 1; append(buffer,expBatch);```

After appending experiences to the buffer, you can sample mini-batches of experiences for training of your RL agent. For example, randomly sample a batch of 50 experiences from the buffer.

`miniBatch = sample(buffer,50);`

You can sample a horizon of data from the buffer. For example, sample a horizon of 10 consecutive experiences with a discount factor of 0.95.

```horizonSample = sample(buffer,1,... NStepHorizon=10,... DiscountFactor=0.95);```

The returned sample includes the following information.

• `Observation` and `Action` are the observation and action from the first experience in the horizon.

• `NextObservation` and `IsDone` are the next observation and termination signal from the final experience in the horizon.

• `Reward` is the cumulative reward across the horizon using the specified discount factor.

You can also sample a sequence of consecutive experiences. In this case, the structure fields contain arrays with values for all sampled experiences.

```sequenceSample = sample(buffer,1,... SequenceLength=20);```

Define observation specifications for the environment. For this example, assume that the environment has two observation channels: one channel with two continuous observations and one channel with a three-valued discrete observation.

```obsContinuous = rlNumericSpec([2 1],... LowerLimit=0,... UpperLimit=[1;5]); obsDiscrete = rlFiniteSetSpec([1 2 3]); obsInfo = [obsContinuous obsDiscrete];```

Define action specifications for the environment. For this example, assume that the environment has a single action channel with one continuous action in a specified range.

```actInfo = rlNumericSpec([2 1],... LowerLimit=0,... UpperLimit=[5;10]);```

Create an experience buffer with a maximum length of 5,000.

`buffer = rlReplayMemory(obsInfo,actInfo,5000);`

Append a sequence of 50 random experiences to the buffer.

```for i = 1:50 exp(i).Observation = ... {obsInfo(1).UpperLimit.*rand(2,1) randi(3)}; exp(i).Action = {actInfo.UpperLimit.*rand(2,1)}; exp(i).NextObservation = ... {obsInfo(1).UpperLimit.*rand(2,1) randi(3)}; exp(i).Reward = 10*rand(1); exp(i).IsDone = 0; end append(buffer,exp);```

After appending experiences to the buffer, you can sample mini-batches of experiences for training of your RL agent. For example, randomly sample a batch of 10 experiences from the buffer.

`miniBatch = sample(buffer,10);`

## Input Arguments

collapse all

Experience buffer, specified as an `rlReplayMemory` or `rlPrioritizedReplayMemory` object.

Batch size of experiences to sample, specified as a positive integer.

If `batchSize` is greater than the current length of the buffer, then `sample` returns no experiences.

### Name-Value Arguments

Specify optional pairs of arguments as `Name1=Value1,...,NameN=ValueN`, where `Name` is the argument name and `Value` is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: `DiscountFactor=0.95`

Sequence length, specified as a positive integer. For each batch element, sample up to `SequenceLength` consecutive experiences. If a sampled experience has a nonzero `IsDone` value, stop the sequence at that experience.

N-step horizon length, specified as a positive integer. For each batch element, sample up to `NStepHorizon` consecutive experiences. If a sampled experience has a nonzero `IsDone` value, stop the horizon at that experience. Return the following experience information based on the sampled horizon.

Sampling an n-step horizon is not supported when sampling sequences. Therefore, if `SequenceLength` > `1`, then `NStepHorizon` must be `1`.

Discount factor, specified as a nonnegative scalar less than or equal to one. When you sample a horizon of experiences (`NStepHorizon` > `1`), `sample` returns the cumulative reward R computed as follows.

`$R=\sum _{i=1}^{N}{\gamma }^{i}{R}_{i}$`

Here:

• γ is the discount factor.

• N is the sampled horizon length, which can be less than `NStepHorizon`.

• Ri is the reward for the ith horizon step.

`DiscountFactor` applies only when `NStepHorizon` is greater than one.

Data source index, specified as one of the following:

• `-1` — Sample from the experiences of all data sources.

• Nonnegative integer — Sample from the experiences of only the data source specified by `DataSourceID`.

## Output Arguments

collapse all

Experiences sampled from the buffer, returned as a structure with the following fields.

Observation, returned as a cell array with length equal to the number of observation specifications specified when creating the buffer. Each element of `Observation` contains a DO-by-`batchSize`-by-`SequenceLength` array, where DO is the dimension of the corresponding observation specification.

Agent action, returned as a cell array with length equal to the number of action specifications specified when creating the buffer. Each element of `Action` contains a DA-by-`batchSize`-by-`SequenceLength` array, where DA is the dimension of the corresponding action specification.

Reward value obtained by taking the specified action from the observation, returned as a 1-by-1-by-`SequenceLength` array.

Next observation reached by taking the specified action from the observation, returned as a cell array with the same format as `Observation`.

Termination signal, returned as a 1-by-1-by-`SequenceLength` array of integers. Each element of `IsDone` has one of the following values.

• `0` — This experience is not the end of an episode.

• `1` — The episode terminated because the environment generated a termination signal.

• `2` — The episode terminated by reaching the maximum episode length.

Sequence padding mask, returned as a logical array with length equal to `SequenceLength`. When the sampled sequence length is less than `SequenceLength`, the data returned in `experience` is padded. Each element of `Mask` is `true` for a real experience and `false` for a padded experience.

You can ignore `Mask` when `SequenceLength` is 1.

## Version History

Introduced in R2022a