# Deep reinforcement learning for data-driven adaptive scanning in ptychography

Table of Contents

### Image formation in ptychography

The approach described in this paper is compatible with multisclice ptychography, but in light of the application to a 2D material we constrain ourselves to single-slice ptychography. Here, ptychography can be expressed by a multiplicative approximation that describes the interaction of a wavefunction \(\psi ^in_p(\textbf r)\) of an incoming beam with the transmission function \(t(\textbf r)\) of a specimen. For each measurement *p*, the beam is shifted by \(\textbf R_p\) and a diffraction pattern is acquired with the intensity \(I_p\) that is expressed by

$$\beginaligned \beginaligned I_p = |\Psi ^ex_p(\textbf k)|^2 = |\mathscr F \left[ \psi ^in_p(\textbf r-\textbf R_p) t(\textbf r) \right] |^2, \endaligned \endaligned$$

(1)

where \(\mathscr F\) is the Fourier propagator, \(\textbf r\) the real space coordinate, \(\textbf k\) the reciprocal space coordinate and \(\Psi ^ex_p(\textbf k)\) the exit wavefunction at the detector. The transmission function can be defined as \(t(\textbf r) = e^i \sigma V(\textbf r)\), with the interaction constant \(\sigma\) and the complex potential \(V(\textbf r)\). Throughout this treatment, \(\sigma\) is absorbed into \(V(\textbf r)\). X-ray and optical ptychography is mathematically described similarly with the only difference that the transmission function \(t(\textbf r)\) is related to the complex refractive index of the specimen. Figure 1a illustrates the experimental configuration of conventional ptychography. The potential of the specimen is recovered from data of experimentally acquired diffraction patterns \(J_p\) using a reconstruction algorithm. Here, we apply a gradient based algorithm^{17} with a gradient decent optimization and the potential is retrieved by iteratively minimizing the loss function

$$\beginaligned \mathscr L(V) = \frac1P\sum ^P_p=1 \Vert I_p(V) – J_p\Vert ^2_2 . \endaligned$$

(2)

### Generation of scan sequences

We use a recurrent neural network (RNN)^{36,37,38} for the generation of scan sequences. Its network architecture is designed to model temporal sequences with recurring input information. Memory cells combine the current input information \(X_t\) with the hidden state \(H_t\) and map it to the next hidden state \(H_t+1\). These hidden states represent the memory gathered from all the previous time steps. Gated recurrent units (GRU)s^{39}, which allow a computationally fast mapping with a high performance, are used in this work. At every time step *t*, an output is generated on the basis of the current hidden state. In the implementation shown here, the output corresponds to multiple scan positions, i.e. a sub-sequence of scan positions, given by a vector of 2D coordinates \(\vec \textbf R_P_t\). In principle, the output can be reduced to a single scan position \(\textbf R_p_t\), but we do not do so for practical reasons that involve a reduced training performance of the network and also a greatly increased acquisition time due to, e.g., more frequent data transfer and pre-processing of these intermediate data chunks. The sub-sequence is predicted via a fully connected layer (FC) that is parameterized by the layer weights \(\theta _H\):

$$\beginaligned \vec \textbf R_P_t = \text FC_\theta _H(H_t). \endaligned$$

(3)

At the predicted scan positions \(\vec \textbf R_P_t\), diffraction patterns \(\varvecJ_P_t\) are acquired by the microscope and from these diffraction patterns a potential \(V_t(\textbf r)\) is reconstructed minimizing Eq. (2). The intermediate reconstruction \(V_t(\textbf r)\) combined with its corresponding sub-sequence of scan positions \(\vec \textbf R_P_t\) can then be used for the input information \(X_t\) of the RNN. However, the bandwidth of the information given in \(V_t(\textbf r)\) and \(\vec \textbf R_P_t\) differs strongly and thus pre-processing is required before the two components can be concatenated and mapped to \(X_t\). For the processed location information \(\textbf L_t\) based on the sub-sequence \(\vec \textbf R_P_t\), a FC that is parameterized by the weights \(\theta _R\) is used:

$$\beginaligned \textbf L_t = \text FC_\theta _R(\vec \textbf R_P_t). \endaligned$$

(4)

For the processed structure information \(\textbf C_t\) based on the reconstructed potential \(V_t(\textbf r)\), a compressed representation \(z_t\) is generated by using the encoder part of a convolutional autoencoder^{40}. This processing step is shown in Fig. 1b and the training of the convolutional autoencoder is described in the Supplementary Information. The compressed representation \(z_t\) is then fed into a FC that is parameterized by the weights \(\theta _z\):

$$\beginaligned \textbf C_t = \text FC_\theta _z(z_t). \endaligned$$

(5)

The processed location information \(\textbf L_t\) is subsequently concatenated with the processed structure information \(\textbf C_t\) and mapped to the input information \(X_t\) with a FC that is parameterized by the weights \(\theta _LC\). The whole process of predicting sub-sequences of scan positions and acquiring the corresponding diffraction patterns is repeated until a ptychographic data set of desired size is reached. Finally, backpropagation through time (BPTT) is used to generate the required gradients to update the network weights \(\theta = \ \theta _H,\theta _\text GRU,\theta _LC,\theta _R, \theta _z \\) of the RNN. Figure 1c shows the prediction process modeled by the RNN in full detail.

### Training through reinforcement learning

A RNN can be combined with RL to provide a formalism for modelling behaviour to solve decision making problems. In the case of adaptive scanning in ptychography, where we seek to predict multiple scan positions at each time step, RL can be formalized by a partially observable stochastic game (POSG) that is described by a 8-tuple, \(\langle M, \mathscr S,\\mathscr A^m\_m\in M,\rho ,\r^m\_m\in M, \\mathscr O^m\_m\in M, \omega , \gamma \rangle\), with multiple agents *M*. At each time-step *t* an agent *m* selects an action \(a^m_t \in \mathscr A^m\) and makes an observation \(o^m_t \in \mathscr O^m\) given the state \(s_t\in \mathscr S\). Thus, joint actions \(\varveca_t = \langle a^1_t,\ldots , a^m_t \rangle\) from the joint action space \(\varvec\mathscr A = \mathscr A_1 \times \cdots \times \mathscr A_M\) are executed and joint observations \(\varveco_t = \langle o^1_t, \ldots , o^m_t \rangle\) from the joint observation space \(\varvec\mathscr O = \mathscr O_1 \times \cdots \times \mathscr O_M\) are received from the environment at every time step. The next state \(s_t+1\) is generated according to a transition function \(\rho : \mathscr S \times \varvec\mathscr A \times \mathscr S \rightarrow [0,1]\), the observations \(\varveco_t+1\), containing incomplete information about the state \(s_t+1\), are generated according to an observation function \(\omega : \varvec\mathscr A \times \mathscr S \times \varvec\mathscr O \rightarrow [0,1]\) and each agent receives its immediate reward defined by the reward function \(r^m: \mathscr S \times \varvec\mathscr A \rightarrow \mathbb R\). This reward \(r^m\) contributes to the total reward computed at the end of the sequence, \(G^m=\sum ^T_t=0 \gamma ^t r^m(\varveca_t,s_t)\), also known as the return. The discount factor \(\gamma \in [0,1]\) controls the emphasis of long-term rewards versus short-term rewards. The entire history of observations and actions up to the current time \(\varvech_t= \ \varveco_1, \varveca_1, \ldots , \varveco_t-1,\varveca_t-1,\varveco_t \\) is used as basis for optimal or near-optimal decision making. A stochastic policy \(\pi _\theta ^m(a^m_t|\varvech_t)\) maps the history of past interactions \(\varvech_t\) to action probabilities. Given a continuous action space, the policy can be represented by a two-dimensional Gaussian probability distribution:

$$\beginaligned \pi _\theta ^m(a^m_t|\varvech_t) = \mathscr N(\varvec\mu _\theta ^m(\varvech_t), \Sigma ), \endaligned$$

(6)

with its mean vector \(\varvec\mu _\theta ^m(\varvech_t)\) corresponding to \(\textbf R_p_t\), where the history \(\varvech_t\) is summarized in the hidden state \(H_t\) of the RNN and the covariance matrix \(\Sigma\) with fixed variances \(\sigma _x^2 \in [0,1]\) and \(\sigma _y^2 \in [0,1]\). The joint policy of all agents is then defined as \(\varvec\pi _\varvec\theta (\varveca_t| \varvech_t) = \prod _m=1^M \pi _\theta ^m(a^m_t|\varvech_t)\), with \(\varvec\theta =\ \theta ^m \_m \in M\). The goal of RL is now to learn a joint policy that maximizes the expected total reward for each agent *m* with respect to its parameters \(\theta ^m\):

$$\beginaligned \beginaligned \mathscr J^m(\varvec\theta ) = \mathbb E_\varvec\pi _\varvec\theta (\varvec\tau ) \left[ G^m \right] \approx \frac1N \sum ^N_n=1 G^m_n, \endaligned \endaligned$$

(7)

where the expected total reward can be approximated by Monte Carlo sampling with *N* samples. In this paper, improvement of the policy is achieved by updating the policy parameters \(\theta ^m = \ \theta ^m_H,\theta _\text GRU,\theta _LC,\theta _R, \theta _z \\) with ’REINFORCE’^{41}, a policy gradient method:

$$\beginaligned \nabla _\theta ^m \mathscr J^m(\varvec\theta )= & \mathbb E_\varvec\pi _\varvec\theta (\varvec\tau ) \left[ \nabla _\theta ^m \text log \varvec\pi _\varvec\theta (\varvec\tau ) G^m \right] \nonumber \\ & \approx \frac1N \sum ^N_n=1 \sum ^T_t=0 \nabla _\theta ^m \text log \pi _\theta ^m(a^m_n,t|\varvech_n,t) G^m_n. \endaligned$$

(8)

The derivation of \(\nabla _\theta ^m \mathscr J^m(\varvec\theta )\) is given in the Supplementary Information.

### Learning to adaptively scan in ptychography

While policy gradient methods are the preferred choice to solve reinforcement learning problems in which the action spaces are continuous^{42}, they come with significant problems. Like any gradient based method, policy gradient solutions mainly converge to local, not global, optima^{43}. In this paper, we reduce the effect of this problem during training by splitting the training of the RNN into supervised learning and RL. While training in RL can be performed with a policy whose parameters are arbitrarily initialized, this is not ideal. Having an adequate initial guess of the policy and using RL subsequently to only fine-tune the policy is a much easier problem to solve. A sparse grid scan sequence is a reasonable initialization as it follows the current scanning convention used in a microscope. Pre-training of the parameterized policy for the RL model can then be performed by supervised learning applied on the RNN such that the discrepancy between the predicted scan positions \(\vec \textbf R_P_t = \vec \varvec\mu _\varvec\theta (\varvech_t)\) and the scan positions of the initialization sequence \(\vec \textbf R^\text init_P_t\) is minimized:

$$\beginaligned \mathscr K(\varvec\theta ) = \sum ^T_t \Vert \vec \textbf R_P_t – \vec \textbf R^\text init_P_t \Vert ^2_2. \endaligned$$

(9)

Figure 5 illustrates the scan positions during the fine-tuning of the policy through RL for the first 10,000 iterations when either (a) a policy that has not been initialized via supervised learning or (b) an initialized policy is used. While the scan positions in both cases converge to the atomic structure, the positions predicted by the non-initialized policy are distributed only within a small region of the field of view during the entire training. Note that once the predicted scan sequence mimics the sparse grid scan sequence as a result of the supervised learning based initialization, all further improvements in performance are the result of the subsequent training through RL.

A high variance of gradient estimates is another problem that particularly strongly affects the Monte Carlo policy gradient method^{42,44,45}. Due to this, the sampling efficiency is relatively low, which causes a slow convergence to a solution. This makes deep RL applied to ptychography challenging as the image reconstruction itself requires iterative processing. The high variance can be in part attributed to the difficulty of assigning credit from the overall performance to an individual agent’s action. Here, we introduce a way to estimate the reward function in order to tackle the credit assignment problem for adaptive scanning in ptychography. The reward function should naturally correspond to the quality of the ptychographic reconstruction. We have found empirically that a high reconstruction quality correlates positively with a high dynamic range in the phase. Therefore, the reward function could intuitively be formalized by \(r^m(\varveca_t|s_t)=P^-1 \sum _\textbf r\in \text FOV V(\textbf r)\), where *P* is the total number of scan positions. This formulation, however, does not solve the credit assignment problem and results in an insufficient training performance, as shown in Fig. 6a. To estimate the reward for the actions of each individual agent, we use a tessellation method that partitions the atomic potential into small segments. A Voronoi diagram^{46}, where each position corresponds to a seed for one Voronoi cell, enables assignment of only a part of the total phase to each position. More precisely, the Voronoi diagram formed by the predicted scan positions is overlaid with the corresponding ptychographic reconstruction at the end of the prediction process and the summed phase within each Voronoi cell is the reward for that cell’s seed position. The reward function can be expressed by \(r^m(\varveca_t|s_t)= P^-1 \sum _\textbf r\in \text Cell^m V(\textbf r)\). Figure 6b shows a Voronoi diagram generated by predicted scan positions.

### Settings

For the experimental investigation, we acquired multiple ptychographic data sets from a monolayer molybdenum disulfide (MoS\(_2\)) specimen with a Nion HERMES microscope. The microscope was operated with a 60 kV acceleration voltage, a convergence angle of 33 mrad and diffraction patterns with a pixel size of 0.84 mrad were acquired using a Dectris ELA direct-electron detector mounted at the electron energy-loss spectroscopy (EELS) camera port. Distortions induced by the EEL spectrometer were corrected with in-house developed software. For the ptychographic data set acquisition, a conventional grid scan with a scanning step size of 0.02 nm was used. From the experimentally acquired data sets we created 200 smaller data sets, each with 10,000 diffraction patterns and located at different regions of the sample. 175 of these small data sets were dedicated to the training of the network model, while the remaining 25 data sets were used to test the trained model and to generate the results shown here. The diffraction patterns were binned by a factor of 2 to \(64 \times 64\) pixels. The adaptive scanning model was trained on the small data sets with the goal of predicting optimal scan sequences of 250 to 500 probe positions, out of the possible 10,000, which corresponds to a dose reduction by a factor of 40 to 20. Each sub-sequence contains 50 to 100 positions, where the initially given first sub-sequence follows a sparse grid scanning sequence.

We conducted a second investigation of the model’s performance on simulated data sets of a DWCNT and compared it to the performance of alternative low-dose data acquisition methods in ptychography. The data sets were generated using the simulation tool *ab*TEM^{47} where we set the acceleration voltage to 60 kV, the convergence angle to 40 mrad and the scanning step size to 0.02 nm. The size of the diffraction patterns is \(86 \times 86\) pixels with a pixel size of 0.91 mrad. 962 data sets have been used to train the network model and from the 13,225 diffraction patterns in each data set only 840 were chosen within the prediction process of the adaptive scanning workflow. 25 data sets were used to compare the different scanning procedures and analyse their performance using the \(Q_\text SSIM\) metric. The simulated DWCNT consisted of an inner and an outer nanotube with a diameter of 9.78 Å and 16.44 Å and a chiral angle of \(44^\circ \) and \(60^\circ \), respectively. For each data set, a unique rotation between the two nanotubes and a translation of the entire DWCNT within the field of view was applied.

The ptychographic reconstructions were performed with an optimized version of ROP^{17,28} that allows simultaneous reconstruction from a batch of different data sets and therefore the parallel hardware architecture of a NVIDIA V100 GPU could be optimally used to efficiently train the model. For a batch size of 24, reconstructions were retrieved in about 35 s. A gradient descent step size \(\alpha _\text ROP\) of 525 was chosen and the potential was retrieved at iteration 5. In the experimental investigation, the reconstructed potential was \(200\times 200\) pixels with a pixel size of 0.0154 nm, for a field of view of \(2 \times 2\) nm, while for the simulation, the reconstructed potential was \(200\times 200\) pixels with a pixel size of 0.0140 nm, for a field of view of \(2.3 \times 2.3\) nm. For the generation of the reward function, Voronoi diagrams were generated with the Jump Flooding Algorithm^{48} and for the implementation of the network models, PyTorch^{49} was used. For the compression of structure information, we used a convolutional autoencoder consisting of 6 convolutional layers with kernels of dimension 3, a stride of 1 and channels that ranged from 16 to 512 for the encoder and decoder part, respectively. The input of the autoencoder had a dimension of 512 with a pixel size of 0.0064 nm and thus a scaling and an interpolation was required before the potential generated by ROP could be compressed. In addition, the value of the potential \(V_i\) at each pixel *i* was transformed to zero mean and unit variance. For the prediction of the scan sequences, pre-training and fine-tuning was performed with a RNN model composed of 2 stacked GRU layers with hidden states \(H_t\) of size 2048, the Adam optimizer^{50} with a learning rate \(\alpha _\text RNN\) of \(1\textsc e-5\) and a batch size of 24. For the fine-tuning, a policy with variances of \(\sigma _x^2 = \sigma _y^2 = 0.0125^2\) was chosen and a myopic behavior was enforced by setting the discount factor for the return, *G*, to \(\gamma = 0\). In the case of the experimental investigation, training of the autoencoder involved 100,000 iterations, while the training of the RNN with supervised learning and RL has been performed with 800 and 20,000 iterations, respectively. In the second investigation with the simulated data, the training of the autoencoder has been done with 30,000 iterations and for the training of the RNN with supervised learning and RL, 200 and 2800 iterations were used, respectively.