Toward the Next Generation of Retinal Neuroprosthesis: Visual Computation with S

时间：2024-07-28

Zhaofei Yu, Jian K. Liu*, Shanshan Jia, Yichen Zhang, Yajing Zheng, Yonghong Tian,Tiejun Huang

a National Engineering Laboratory for Video Technology, School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China

b Peng Cheng Laboratory, Shenzhen 518055, China

c Center for Systems Neuroscience, Department of Neuroscience, Psychology and Behaviour, University of Leicester, Leicester LE1 7RH, UK

Keywords:Visual coding Retina Neuroprosthesis Brain-machine interface Artificial intelligence Deep learning Spiking neural network Probabilistic graphical model

A B S T R A C T A neuroprosthesis is a type of precision medical device that is intended to manipulate the neuronal signals of the brain in a closed-loop fashion, while simultaneously receiving stimuli from the environment and controlling some part of a human brain or body.Incoming visual information can be processed by the brain in millisecond intervals.The retina computes visual scenes and sends its output to the cortex in the form of neuronal spikes for further computation. Thus, the neuronal signal of interest for a retinal neuroprosthesis is the neuronal spike. Closed-loop computation in a neuroprosthesis includes two stages:encoding a stimulus as a neuronal signal, and decoding it back into a stimulus. In this paper, we review some of the recent progress that has been achieved in visual computation models that use spikes to analyze natural scenes that include static images and dynamic videos.We hypothesize that in order to obtain a better understanding of the computational principles in the retina, a hypercircuit view of the retina is necessary, in which the different functional network motifs that have been revealed in the cortex neuronal network are taken into consideration when interacting with the retina. The different building blocks of the retina, which include a diversity of cell types and synaptic connections—both chemical synapses and electrical synapses(gap junctions)—make the retina an ideal neuronal network for adapting the computational techniques that have been developed in artificial intelligence to model the encoding and decoding of visual scenes. An overall systems approach to visual computation with neuronal spikes is necessary in order to advance the next generation of retinal neuroprosthesis as an artificial visual system.

1. Introduction

The concept of precision medicine has been under development for a few years.This term is usually used to refer to the customization of healthcare to individual patients. Current advancements in artificial intelligence techniques, including hardware, software,and algorithms, are making the process of healthcare increasingly precise for each individual patient,as the communication between healthcare devices or services and patients is specifically designed and adjusted.

A neuroprosthesis is a precise medical device that provides a means of therapy aside from traditional pharmacological treatment. A neuroprosthesis usually has direct interaction with the neuronal activity—and with the neuronal spikes in particular—of an individual brain [1-9]. It consists of a series of devices that can substitute for part of a human body and/or brain, such as motor, sensory, or cognitive modality that has been damaged. As the brain is the central hub that controls and exchanges the information used by human motor, sensory, and cognitive behavior, in order to improve the performance of a neuroprosthesis,it is necessary to focus on better analyzing the neuronal signal used by the neuroprosthesis.Therefore,in addition to the development of neuroprosthesis hardware, better algorithms are a core feature in enabling better performance of neuroprostheses [6,10,11].

Motor neuroprostheses have a long history of intensive studies;in particular, recent techniques have been able to record cortical neuronal spikes well and use them to control neuroprostheses[6].Cochlear implants are the most widely used sensory neuroprosthesis, and have demonstrated fairly good performance in addressing hearing loss, although many questions remain regarding how to improve their performance in a noisy environment and their effect on the neuronal activity of the downstream auditory cortex[11,12].However, in contrast to the intensive computational modeling of cochlear implants that has been done[11],retinal neuroprostheses are much less well studied and have a much worse performance for restoring eyesight,although a few types of retinal neuroprostheses are being used in clinical trials[13,14].

The retina consists of three layers of neurons with photoreceptors, bipolar cells, and ganglion cells, surrounded by inhibitory horizontal and amacrine cells. Photoreceptors receive incoming light signals that encode the natural environment and transform them into electrical activity that is modulated by the horizontal cells. The electrical activity is then sent to bipolar cells and amacrine cells for further processing. In the end, all of the signals go to the output side of the retina,where retinal ganglion cells—which are the only output neurons—produce a sequence of action potentials or spikes that are transmitted via the optic nerve to various downstream brain regions.In essence,all of the visual information about the body’s environment,both in space and in time,is encoded by these spatiotemporal patterns of spikes from ganglion cells.

Many types of eye disease are caused by neuronal degeneration of the photoreceptors, while the outputs of the retina—the ganglion cells—remain healthy. One type of therapy for such diseases would be to develop an advanced retinal prosthesis to directly stimulate the ganglion cells with an array of electrodes. Retinal neuroprostheses have a relatively long history of research [15].However, much effort has been dedicated to the material design of retinal neuroprosthesis hardware[13-18]. Recently, it has been suggested that employing better neural coding algorithms would improve the performance of retinal neuroprostheses [10]. It was shown that the reconstruction of visual scenes can be significantly improved by adding an encoder that converts the input images into the spiking codes used by retinal ganglion cells; these codes are then used to drive transducers such as electrodes, optogenetic stimulators, or other components for vision restoration.

Therefore, better computational models are needed to advance the performance of retinal neuroprostheses. Compared with other neuroprostheses,whose stimulus signals are relatively simple,retinal neuroprostheses deal with dynamic visual scenes in space and time with higher-order correlations. Low performance is mainly due to the major difficulty of there being no clear understanding of how ganglion cells encode rich visual scenes. Much of our knowledge has been documented through experiments with simple artificial stimuli,such as white noise images,bars,and gratings.It remains unclear how the retina processes complex natural images with its neuronal structure. In recent years, remarkable progress has been made in using artificial intelligence to analyze complex visual scenes, including natural images and videos. Thus,it is now possible to develop novel functional artificial intelligence models to study the encoding and decoding of natural scenes by analyzing the spiking responses of retinal ganglion cells.

In this paper, we review some of the recent progress that has been achieved in this field. Most studies on visual coding can be roughly classified into two streams. The first, more traditional stream is the feature-based modeling approach, in which visual features or filters are aligned with the biophysical properties—such as the receptive field (RF)—of the retinal neurons. The second, relatively new stream is the sampling-based modeling approach, in which the statistics of visual scenes—such as pixels—are formulated using probabilistic models. It should be noted that these two approaches are not completely separate; in fact, there are increasingly close interactions between them due to advances in recent computational techniques for both hardware and algorithms.Here,we review some of the core ideas that have emerged from both approaches regarding the analysis of visual scenes using neural spikes, in order to promote the next generation of retinal neuroprostheses, in which computational modeling plays an essential role.

This review is organized as follows:Section 2 provides an introduction of the biological structure of the retina,with a focus on its inner neuronal circuit.We emphasize that the retinal circuit carries out rich computations that are beyond the dynamics of the single cells of the retina.

In Section 3, in contrast to the view that the retina is a simple neural network, we hypothesize that the retina is highly complex and is comparable to some aspects of the cortex,with different network motifs for specialized computations in order to extract visual features.In particular,we outline three views that present the retinal neuronal circuit as feedforward,recurrent,and winner-take-all(WTA) network structures. For each of these three viewpoints, we provide some evidence and recent results that fit into the proposed framework.

In Section 4, a feature-based modeling approach is discussed,and the models of encoding and decoding visual scenes based on feature extraction by the retina are reviewed.For encoding,we first summarize the biophysical models that directly analyze and fit neuronal spikes in order to determine some neuronal properties,such as the RF of the neuron.We then review some encoding models based on artificial neural networks (ANNs) that use recent state-of-the-art machine learning techniques to address complex natural scenes. For decoding, however, it is necessary to rely on statistical and machine learning models that aim to reconstruct visual scenes from neuronal spikes.We review some of these decoders with an emphasis on how they can be used to give retinal neuroprostheses a better performance for both static images and dynamic videos.

In Section 5,a sampling-based modeling approach is discussed.We give an overview of the retinal circuitry in which visual computation can be implemented by means of spiking neuronal networks(SNNs) and probabilistic graph models (PGMs), such that different functional networks can conduct the visual computations observed in the retina. We first introduce the basis of neural computation with spikes. Modeling frameworks of neuronal spikes and SNNs are discussed from a sampling perspective. We then propose that the study of retinal computation should go beyond the classical description of the dynamics of neurons and neural networks by taking into account probabilistic inference.We review some recent results on implementing probabilistic inference with SNNs. These approaches are traditionally applied to theoretical studies of the visual cortex. Here, we demonstrate how similar computational approaches can be used for retinal computation.Finally,in the last section, we conclude the paper with a discussion of possible research directions in the future.

2. Visual computation in the neuronal circuit of the retina

Fig. 1 shows a typical setup of the retinal neuronal circuit.Roughly speaking, there are three layers of networks consisting of a few types of neurons.Following the information flow of visual scenes, photoreceptors convert light with a wide spectrum of intensities (from dim to bright) and colors (ranging from red, to green, to blue), into electrical signals that are then modulated by inhibitory horizontal cells. Next, these signals are transferred to excitatory bipolar cells that carry out complex computations. The outputs of bipolar cells are mostly viewed as graded signals;however, recent evidence suggests that bipolar cells can generate fast spiking events[19].Inhibitory amacrine cells then modulate these outputs in different ways in order to make the computations more efficient, specific, and diverse [20]. At the final stage of the retina,the signals pass on to the ganglion cells for final processing.In the end,the ganglion cells send their spikes to the thalamus and cortex for higher cognition.

Fig. 1. Illustration of the retinal neuronal circuit. Visual scenes are converted by photoreceptors in the first layer, where rods encode the dim light and cones encode color.After being modulated by horizontal cells,the signals are sent to bipolar cells in the second layer.The outputs are sent to the third layer,which consists of amacrine cells and ganglion cells,for further processing.The final signals of the retina are the spikes from the ganglion cells,which are transferred to the cortex.In addition to chemical synapses between cells, massive gap junctions exist between different and the same types of cells (e.g., ganglion-ganglion cells).

Each type of neuron in the retina has a large variation in morphology; for example, it has been suggested that in the mouse retina, there are about 14 types of bipolar cells [21,22], 40 types of amacrine cells [23], and 30 types of ganglion cells [24]. In addition to neurons, a unique feature of any neuronal circuitry is the connections between neurons. Connections between neurons in the retina are typically formed by various types of chemical synapses. However, there are a massive number of electrical synaptic connections, or gap junctions, between different types of cells and between the same type of cells [25-28]. The functional role of these gap junctions remains unclear, however [25]. We hypothesize that gap junctions have the functional role of creating recurrent connections in order to enhance visual computation in the retina; this concept will be discussed in later sections.

In the field of retinal research, most studies are based on the traditional view that neurons in the retina have static RFs that act as spatiotemporal filters to extract local features from visual scenes.We also know that the retina has many levels of complexity in its information processing—from photoreceptors, to bipolar cells,to ganglion cells.In addition,the functional role of the modulation of inhibitory horizontal and amacrine cells is still unclear[20,29]. It is possible that the only relatively well-understood example is the computation of direction selectivity in the retina[30-33].

The retinal ganglion cells are the only output of the retina;however,their activities are tightly coupled and highly interactive with the rest of the retina. These interactions not only make the retinal circuitry complicated in its structure,but also make the underlying computation for visual processing much richer. Therefore, the retina should be considered to be ‘‘smarter” than what scientists have believed[34].These observations lead us to rethink the functional and structural properties of the retina. Given such a complexity of neurons and neuronal circuits in the retina, we propose that the computations of visual scenes that are carried out by the retina should be perceived in a way that goes beyond the view that the retina is similar to a feedforward network that causes information to pass through. Like the cortical cortex, the retina has lateral inhibition and recurrent connections (e.g., gap junctions),which cause the retina to inherit various motifs of neural networks for the specific computations involved in extracting different features of visual scenes, just as visual processing occurs in the visual cortex [35-37].

It should be noted that in comparison with the visual cortex, a detailed understanding of the computation and function of the retina for visual processing has just emerged in recent decades.Today, the retinal computation of visual scenes by means of the retina’s neurons and neuronal circuits is seen as being refined at many different levels; for more detail, see recent reviews on neuroscience advancements on the retina [20,21,25-29,34].

3. Computational framework for the retina

It seems to be difficult to unify the different pieces of neuroscientific experimental evidence from the retinal circuit from a biology perspective [38]. Instead, we hypothesize that it is necessary to study the computation carried out by the retinal circuit using a combination of diverse neural network structure motifs. Such an as-yet-to-emerge computational framework could benefit our understanding of visual computation by utilizing the machine learning techniques that have emerged in recent years[39].When looking at a complete overview of the retinal neuronal circuitry,as shown in Fig.1,it seems rather complicated.After extracting some of the features of network structures, however, simple network motifs emerge. Here, we only focus on three types of network structures—namely, the feedforward, recurrent, and WTA networks, as illustrated in Fig. 2—and hypothesize that these structures play different functional roles in the visual computation of the retina.However,the retina is more than a hybrid of these three network motifs; rather, it consists of multiple types of networks that form a hypercircuit[38],from which more computational features can be extracted with advancements in experimental and computational techniques. Such a hypercircuit view provides a biological basis for a potentially unified framework of retinal computation, although how these different networks work together more efficiently for visual computation remains an open question.

Fig. 2. Illustration of different computational network motifs. (a) Parts of the retinal circuity show different network motifs such as feedforward, recurrent, and WTA subnetworks.(b)Abstract representation of different types of neural networks used by modeling,where the stimulus is first represented by the activities of afferent neurons,and is then fed into a network of excitatory and/or inhibitory neurons for computation. Shadowed networks indicate the same motifs.(c) Abstract computation specifically used by certain typical ANNs,such as convolutional neural networks(CNNs),Markov random fields(MRFs),and hidden Markov models(HMMs).Note that ANNs can use one or mixed computational network motifs,as shown in(b).In MRF,xi are variables represented by a WTA circuit.In HMM,xi are observation variables represented by afferent neurons in a WTA circuit, and yi are hidden variables represented by excitatory neurons in a WTA circuit.

3.1. Feedforward network

The feedforward network is the most classical view of the direction of visual information flow in the retina, as shown in Figs. 2(a,b). The feedforward information flow of the light passes through the retina by means of three major types of cells:photoreceptors, bipolar cells, and ganglion cells. The other two types of inhibitory cells play a modulation role, which has been ignored for simplicity in this viewpoint. The biological basis of this view can be seen in the fovea, where excitatory cells play a major role,with few inhibitions[40].In the fovea,there is direct cascade processing from photoreceptors, to bipolar cells, and then to ganglion cells as the outputs.

The advantage of the feedforward network has been demonstrated by the advancement of ANNs in recent years. In particular,breakthroughs have been made in the framework of deep convolutional neural networks (CNNs) [39]. A simple CNN with three layers, as in the retina, is shown in Fig. 2(c), where a convolutional filter plays the role of the RF of the retinal cell.Cascade processing of visual inputs is computed by the RF of each individual neuron in the retina.The pooling of the computation from the previous layer passes to a neuron in the next layer. Recent studies highlight the similarity between the structure of CNNs and retinal neural circuitry [41,42], which will be discussed in later sections.

3.2. Recurrent network

The dynamics of a recurrent network[43-45],together with the diversity of synaptic dynamics and plasticities [46,47], are important for understanding the brain’s function. Here, we hypothesize that recurrent connections are also important for the retina.Recurrent connections in the retina are mainly produced by a massive number of gap junctions, as shown in Fig. 2(a). Unlike chemical synapses, gap junctions are bidirectional or symmetric. Occurring within and between all types of cells in the retina, gap junctions are used to form short connections between neighboring cells.However, the functional role of these gap junctions remains unclear [25].

From the computational viewpoint, recurrent connections formed by gap junctions make the retinal circuit similar to a PGM of an undirected Markov random field (MRF), as shown in Figs.2(b,c).A PGM provides a powerful formalism for multivariate statistical modeling by combining graph theory with probability theory [48]. PGMs have been widely used in computer vision and computational neuroscience. In contrast to the MRF, there is another type of PGM that is mainly referred to as the Bayesian network, in which the connections have a direction between nodes.Fig. 2(c) shows one type of Bayesian network, termed the hidden Markov model (HMM). In recent years, much effort has been dedicated to implementing these PGMs by SNNs, setting up an insightful connection between artificial machine computation by PGMs and the neural computation observed in the brain, as well as the visual computation in the retina.

3.3. WTA network

Finally, we hypothesize that the retinal circuit has a computational network unit as a WTA motif. In the cortical cortex, the WTA circuit has been suggested to be a powerful computational network motif that implements normalization [49], visual attention [50], classification [51], and more [52].

Two types of inhibitory neurons sit in the first two layers of the retina.Horizontal cells target photoreceptors and relay light information to bipolar cells, while amacrine cells modulate the signals between bipolar cell terminals and ganglion cell dendrites.Both types of cell have specific subtypes that are wide-field or polyaxonal,such that they spread action potentials over a long distance (greater than 1 mm) [38]. From the computational viewpoint, this hypercircuit feature of the retina plays a functional role that is similar to that of a WTA network motif.A recent study has shown that a MRF can be implemented by the network of a WTA circuit, which suggests that the WTA could be the minimal unit of probabilistic inference for visual computation [53].

3.4. Rich computation with network motifs

In the discussion above, we briefly reviewed retinal circuitry and identified three basic neural network motifs that act as units for the complex computations conducted in the retina. However,more types of network motifs have been suggested for cortical microcircuits [37], and it has been suggested that these motifs are also involved in the retinal computation as part of the retinal hypercircuit [38]. The hypercircuit view of the retina transfers most of the methods that have been developed for studying visual processing in the cortex to the investigation of the retinal computation,thereby introducing rich dynamics that are beyond the traditional view of the retina [34]. In particular, quite a few visual functions have been found to be implemented by certain types of network mechanisms in the retina; see Ref. [34] for detailed discussions.

Recent computational advancements in the field of ANNs have led to many breakthroughs in computational vision. For example,deep CNNs can perform hierarchical network modeling of visual computation passing from the retina to the inferior temporal part of the cortex [54]. These feature-based models take advantages of the RF to capture visual features. However, CNN models have a few disadvantages for visual computation; for example, CNN architecture largely lacks design principles,so it may be enhanced by the knowledge of biological neural network design in the brain,including the retina [55].

On the other hand, it has been suggested that a hierarchical Bayesian inference framework is necessary in order to understand visual computation [56]. When using such a sampling-based modeling approach, statistical computation of visual scenes can be formulated by various types of probabilistic models, where different types of network motifs can implement certain computations[57].These computational techniques in Bayesian models are suitable for the visual processing of the visual cortex and the retina[56].

However,these two approaches are not completely separate;in fact, there are close interactions between them [55]. We will explain these ideas by using the retina as a model system in the sections below: The feature-based approach will be discussed in Section 4, and the sampling-based approach will be discussed in Section 5.

4. Encoding and decoding models of the retina

The usability of neural coding is one of the central questions in systems neuroscience[58-60].In particular,for visual coding,it is necessary to understand first how visual scenes are represented by neuronal spiking activities,and then how to decode neuronal spiking activities to represent the given visual information. The retina serves as a useful system to study these questions.

4.1. Biophysical encoding model

In order to understand the encoding principles of the retina,several models have been developed based on the biophysical properties of the neurons and neuronal circuits in the retina, and have recently been reviewed [61]. Here, we briefly review these approaches.

The starting point for examining retinal neuronal computation was to find the RFs of neurons. The classical approach to mapping the neuronal RF is to patch a single cell and then vary the size of a light spot in order to obtain the RF structure as a difference-of-Gaussian filter with central excitation and surrounding inhibition.Later on,a systematic experimental method was developed using a multi-electrode array to record a population of retinal ganglion cells;using this method,it is possible to manipulate light stimulation with various types of optical images, including simple bars,spots, gratings, white noise, and complex well-controlled images and videos. In particular, it is possible to analyze the spike trains of individual neurons when simultaneously recording a large population with white noise stimulus.A simple reverse correlation technique termed the spike-triggered average (STA) [62] can be used to obtain the RF of every recorded ganglion cell.An extension of the STA to covariance analysis, which is known as spiketriggered covariance, serves as a powerful tool for analyzing the second-order dynamics of the retinal neurons [63,64].

With the RF mapped from each neuron, a simple and useful analysis is based on a linear-nonlinear (LN) model to simulate the cascade processing of light information. There are two stages in the LN model [65,66]. The first stage is a linear spatiotemporal filter that represents the sensitive area of the cell—that is,the characteristic of the RF.The second stage is a nonlinear transformation to convert the output of the linear filter into a firing rate. Both properties of the LN model can be easily estimated from the spikes with white noise stimulus [64]. Otherwise, when dealing with complicated stimulus signals rather than white noise, it is necessary to use other methods—such as maximum likelihood estimation [65] and maximally informative dimensions [67]—to estimate the model components when there is enough data.

To date, several models have been developed to refine the building blocks of the LN model in order to make the model more powerful. These models include: the LN Poisson model [63], in which after nonlinear operation,a Poisson process is used to determine whether a spike will be generated;and the generalized linear model [68], in which several additional components are included,such as a spike history filter for adaptation and a coupling filter to address the influence of nearby neurons. Recently, there has been an emphasis on models with a few subunit components to mimic upstream nonlinear components; examples include the nonlinear input model[69],in which a few upstream nonlinear filters are included with the assumption that the inputs of the neuron are correlated; the spike-triggered covariance model[64,70,71], in which the covariance of the spike-triggered ensemble is analyzed by means of eigenvector analysis in order to obtain a sequence of filters as a combination of some parts of the RF; the two-layer LN network model [72], in which a cascade process is implemented by two-layer LN models; and the spike-triggered non-negative matrix factorization(STNMF)model[73],in which the orthogonality constraint used in spike-triggered covariance is relaxed to obtain a set of non-orthogonal subunits that is experimentally verified as the bipolar cells in the retina. It has been further demonstrated that STNMF can recover various biophysical properties of upstream bipolar cells, including spatial RFs, temporal filters,transferring nonlinearities, and synaptic connection weights from bipolar cells to ganglion cells. In addition, a subset of spikes contributed by each bipolar cell can be teased apart from the whole spike train of one ganglion cell [74].

4.2. ANN-based encoding model

In recent years,breakthroughs have been made in using ANNs—such as deep CNNs and PGMs—for numerous practical tasks related to the system identification of visual information [39]. For example, given a large set of visual images that are collected and welllabeled with specific tags, ANNs can outperform human-level performance in object recognition and classification [39]. Various techniques have been developed to visualize the features of images learned by CNN. However, the way in which CNN conducts the end-to-end learning of complex natural images makes it difficult to use this method to interpret underlying network structure components [75,76].

Inspired by experimental observation in neuroscience[55,77],a typical deep CNN has a hierarchical architecture with many layers[78]. Of these layers, some have a bank of convolutional filters,such that each convolutional filter serves as a feature detector to extract the important properties of the images [79,80]. Therefore,after training with a large set of images,these convolutional filters can play the same functional role as the neurons in our retina and in other visual systems to encode complex statistical properties of natural images [59]. The shapes of these filters are sparse and localized, and are similar to the RFs of visual neurons.

Therefore, it is reasonable to use the similar ANN-based approach to investigate the central question of neuronal coding in neuroscience [54,81]. In particular, for visual coding, it has been widely accepted that the ventral visual pathway in the brain is a path that starts from the retina and then passes through the lateral geniculate nucleus and the layered visual cortex to reach the inferior temporal part of the cortex.This visual pathway has been suggested as the‘‘what”pathway for the recognition and identification of visual objects.When CNN is used to model experimental neuroscience data recorded in the neurons of the inferior temporal cortex in monkeys, the neuronal response can be predicted very well[54,82-84].Therefore,it is possible to relate the biological structure of visual processing in the brain with the network structure components used in CNN.However,interpreting this relationship is not a straightforward process, since the pathway from the retina to the inferior temporal cortex is complicated [54]. One possible easier way is to use CNN to model the early visual system of the brain—and the retina in particular,as discussed above—in which the neuronal organization is relatively simple.

Indeed, a few studies take this approach by using CNNs and their variations to model earlier visual systems in the brain, such as the retina [41,42,85-87], the visual cortical areas V1 [88-92],and V2 [93]. Most of these studies are driven by the assumption that better neural response performance can be achieved by using either feedforward or recurrent neural networks (or both). These new approaches increase the level of complexity of system identification, compared with conventional LN models [71]. Some of these studies also attempt to examine network components in detail after determining whether and how such components are comparable to the biological structure of neuronal networks[41,42,92].

Fig. 3 [41,74,85] shows a typical setup of a CNN modeling approach for the retina. To understand the fine structure of the RF in the retinal circuit, it is important to understand the filters learned by the CNNs. In contrast to studies in which a population of retinal ganglion cells is used[42,92,94],the model can be simplified from a complicated retinal circuit to a simple network model,as shown in Fig.3(a);this makes it easier to refine the model of the structural components at the single-cell level of the retina.Indeed,it has been found that CNNs can learn to adjust their internal structural components to match the biological neurons of the retina[42,85], as illustrated in Fig. 3(d).

Given that the retina has a relatively clear and simple circuit,and the eyes have(almost)no feedback connection from the cortical cortex, it is reasonable to model this system as a feedforward neural network, similar to the principle of the CNN. It is certain that the inhibitory neurons, such as the horizontal cells and amacrine cells,play a role in the functioning of the retina.In this sense,potential neural networks with lateral inhibition and/or recurrent units are desirable [86,94].

4.3. Decoding visual scenes from retinal spikes

For a retinal neuroprosthesis, an ideal encoder model is able to deliver precise stimulation to electrodes for given visual scenes.To achieve this,it is necessary to find an ideal decoder model that can read out and reconstruct the stimuli of visual scenes from neuronal responses.

The reconstruction of visual scenes by means of algorithms has been studied over many years.Neuronal signals of interest include functional magnetic resonance imaging (fMRI) human brain activities [95-98], neuronal spikes in the retina [99-102] and lateral geniculate nucleus [103], and neuronal calcium imaging data in V1 [104]. However, the decoding performance of current methods is rather low for natural scenes, whether static natural images or dynamic videos. A particularly interesting example of videos reconstructed from fMRI data can be found in Ref. [97].

For a retinal neuroprosthesis,one would expect to decode visual scenes by using the spiking responses of a population of ganglion cells. The decoding of visual scenes is possible when there are enough retinal ganglion cells available,as shown in a recent study with simulated retinal ganglion cells [100]. However, it is unclear whether it is possible to use experimental data to achieve this aim. This decoding approach can be described as a spike-image decoder that performs an end-to-end training process from neuronal spikes to visual scenes.

We recently developed such a decoder with a model of a deep learning neural network. Our decoder can achieve much better resolution than previous studies in reconstructing real-life visual scenes—including both static images and dynamic videos—from the spike trains of a population of retinal ganglion cells recorded simultaneously [105].

The workflow of the spike-image decoder is illustrated in Fig. 4 [105,106]. With a multi-electrode array setup, a large population of retinal ganglion cells can be recorded simultaneously,and their spikes can be extracted. Next, a spike-image converter is used to map the spikes of every ganglion cell to images at the pixel level. After that, an autoencoder deep learning neural network is applied to transform the spike-based images to the original stimulus images. In essence, this approach has two stages:spike-image conversion and image-image autoencoding. Most of the previous studies have focused on the first stage, and involve optimizing a traditional decoder by means of statistical models and/or ANN-based models in either a linear or nonlinear fashion [95-103]. A recent study trained a separate CNN autoencoder as the second stage in order to enhance the quality of the images [100]. Instead, we found that a better quality can be achieved by means of an end-to-end training process that includes both stages of spike-to-image conversion and imageto-image autoencoding. However, the detailed architecture of the networks used in these two stages could be optimized to an even better quality using other possible deep learning neural networks.

Fig. 3. Encoding visual scenes by means of a simplified biophysical model with the CNN approach. (a) Simplification of retinal circuitry to a biophysical model: (top) A feedforward network is represented as part of the retinal circuitry that receives incoming visual scenes and sends out spike trains from ganglion cells; (middle) a minimal network with one ganglion cell and five bipolar cells;(bottom)a biophysical model with five subunits representing five bipolar cells,where each has a linear filter as the RF,and a nonlinearity. The outputs of the five subunits are pooled and rectified by another output nonlinearity. The final output can be sampled to give a spike train. (b) A representative CNN model trained with images as input and spikes as output.Here,there are two convolutional layers and one dense layer.(c)After training,the CNN model shows the same RF as the biophysical model of the ganglion cell.(d)The convolutional filters after training resemble the RFs used by the biophysical model of the bipolar cells in part (a). (a) is reproduced from Ref. [74], and (b-d) are reproduced from Refs. [41,85].

Fig. 4. Decoding visual scenes from neuronal spikes. (Top) Workflow of decoding visual scenes. Here, a video of a salamander swimming was presented to a salamander retina in order to obtain a population of ganglion cells fired with a sequence of spikes.A population of spike trains is used to train a spike-image decoder to reconstruct the same video. RFs of ganglion cells are mapped onto the image. Each colored circle is an outline of a RF. (Bottom) A spike-image decoder is an end-to-end decoder with two stages:spike-image conversion,which is used to map a population of spikes to a pixel-level preliminary image;and image-image autoencoding,which maps every pixel to the target pixels in the desired images. Note that the spike-image decoder has no unique architecture, and a state-of-the-art model could be adopted and optimized. The exact preliminary images depend on the loss functions used for training. Details of the decoding process can be found in Ref. [105] and online. The data presented in this figure are publicly available online [106]. RGC: retinal ganglion cell.

5. Modeling the retina with SNNs and PGMs

SNNs are viewed as the third generation of ANN models; they use neuronal spikes for computation, as the brain does [107]. In addition to neuronal and synaptic states, the importance of spike timing is considered in SNNs. It has been demonstrated that SNNs are computationally more powerful than other ANNs with the same number of neurons [107]. In recent years, SNNs have been widely studied in a number of research areas [108-110]. In particular, recent studies have shown that SNNs can be combined with a deep architecture of multiple layers in order to obtain similar to or better performance than ANNs[111-115].The spiking feature of SNNs is particularly important for the next generation of neuromorphic computer chips [116,117].

The computational capability of a single neuron is limited.However, when a population of neurons is connected to form a network, the computational ability of the connected neurons can be greatly expanded. In terms of the language of graphs [118], an SNN can be denoted as a graph G = (V, E), where V represents the set of neurons and E ⊂V×V represents the set of synapses.Given this equivalence between graphs and neural networks, a different approach known as PGMs has been intensively studied in recent years. Both ANNs and SNNs traditionally perform modeling as a deterministic dynamical system, which has been demonstrated by the classical Hodgkin-Huxley model [119]. However, the computational principles used in the brain seem to go beyond this viewpoint [57], leading to the use of PGMs.

An increasing volume of neuroscience evidence indicates that humans and monkeys (and other animals as well) can represent probabilities and implement probabilistic computation [120-122];thus,the perspective of the probabilistic brain is increasingly recognized [123]. Therefore, it is reasonable to employ a network of spiking neurons to implement probabilistic inference at the neural circuit level [123]. Increasing research interest has focused on the combination of SNNs and probabilistic computation in order to both understand the principles of brain computation and solve practical problems with these brain-inspired principles.

Probabilistic inference studied in the framework of PGMs is traditionally a combination model of probability theory and graph theory. The core idea of PGMs is to take advantage of a graph to represent the joint distribution among a set of variables, where each node corresponds to a variable and each edge corresponds to a direct probabilistic interaction between two variables. With the benefit of a graph structure, a complex distribution over a high-dimensional space can be factorized into a product of lowdimensional local potential functions. PGMs can be divided into directed graphical models, such as Bayesian networks, and undirected graphical models,such as MRFs.Bayesian networks can represent causality between variables,so they are often used to model the processes of cognition and perception, while MRFs can represent a joint distribution by a product of local potential functions.

Implementing PGMs by SNNs is done in order to explain how neuronal spikes can implement probabilistic inference. Inference in SNNs includes two main questions related to probabilistic coding and to probabilistic inference,respectively:①How do the neural activities of a single cell or a population of cells (such as the membrane potential and spikes) encode probability distribution?and ②How do the dynamics of a network of spiking neurons approximate the inference with probabilistic coding?

It is clear that probabilistic coding is the precondition of probabilistic inference. Depending on how probability is expressed,probabilistic codes can be divided into three basic types: ①those that encode the probability of each variable in each state, such as the probability code [124], log-probability code [125,126], and log-likelihood ratio code [127,128]; ② those that encode the parameters of a distribution, such as a probabilistic population code that takes advantage of neural variability[129-131](i.e.,neural activities in response to a constant stimulus have a large variability, which suggests that the population activities of neurons can encode distributions automatically);and ③those that consider neural activities to be a sampling from a distribution [132,133],which has been suggested by numerous experiments [134-137].

According to these coding principles,there are different ways to implement inference with a network of neurons: ①Inference can be implemented with neural dynamics using equations that are similar to the inference equations of some PGMs over the time course [125,126,128,138-140]. This approach is mainly suitable for small-scale SNNs.②Inference can be implemented with neural variational approximations; this is suitable for describing the dynamics of a large-scale SNN directly [53,56,141-148].③Inference can be implemented with probabilistic population coding and some neural plausible operations, including summation, multiplication, linear combination,and normalization[149-153].④Inference can be implemented with neural sampling over time,where the noise—such as the stochastic neural response found in experimental observations[154,155]—is the key to neural sampling and inference [156-160]. Similarly, it is possible to perform sampling by using a large number of neurons to sample from a distribution at the same time[153,161-163],as it has been found that the states of neurons in some areas of the brain follow special distributions [164,165].

The studies described above were mostly conducted in an abstract way in order to model the neural computation of the cortex, including the visual cortex. We suggest that these computational techniques can be transferred to study retinal computation. Fig. 5 [53,166,167] shows some examples in the retina where there is a similarity at the network level between a network of photoreceptors connected by gap junctions (Fig. 5(a)),a MRF model(Fig.5(b)),and the implementation of a MRF by a network of spiking neurons consisting of clusters of WTA microcircuits (Fig. 5(b)). As illustrated in Fig. 2, massive gap junctions play a functional role as recurrent connections between retinal neurons. A recent study shows that a network of rod photoreceptors with gap junctions can denoise images that can be further enhanced by an additional CNN,as shown in Fig.5(c).It was found that this CNN,which included photoreceptors, in contrast to other traditional CNNs, could achieve state-of-the-art performance in de-noising [166]. Similarly, PGM has been used to denoise images[168]. Recently, it was shown that PGMs can be implemented by SNNs for various types of computations [53,163,169-172]; thus,a similar performance, when using SNNs for denoising, can be achieved [167], as illustrated in Fig. 5(d).

PGMs have been intensively studied and used for visual coding,but are mostly used to model the cortical process[56].Here,these results discussed in this article suggest that it is possible to study visual computation in the retina by combining several approaches into a systematical framework,including classical PGMs,nontrivial retinal circuit structures, gap junctions in particular, and recent efforts regarding the implementation of PGMs by SNNs. Future work is needed to study this framework with more inspiration from the rich network structure of the retina, including the recurrent neural network, WTA circuit, and feedforward neural network, along with other ubiquitous motifs of cortical microcircuits [37].

6. Discussion

Fig.5. Implementation of noise reduction computation with retinal photoreceptors,a PGM,and a spiking neural network.(a)A network of rod photoreceptors connected by gap junctions. (b) A graph of a MRF represented by a network of spiking neurons with subnetworks as WTA circuits. Each variable of the MRF is represented by one WTA neural network.(c)Noisy images can be denoised by means of a photoreceptor network,and then enhanced by CNN.(d)Noisy images can be denoised by a MRF implemented by a recurrent spiking neural network without enhancement. In MRF models shown in (b) and (d), xi are variables represented by a WTA circuit. (a) and (c) are reproduced from Ref. [166], (b) is reproduced from Ref. [53], and (d) is reproduced from Ref. [167].

Neuroprostheses are promising medical devices within the framework of precision medicine. As they directly interact with the brain of each individual patient, advancements in neuroprostheses are necessary, with better computational algorithms for neuronal signals in addition to better hardware designs.The major difficulty in developing the computational capability of the retinal neuroprosthesis is the need to track the complexity of spatiotemporal visual scenes.

In contrast to other neuroprostheses, for which the incoming signals are in a low-dimensional space—such as the moving trajectory of the body’s arms or legs in three-dimensional space, or an auditory signal in a one-dimensional frequency space—visual scenes are extremely complex and contain information in a spatiotemporal fashion. Recent advancements in computer vision have resulted in breakthroughs in the analysis of these complex natural scenes,which make artificial intelligence up to a high attitude than ever before.

Given the experimental advancements that have been made in neuroscience, it is now possible to record a large population of neurons simultaneously. In particular, in the retina, a population of spike trains from hundreds of retinal ganglion cells can be recorded as a result of exposing the retina to well-controlled visual scenes,such as images and videos[173].The newest technique can record several thousand neurons simultaneously [174-176]. This technology opens up a way to study the encoding and decoding of visual scenes by using enough spikes to achieve superb resolution.

Implants with electrodes are the most common of the current approaches for retinal neuroprostheses, and have been used in clinical trials. However, there are very limited computational models embedded into such retinal prostheses [10,13,177]. With an encoder embedded in the retinal prosthesis, it is possible to process incoming visual scenes in order to better trigger ganglion cells [10,13]. The benefit of using decoding models is to verify the spiking patterns produced by the targeted downstream neurons. Ideally, electrical stimulation should be able to achieve a result that is close to the desired patterns of retinal neural activity in a prosthesis. The traditional approach for comparing the similarity between spiking patterns focuses on how to compute the distance between two spike trains, both in general [178,179]and in the context of the retinal prosthesis [180]. Another way of doing this is to use decoding models to achieve better performance from the neuroprosthesis [10,100,181]. Like other neuroprostheses, in which a closed-loop device can be employed to decode the neuronal signal in order to control the stimulus, the signal delivered by a retinal prosthesis should ideally be able to reconstruct the original stimuli—that is, the dynamic visual scenes that are projected onto the retina.Thus,it is possible to use a decoding model to reconstruct such visual scenes from the spiking patterns produced by the retinal ganglion cells [10,100]. Direct measurement of the precision of spiking patterns with the given decoding model could play the functional role of controlling the electrical stimulation patterns generated by the retinal neuroprosthesis,which is the goal of a better and adjustable neuroprosthesis.

In this article,we focused on the issue of computational modeling for just one type of retinal neuroprosthesis, with embedded electrodes. Of course, for retinal neuroprostheses as engineering systems, many parallel and difficult issues remain, such as the need for advanced materials, power designing, communication efficiency, and other related hardware issues; these issues have been covered in many well-written reviews[13,15,16,18].It should be noted that there are different types of visual implants,including those with light retinal stimulation such as optogenetics and chemical photoswitches, as well as implants in other parts of the brain, beyond the retina. The computational issues raised in this paper are also relevant to general visual prostheses. In addition to artificial visual implants, another line of research focuses on retinal repair by means of the biological manipulation of stem cells,such as induced pluripotent stem cells [182-184]; in this context,understanding the computational mechanisms of the biological neurons and neuronal circuits is more relevant for encoding visual scenes. For these applications, more effort may be needed to include the biological principles found in the retina in potential decoding models [34].

Given these advancements in neuroscience experiments and prosthesis engineering, it is now time to advance our understanding of visual coding by using retinal spiking data and ANN-based models to obtain better computational algorithms to improve the performance of retinal neuroprostheses. In this article, we reviewed some of the recent progress that has been made in developing novel functional artificial intelligence models for visual computation. Feature-based modeling approaches, such as deep CNN,have made significant progress in analyzing complex visual scenes.For some particular visual tasks, these models can outperform humans [39]. However, the levels of efficiency, generalization ability, and adaption or transfer learning between different tasks in well-trained models are still far from a human level of performance [55]. Sampling-based modeling with neuronal spikes has emerged as a new approach that takes advantage of many factors of the neuronal system of the brain[57], such as noise at the level of single neurons and synapses [52,157,160]. With the generic benefit of pixel representation of visual scenes, sampling models can be easily used for various types of visual computation [168].However, the efficiency of the learning algorithms used in sampling models is still far from the flexibility of the brain’s neuron system [185]. Nevertheless, these two approaches could be combined by utilizing both of the advantages of feature and sampling for visual computation.To achieve this,it is necessary to consider the retina as a neuronal network in which visual computation can be performed by different functional network structures. In future, more work is needed to combine various network motifs into a hybrid network, in which different visual information can be extracted,processed,and computed.Such hybrid or hypercircuit networks have only been explored very recently;in particular,the WTA network motif has been shown to be a functional module within a more complex hypercircuit network model for various types of computations[52,53,110,186].We expect that more studies will align with this research direction in future.

The modeling framework described in this paper is not limited to application to the retina;it could also be applied to other visual systems in the brain, and to other artificial visual systems. The main feature of these algorithms is to make use of neural spikes.Recent advancements in artificial intelligence computing align with the development of the next generation of neuromorphic chips and devices, in which the new data format is processed as spikes or events [187-191]. Therefore, these methods can be applied to neuromorphic visual cameras with spike or event signals as well. These computational retinal models can be used to simulate a population of spikes for the encoding and decoding of any given visual scene, including static natural images, dynamic videos, and even real-time videos captured by a standard framebased camera [105]. By combining neuromorphic hardware with event/spiking computing algorithms, the next generation of computational vision will develop a better system for artificial vision that extends beyond the purpose of retinal neuroprostheses.Therefore,we believe that rich interactions between artificial intelligence, computer vision, neuromorphic computing, neuroscience,bioengineering, and medicine will be important in advancing our understanding of the brain and developing the next generation of retinal neuroprosthesis for an artificial vision system. The algorithm part of the artificial eye, including the models for encoding and decoding real-life visual scenes, will be particularly crucial for such a systems-level approach.

Acknowledgements

This work is supported by the National Basic Research Program of China (2015CB351806); the National Natural Science Foundation of China (61806011, 61825101, 61425025, and U1611461);the National Postdoctoral Program for Innovative Talents(BX20180005); the China Postdoctoral Science Foundation(2018M630036); the International Talent Exchange Program of Beijing Municipal Commission of Science and Technology(Z181100001018026); the Zhejiang Lab (2019KC0AB03 and 2019KC0AD02); and the Royal Society Newton Advanced Fellowship (NAF-R1-191082).

Compliance with ethics guidelines

Zhaofei Yu, Jian K. Liu, Shanshan Jia, Yichen Zhang, Yajing Zheng, Yonghong Tian, and Tiejun Huang declare that they have no conflict of interest or financial conflicts to disclose.