# Statistical analysis of solar radiation datasets for P50, P90 and P99 conditions

Annual performance changes of solar power systems are an important part of a project’s financial analysis. The greatest financial benefit is the exceeding probability that the power generation facility will produce electricity in a specified year, which is also the probability that the specified generation capacity will be met or exceeded. For example, a system has a 90% probability of exceeding the specified power generation (usually expressed as P90), and this value is determined based on the probability density of the predicted power generation. The predicted power generation is directly related to the interannual variation in incident solar radiation. Therefore, it is important to have a long-term solar irradiance dataset that covers the annual incident solar radiation range at the location of the power generation facility and can estimate the probability of occurrence of different solar radiation levels.

1. Purpose of P50, P90 and P95

Figure 1 is a histogram of historical GHI data for 43 years in Phoenix, Arizona, 1961-2008. The dataset includes ground-based modeling data from NSRDB (“1991-2005 NSRDB Database (Updated Edition) User Manual” published in April 2007) and satellite modeling data from the SolarAnywhere dataset (source: https://www.solaranywhere .com/Public/About.aspx.). Figure 1 – Histogram of historical interannual variation of GHI in Phoenix, Arizona, with 43 years of observational data and different distribution functions suitable for the data

When it comes to probability statistics, even 43 years of data are of limited value. When a smaller data set must be used, then it is necessary to expand the data set using reasonable mathematical methods to calculate the expected data required. These data can be used as the basis for larger datasets through a process called bootstrapping. The capacity of P50 and P90 can be calculated from the larger data set. Bootstrapping generates thousands of data from small data sets, assuming that the data follow prescribed distribution rules, information that facilitates statistical analysis of small data sets (Efron and Tibshirani 1993).

The bootstrapping method requires not only an initial dataset, but also multiple datasets, and the dataset should have basic information on the distribution type. Especially for solar radiation that does not exhibit a Gaussian or normal distribution but exhibits a more skewed distribution. Bootstrapping models use this information to generate large datasets with sufficient data to derive statistically reliable estimates of P50, P90, and P99. Of course, the larger the initial dataset, the better the bootstrap model. Since the initial model should cover the year in which the event occurred, it is critical that it should include at least one year in which volcanic aerosols affected solar resources. If no extreme events are included, then all events are not included, and the resulting bootstrapped dataset can only be considered to include a subset of possible events. While this does not have a particular effect on the P50 value, it can cause the P90 or P99 value to overestimate the less productive years (which may occur during the life of the system).

Figure 1 shows the optimal probability distribution functions (pdfs for short) generated using the normal distribution, the Wakerby distribution (Rao and Hamed 1999), and the Kernel Density Estimation (KDE for short) corresponding to the historical GHI dataset (Sheather et al. 1991). ). The Wake ratio distribution can be selected using statistical techniques to determine how well it fits the underlying data of different distribution types.
Since KDE can quickly define probability distribution functions for arbitrary data sets that may or may not correspond to common statistical distributions, it is convenient to characterize changes in solar resources.

It is important to note that no distribution function can exactly match the distribution of the data. The reason is that the data is complete or includes non-standard events, such as years affected by volcanic eruptions. It may be that a combination of the two different distributions is a better representation of data that includes non-standard years.

In fact, bootstrapping should be used to predict annual energy performance, not solar radiation data. This reduces the number of program runs required to evaluate device performance, and there is no precise linear correlation between incident energy on solar collectors or PV solar panels and production predictions.

1. Long-term data requirements

The years included in the dataset need to be long enough to obtain data information in a statistically reasonable and reliable manner. The year dataset must contain all possible events. Additionally, the dataset must be long enough to adequately represent the distribution of the various data. Meteorological standards are formed using 30 years of data, and extremes and changes are derived from complete data records. For solar radiation data, it took 30 years to establish a standard with a high confidence level. Sometimes only smaller datasets with 8 or 15 years of data are available.

In conclusion, the following three elements are required to accurately estimate the probability of P50, P90, and P99 above statistically: 10-15 years of data, which can provide enough data for the bootstrapping model to generate statistically reliable and reasonable results. The longer the dataset, the better. Include data for at least 1 year affected by extreme events, such as reduced solar irradiance due to volcanic eruptions. Know the typical distribution of annual solar irradiance.