## Portfolio Selection

The portfolio selection problem consists in selecting a set of assets, and the share invested in each asset, that provides the investor a minimum required return and minimizes the risk. One of the main contributions in this problem is the seminal work by Markowitz [Markowitz, 1952], who first introduced the so-called mean-variance model, which takes the variance of the portfolio returns as the measure of investor's risk. According to Markowitz, the portfolio selection problem can be formulated as an optimization problem over real-valued variables with a quadratic objective function and linear constraints.

## Portfolio Selection Instances from Yahoo Finance

In this page we collect some benchmark instances for the Portfolio Selection problem taken from real-world stock markets data. The instances were built employing the Yahoo Finance website as a data source for historical stock values, which have been further processed by some PHP scripts.

We first start with introducing the precise problem statement, then we describe the file format of the problem instances and finally we provide the links to download the data files.

## Problem statement

Following [Markowitz, 1952], we are given a set of n assets, A = {a1, ... , an}. Each asset ai has an associated real-valued expected return (per period) ri, and each pair of assets (ai, aj) has a real-valued returncovariance σij. The matrix σ is symmetric and the diagonal elements σii represent the return variance of assets ai. A positive value R represents the minimum required (expected) return. The values ri and σij are usually estimated from past data and are relative to one fixed period of time.

### Basic problem

A portfolio is a vector of real values X = {x1, ..., xn} such that each xi represents the fraction invested in the asset ai. The value Σi = 1..n Σj = 1..n σij xi xj represents the variance of the portfolio, and is considered as the measure of the risk associated with the portfolio. Consequently, the problem is to minimize the overall variance, still ensuring the minimum required return R. The formulation of the basic unconstrained problem is thus the following.

min FR(X) = Σi = 1..n Σj = 1..n σij xi xj
subject to
Σi = 1..n ri xiR
Σi = 1..n xi = 1
xi ≥ 0 ∀ i = 1..n

Since the minimum required return R can be considered as a parameter of the problem, by solving the problem as a function of R, ranging over a finite and discretized domain, we obtain the so-called unconstrained (Pareto) efficient frontier (UEF), that gives for each return the minimum associated risk.

### Short sellings

In the problem formulation it is required that all the asset shares have to be non-negative. Even though this requirement is a common assumption behind theoretical approaches, it not enforced in real-markets, where the presence of short positions (i.e., assets with negative shares corresponding to speculations on falling prices) is greatly intertwined to long positions (i.e., assets with positive shares).

In order to allow short sellings we modify the model by removing the inequality xi≥ 0 and adding an additional asset n + 1 that represents a risk-free investment (e.g., US T-bills/T-bonds). Furthermore, we have to add the following constraints, which depend on market regulations.

xn + 1≥ -γ Σi = 1..n min{0,xi}
Σi = 1..n + 1|xi| ≤ 2

The first constraint requires that, in order to warrant the investor position in case the price of the sold asset will rise instead of falling, a collateral risk-free investment is required. The investment in the risk-free asset must be no less than a proportion γ of the overall sum of the short positions.

The second constraint is imposed by law (US regulation T) to limit the amount of investments in the short positions.

For determining the UEF in the case of short sales we set the parameter γ to 0, so that no collateral is strictly required (the most extreme situation). As for the return of the collateral asset we fixed it to the return of the T-bond at the beginning of the period, having the same maturity has the stock-prices data taken into account.

Notice that according to the modifications and the introduction of the additional asset n + 1 the constraints of the original formulation now reads:

Σi = 1..n + 1ri xiR
Σi= 1..n + 1xi= 1
xn + 1 ≥ 0
Notice also that the risk-free asset can be employed also in the UEF of the original problem formulation. In this case the share of the risk-free asset correponds to the decision of keeping part of the wealth uninvested.

## File format

The data needed to fully describe an instance of the problem are:

• n the overall number of assets;
• ri the expected return of each asset, including the expected return rn + 1 of the risk-free (collateral) asset;
• σij the covariance matrix.
The file format contains exactly those data as reported in the following example:
```# Instance created from Yahoo Finance on 02-08-2007
# Country: Italy
# Index: MIBTEL
# Index_file: ./indexes/it_mibtel.index
# Period: 2001-2006
Cash_return: 0.002112
Number_of_assets: 167
Asset: 1 0.0088812826376279 ACE.MI
Asset: 2 -0.0147325872017824 ACO.MI
Asset: 3 0.0065871978321356 ACS.MI
Asset: 4 0.0109624773142050 AE.MI
Asset: 5 0.0086946666486191 AFI.MI
...
Covariance: 1 1 0.0075622080207523
Covariance: 1 2 0.0051116935557743
Covariance: 1 3 0.0038950532965118
Covariance: 1 4 0.0025220967764908
Covariance: 1 5 0.0032403279431379
...
Covariance: 166 167 0.0018085655370221
Covariance: 167 167 0.0034447930439915
```
The format is as follows:
• The first lines starting with the character `#` are comments and could be ignored. They are present only to give a human-readable description of the instance (e.g., the index employed and the temporal period used to build the instance).
• The line starting with the string `Cash_return:` cointains the value of the T-bond/T-bill return with the same maturity as the investement horizon.
• The line starting with the string `Number_of_assets:`, contains the value of n, i.e., the overall number of assets (167 in the present case).
• There are exactly n asset lines starting with the string `Asset:`. Each asset line is composed by the asset index i, the asset return ri and the Yahoo Finance symbol of that asset (which can be ignored).
• There are n (n - 1) / 2 covariance lines starting with the string `Covariance:`. Each covariance line is composed by the two asset indexes i and j and the covariance value σij. Since the matrix σ is symmetric, the line corresponds also to the value of σji.

The file format slightly resembles the one of the 5 Beasley's OR-Library instances (available from the author's website) that were the only publicly available instances for the problem, at the time of creation of this repository. The main difference between our format and the OR-Library one resides in the fact that Beasley's instances provide the variance of each asset and the correlation matrix, which form an indirect way of computing the covariance values. We decided to directly provide the covariance matrix instead.

In order to make possible the comparison of the results against a lower bound, we plan also to provide a discretization of the Efficient Frontier for the unconstrained problem (UEF). In this case the file format is simply the sequence of expected return and variance values, one on each line as in the following example:

```# Efficient frontier computed for file it_mibtel-2001-2006-m.sd
# Number of samples: 100
# Maximum return: 0.03175851315349978 (for risk 0.02542564323870561)
# Minimum return: 0.002118880183120193 (for risk 8.682121560859373e-10)
# Return	Risk	Number_of_assets	Running_time	Status
0.03175851315349978 0.02542564323870561 1 0.1622598171234131 Optimal
0.03145912292147575 0.01837203123138026 3 0.2409031391143799 Optimal
0.03115973268945171 0.01330801327560206 3 0.2350549697875977 Optimal
0.03086034245742768 0.009768678805943964 3 0.217911958694458 Optimal
0.03056095222540364 0.00775237509067065 4 0.2399439811706543 Optimal
```
The lines starting with the character `#` are comments that could be ignored. Each point of the frontier is reported on a single line. The values are to be interpreted as follows: return, variance, number of assets with a share greater than 10-7 included in the portfolio, CPU time for computing the solution, status of the solution (as reported by CPLEX).

## Problem instances

### New instances

The following problem instances have been built employing the Yahoo Finance website as a data source for historical stock values. Each file corresponds to the assets that were part of the given stock index on August, 1st 2007. The data collected were montly prices in the reference period and stocks with missing values were removed. Since the composition of the indexes could differ throughout years, we include only the stock data for instances with more than 30 stable assets. The unconstrained efficient frontier for the basic problem and the short-sellings one for 100 points have been computed using IBM ILOG CPLEX 12.2.

Instance File Country Index Period n Efficient Frontier File (basic formulation) Efficient Frontier File (short-selling formulation)
au_all_ordinaries-2001-2006-m.sd Australia All ordinaries 2001-2006 264 au_all_ordinaries-2001-2006-m.efn au_all_ordinaries-2001-2006-m.ssefn
it_mibtel-2001-2006-m.sd Italy MIBTEL 2001-2006 167 it_mibtel-2001-2006-m.efn it_mibtel-2001-2006-m.ssefn
kr_kospi_composite-2001-2006-m.sd Korea KOSPI Composite 2001-2006 562 kr_kospi_composite-2001-2006-m.efn kr_kospi_composite-2001-2006-m.ssefn
uk_ftse_act_250-2001-2006-m.sd UK FTSE ACT250 2001-2006 128 uk_ftse_act_250-2001-2006-m.efn uk_ftse_act_250-2001-2006-m.ssefn
us_amex_composite-2001-2006-m.sd USA AMEX Composite 2001-2006 1893 us_amex_composite-2001-2006-m.efn us_amex_composite-2001-2006-m.ssefn
us_nasdaq_bank-2001-2006-m.sd USA NASDAQ Bank 2001-2006 380 us_nasdaq_bank-2001-2006-m.efn us_nasdaq_bank-2001-2006-m.ssefn
us_nasdaq_biotech-2001-2006-m.sd USA NASDAQ Biotech 2001-2006 130 us_nasdaq_biotech-2001-2006-m.efn us_nasdaq_biotech-2001-2006-m.ssefn
us_nasdaq_composite-2001-2006-m.sd USA NASDAQ Composite 2001-2006 2235 us_nasdaq_composite-2001-2006-m.efn us_nasdaq_composite-2001-2006-m.ssefn
us_nasdaq_computer-2001-2006-m.sd USA NASDAQ Computer 2001-2006 417 us_nasdaq_computer-2001-2006-m.efn us_nasdaq_computer-2001-2006-m.ssefn
us_nasdaq_financial100-2001-2006-m.sd USA NASDAQ Financial100 2001-2006 91 us_nasdaq_financial100-2001-2006-m.efn us_nasdaq_financial100-2001-2006-m.ssefn
us_nasdaq_industrial-2001-2006-m.sd USA NASDAQ Industrial 2001-2006 808 us_nasdaq_industrial-2001-2006-m.efn us_nasdaq_industrial-2001-2006-m.ssefn
us_nasdaq_telecom-2001-2006-m.sd USA NASDAQ Telecom 2001-2006 139 us_nasdaq_telecom-2001-2006-m.efn us_nasdaq_telecom-2001-2006-m.ssefn
us_nyse_us100-2001-2006-m.sd USA NYSE US100 2001-2006 94 us_nyse_us100-2001-2006-m.efn us_nyse_us100-2001-2006-m.ssefn
us_sp500-2001-2006-m.sd USA S&P 500 2001-2006 469 us_sp500-2001-2006-m.efn us_sp500-2001-2006-m.ssefn

### ORLib instances

These instances have the same content as the original OR-Library files, updated to the new format and augmented with the return of the risk-free collateral asset. The unconstrained efficient frontiers for the two formulations have been recomputed as for the previous instances in order to include the risk-free asset.

Instance File Country Index Period n Efficient Frontier File (basic formulation) Efficient Frontier File (short-selling formulation)
port1.sd Hong Kong Hang Seng 1992-1997 31 port1.efn port1.ssefn
port2.sd Germany DAX 100 1992-1997 85 port2.efn port2.ssefn
port3.sd UK FTSE 100 1992-1997 89 port3.efn port3.ssefn
port4.sd USA S&P 100 1992-1997 98 port4.efn port4.ssefn
port5.sd Japan Nikkei 225 1992-1997 225 port5.efn port5.ssefn