Data cleaning

Data cleaning of high-frequency data is a necessary step in all finance and financial econometrics applications. The reason is that most data providers like tick data offer raw instead of preprocessed data. The upside is that you can follow the cleaning process yourself instead of relying on another party. In this vignette, we will address a common workflow that makes use of both trades and quotes data on a microsecond resolution.

Cleaning tick data on trades

We start by loading the highfrequency package and printing the head of one example data set that contains trade data for one stock on two days.

library(highfrequency)
head(sampleTDataRaw)
#>                     DT EX SYMBOL COND SIZE  PRICE CORR
#> 1: 2018-01-02 05:01:21  P    XXX  FTI    2 157.80    0
#> 2: 2018-01-02 05:23:50  P    XXX  FTI    3 157.80    0
#> 3: 2018-01-02 05:23:50  P    XXX  FTI    1 157.80    0
#> 4: 2018-01-02 07:11:54  P    XXX    T  130 158.00    0
#> 5: 2018-01-02 07:23:12  P    XXX   TI   30 158.10    0
#> 6: 2018-01-02 07:28:44  P    XXX   TI   30 158.28    0

A quick look via summary into the data let’s us see that there are trades listed in the data set out of the bounds of the regular trading hours that are 9:30 to 16:00 on regular work days.

summary(sampleTDataRaw[, c("DT", "SIZE", "PRICE")])
#>        DT                           SIZE              PRICE      
#>  Min.   :2018-01-02 05:01:21   Min.   :     1.0   Min.   :155.4  
#>  1st Qu.:2018-01-02 12:50:07   1st Qu.:    29.0   1st Qu.:156.4  
#>  Median :2018-01-02 15:59:06   Median :   100.0   Median :156.7  
#>  Mean   :2018-01-03 00:33:21   Mean   :   132.7   Mean   :156.8  
#>  3rd Qu.:2018-01-03 12:21:06   3rd Qu.:   100.0   3rd Qu.:157.1  
#>  Max.   :2018-01-03 19:55:37   Max.   :443901.0   Max.   :159.4

Next, we will use one of three main cleaning functions called tradesCleanup. It implements the steps laid out in the seminal paper of Barndorff-Nielsen et al. (2009). It processes the given data by

tDataCleaned <- tradesCleanup(tDataRaw = sampleTDataRaw, exchange = "N")

When using tradesCleanup, the return value is a list of two. It contains a report about the number of observations after each cleaning step and the cleaned data itself.

tDataCleaned$report
#>          initialObservations        removedFromZeroTrades 
#>                        77263                            0 
#>  removedOutsideExchangeHours removedFromSelectingExchange 
#>                          451                        65625 
#>       removedFromCorrections    removedFromSalesCondition 
#>                            0                            2 
#>    removedFromMergeTimestamp            finalObservations 
#>                         4017                         7168

summary(tDataCleaned$tData[, c("DT", "SIZE", "PRICE")])
#>        DT                           SIZE            PRICE      
#>  Min.   :2018-01-02 09:30:00   Min.   :   1.0   Min.   :155.4  
#>  1st Qu.:2018-01-02 12:20:59   1st Qu.:  99.0   1st Qu.:156.4  
#>  Median :2018-01-02 15:59:22   Median : 100.0   Median :156.7  
#>  Mean   :2018-01-03 00:18:15   Mean   : 164.9   Mean   :156.9  
#>  3rd Qu.:2018-01-03 12:07:29   3rd Qu.: 200.0   3rd Qu.:157.1  
#>  Max.   :2018-01-03 15:59:59   Max.   :6658.0   Max.   :159.4

As just one example, one can see that all trades that had time stamps outside the opening hours of the NYSE. The report of the cleaning procedure shows us that most of the observations are removed either due to originating from another exchange or having irregular sales conditions.

As a last step for cleaning tick data on trades it is advised to check whether compatible quotes have been made before the recorded transaction. For doing so, we first need to implement a similar cleaning procedure for quotes data.

Cleaning tick data on quotes

The highfrequency package also brings a similarly-named function for cleaning quotes data, quotesCleanup. Here, the cleaning steps are

qDataCleaned <- quotesCleanup(qDataRaw = sampleQDataRaw, exchange = "N")

quotesCleanup also provides a list with two elements, a report and the cleaned data.

qDataCleaned$report
#>          initialObservations        removedFromZeroQuotes 
#>                       131401                           96 
#>  removedOutsideExchangeHours removedFromSelectingExchange 
#>                          776                        36109 
#>    removedFromNegativeSpread       removedFromLargeSpread 
#>                            0                            0 
#>    removedFromMergeTimestamp              removedOutliers 
#>                        47856                            0 
#>            finalObservations 
#>                        46564

summary(qDataCleaned$qData[, c("DT", "OFR", "OFRSIZ", "BID", "BIDSIZ", "MIDQUOTE")])
#>        DT                           OFR            OFRSIZ       
#>  Min.   :2018-01-02 09:30:00   Min.   :155.4   Min.   :  1.000  
#>  1st Qu.:2018-01-02 12:21:55   1st Qu.:156.5   1st Qu.:  1.000  
#>  Median :2018-01-02 15:54:41   Median :156.7   Median :  2.000  
#>  Mean   :2018-01-03 00:03:51   Mean   :156.9   Mean   :  4.211  
#>  3rd Qu.:2018-01-03 12:12:14   3rd Qu.:157.2   3rd Qu.:  4.000  
#>  Max.   :2018-01-03 15:59:59   Max.   :159.4   Max.   :417.000  
#>       BID            BIDSIZ           MIDQUOTE    
#>  Min.   :155.4   Min.   :  1.000   Min.   :155.4  
#>  1st Qu.:156.4   1st Qu.:  1.000   1st Qu.:156.4  
#>  Median :156.7   Median :  2.000   Median :156.7  
#>  Mean   :156.9   Mean   :  4.304   Mean   :156.9  
#>  3rd Qu.:157.1   3rd Qu.:  4.000   3rd Qu.:157.1  
#>  Max.   :159.4   Max.   :426.000   Max.   :159.4

Cleaning transaction data using cleaned quotes

Given our freshly cleaned trades and quotes data, we now want to filter for trades that are proceeded by a compatible set of quotes. Currently, the tradesCleanupUsingQuotes function only supports submitting trades and quotes data for one day only, Hence, we filter our data set to one of the days.

tqDataCleaned <- tradesCleanupUsingQuotes(tData = tDataCleaned$tData[as.Date(DT) == "2018-01-02"], 
                                           qData = qDataCleaned$qData[as.Date(DT) == "2018-01-02"])
tqDataCleaned
#>                        DT SYMBOL    BID    OFR OFRSIZ BIDSIZ QUOTEEX MIDQUOTE
#>    1: 2018-01-02 09:30:00    XXX 158.39 158.50     18      1       N  158.445
#>    2: 2018-01-02 09:30:00    XXX 158.39 158.58      1      1       N  158.485
#>    3: 2018-01-02 09:30:00    XXX 158.39 158.58      1      1       N  158.485
#>    4: 2018-01-02 09:30:00    XXX 158.39 158.58      1      1       N  158.485
#>    5: 2018-01-02 09:30:00    XXX 158.39 158.58      1      1       N  158.485
#>   ---                                                                        
#> 3687: 2018-01-02 15:59:59    XXX 157.02 157.04    103     12       N  157.030
#> 3688: 2018-01-02 15:59:59    XXX 157.02 157.03     18      4       N  157.025
#> 3689: 2018-01-02 15:59:59    XXX 157.02 157.03     17      4       N  157.025
#> 3690: 2018-01-02 15:59:59    XXX 157.02 157.03     34      8       N  157.025
#> 3691: 2018-01-02 15:59:59    XXX 157.02 157.03     86      6       N  157.025
#>         PRICE NUMTRADES SIZE EX COND CORR
#>    1: 158.500         1   50  N    I    0
#>    2: 158.500         1 1805  N         0
#>    3: 158.485         1    4  N    I    0
#>    4: 158.485         1    1  N    I    0
#>    5: 158.485         1   72  N    I    0
#>   ---                                    
#> 3687: 157.020         2  500  N    F    0
#> 3688: 157.020         1  400  N    F    0
#> 3689: 157.030         1    5  N  F I    0
#> 3690: 157.030         2   30  N  F I    0
#> 3691: 157.020         1   62  N    I    0

Here, we see that once more half of the trades are filtered out due to non-plausibility.

References

Barndorff-Nielsen, O. E., P. R. Hansen, A. Lunde, and N. Shephard (2009). Realized kernels in practice: Trades and quotes. Econometrics Journal 12, C1-C32.