Mock Data - Data Testing Tool - Pandas TimeSeries

Time Series can be useful to analyze financial data in different frequencies based on data distribution.
We can reload/repeat the data analysis process in Python and export the result to Excel or CSV on daily, weekly, monthly, quarterly or yearly basis.

Goals: Generate Mock Data

Overview Data Time Series Basics

- Example is Amazon stock daily closing prices from 1997 - 2021 in CSV format
- View date type: convert string to time sereies
- Set date as index : retrieve range of data 
- Resample function : for plotting
- Assign date to the stock price without date column : with date frequency options

Mock Data

- Generate date range 
- Use Numpy to geneerate random number
- Set index to match date and number     
In [2]:
## Read the file from CSV format
df = pd.read_csv('data/Amazon.csv')
df.tail()
Out[2]:
Date Open High Low Close Adj Close Volume
6145 10/15/2021 3311.419922 3410.419922 3304.000000 3409.020020 3409.020020 5175100
6146 10/18/2021 3388.360107 3449.169922 3385.100098 3446.739990 3446.739990 3174100
6147 10/19/2021 3434.290039 3454.689941 3422.000000 3444.149902 3444.149902 2386100
6148 10/20/2021 3452.659912 3462.860107 3400.370117 3415.060059 3415.060059 2139800
6149 10/21/2021 3414.250000 3440.280029 3403.000000 3435.010010 3435.010010 1881400


Find out Date type
- The Date type here is a string, not time series type

In [3]:
type(df.Date[0])
Out[3]:
str


Convert string into time series type
- By adding Parse to read the CSV file

In [4]:
df = pd.read_csv('data/Amazon.csv', parse_dates=["Date"])
df.tail()
Out[4]:
Date Open High Low Close Adj Close Volume
6145 2021-10-15 3311.419922 3410.419922 3304.000000 3409.020020 3409.020020 5175100
6146 2021-10-18 3388.360107 3449.169922 3385.100098 3446.739990 3446.739990 3174100
6147 2021-10-19 3434.290039 3454.689941 3422.000000 3444.149902 3444.149902 2386100
6148 2021-10-20 3452.659912 3462.860107 3400.370117 3415.060059 3415.060059 2139800
6149 2021-10-21 3414.250000 3440.280029 3403.000000 3435.010010 3435.010010 1881400
In [5]:
## Find out Date type again: the Date type is Timestamp which is one of the time series
type(df.Date[0])
Out[5]:
pandas._libs.tslib.Timestamp


Set index to Date by adding Index column (index_col) to Date
-The benefit of setting index is we can retrieve specific date ranges of Amazon stock prices

In [6]:
df = pd.read_csv('data/Amazon.csv', parse_dates=["Date"],index_col="Date")
df.index
Out[6]:
DatetimeIndex(['1997-05-16', '1997-05-19', '1997-05-20', '1997-05-21',
               '1997-05-22', '1997-05-23', '1997-05-27', '1997-05-28',
               '1997-05-29', '1997-05-30',
               ...
               '2021-10-08', '2021-10-11', '2021-10-12', '2021-10-13',
               '2021-10-14', '2021-10-15', '2021-10-18', '2021-10-19',
               '2021-10-20', '2021-10-21'],
              dtype='datetime64[ns]', name='Date', length=6150, freq=None)
In [7]:
## Retrieve September 2021 stock price
df["2021-09"].head()
Out[7]:
Open High Low Close Adj Close Volume
Date
2021-09-01 3496.399902 3527.000000 3475.239990 3479.000000 3479.000000 3629900
2021-09-02 3494.760010 3511.959961 3455.000000 3463.120117 3463.120117 2923700
2021-09-03 3452.000000 3482.669922 3436.439941 3478.050049 3478.050049 2575700
2021-09-07 3478.000000 3528.090088 3476.939941 3509.290039 3509.290039 2737900
2021-09-08 3511.649902 3545.629883 3495.669922 3525.500000 3525.500000 3053400


Time Series enable us to do more financial analysis by few coding steps

- Get the average closing price of Amazon stock in March 2018
- Retrieve the specific date price
- Weekly, Monthly or Yearly average of the stock price
- Plot the stock price
In [8]:
## Get the average closing price in March 2018
df["2021-09"].Close.mean()
Out[8]:
3427.1828613333337
In [9]:
## Retrieving one specific day - can use loc slice function, but it's not necessary 
df.loc["2021-09-15":"2021-09-15"]
Out[9]:
Open High Low Close Adj Close Volume
Date
2021-09-15 3442.52002 3485.419922 3402.01001 3475.790039 3475.790039 2957500
In [10]:
## Retrieving with a date range
df["2021-09-10":"2021-09-14"] 
Out[10]:
Open High Low Close Adj Close Volume
Date
2021-09-10 3501.830078 3508.449951 3462.909912 3469.149902 3469.149902 2393300
2021-09-13 3482.800049 3497.959961 3438.000000 3457.169922 3457.169922 2569000
2021-09-14 3475.550049 3486.810059 3437.709961 3450.000000 3450.000000 1936900
In [11]:
## Yearly average of the stock price
df.Close.resample('Y').mean().head()
Out[11]:
Date
1997-12-31     3.133746
1998-12-31    15.661112
1999-12-31    69.062314
2000-12-31    46.533668
2001-12-31    12.200426
Freq: A-DEC, Name: Close, dtype: float64


Resample data function

- Shift+tab to view all the possibilities of the Resample data function
- Without Resample, the chart is more grannuily
In [12]:
## Plot with resample - average monthly price 
df.Close.resample('M').mean().plot()
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x27d4c9bebe0>
In [13]:
## Plot without Resample
df.Close.plot()
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x27d4cb20400>


Assigning date to a dataset with missing date

- The following example is the August 2021 Amazon stock prices but without any date
- Read the stock closing price csv file with the missing date column
- Assign the date to match the stock price
In [14]:
## Read August 2021 Amazon stock price csv file without date column
df1 = pd.read_csv('data\Amazon_No_Date.csv')
df1.tail()
Out[14]:
Open High Low Close Adj Close Volume
17 3309.870117 3321.000000 3286.149902 3299.179932 3299.179932 1680300
18 3299.000000 3332.000000 3296.000000 3316.000000 3316.000000 2098800
19 3333.229980 3352.320068 3313.750000 3349.629883 3349.629883 2391300
20 3357.429932 3445.000000 3355.219971 3421.570068 3421.570068 3192200
21 3424.800049 3472.580078 3395.590088 3470.790039 3470.790039 4356400


Assign the date to match the data

- Use "rng" function to assign date to matach stock price
- Apply "B" (Business Day) as the frequency
- Shift+ tab twice on "B" to follow the link for more day function
- The Business day function does not consider holidays
In [15]:
rng = pd.date_range(start="8/1/2021", end = "8/31/2021", freq='B')
rng
Out[15]:
DatetimeIndex(['2021-08-02', '2021-08-03', '2021-08-04', '2021-08-05',
               '2021-08-06', '2021-08-09', '2021-08-10', '2021-08-11',
               '2021-08-12', '2021-08-13', '2021-08-16', '2021-08-17',
               '2021-08-18', '2021-08-19', '2021-08-20', '2021-08-23',
               '2021-08-24', '2021-08-25', '2021-08-26', '2021-08-27',
               '2021-08-30', '2021-08-31'],
              dtype='datetime64[ns]', freq='B')


Set the date column as the index
- You must specify "True" to modify the original dataframe

In [16]:
df1.set_index(rng,inplace=True)
df1.head(5)
Out[16]:
Open High Low Close Adj Close Volume
2021-08-02 3353.100098 3358.919922 3317.000000 3331.479980 3331.479980 3353900
2021-08-03 3340.719971 3391.000000 3299.770020 3366.239990 3366.239990 4157300
2021-08-04 3379.350098 3388.879883 3345.560059 3354.719971 3354.719971 2183900
2021-08-05 3356.219971 3389.000000 3340.919922 3375.989990 3375.989990 2433500
2021-08-06 3375.000000 3375.000000 3329.040039 3344.939941 3344.939941 2635300
In [18]:
## Average Amazon price between 8/1/21 - 8/18/21
df1["2021-08-01":"2021-08-18"].Close.mean() 
Out[18]:
3312.897686230769


Add weekends to the list
- The price would be the same as the previous business day's closing price

In [19]:
df1.asfreq('D', method='pad').head()
Out[19]:
Open High Low Close Adj Close Volume
2021-08-02 3353.100098 3358.919922 3317.000000 3331.479980 3331.479980 3353900
2021-08-03 3340.719971 3391.000000 3299.770020 3366.239990 3366.239990 4157300
2021-08-04 3379.350098 3388.879883 3345.560059 3354.719971 3354.719971 2183900
2021-08-05 3356.219971 3389.000000 3340.919922 3375.989990 3375.989990 2433500
2021-08-06 3375.000000 3375.000000 3329.040039 3344.939941 3344.939941 2635300


View stock data by weekly
- 8/1/21 is Sunday, the weekly business date starts at 8/2/21

In [20]:
df1.asfreq('W', method='pad')
Out[20]:
Open High Low Close Adj Close Volume
2021-08-08 3375.000000 3375.000000 3329.040039 3344.939941 3344.939941 2635300
2021-08-15 3305.669922 3306.070068 3283.000000 3293.969971 3293.969971 2052800
2021-08-22 3203.870117 3207.810059 3175.760010 3199.949951 3199.949951 3341200
2021-08-29 3333.229980 3352.320068 3313.750000 3349.629883 3349.629883 2391300


Mock Data - Generating Dates and Random Data for Testing

- Set the date frequency to 'B' for business days
- Import numpy to generate random numbers
- Set the index to match the random numbers


Generating business days
- Generate 30 business days starting from 9/10/2021

In [21]:
rng = pd.date_range(start='9/10/2021', periods=30, freq='B')
rng
Out[21]:
DatetimeIndex(['2021-09-10', '2021-09-13', '2021-09-14', '2021-09-15',
               '2021-09-16', '2021-09-17', '2021-09-20', '2021-09-21',
               '2021-09-22', '2021-09-23', '2021-09-24', '2021-09-27',
               '2021-09-28', '2021-09-29', '2021-09-30', '2021-10-01',
               '2021-10-04', '2021-10-05', '2021-10-06', '2021-10-07',
               '2021-10-08', '2021-10-11', '2021-10-12', '2021-10-13',
               '2021-10-14', '2021-10-15', '2021-10-18', '2021-10-19',
               '2021-10-20', '2021-10-21'],
              dtype='datetime64[ns]', freq='B')


Import numpy to generate random numbers
- Generate random numbers between 1 to 20 for the length of 30 days

In [22]:
import numpy as np
np.random.randint(1,20,len(rng))
Out[22]:
array([13, 13, 19,  3,  6,  8, 14, 15, 12,  6,  1, 11,  4,  1, 18, 16, 12,
       16, 18,  8,  1, 14, 19, 10,  7, 16,  3,  9, 10,  7])