-
Notifications
You must be signed in to change notification settings - Fork 1
/
databaseball.Rmd
224 lines (136 loc) · 4.09 KB
/
databaseball.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
---
title: "DataBaseball"
output:
html_notebook: default
html_document:
df_print: paged
pdf_document: default
---
# On deck: Import the packages we'll need
"There's a package for that!"
```{r}
library(Lahman) # Our source data
library(tidyverse) # "Metapackage," don't leave home without it!
library(psych) # Descriptive stats
library(skimr) # Data profiling
library(writexl) # Write to Excel -- data dump
library(openxlsx) # Write to Excel -- plus
```
# 1st base: Import the data
We can pull this data from the `Teams` and `Salary` tables of `Lahman`
```{r}
head(Teams)
head(Salaries)
```
Drop some of these Teams columns
```{r}
teams <- Teams %>%
select(yearID, teamID, name, attendance, W)
head(teams)
```
We would like to pull the payroll data from `Salaries` and merge it to the data in `Teams`.
`Salaries` is at the *player* level so we will *aggregate* it, then merge. _Think `VLOOKUP()`!_
```{r}
# Get the total annual salaries by team
payroll <- Salaries %>%
group_by(yearID, teamID) %>%
summarise(payroll = sum(salary))
head(payroll)
```
Now let's merge this into our `Teams` table:
```{r}
min(Teams$yearID)
min(payroll$yearID)
# Drop the years that don't have payroll data
teams_merged <- teams %>%
inner_join(payroll)
head(teams_merged)
```
Sanity check -- does the total payroll equal sum of salaries?
```{r}
sum(teams_merged$payroll) == sum(Salaries$salary)
```
# 2nd base: Profile the data
Lots of ways to profile the data...
```{r}
# Shape
dim(teams_merged)
# Summary statistics
summary(teams_merged)
```
Many packages exist to aid in the exploration:
```{r}
skim(teams_merged)
```
```{r}
describe(teams_merged)
```
## Shortstop: Data visualization?
This is an easy step to overlook, but it is _crucial_! It leads very well into modeling.
```{r}
# Distribution of payrolls
ggplot(data = teams_merged, aes(x = payroll)) +
geom_histogram()
# What could be causing this?
```
```{r}
# Distribution of wins
ggplot(data = teams_merged, aes(x = W)) +
geom_histogram()
```
# 3rd base: Analyze the data
Depending how statistically informed you want to get, R can "probably" handle it! Let's see what our payroll report might have in store...
```{r}
# Relationship between payroll and attendance
ggplot(data = teams_merged, aes(x = payroll, y = attendance)) +
geom_point()
```
```{r}
# Relationship between payroll and wins
ggplot(data = teams_merged, aes(x = payroll, y = W)) +
geom_point()
```
```{r}
# What if we take the log of payrolls?
teams_merged$log_payroll <- log(teams_merged$payroll)
head(teams_merged)
```
```{r}
# Relationship between payroll and attendance
ggplot(data = teams_merged, aes(x = log_payroll, y = attendance)) +
geom_point() +
geom_smooth(method = 'lm')
```
You'd want to back this up with more diagnostics, but you get the idea...
# Home: Export to Excel
There's definitely more that could be done with our payroll/attendnance analysis, but let's take our victory lap now :)
If you want a pure data dump to Excel, `writexl` is a good option:
```{r}
# What do you want to write, and to where?
write_xlsx(teams_merged, 'output/writexl.xlsx')
```
```{r}
# What can this function do and not?
# Awesome Easter Egg
?write_xlsx
```
If you would like to fully automate the production from R, `openxlsx` is a good option:
```{r}
# Add workbook
wb <- createWorkbook()
# Add worksheet of raw data
data_sheet <- addWorksheet(wb, sheetName = 'data')
# Write the dataset as a Table to Excel
writeDataTable(wb, sheet = data_sheet, x = teams_merged, rowNames = FALSE)
# Send plot to the workbook
ggplot(data = teams_merged, aes(x = payroll, y = attendance)) +
geom_point() +
ggtitle('Payroll vs attendance')
# Add the image to the worksheet, set the placement
insertPlot(wb, sheet = data_sheet, startRow = 1, startCol = "I", width = 6, height = 4)
file = 'output/openxlsx.xlsx'
# save workbook to working directory
saveWorkbook(wb = wb, file = file, overwrite = TRUE)
```
We could even add cell formats, conditional formatting and so forth to Excel from R using this package. **Home run!**
Back to the slides for a conclusion...