top of page
Writer's pictureZachary Ryan

Filtering Data with RStudio

In this post, we will go over a very useful tool for analyzing data. This tutorial will show how to filter and sort data within the Lahman data base, which is built into the R Studio application. The Lahman database is a massive data set that includes baseball data from 1871 to 2019. To start off, lets make sure all the packages needed to sort data are installed on your computer. (See Below)


Install.packages(“Lahman”)

Install.packages(“tidyverse”)

After installing both packages, go ahead and run the command “Library” which will activate and load all the functions you need to filter and sort through the Lahman Database.


library(Lahman) library(tidyverse)


Looking at Data/Variables Within Lahman:


The next step we want to look into involves the functions, “str” and “head.” The “str” function allows you to look at all the variables within the table, how many data points are included, and what type of variable you are working with (chr,int,factor,etc.). An important aspect of this function is the ability to find the correct variable names. The “head” function shows very similar information but within a table format which makes both very useful.


str(Batting)


## 'data.frame': 107429 obs. of 22 variables: ## $ playerID: chr "abercda01" "addybo01" "allisar01" "allisdo01" ... ## $ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1871 ... ## $ stint : int 1 1 1 1 1 1 1 1 1 1 ... ## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 136 111 39 142 111 56 111 24 56 24 ... ## $ lgID : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ... ## $ G : int 1 25 29 27 25 12 1 31 1 18 ... ## $ AB : int 4 118 137 133 120 49 4 157 5 86 ... ## $ R : int 0 30 28 28 29 9 0 66 1 13 ... ## $ H : int 0 32 40 44 39 11 1 63 1 13 ... ## $ X2B : int 0 6 4 10 11 2 0 10 1 2 ... ## $ X3B : int 0 0 5 2 3 1 0 9 0 1 ... ## $ HR : int 0 0 0 2 0 0 0 0 0 0 ... ## $ RBI : int 0 13 19 27 16 5 2 34 1 11 ... ## $ SB : int 0 8 3 1 6 0 0 11 0 1 ... ## $ CS : int 0 1 1 1 2 1 0 6 0 0 ... ## $ BB : int 0 4 2 0 2 0 1 13 0 0 ... ## $ SO : int 0 0 5 2 1 1 0 1 0 0 ... ## $ IBB : int NA NA NA NA NA NA NA NA NA NA ... ## $ HBP : int NA NA NA NA NA NA NA NA NA NA ... ## $ SH : int NA NA NA NA NA NA NA NA NA NA ... ## $ SF : int NA NA NA NA NA NA NA NA NA NA ... ## $ GIDP : int 0 0 1 0 0 0 0 1 0 0 ...


head(Batting)


## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB ## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0 ## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4 ## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2 ## 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0 ## 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2 ## 6 armstbo01 1871 1 FW1 NA 12 49 9 11 2 1 0 5 0 1 0 ## SO IBB HBP SH SF GIDP ## 1 0 NA NA NA NA 0 ## 2 0 NA NA NA NA 0 ## 3 5 NA NA NA NA 1 ## 4 2 NA NA NA NA 0 ## 5 1 NA NA NA NA 0 ## 6 1 NA NA NA NA 0

# You can also substitue with "Pitching" if you plan to work with pitching Data.


Filtering the Data:


The next step we will look at is how to filter our Batting table into a specific timeframe, primarily using the filter function. The filter function allows you to select specific variables to sort by, in this case we will be looking at the AL batting data from the years 2010 to 2019. You may ask what “%>%” means/stand for within the code, which is a great question. This is often known as “Piping” which allows you to filter by more than one variable at a time. I often look at it as a symbol for the word “and” to help understand what we are looking at. For this example, we are looking at the Batting data AND filtering by the years 2010-2019 AND the American League. If you have filtered correctly, you should be able to run “data_2019_2010” which will produce the new data set you have just created or simply click on the data set in you R Studio “Data” panel.


data_2019_2010 <- Batting %>% # Choosing the Batting Table # AND filter(yearID >= 2010) %>% # AND filter(lgID == "AL") # View the New Data Set data_2019_2010


## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB ## 1 aardsda01 2010 1 SEA AL 53 0 0 0 0 0 0 0 0 ## 2 abreubo01 2010 1 LAA AL 154 573 88 146 41 1 20 78 24 ## 3 accarje01 2010 1 TOR AL 5 0 0 0 0 0 0 0 0 ## 4 aceveal01 2010 1 NYA AL 10 0 0 0 0 0 0 0 0 ## 5 albaljo01 2010 1 NYA AL 10 0 0 0 0 0 0 0 0 ## 6 alberma01 2010 1 BAL AL 62 0 0 0 0 0 0 0 0 ## 7 aldrico01 2010 1 LAA AL 5 13 0 1 0 1 0 1 0 ## 8 alfonel01 2010 1 SEA AL 13 41 4 9 1 0 1 4 0 ## 9 ambrihe01 2010 1 CLE AL 34 0 0 0 0 0 0 0 0 ## 10 anderbr04 2010 1 OAK AL 19 0 0 0 0 0 0 0 0

Looking at Leaders (Within our Time Frame):


Next, lets look at a few statistics and see what the max count of that variable was within the years 2010-2019. For this example, we will use the “max” function to produce the highest number in each category looked at (HR,SO,RBI).


max(data_2019_2010$HR)

## [1] 54


max(data_2019_2010$SO)

## [1] 222


max(data_2019_2010$RBI)

## [1] 139


Conclusion:


All in all, these basic R fucntions that you have learned within this post will allow you to filter your datasets to find exactly what you are looking for. In the next section, I will teach you how to add the players’ names to the table you just created which will make analyzing the data that much easier!

---------------------------------------------------------------------------------------------------------------------------

-ZR

1,639 views0 comments

Commenti


bottom of page