Manipulating Data in R

Published on:

16 Jun 2017, 11:03 pm

R is one of the best languages for data analysis. It has over 10,837 add-on packages with more than 98,996 members on LinkedIn's R Group. The best thing about R is that it is open source, very powerful and can perform complex data analysis. In this document, I will introduce approaches to manipulate and transform data in R.

The packages required for the manipulation include:

library(dplyr)
library(plyr)

These packages make data manipulation a fun in R. So, let's go ahead and explore their functions.

Filtering Data: With dplyr

dplyr is a package for data manipulation, written and maintained by Hadley Wickham. It provides some great, easy-to-use functions that are very handy when performing exploratory data analysis and manipulation. Here, I will provide a basic overview of some of the most useful functions contained in the package.

> Ozone=c(41,36,12,18,NA,28)
> Solar.R=c(190,118,149,313,NA,NA)
> Wind=c(7.4,8.0,12.6,11.5,14.3,14.9)
> Temp=c(67,72,74,62,56,66)
> Month=c(5,5,5,5,5,5)
> Day=c(1,2,3,4,5,6)

> airquality=data.frame(Ozone, Solar.R, Wind, Temp, Month, Day)

> airquality

Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6

> filter(airquality, Temp > 70)

Ozone Solar.R Wind Temp Month Day
1 36 118 8.0 72 5 2
2 12 149 12.6 74 5 3

> filter(airquality, Wind>11)

Ozone Solar.R Wind Temp Month Day
1 12 149 12.6 74 5 3
2 18 313 11.5 62 5 4
3 NA NA 14.3 56 5 5
4 28 NA 14.9 66 5 6

> filter(airquality,Temp>70 & Month==5)

Ozone Solar.R Wind Temp Month Day
1 36 118 8.0 72 5 2
2 12 149 12.6 74 5 3

> filteredData <- filter(airquality, Day != 5)

The above excludes Day with 5.

> filteredData

Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 28 NA 14.9 66 5 6

Adding a new column

Mutate is used to add new variables to the data. For example, let's adds a new column that displays the temperature in Celsius

> mutate(airquality, TempInC = (Temp – 32) * 5 / 9)

Ozone Solar.R Wind Temp Month Day TempInC
1 41 190 7.4 67 5 1 19.44444
2 36 118 8.0 72 5 2 22.22222
3 12 149 12.6 74 5 3 23.33333
4 18 313 11.5 62 5 4 16.66667
5 NA NA 14.3 56 5 5 13.33333
6 28 NA 14.9 66 5 6 18.88889

Alternatively, we can also use the transform function which helps in adding several columns thereby avoiding the use of $ nomenclature.

> newcolumns<-transform(airquality, TempInC = (Temp – 32) * 5 / 9, excess=Temp-Wind)

> newcolumns

Ozone Solar.R Wind Temp Month Day TempInC excess
1 41 190 7.4 67 1 19.44444 59.6
2 36 118 8.0 72 2 22.22222 64.0
3 12 149 12.6 74 3 23.33333 61.4
4 18 313 11.5 62 4 16.66667 50.5
5 NA NA 14.3 56 5 13.33333 41.7
6 28 NA 14.9 66 6 18.88889 51.1

Removing a column

If we want to remove any column from the data set, we use the below code.

> newcolumns$excess<-NULL

> newcolumns

Ozone Solar.R Wind Temp Month Day TempInC
1 41 190 7.4 67 1 19.44444
2 36 118 8.0 72 2 22.22222
3 12 149 12.6 74 3 23.33333
4 18 313 11.5 62 4 16.66667
5 NA NA 14.3 56 5 13.33333
6 28 NA 14.9 66 6 18.88889

Sample Function

The sample function is used to select random rows from a table. The first line of code randomly selects ten rows from the dataset and the second line of code randomly selects 1 row (10% of the original 5 rows) from the dataset.

> sample_n(airquality, size = 2)

Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
5 NA NA 14.3 56 5 5

> sample_frac(airquality, size = 0.1)
Ozone Solar.R Wind Temp Month Day
3 12 149 12.6 74 5 3

Arrange Function

The arrange function is used to arrange rows by variables. We can use the arrange function to arrange the rows in the descending order of Temp, and then in the ascending order of Day.

> arrange(airquality, desc(Temp), Day)

Ozone Solar.R Wind Temp Month Day
1 12 149 12.6 74 5 3
2 36 118 8.0 72 5 2
3 41 190 7.4 67 5 1
4 28 NA 14.9 66 5 6
5 18 313 11.5 62 5 4
6 NA NA 14.3 56 5 5

Pipe

The pipe operator in R, represented by %>% can be used to chain code together. It is very useful when you are performing several operations on data, and don't want to save the output at each intermediate step.

For example, let's say we want to remove all the data corresponding to Month = 5, group the data by month, and then find the mean of the temperature each month. The conventional way to write the code for this would be:
filteredData <- filter(airquality, Month != 5)
groupedData <- group_by(filteredData, Month)
summarise(groupedData, mean(Temp, na.rm = TRUE))

With piping, the above code can be rewritten as:

airquality %>%

filter(Month != 5) %>%
group_by(Month) %>%
summarise(mean(Temp, na.rm = TRUE))

Subsetting Data

R has powerful indexing features for accessing object elements. These features can be used to select and exclude variables and observations. The following code snippets demonstrate ways to keep or delete variables and observations and to take random samples from a dataset.

Using the same data frame above for airquality. This is similar to the filter function used in the dply package.

> Ozone=c(41,36,12,18,NA,28)
> Solar.R=c(190,118,149,313,NA,NA)
> Wind=c(7.4,8.0,12.6,11.5,14.3,14.9)
> Temp=c(67,72,74,62,56,66)
> Month=c(5,4,7,5,6,5)
> Day=c(1,2,3,4,5,6)

> airquality=data.frame(Ozone, Solar.R, Wind, Temp, Month, Day)

> subset(airquality, Temp > 70, select = c(Ozone, Temp))

Ozone Temp
2 36 72
3 12 74

> subset(airquality, Day == 1, select = -Temp)

Ozone Solar.R Wind Month Day
1 41 190 7.4 5 1

> subset(airquality, select = Ozone:Wind)

Ozone Solar.R Wind
1 41 190 7.4
2 36 118 8.0
3 12 149 12.6
4 18 313 11.5
5 NA NA 14.3
6 28 NA 14.9

> with(airquality, subset(Ozone, Temp > 70))

[1] 36 12

Using "which" function

Which function in R is also used to subset the data on specific conditions. For instance, in the airquality data frame, if we need to subset the data with Temp greater than 70 and having Ozone greater than 10, the following code could be written.

> Tempmore<-airquality[which(airquality$Temp>70 & airquality$Ozone>10)]

> Tempmore

Solar.R Wind
1 190 7.4
2 118 8.0
3 149 12.6
4 313 11.5
5 NA 14.3
6 NA 14.9

Changing the levels of the factorial variables

By default the levels of a factor are ordered alphabetically. We can change the order simply by providing levels= to factor().

dss$location <- factor(dss$location, levels=rev(levels(dss$location)))

All these functions will give you a basic understanding of the data transformation for easy analysis. However, this is just a beginning and a lot of information is available on the web for extensive data analysis. Keep learning and all the best!!

Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp

_____________

Disclaimer: Analytics Insight does not provide financial advice or guidance. Also note that the cryptocurrencies mentioned/listed on the website could potentially be scams, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. You are responsible for conducting your own research (DYOR) before making any investments. Read more here.

Data Science

Education

Tutorials

Manipulating Data in R

Related Stories