Saturday 21 November 2015

Learn R in a day - Part 1 The Basics

I have been teaching R for more than 2 years now and been a user of the same for the last 7 years.
Over the course of time, I have seen many colleagues and students who wanted to learn R as they could not find a suitable self-learning resource.

However, I believe that learning a new language is difficult without knowing how/ where to apply the learning.
Moreover, 80-20 principle applies to learning a new language too.

This post is an attempt at helping the reader learn R using real-world examples and is intended only for those who have not worked on R earlier and want to understand the basics of the same.

1. Download & Install R:

R language can be downloaded from CRAN

Once the executable version is downloaded, install it.


R Layout

2. R GUI:

Once R is installed, the GUI looks as shown in the picture.










3. Functionalities in R:

R can be used for any of the following:

A) Perform Basic Math
B) Work on top of data sets
C) Create data sets by writing SQL queries
D) Perform statistical modeling on top of data sets
E) Create executive level reports


A) Perform Basic Math

The utility of R ranges from the basic of functions to the most complex. In this section, we'll go through some of the basic functions in R.

In order to perform basic math, one needs to type code, as follows in R console:

2+3 # Addition
3^2 #Square
exp(10) # e raised to the power 10
log(2,base=exp(1))  # natural logarithm (base = exp(1))



Note that # in the code above is used to add comments to code. If a line of code has multiple lines of code, it should start with /* and end with */ - For example /* Line 1  Line 2*/

Basic math with multiple lines of code looks like the one in picture

It is to be noted that, unlike in other languages, one can just copy and paste code into R console and get the desired output.

B) Work on Top of data sets:

Before working on top of data sets, let us go through the major data types in R

1. Numeric
2. Character
3. Factor

The type of any variable/ value can be obtained by using the "class" function

Variable Vs Vector

Any value can be stored in a variable by assigning the value to a variable, as follows:

a=2;
b=a;
c=b*b





In the above lines of code, the variable "a" is initialized to a value of 2. Further, the variable "b" is initialized to the value to which "a" is initialized and similarly the value of "c".

Similar to the way in which variables are initialized, a combination of variables/ values can be initialized as objects/ vectors.
Simply, a vector or object is a collection of values as shown in the code below:


a=c(1,2,3,4,5);
b=a[1];

From the code above, the vector a (which is a combination of values) is initialized by specifying a concatenation function (c()). Each value within the vector can then be referenced by specifying the position of the value we are interested in.

Vector Vs Data Frame:

Typically, we are interested in working on matrix form of data in excel style (variables in columns and different cases in rows).

We saw that a vector is a combination of values - A similar aggregate of vectors is a data frame - i.e., a data frame consists of rows and columns.

A simple illustration of data frames is as follows - 'iris' is an in-built dataset that is available in R. The 'iris' dataset can be initialized by calling it as shown in the picture:

The first line in the picture "data(iris)" invokes the in-built data set - "iris"

The second line "iris[1:3,]" should be read as "Provide the rows starting from 1st row to the 3rd row along with all the columns (The data frame by default provides all columns if the columns of interest are not specified".

The 3rd line "dim(iris)" provides the dimension of the data frame "iris"

The 4th line "summary(iris)" provides the summary of each column in the data frame.

"colnames(iris)" provides the names of all columns in the dataframe "iris"

"str(iris)" provides the structure of the data frame

"iris[1,1]" indirectly refers to the first row and first column of the data frame

Another method of referencing an element in data frame is to specify the nth row of the column of interest - where, a column of interest is specified by calling [data frame]$[column name]
A row in the column of interest can be called by [data frame]$[column name][nth row] - in our case, iris$Sepal.Length[1]


Reading data into R:

So far, we have seen how to work on top of data once a data is in the current working environment of R. However, in the real-world use-cases one has to import data sets into R. The following are the ways in which one shall import data into R:

Reading files:

The method of importing a .csv file into a data frame named "t" is by specifying the code:
t=read.csv("File location",sep=",",header=TRUE)

Similarly the method of importing .txt file is:
read.table("File location",sep="/t",header=TRUE)

However, one has to note that there is no native support of reading a .xlsx file into R

Once the data is read into the current R working environment, it can then be referenced in a similar manner as we have seen with iris data set in the previous section


This concludes  the first part of the series. I'll try to answer any questions/ feedback you have over comments.

In the next series, we'll go through the various data frame manipulation techniques in R

No comments:

Post a Comment