02. R data structure and plots
Dimensions |
single type |
multiple type |
1 |
vector c() |
list list() |
2 |
matrix matrix() |
data frame data.frame() |
vector
- 1 dimension(x-axis only), defined for one data type
- If the types are different, the types will be coerced. R uses weak data type.
x <- c(1,2,3) # num[1:30] 1 2 3
# In R, it is not called an array.
x # [1] 1 2 3
# One vector output result, 1 2 3
typeof(x) # double
class(x) # numeric
y <- c(1. "2", TRUE) # different types
y # [1] "1" "2" "TRUE"
typeof(y) # character
class(y) # character
- The following example is to use a vector when plotting a histogram.
hist(ChickWeight$weight, breaks = fivenum(ChickWeight$weight)) # plotting with one line.
#Without argument breaks, the default values will be created.
#find out what is returned from function fivenum
five.values <- fivenum(ChickWeight$weight)
#from the env window, five.values num[1:5]. It is a vector.
#They are min, 1st quartile, median, 3rd quartile, max.
#There are 4 bins.
hist vs boxplot
- The above function hist creates 4 bins, contacted one anothers.
- not the case for function boxplot.
list
- 1 dimension(x-axis only), multiple data types
- list(...) is to create a list.
- in the env window, expand the list, there are 3 elements
- chr "Lucky"
- num 32
- logi TRUE
- You can get the same result using str(l) function.
- str(...) is to get the structure of the data.
l <- list("Luck", 32, TRUE)
class(l) # list
typeof(l) # list
matrix
- 2 dimensions(x-axis and y-axis), one data type
data frame
- 2 dimensions(x-axis and y-axis), multiple data types
- like excel worksheets, sql results
- Usually, you use a data frame from csv file or sql result set.
- In the following code, iris data frame is used.
?datasets #package, Base R datasets
?iris #about data frame iris
?head(iris) #the contents of the first six rows
#There are five column names.
#no row name is defined, use default, 1,2,3...
class(iris) #data.frame
#taking care of the whole data.
typeof(iris) #list
str(iris) #structure of a row
#iris data frame has 5 elements.Each column is for each element.
- Another example is data set diamonds in package ggplot2
data(package=’ggplot2’) #list the datasets in ‘ggplot2’
?diamonds #get to know one data set, diamonds
View(diamonds) #in script window
summary(diamonds) #min ..mean..3rd qunatile...for all variables.
s <- subset(diamonds, cut %in% ‘Fair’ & price < 1000)$price #subset and select
mean(s) #get one of its statistical data.
data frame example 1
- ChickWeight is a data frame.
- Two variables are involved - Time, weight
- plot(ChickWeight$Time, ChickWeight$weight)
- scatter plotting, x-coordiate for Time, y-coordinate for weight
- Both variables are num.
data frame example 2
- ChickWeight is a data frame.
- Two variables are involved.
- Formula is involved.
- Plot function expects its argument as below.
- The output is four distributions of their weights for each Diet.
- Two variables are for the plotting arguments, not their data.
boxplot(weight ~ Diet, data = ChickWeight)
data frame example 3
- Two variables are involved.
- The data of one variable is for x-cooridate.
- The data of another variable is for y-cooridate.
library(ggplot2)
g <- ggplot(diamonds, aes(x = carat, y = price))
g <- g + geom_point(aes(color=clarity))
g
# see the trend
library(mgcv)
c = g + geom_smooth(color='yellow')
c
# see the trend in linear model
l = g + geom_smooth(method='lm', color='red')
l
data frame index and combining data frame
#data frame index
#The index in R data frame is 1-based.
# create a data frame
name <- c('happy', 'lucky', 'joy')
age <- c(1, 3, 5)
my.df <- cbind(name, age)
#access the data- frame
my.df
#output [1,] happy 1
# [2,] lucky 3
# [3,] joy 5
my.df[,1] # happy, luck, joy
my.df[2,2] # 3
#merge for inner join, full join, left join, right join
df1 <- data.frame(LETTERS, share.keys = 1:26) #26 rows
df2 <- data.frame(letters, share.keys = c(1:9, 11, 12,13, 14, 22:34)) #26 rows
merge(df1,df2) # inner join 18 rows
merge(df1,df2, all = TRUE) # full join 34 rows, for mistmatch
merge(df1,df2, all.x = TRUE) # left join 26 rows, all the left + matched right
merge(df1,df2, all.y = TRUE) # right join 26 rows, all the right + matched left
#combine the rows from two data frames
name <- c('John', 'Mary', "Mike")
age <- c(20, 30, 40)
df1 <- data.frame(name, age)
df1
name <- c('Wiwi', 'Tairo', "Emi")
age <- c(5, 6, 7)
df2 <- data.frame(name, age)
df2
two <- rbind(df1, df2)
two
table function
- data ChickWeight has 578 occurrences, 4 variables
- Four variables are weight, time, chick, diet.
- Function table creates a tabular data as below:
- The function output is in class table.
- The output are the counts of factor, categorical variable, like diet type.
- Some plot functions require table as their arguments for input.
data(ChickWeight)
t <- table(ChickWeight$Diet)
class(t)
t
----- result ----------------------
table
1 2 3 4
220 120 120 118