## 28.    More on R - data structure, function, R files, debug

home

### 01. R simple data

• not a collective data
• ChickWeight is a data frame.
• weight is one of its variable.
• Plot function expects a simple variable as its argument.
• The output is the distribution of their weight.
• for all rows
• The plotting function expects a variable as its argument, not data.
```        boxplot(ChickWeight\$weight)
```

### 02. R data structure and plots

Dimensions single type multiple type
1 vector   c() list   list()
2 matrix   matrix() data frame   data.frame()
vector
• 1 dimension(x-axis only), defined for one data type
• If the types are different, the types will be coerced. R uses weak data type.
x <- c(1,2,3) # num[1:30] 1 2 3 # In R, it is not called an array. x #  1 2 3 # One vector output result, 1 2 3 typeof(x) # double class(x) # numeric y <- c(1. "2", TRUE) # different types y #  "1" "2" "TRUE" typeof(y) # character class(y) # character
• The following example is to use a vector when plotting a histogram.
hist(ChickWeight\$weight, breaks = fivenum(ChickWeight\$weight)) # plotting with one line. #Without argument breaks, the default values will be created. #find out what is returned from function fivenum five.values <- fivenum(ChickWeight\$weight) #from the env window, five.values num[1:5]. It is a vector. #They are min, 1st quartile, median, 3rd quartile, max. #There are 4 bins.
hist vs boxplot
• The above function hist creates 4 bins, contacted one anothers.
• not the case for function boxplot.

list
• 1 dimension(x-axis only), multiple data types
• list(...) is to create a list.
• in the env window, expand the list, there are 3 elements
• chr "Lucky"
• num 32
• logi TRUE
• You can get the same result using str(l) function.
• str(...) is to get the structure of the data.
l <- list("Luck", 32, TRUE) class(l) # list typeof(l) # list
matrix
• 2 dimensions(x-axis and y-axis), one data type
data frame
• 2 dimensions(x-axis and y-axis), multiple data types
• like excel worksheets, sql results
• Usually, you use a data frame from csv file or sql result set.
• In the following code, iris data frame is used.
```    ?datasets                #package, Base R datasets
?head(iris)              #the contents of the first six rows
#There are five column names.
#no row name is defined, use default, 1,2,3...
class(iris)              #data.frame
#taking care of the whole data.
typeof(iris)             #list
str(iris)                #structure of a row
#iris data frame has 5 elements.Each column is for each element.
```
• Another example is data set diamonds in package ggplot2
```    data(package=’ggplot2’)     #list  the datasets in ‘ggplot2’
?diamonds                   #get to know one data set, diamonds
View(diamonds)              #in script window
summary(diamonds)           #min ..mean..3rd qunatile...for all variables.
s <- subset(diamonds,   cut %in% ‘Fair’  &  price < 1000)\$price    #subset and select
mean(s)                     #get one of its statistical data.
```
data frame example 1
• ChickWeight is a data frame.
• Two variables are involved - Time, weight
• plot(ChickWeight\$Time, ChickWeight\$weight)
• scatter plotting, x-coordiate for Time, y-coordinate for weight
• Both variables are num.
data frame example 2
• ChickWeight is a data frame.
• Two variables are involved.
• Formula is involved.
• Plot function expects its argument as below.
• The output is four distributions of their weights for each Diet.
• Two variables are for the plotting arguments, not their data.
```        boxplot(weight ~ Diet, data = ChickWeight)
```
data frame example 3
• Two variables are involved.
• The data of one variable is for x-cooridate.
• The data of another variable is for y-cooridate.
```        library(ggplot2)
g <- ggplot(diamonds, aes(x = carat, y = price))
g <- g + geom_point(aes(color=clarity))
g

# see the trend
library(mgcv)
c = g + geom_smooth(color='yellow')
c

# see the trend in linear model
l =  g + geom_smooth(method='lm', color='red')
l
```
data frame index and combining data frame
#data frame index #The index in R data frame is 1-based. # create a data frame name <- c('happy', 'lucky', 'joy') age <- c(1, 3, 5) my.df <- cbind(name, age) #access the data- frame my.df #output [1,] happy 1 # [2,] lucky 3 # [3,] joy 5 my.df[,1] # happy, luck, joy my.df[2,2] # 3 #merge for inner join, full join, left join, right join df1 <- data.frame(LETTERS, share.keys = 1:26) #26 rows df2 <- data.frame(letters, share.keys = c(1:9, 11, 12,13, 14, 22:34)) #26 rows merge(df1,df2) # inner join 18 rows merge(df1,df2, all = TRUE) # full join 34 rows, <NA> for mistmatch merge(df1,df2, all.x = TRUE) # left join 26 rows, all the left + matched right merge(df1,df2, all.y = TRUE) # right join 26 rows, all the right + matched left #combine the rows from two data frames name <- c('John', 'Mary', "Mike") age <- c(20, 30, 40) df1 <- data.frame(name, age) df1 name <- c('Wiwi', 'Tairo', "Emi") age <- c(5, 6, 7) df2 <- data.frame(name, age) df2 two <- rbind(df1, df2) two
table function
• data ChickWeight has 578 occurrences, 4 variables
• Four variables are weight, time, chick, diet.
• Function table creates a tabular data as below:
• The function output is in class table.
• The output are the counts of factor, categorical variable, like diet type.
• Some plot functions require table as their arguments for input.
data(ChickWeight) t <- table(ChickWeight\$Diet) class(t) t ----- result ---------------------- table 1 2 3 4 220 120 120 118

### 03. function

• The following code is an example of R function.
• like a javascript function, except the assignment operator.
• ( ) is needed for function return data.
• A function is a main way to encapsulate a code block.
• Unlike, OO- language, R does not use class for encapulation.
#define a function my.function <- function(a, b){ sum = a + b double.sum = sum * 2 return (double.sum) } #set function arguments, and call the function p1 <- 2 p2 <- 3 result <- my.function(p1, p2) #output the result print(result)
notes:
• If you select the function definition, and run it
• the function will be loaded into memory.
• Its class is function; its typeof is closure.
• It must be loaded before using the function.
• If you use a function in a package, loading the package will take care of this.

### 04. debug

• You can select one R statement, and run it.
• see the result.
• Then, select the next statement...
• step thru the all.
• No debug is needed.
• If you has any function of your own, you need a debugging process.
• If you experience some code problems, a debugging process is needed.
• If you want to learn any open-source packages, a debugging process is helpful.
how to debug
• using the code for function
• add browser() before statement sum = a + b
• add browser() after statement sum = a + b
• Function browser is a R function.
• It sets break points when executing the code.
• You can browse the data at the location.
• In the top of the script window, click icon Source, not run, to execute in the debug mode.
• The first browser() is highlighted.
• In the console, the prompt is changed to Browse
• enter a, you can see the value in the env window. a is 2
• click next twice, you reach the second break point.
• In the console, the prompt is changed to Browse
• enter a, you can see the value in the env window. sum is 5
• click cion stop to leave the debug mode.

### 05. R files

05-1 separation of R files
• The following example is to demonstrate how to partition R code into multiple files.
• Under the same folder, create two files as below
• With peter_main.R open, click run, you see the result.
• Function source is to load peter_add.R, the next line, use its function.
• The result is  12.