25.    RStudio and R

November 21, 2018


01. R and RStudio

02. Four window panes in RStudio

02x. Data in R

console result desc & clean up
?diamonds in help window, provides info. description, variables info
data(package='ggplot2") in the left-top window, list all the datasets for the package. data set, diamonds is in the list. After viewing the info, you can close it.
head(diamonds) in the console, you can see the first five rows. clean the console using the brush
x <- diamonds in the env window
under Data
x 53940 obs, 10 variables
y <- c(1, 2, 3) in the env window
under Values
y num[1:3] 1 2 3
c(1,2,3) is a vector. R vector is 1-based.
clean the env window using the brush
View(diamonds) in the left-top windows, you see all the rows like in spread sheet format.
click filter icon
    column carat
  • data like 0.23, 0.21, 0.29, 0.71...
  • That's a continuous quantitative value
  • With filter icon clicked, adjust the range to filter.
    column cut
  • data like VS2,SI1,VVS1...
  • That contains categorical data.
  • With filter icon clicked, select 'Good' select one category.
After this lab, clean up.
summary(diamonds) console window, you get the summary report.
    column caret
  • Aggregate data: Min, 1st Quantile, Median, Mean, 3rd Quantile, Max
  • Median is the middle value.
  • All unique values are listed, the 1 Qu is the one between min and median....
    column cut
  • 5 values - Fair, Good, Very Good, Premium, Ideal.
  • Each category has its frequency number.
s1 = subset(diamonds, cut %in% 'Fair') create a subset when cut is 'Fair'.
s2 = subset(diamonds, color %in% 'E' & price > 10000)$price color in e category, price greate than 10000, 1 variable only. the subset has 590 obs
Using subset function is convenient.
It is not as flexible by using package sqldf, the sql knowledge is required.

03. Import CSV

04. Plot


05. Data from SQLite

        sqlite> create table dogs(name string, age int);
        sqlite> insert into dogs(name, age) values("Tairo", 8);
        sqlite> insert into dogs(name, age) values("Emi", 7);
        sqlite> select * from dogs;
        sqlite> .quit
        Ingres:peter_sqlite peterkao$ ls

Now, the sqlite database is ready.

06. Processing data frame with function sqldf

        # -----    demo 1  ------------------
        # execute data() to list all the build-in datasets.
        # select one, diamonds in package ggplot2
        # execute diamonds to view the content
        # make sure packages, ggplot2, sqldf are installed and loaded
        # execute the following code to set a subset.
        result <- sqldf("select carat, clarity, price      
                         from diamonds
                         where price < 350
                         order by carat")
        # -----    demo 2  -----------------
        #  Package dataset,    dataset  ChickWeight
        #  make sure  ake sure packages, ggplot2, sqldf are installed and loaded
        sqldf("Select Chick, median(weight) 
               from ChickWeight 
               group by Chick   
               order by Chick")
        # execute the code in the control, the sort order is not correct
        # after examine the type of Chick, it is not type int for some reason
        # cast it to int as below, then ok
               order by cast(Chick as int)")    
        # The output has aggregate data. You can use the steps in section 7 for distribution..
        # Populate the result to a data frame, and convert it to a csv file. 

07. Export CSV

        my.df <- diamonds
                    file = "diamonds.csv,
                    sep = ",",
                    col.names = colnames(diamonds),
                    qmethod = "escape")

08. Import a csv file from web

        url <- 'http://www.mj-go-test.com/peter091218.csv?accessType=DOWNLOAD'
        result  <- getURL(url, option=curlOptions(followlocation=TRUE))
        out <- read.csv(textConnection(result))

09. Create HTML page from RStudio

.... <!--begin.rcode summary(cars) end.rcode--> .... <!--begin.rcode fig.width=7, fig.height= 6 plot(cars) end.rcode-->


10. package ggplot2

10.0 What is ggplot2 and gg

10.1 histogram

histogram plot
        ggplot(mtcars, aes(mpg))+

        ggtitle("    vehicle count vs mile per gallon")+ 
        xlab("mile per gallon") + ylab("number")
        - ?mtcars to see the data description.
        - mpg is a variable, miles per gallon.
        - View(mtcars) to view its contents.
        - rowname is car model, like Merc 280c.
        - typeof(mtcars) return double,continuous, not discrete

        - It is a function.
        - aes(mpg) defines the mpg for x-coordinate.
        - one variable only.
        - y-coordinate is the count of the vehicle for each range of mpg
        gemo_histogram, ggtitle, xlab, ylab are all functions.
        - for the 3rd bin as example
        - in x-coordinate, range: 17-22, the difference is 5
        - in y-coordinate, 11 count
        - geom_histogram function presents histogram plot.
        - Other three functions present some descriptive info.
        - The best way to explan when to use histogram is by scenario.
        - For example, in a specific city at specific year, how many women give birth between age 15-16, 17-18,...36-38...
        geom_density function shows a curve instead.  

10.2 Scatter Plot

histogram plot - 3 variables
        # note: The above image is for 3 variables.

        # scatter plot,  x-y correlation
        # two variables
        ggplot(mtcars,aes(x=wt, y=mpg))+

        ggtitle("correlation for st and mpg")

        # three variables
        ggplot(mtcars,aes(x=wt, y=mpg, color=as.character(cyl)))+

        ggtitle("correlation for st and mpg")
        typeof(mtcars$cyl)          #double

        #argument color expects a character data type
        #The third variable is for group, double type is not for group.
        #Cast from double to character is needed for discrete data to group.
        # in these cases, you can see they are relatively correlated.
        # Lighter cars consume less gas.

10.3 box plot

histogram plot
        #box plot

        ggplot(mtcars,aes(x=as.character(cyl),y=mpg)) +

        #cyl is for group. It is type double, continuous.
        #   it must be casted into type character, discrete.
        #Present with function geom_boxplot.
        #in x-coordinate, there are 3 for cyl - 4,6,8
        #in y-coordinate, there are statistical data for mpg.
        #    the dark thick line stands for median, the middle value.
        #    the upper line is called upper quantile, 25% higher than that.
        #    the lower line is called lower qunatile, 25% lower than that.
        #    If you zoom the graphics, you can also see the outliers.
        #The purpose is to compare the distribution, like median, not average.
        # of variables, like mpg for different groups, like cyl

11. package ggplot2 - case study, Titanic

Study Resource


Data and pre-process

titanic <- read.csv("titanic.csv", stringsAsFactors = FALSE) View(titanic) # Set up factors. titanic$Pclass <- as.factor(titanic$Pclass) titanic$Survived <- as.factor(titanic$Survived) titanic$Sex <- as.factor(titanic$Sex)

Analysis 1 - What was the survival rate?

            ggplot(titanic, aes(x = Survived)) + 

Analysis 2 - What was the survival rate by gender?

Analysis 3 - What was the survival rate by ticket class?

                ggplot(titanic, aes(x = Sex, fill = Survived)) + 
                theme_bw() +
                geom_bar() +
                labs(y = "Passenger Count",
                       title = "Titanic Survival Rates by Sex")
                ggplot(titanic, aes(x = Pclass, fill = Survived)) + 
                theme_bw() +
                geom_bar() +
                labs(y = "Passenger Count",
                       title = "Titanic Survival Rates by Pclass")

case1_2 case1_3

Analysis 4 - What was the survival rate by class of ticket no and gender?

            ggplot(titanic, aes(x = Sex, fill = Survived)) + 
            theme_bw() +
            facet_wrap(~ Pclass) +
            geom_bar() +
            labs(y = "Passenger Count",
                   title = "Titanic Survival Rates by Pclass and Sex")


Analysis 5 - What is the distribution of passenger ages?

            ggplot(titanic, aes(x = Age)) +
            theme_bw() +
            geom_histogram(binwidth = 5) +
            labs(y = "Passenger Count",
                 x = "Age (binwidth = 5)",
             title = "Titanic Age Distribtion")


Analysis 6 - What are the survival rates by age?

        ggplot(titanic, aes(x = Age, fill = Survived)) +
        theme_bw() +
        geom_histogram(binwidth = 5) +
        labs(y = "Passenger Count",
             x = "Age (binwidth = 5)",
         title = "Titanic Survival Rates by Age")

        ggplot(titanic, aes(x = Survived, y = Age)) +
        theme_bw() +
        geom_boxplot() +
        labs(y = "Age",
            x = "Survived",
            title = "Titanic Survival Rates by Age")

case1_2 case1_3


12. Pick a Plot Sample and Clone it

13. formula

--------------- example from package lattice -------------------------------


        #formula  y ~ x 
        xyplot(Sepal.Width ~ Sepal.Length, iris)

        #plot multi-variant data
        #display the relationship between variables Y and X separately 
        #for every combination of factor A.
        #formula  y ~ x | a
        xyplot(Sepal.Width ~ Sepal.Length | Species, iris)
        # the following three, in console, outputs are all formula.  
        class(~ Sepal.Length)
        class(~ Sepal.Length | Species)     
        class(Sepal.Width ~ Sepal.Length | Species) 

--------------- example from package ggplot2 -------------------------------


        ggplot(iris, aes(x=Sepal.Length, 
            y=Sepal.Width)) + geom_point() + facet_wrap(~Species, nrow = 2)
        class(~Species)    # in console, output is formula


14. Base Graphics and Package lattice

                #Base graphics is the default
                # plot(x, y)
                plot(iris$Sepal.Length, iris$Sepal.Width, ces=0.4)
                #package lattice
                # xyplot(y ~ x)        xyplot(y ~ x | a)
                xyplot(Sepal.Width ~ Sepal.Length, iris)
                xyplot(Sepal.Width ~ Sepal.Length | Species, iris)

15. if-then-else, for-loop, apply

#if-then-else x <- 50 if (x == 50){ print("the value is 50") }else{ print("the value is not 50") } #for-loop x <- c(2,5,3,9,11,6) count <- 0 for (val in x) { if (val %% 2 == 0) count = count + 1 } print(count) #function apply --------------------------- WorldPhones #matrix, total phone counts, row for years, column for regions apply(WorldPhones, 1, mean) apply(WorldPhones, 2, mean)