description

Seaborn is a library for making statistical graphics in Python. It is built on top of matplotlib and closely integrated with pandas data structures.
Either matplotlib.pyplot or seaborn provides methods for single-plot and multi-plots.
I found that using seaborn is more easy for single-plot or multi-plots
I use the functions in matplotlib.pyplot only when needed.
There are two to way to create multi-facets.
The first way is to use contruction function, like FacetGrid to create a lattice.
Then, use a map function to plot each subplots in loop.
The second way is is to add function parameter - col, row in a seaborn function.
I found the first way more easy. I use code visual studio for my labs.
My learning materials for seaborn single-plot are from http://seaborn.pydata.org/tutorial.html
My learning materials for seaborn multi-plots are from http://seaborn.pydata.org/api.html including Multi-plot grids, Facet grids, Pair grids, Joints grids.

overview

When you do a statistical analysis, define what you want to see, get data and process them, finally, choose a plotting type to plotting. Once you get the plotting, you can start to analyze. This overview is for choosing the visualization plotting type.

plotting without grid lattice - mainly for single plot
- relationship, relplot
  - scatter, mainstay uses
  - line, for continuity, with time-series data, usage examples like stock prices, mri labs
- distribution, distplot
- regression - - get scatter plotting first, then, select model, get the estimation from data, for prediction purpose
  - regplot
    - more simple, easy
    - axes-level interface
    - for linear regression model
  - lmplot
    - figure-level interface
    - more options, for example for non-linear estimation
plotting with grid lattice - create a grid structure first, then subplot all.
- FacetGrid - relationship
- PairGrid - relationship
- JointGrid - relationship and distribution

03. CSV for Seaborn data

    import pandas as pd

    tips = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")
    
    print(tips.head(2))
    print('type of tips = ' + str(type(tips)))

    '''
        total_bill   tip     sex smoker  day    time  size
    0       16.99  1.01  Female     No  Sun  Dinner     2
    1       10.34  1.66    Male     No  Sun  Dinner     3
    '''

data source

A csv is the main data format.
It must take the form of what Hadley Whickam calls “tidy”. They is long-form data
It is unlike other data types, like JSON data, which is in a format of hierarchical, key-value structure.
Each column is a variable. Each row is an observation.
The type of data is pandas.core.frame.DataFrame
pd.read_csv function is used to get a any csv file.
If the csv files are used by seaborn demo purposes, you save some typing as below:
import seaborn as sns ... tips = sns.load_dataset('tips')

variable types

Based on a variable value, its type is determined,
total_bill,16.99, for float
time, Dinner for str
size, 2 for int
This is a dynamic feature, like Javsscript. No variable type definitions are required like SQL, Java.
total_bill, tip are type float
sex, smoker, day, time are type str.
size is type int.
Variables with type str or int are calegorical vaiables.
sns.calplot function is used for more easy plotting with categorical variable.

In addition to adding more dimensions, not just x,y, you can add semantic parameters likes hue, style...
by using categorical variables.

04. figure-level functions vs axes-level functions

4.1 figure-level functions for one plot

        tips = sns.load_dataset("tips")
        g = sns.relplot(x="total_bill", y="tip", data=tips)

        print(' type of tips = ' + str(type(tips)))
        print(' type of the method returns ' + str(type(g)))

        plt.show()

review

pandas.core.frame.DataFrame
function name has a universal style.
kind: default, scatter
seaborn.axisgrid.FacetGrid, one facet, from figure-level function

changing figure size and adding title for plotting from figure-level function

change a figure size by add a paramter like, height=10, using the default aspect ratio.
add a title by add the following two lines
fig = g.fig
fig.suptitle("my title", fontsize=12)
These are NOT not for matplotlib...AxesSubplot.

4.2 axes-level functions for one plot

        import seaborn as sns
        import matplotlib.pyplot as plt
        tips = sns.load_dataset("tips")
        sns.set()
        ax = sns.scatterplot(x="total_bill", y="tip", data=tips)

review

The type returned from the function is matplotlib.axes._subplots.AxesSubplot

changing figure size and adding title for plotting from axes-level function

before sns.scatter, add one line of code plt.figure(figsize=(15,8))
before sns.scatter, add one line of code ax.set_title('my scatterplot test')
These are NOT not for figure-level plotting.

4.3 figure-level functions for many subplots

cookie-cutter style
All the subplots have the same configuration.
see topic 21-27

4.4 axes-level functions for may subplots

    ...
    f, axes = plt.subplots(1, 2, sharey=True, figsize=(6, 4)) # one row, 2 columns

    sns.boxplot(x="day", y="tip", data=tips, ax=axes[0])                         #plot 1
    sns.scatterplot(x="total_bill", y="tip", hue="day", data=tips, ax=axes[1])   #plot 2
    
    plt.legend()                         
    plt.show()

not cookie-cutter style.
The subplots can have differnt configurations.

11a. x-y relationshiprelationship, relplot, scatter

relplot is for relationship between two variables in x-y cooridates.
The default kind is scatter.
seaborn.axisgrid.FacetGrid

    import matplotlib.pyplot as plt
    import seaborn as sns
    
    tips = sns.load_dataset("tips")
    sns.set(style="ticks")
            
    g = sns.relplot(x="total_bill", y="tip", hue="day", data=tips, height=4)
    print('type of g = ' + str(type(g)))  
    
    plt.legend()
    plt.show()

11b. x-y relationship, continuity, relplot, line

Many x-y relationships involve with time and continuity.
A daily stock price, a monthly real estate price index are examples.
fMRI labs use timepoints(measurement unit) as x-coordinate, signal as y-coordinate
- many signals for one timepoint
- The default behavior in seaborn is to aggregate the multiple measurements
- at each x value by plotting the mean
- and the 95% confidence interval around the mean.
- The generated plotting is like a wave. The data is time-series.
method: replot, kind="line"

    fmri = sns.load_dataset("fmri")
    sns.relplot(x="timepoint", y="signal", kind="line", data=fmri)

12. distribution, distplot, histogram

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
n = np.random.normal(size=100)
ax = sns.distplot(n)
plt.show()

description

create 100 numerical data in random with normal distribution
univariate for x-coordinate
function displot, default histogram
and fit a kernel density estimate (KDE)
The y-coordinate is the approximate probability of the number of occurrences of x value.
The sum of the probability under KDE is 1.
Axes-level function

13a. linear regression model

descriptiojn

Regression attempts to predict one dependent variable (usually denoted by Y) and
a series of other changing variables (known as independent variables, usually denoted by X).
Dataset tips provides a good example, the larger a group size is, the tip will be bigger.
Two main functions are used to visualize a linear relationship
as determined through regression - regplot() and lmplot().
Both functions draw a scatterplot of two variables, x and y,
and then fit the regression model y ~ x and
plot the resulting regression line and a 95% confidence interval for that regression.
Bad data means outliners. They are not included. It is not pure fact.
This is the different from estimations.
Examining its data scatter plotting, determine to use linear model or non-linear model.
The following codes are used linear model for both functions.
lmplot is a figure-level function.
lmplot() has data as a required parameter
and the x and y variables must be specified as strings.
This data format is called “long-form” or “tidy” data.
lmplot method is for both linear regression model, and nonlinear regression model.
regplot is a axes-level function.
regplot() accepts the x and y variables in a variety of formats

    ---  Using regression linear model 1
    g = sns.lmplot(x="total_bill", y="tip", data=tips)

    ---  Using regression linear model 2
    ax = sns.regplot(x="total_bill", y="tip", data=tips)

13b. non-linear regression model

After examining the scatter plotting of data, the data fits non-linear.
Then, use non-linear regression model.
the code sample is as below:

    anscombe = sns.load_dataset("anscombe")
    ...
    sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"), 
           order = 2,           nonlinear
           #order = 1,          linear        
           #order = 3,          nonlinear     
           #order = 5,          nonlinear     
           ci=None)

14a. categorical-data, catplot, scatter

description

tips dataset
variable day is a categorical for x-coordinate.
variable total_bill is a numeric variable.
relational plotting
default kind: scatter
The function returns an object of class seaborn.axisgrid.FacetGrid
figure-level function, easy to use.
The dots align vertically.

    g = sns.catplot(x="day", y="total_bill", data=tips)

21. sns.FacetGrid and plt.hist

    import matplotlib.pyplot as plt
    import seaborn as sns
    
    tips = sns.load_dataset('tips')       
    
    sns.set()
    g = sns.FacetGrid(tips, col="time", row="smoker")
    g = g.map(plt.hist, "total_bill")  
    
    plt.show()

data

The dataset is tips.csv file.
After being loaded, it is a Pandas DataFrame
It must take the form of what Hadley Whickam calls “tidy” data
Each column is a variable
Each row is an observation

description

seaborn.FacetGrid creates the 2 * 2 lattice for 4 plots based on the data for column and row.
matplotlib.pyplot.hist creates each plots for histogram.
- For x-cooridate, the variable is total_bill.
- total_bill is a quantative variable.
- Many are created automatically including the bins, scales...
- To construct a histogram, the first step is to "bin" (or "bucket") the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval for y-coordinate.
- To representation of the distribution of numerical data.

My analysis on method g.map mechanism

In FacetGrid's constructor, based on the data for column and row, a matrix is created.
In seaborn.FacetGrid.map method, there are loops for each matrix element.
Execute the loop
- Prepare the column data and row data for a element.
- The first argument is the plotting method name.
- Pythod has function lanugage feature. The function argument can be a function.
- Call the function with its data.
- build the plotting for that element.

22. sns.FacetGrid and plt.scatter

    ...
    g = sns.FacetGrid(tips, col="time", row="smoker")
    g.map(plt.scatter, "total_bill", "tip")
    ...

Description

seaborn.FacetGrid creates the 2 * 2 lattice for 4 plots based on the data for column and row.
matplotlib.pyplot.scatter creates each plots for scatter plotting.
- two variables, total_bill for x, tip for y.
- Relational Plotting

plt.scatter usage differences with and without seaborn.FacetGrid

---- using FacetGrid -----------
seaborn.FacetGrid.map method is needed
to apply a plotting function to each facet’s subset of the data.

----- not using FacetGrid ------
code as below:
plt.scatter(tips['total_bill'], tips['tip'] )

23. sns.FacetGrid and parameter hue

Description

In method sns.FacetGrid
- Variable time is still used for lattice. Two plots - Lunch, Dinner.
- Variable smoker is not for lattice. Now it is used as hue. Two different colors for the dots.
- Related codes for legend are needed.

    ...
    g = sns.FacetGrid(tips, col="time",  hue="smoker")    
    g.map(plt.scatter, "total_bill", "tip")
    g.add_legend()                                    
    
    plt.legend()                                         
    plt.show()

25. sns.FacetGrid and col_wrap

        ...        
        att = sns.load_dataset('attention') 
        sns.set()  
        
        g = sns.FacetGrid(att, col="subject", col_wrap=5, height=1.5)
        g = g.map(plt.plot, "solutions", "score", marker=".")
        ...

Description

Select one variable(subject for this example) as column.
wrap around the subplots.
The relational representation between two numeric variables.
plt.plot method is used. The default kind is line
solutions for x-coordinate; score for y-coordinate.
The data is called attention, which is properly sorted.
Within a subject, solutions is in ascending sort order.

26. sns.PairGrid and plt.scatter

  
        ...  
        iris = sns.load_dataset('iris') 
        sns.set()  
        g = sns.PairGrid(iris)
        g.map(plt.scatter)
        ...

Description

Each row in Iris flower dataset contains 4 numeric variables about flower part width or height.
sns.PairGrid method will create a matrix for the four variables.
The method returns a object of seaborn.axisgrid.PairGrid.
matplotlib.pyplot.scatter creates each plots for scatter plotting.
- One vs others three and itself in x, y cooridates.
- Relational Plotting

27. sns.JointGrid and sns.regplot, sns.distplot

  
    g = sns.JointGrid(x="total_bill", y="tip", data=tips)
    g.plot(sns.regplot, sns.distplot)

sns.JointGrid

If you comment out the g.plot method, and run it.
The plot is shown on the left.
A class is created as seaborn.axisgrid.JointGrid.
The g.plot is to prepare a joint plotting.
The constructor created the lattice for method g.plot.
There are two parts.
- The main part.
- The second is on the top margin and on the rightmargin.

g.plot

Based on the definition of seaborn.JointGrid.plot
Parameters: joint_func, marginal_func: callables
Both are needed.
Functions to draw the bivariate and univariate plots.
uncomment go.plot, the right ploting shows
The joint_func
- sns.regplot for being adaptable
- total_bill in x, tip in y.
- scatter plotting, then reg plotting
sns.distplot
- sns.distplot for being adaptable
- The one on the top margin is for total_bill.
- The one on the right margin is for tip.
- They are distributin plottings.

34. Seaborn

March 11, 2019

Contents

description

overview

03. CSV for Seaborn data

04. figure-level functions vs axes-level functions

4.1 figure-level functions for one plot

4.2 axes-level functions for one plot

4.3 figure-level functions for many subplots

4.4 axes-level functions for may subplots

11a. x-y relationshiprelationship, relplot, scatter

11b. x-y relationship, continuity, relplot, line

12. distribution, distplot, histogram

13a. linear regression model

13b. non-linear regression model

14a. categorical-data, catplot, scatter

14b. categorical-data, catplot, box

21. sns.FacetGrid and plt.hist

22. sns.FacetGrid and plt.scatter

23. sns.FacetGrid and parameter hue

24. sns.FacetGrid and plotting size

25. sns.FacetGrid and col_wrap

26. sns.PairGrid and plt.scatter

27. sns.JointGrid and sns.regplot, sns.distplot