02. database content to DATA FRAME directly

description

This topic is about getting data from a database, using MySql database.
Other databases like Oracle, mssql,etc, or document databases for clouds, have the same way like MySql.
In my previous topic 29, 05y, there is a lab for this.
- I did not use an IDE on my mac, I used website Repl.it to test.
- The data returned from the database server were many lines, not a pandas dataframe.
In this lab,
- In my mac, Anacode3 was installed for python, pandas...
- Code visual studio was install for IDE.
The steps to install the database driver with python for mac
- from Google, getting the package name for pip install. mysql-connector-python
- getting the package name for my python code mysql.connector
- open the terminal window,
- pip --version to make sure pip is installed.
- pip list to make sure mysql is not installed
- pip install mysql-connector-python
- note: the package is for all the operating system.
- pip list to see it in the list.
- To verify the success of the installation, using one line of python code import mysql.connector
- To uninstall, enter pip uninstall mysql-connector-python

        print('----------   mysql3.py --------')
        import pandas as pd
        import mysql.connector

        mydb = mysql.connector.connect(
          host="XXXX.winhost.com",
          user="xxxxxxxx",
          password="xxx2xxx",
          database="mysql_56314_peter"
        )

        df = pd.read_sql("SELECT * FROM dogs", con=mydb)

        print(df)

        '''
          dog_name dog_color
        0    Tairo     brown
        1      Emi     white
        2      Joy     black
        3    Lucky    yellow
        '''

        print('---------- end    test 5 --------')

comments:

If a sql query is used for creating a dataframe from a database server.
The query can be used to define the data.
No need to use some pandas functions, like sort, merge.. for that purposes.
Also, it is more direct.

03. creating dataframes for reporting

03.1 prepare statistical data

In addition to plotting, a dataframe can be used for reporting.
One example is to prepare statistical information.
Then, send it out as report

        print('-------  dataframe ==> csv file  for reporting ---')

        print('--- 1. prepare a dataframe -------')
        import pandas as pd

        df =pd.DataFrame({                       # python dictionary
            'name': ['Happy', 'Lucky', 'Wiwi'],  # key-value pairs
            'weight' : [35.50, 65.33, 42.50],
            'length' : [2.5,  4.3,  3.2],
            'color': ['brown', 'white', 'black']
        })
        print(df)
        # The values for color are not measurable,
        # The statistical data will not be generated for it.

        print('--- 2. create a report -------')
        info = df.describe()
        print(info)
        print('type of info = ' + str(type(info)))   # dataframe
        '''
                  weight    length
        count   3.000000  3.000000
        mean   47.776667  3.333333
        std    15.599347  0.907377
        min    35.500000  2.500000
        25%    39.000000  2.850000
        50%    42.500000  3.200000
        75%    53.915000  3.750000
        max    65.330000  4.300000

        note-1: The index are created using method describe.
        note-2: Method describe is defined in a standard way.
        '''

        print('--- 3. save the df to csv file ----')
        info.to_csv('dogs.csv', header = True, index = True)
        #the file will be in the parent of the current folder for my mac.

        print('---------  end of test  ---------')

04. Storing python objects

description

Python objects like dataframe, list, dictionary can be stored in files with file extension pickle.
Pickling means to store for later uses,
like a cucumber that is preserved in brine, vinegar, or the like.
They can be passed around in a unprotected manners.
The pickle file format is in binary.
If the python file is under the working folder, the pickle file will be saved under the same folder.

        import pickle        # built-in module

        print('--- 1.  pickle     write  ------------')

        my_dict = {1: 'Lucky', 2: 'Happy', 3: 'Joy'}
        pickle_out = open('my_dict.pickle', 'wb')   #write output file, byte                        
        pickle.dump(my_dict, pickle_out)            #from, to
        pickle_out.close()

        print('--- 2.  pickle     read  ------------')
        pickle_in = open('my_dict.pickle', 'rb') 
        dict2 = pickle.load(pickle_in)
        pickle_in.close()
        print(dict2)   #{1: 'Lucky', 2: 'Happy', 3: 'Joy'}
        print('type of dict2 = ' + str(type(dict2)))   #dict

        print('---      end     test  -----')

06. Numpy, Series

types in a dataframe

Many data types in a dataframe are converted into types defined in numpy.
The purposes are for efficiency, better interfaces with other languages like C, Java.

        print('--- 1, create df  from a python dictionary ----')
        import pandas as pd
        dogs = {
                  'name':  ['Lucky', 'Happy', 'Joy'],        # column names
                  'age':   [8, 6, 3],                        # column 1
                  'weight':[41.5, 32.7, 64.4],               # column 2
               }                        
        df = pd.DataFrame(dogs)
        print(df)

        c01 = df['age'][1]
        print('c01 = ' + str(c01))               # 6
        print('c01 type is ' + str(type(c01)))   # class 'numpy.int64'

        c11 = df['weight'][1]                    
        print(c11)                               # 32.7
        print('c11 type is ' + str(type(c11)))   # class 'numpy.float64'

numpy objects

A numpy object is an array.
It can be multi-dimensional.
It is a classic array.
You can create it or slice from a dataframe
The following code to use it for create a dataframe.

        import numpy as np
        import pandas as pd

        data = np.array([['', 'Col1', 'Col2','name'],        #python lists
                         [0, '11npa','12npa', 'A'],          #python lists
                         [1, '21npa','22npa', 'B']])         #python lists


        df = pd.DataFrame(data = data[1:,1:],
                          index=data[1:,0],
                          columns= data[0,1:])
        print(df)

        print('df.index = ' + str(df.index))
        print('df.columns = ' +str(df.columns))
        print('df.values = ' + str(df.values))

Get a column or a row in SERIES from a dataframe

The return data type is Series.
It is defined in pandas.
A series object has columns and rows.
It looks like a one-column dataframe.

        # continue to use the dataframe for dogs data
        print('--- 2. get a column ---------------')
        c1 = df['weight']
        print(' c1 = ' + str(c1))
        print('c1 type = ' + str(type(c1)))  # pandas.core.series.Series
        '''
        
        0    41.5
        1    32.7
        2    64.4
        
        '''
        print('--- 3. get a row ---------------')
        r1 = df.iloc[1]
        print(r1)
        print(' r1 type ' + str(type(r1)))   # pandas.core.series.Series
        '''   note: column names are on the left
        name      Happy
        age           6
        weight     32.7
        '''

Create Series Objects

First, create two Series objects from python lists.
Then, create a dictionary by using them.
Finally, prepare a dataframe from it.

        import pandas as pd
        c1 = pd.Series([22,55,33])     #  Series for one column values
        c2 = pd.Series([99,77,11])     #  Series for one column values
        print('c1 type is ' + str(type(c1))) # class pandas.core.series.Series

        dict = {'col1': c1,'col2': c2}  # two column names
        print('dict type is ' + str(type(dict)))    #class 'dict'

        df = pd.DataFrame(dict) # class dataframe
        print(df)

08.Columns first or Rows first

08.1 Columns First

        print('------- dataframe column test , Jan 16, 2019 -----')
        import pandas as pd

        print('--- 1. create df ----')
        import pandas as pd

        dogs = {
                  'name':  ['Lucky', 'Happy', 'Joy'],        
                  'age':   [8, 6, 3],                        
                  'weight':[41.5, 32.7, 64.4],              
               }                        
        df = pd.DataFrame(dogs)
        print(df)
        '''
            name  age  weight
        0  Lucky    8    41.5
        1  Happy    6    32.7
        2    Joy    3    64.4
        '''

        print('--- 2.  rearrange columns ----')
        
        cols2 = ['age', 'weight', 'name']          #python list
        df2 = df[cols2]
        
        print(df2)

        print('--- 3.  drop a column -------')
        df3 = df.drop('weight', 1)  # axis number, 1 for column
        print(df3)

        print('--- 4. add a column ---------')
        
        dogs_color = ['white', 'brown', 'black']    #python list, no need for Series
        df['color'] = dogs_color                    # color is the new column name
        
        print(df)

        print('--- 5. sort ---------')
        
        df5 = df.sort_values(by=['age'])             # no need from index for row
        
        print(df5)

        print('--- 6. access  one column ---------------------')
        column_weight = df['weight']
        print('column weight = ' + str(column_weight))
        '''
        0    41.5
        1    32.7
        2    64.4
        '''

        print('--- 7. access a cell --------------------------')
        weight4Joy = df['weight'][2]                # index is not set for row
        print('weight4Joy = ' + str(weight4Joy))    #64.4

        '''
        
        notes:
        1. There are TWO ways to access a cell in a dataframe.
        2. This is one way to get all the column first. Then select a row.
        
        

        '''

        print('---------- end of test 5 ------------------------')

08.2 Rows FIRST

numeric indices

In the first demo is to use numeric indices to access a dataframe.
Method iloc is used. [...] is for method arguments, not (...)
In this demo, they are automatically generated. The index type is RangeIndex.
Using this way, I'll not treat them as explicit and use loc to access them.

        import pandas as pd
        import pandas as pd
        dogs = {
                  'name':  ['Lucky', 'Happy', 'Joy'],        
                  'age':   [8, 6, 3],                        
                  'weight':[41.5, 32.7, 64.4],              
               }                        
        df = pd.DataFrame(dogs)
        print(df)
        print(df.index)   #RangeIndex(start=0, stop=3, step=1)
        '''
            name  age  weight
        0  Lucky    8    41.5
        1  Happy    6    32.7
        2    Joy    3    64.4
        '''
        d1 = df.iloc[1]                         # 1 row
        print('d1 = ' + str(d1))
        print('d1 type is ' + str(type(d1)))    #class 'pandas.core.series.Series'
        '''
        name      Happy
        age           6
        weight     32.7
        '''
        d01 = df.iloc[0:2]                       # 0, 1 rows, row 2 is excluded.
        print('d01 = ' + str(d01))
        print('d01 type is ' + str(type(d01)))   #dataframe 

        '''
        d01 =     name  age  weight
        0  Lucky    8    41.5
        1  Happy    6    32.7
        '''

explicit indices

In the first demo is to use explicit indices to access a dataframe.
Method loc is used.

        import pandas as pd
        dogs = {
                  'name':  ['Lucky', 'Happy', 'Joy'],        
                  'age':   [8, 6, 3],                        
                  'weight':[41.5, 32.7, 64.4],              
               }                        
        df = pd.DataFrame(dogs)
        
        df.set_index('name', inplace=True)
        df.sort_values(by='name', ascending=True, inplace=True)
        
        print(df)

        '''
               age  weight
        name
        Happy    6    32.7
        Joy      3    64.4
        Lucky    8    41.5
        '''

        d1 = df.loc['Joy']                       # 1 row
        print('d1 = ' + str(d1))
        print('d1 type is ' + str(type(d1)))     # Series
        '''
         age        3.0
         weight    64.4
         Name: Joy, dtype: float64
        '''

        d01 = df.loc['Happy':'Joy']                 # 0, 1 rows, row 1 is included.
        print('d01 = ' + str(d01))
        print('d01 type is ' + str(type(d01)))      # dataframe
        '''
                age  weight
        name
        Happy    6    32.7
        Joy      3    64.4
        '''

access a cell

This is the second way to access a cell
Using loc to get a row first.
Then, set the column.

    Joy_weight = df.loc['Joy']['weight']
    print('Joy_weight = ' + str(Joy_weight))  # 64.5

09. Multilevel indices and grouping

my comments, Jan, 2019

Good for reporting

1-2, data for this demo, as categorical data

data: dog weight measurements
based on dog name, month
one dog, one month, one weight measurement.

3. create mutilevel indices, grouping

df2.set_index(['name', 'month'])
The result dataframe is a report.

4. Grouping

When you create multilevel indices, data are grouped into two level
- level-1: name
- level-2: month
There is no need under the context of containments - like counties-states, years-months.
You use statistical methods to get related statistical data in a report dataframe.

5. making the 2nd index to a column

using method unstack
The result dataframe is also a report.

        # --- 1. prepare a csv file  ----
        name,month,weight
        Joy, 1, 45.9
        Happy, 3, 12.3
        Happy, 1, 11.1
        Joy, 3, 42.2
        Happy, 2, 10.5
        Joy, 2, 43.1

        #  ---2. create a df, sort
        import pandas as pd
        import pandas as pd
        df = pd.read_csv('dogs_weight.csv') 
        df2 = df.sort_values(by=['name','month'])
        print(df2)
        '''
            name  month  weight
        2  Happy      1    11.1
        4  Happy      2    10.5
        1  Happy      3    12.3
        0    Joy      1    45.9
        5    Joy      2    43.1
        3    Joy      3    42.2

        '''
        #----  3. create mutilevel indices
        df3 = df2.set_index(['name', 'month'])
        print(df3)
        '''
                     weight
        name  month
        Happy 1        11.1
              2        10.5
              3        12.3
        Joy   1        45.9
              2        43.1
              3        42.2
        '''
        #---  4. grouping -------------------------
        df4 = df3.groupby('name') 
        s_report = df4.mean()   
        print(s_report)

        '''    weight
         name
        Happy   11.30
        Joy     43.73
        '''

        #---  5. having the 2nd index to be a column
        df5 =df3.unstack()
        print(df5)
        '''
           month     1     2     3
         name
           Happy   11.1  10.5  12.3
           Joy     45.9  43.1  42.2  
        '''

10. pivoting-drill down

This demo is about resturants tip data.
Two restaurants - Pandas,Amber
For lunch and dinner of 2 days are recorded

        -------------  csv   data  ------------------------
        name,store,day,time,tip
        Joe,Pandas,1,lunch,250.50
        Joe,Pandas,1,dinner,420.50
        Joe,Pandas,2,lunch,110.50
        Joe,Pandas,2,dinner,120.50
        Nancy,Pandas,1,lunch,150.50
        Nancy,Pandas,1,dinner,320.50
        Nancy,Pandas,2,lunch,410.50
        Nancy,Pandas,2,dinner,220.50
        Michael,Amber,1,lunch,310.00             note: one data for Muchael is not there
        Michael,Amber,2,lunch,110.50
        Michael,Amber,2,dinner,315.00
        Carey,Amber,1,lunch,110.00
        Carey,Amber,1,dinner,315.00
        Carey,Amber,2,lunch,150.50
        Carey,Amber,2,dinner,     215.00          note: space is ok

        ----------------    code -----------------------------------
        print('------ pivot table, drilldown  -----')
        import pandas as pd

        print('--- 1. prepare df --------')
        df = pd.read_csv('tips.csv')
        print(df.head())
        '''
            name   store  day    time    tip
        0    Joe  Pandas    1   lunch  250.5
        1    Joe  Pandas    1  dinner  420.5
        2    Joe  Pandas    2   lunch  110.5
        3    Joe  Pandas    2  dinner  120.5
        4  Nancy  Pandas    1   lunch  150.5
        '''

        print('---2. create pivot for drill down   ---')

        df2 = pd.pivot_table(df,'tip',['store','name'],['day','time'])
              
        print(df2)
        '''
                  day        1               2
                  time  dinner  lunch  dinner  lunch
        store  name
        Amber  Carey    315.0  110.0    215.0  150.5
               Michael    NaN  310.0    315.0  110.5
        Pandas Joe      420.5  250.5    120.5  110.5
               Nancy    320.5  150.5    220.5  410.5
        '''
        

        print('----     end of test  -------')

11. Combining dataframes

11.1 Vertically

append, for loop
concat, for known elements

        import pandas as pd
        df1 = pd.DataFrame([[11,12],[21,22]], columns = ['A', 'B'])
        df2 = pd.DataFrame([[31,32],[41,42]], columns = ['A', 'B'])

        df12 = df1.append(df2, ignore_index = True)
        print('df12 = ' + str(df12))
        '''
             A   B
        0  11  12
        1  21  22
        2  31  32
        3  41  42
        '''

        df12x =pd.concat([df1, df2], ignore_index = True)
        print(' df12x = ' + str(df12x))    # the same

11.2 Horizontally

like sql inner join

                print('------------------- test merge -----------------------')
                import pandas as pd
                
                print('------  workers ------------------------------')
                workers = pd.DataFrame([{'name': 'Joe','emp_id': 12,'rate': 15.50},
                                        {'name': 'Emi','emp_id': 35,'rate': 20.10}])
                print(workers)
                '''
                   emp_id name  rate
                0      12  Joe  15.5
                1      35  Emi  20.1
                '''
                
                print('-----  works ------------------')
                works = pd.DataFrame([{'id': 12,'hours': 24},
                                      {'id': 35,'hours': 41}])
                                                   
                print(works)
                '''
                   hours  id
                0     24  12
                1     41  35
                '''
                
                print('-----   merge ---------------------')
                m = pd.merge(workers, works, left_on='emp_id', right_on='id').drop('id',axis=1) 
                
                print(m)
                '''
                   emp_id name  rate  hours
                0      12  Joe  15.5     24
                1      35  Emi  20.1     41
                '''
                print('------------   end -- test 3----------------------------')

12. Data Cleanup

Handling missing data is based on the scenarios
Two are presented as below.

    # ---  senario-1 --------------------------------------  
    # --- mydogs.csv-----
    name,month,weight
    Joy, 1, 45.9
    Happy, 3
    Happy, 1, 11.1
    Joy, 3
    Happy, 2, 10.5
    Joy, 2, 43.1

    # --- python code ----
    import pandas as pd 
    print('---   data with missing values ----')
    dogs= pd.read_csv('mydogs.csv')
    print(dogs)
    '''
    name  month  weight
    0    Joy      1    45.9
    1  Happy      3     NaN
    2  Happy      1    11.1
    3    Joy      3     NaN
    4  Happy      2    10.5
    5    Joy      2    43.1
    ''''
    print('----- solution: fill 0  -----------')  
    dogs['weight'].fillna(0, inplace=True)
    print(dogs)
    '''
    name  month  weight
    0    Joy      1    45.9
    1  Happy      3     0.0
    2  Happy      1    11.1
    3    Joy      3     0.0
    4  Happy      2    10.5
    5    Joy      2    43.1
    '''

    # ---  senario-2 ---------------------------------------------  
    # --- my_expenses.csv-----
    item,          amount       Weekday
    misc,           50.10        Mon
    foods,         120.99        Wed
    haircut,        20.55        Thurs
    dinner,         99.99        Sun
    ,               15.00        Sun
    misc,          100.50        Sat

    # ---python code -----
    import pandas as pd 

    print('---   data with missing values -------------')
    expenses= pd.read_csv('my_expenses.csv')
    print(expenses)
    '''
        item             amount       Weekday
    0     misc                 50.10        Mon
    1    foods                120.99        Wed
    2  haircut               20.55        Thurs
    3   dinner                 99.99        Sun
    4      NaN                 15.00        Sun
    5     misc                100.50        Sat
    '''

    print('----- solution:   fill them with reasonable values -----------')

    expenses['item'].fillna(value='misc', inplace= True)
    print(expenses)
    '''
        item             amount       Weekday
    0     misc                 50.10        Mon
    1    foods                120.99        Wed
    2  haircut               20.55        Thurs
    3   dinner                 99.99        Sun
    4     misc                 15.00        Sun
    5     misc                100.50        Sat
    '''
    
    print('------- end of test  -------------')

13. List Comprehension

A list comprehension is a syntactical construct in python language.
It creates a new list from the existing list.

        print('---  1. prepare a dataframe')
        import pandas as pd
        
        data = {'product': ['A', 'B', 'C', 'D'],
                'id': [1, 2, 3, 4],
                'hourly_pay': [35.50, 16.50, 17.50, 120.50]}
        df = pd.DataFrame(data)
        print(df)
        '''
            product  id  hourly_pay
        0       A   1        35.5
        1       B   2        16.5
        2       C   3        17.5
        3       D   4       120.5
        '''
                        
        print('--- 2. Using list comprehension and a function to convert a list ---')
        def myfunc(p):
            if p < 100:
                return p * 1.1
            else:
                return p

        new_hourly_pay = [myfunc(p)  for p in df['hourly_pay']] 
        
        print('---  3. add a new column  ------------')
        df['new_hourly_pay'] = new_hourly_pay
        print(df)
        '''
            product  id  hourly_pay  new_hourly_pay
        0       A   1        35.5           39.05
        1       B   2        16.5           18.15
        2       C   3        17.5           19.25
        3       D   4       120.5          120.50
        '''
        print('--- or    4.  update ------------------')
        #df['hourly_pay'] = new_hourly_pay
        #print(df)

15. code visual studio for python

In this topic, some features are discussed.
set breakpoints and run
- In the editor, a python file is open.
- On the left end of a code, click it.
- A red dot is seen, a break point is set.
- Click the debug button.
- The codes will be executed before the breakpoint.
- You do not have examine any before that. Focus the current spots.
examine data
- With the debug window open on the left.
- In the variables window pane,
- click local window pane
- Hover the mouse over a dataframe content, you can see it.
- You do not have to use print method to see a dataframe.
examine collection data type
- First, set the break points, and run.
- On the right lower window, click ...
- then, click debug window
- The window with one line is open,
- You can enter print(type(my_new_data)) to examine the type.
- It is more comfortable to know what you get - dataframe, Series, Numpy array...

32. Python Package - pandas

January 28, 2019

Contents

01. Streamline Data Analysis using DataFrame in package pandas