Peep into basics of Numpy and Pandas

Peep into basics of Numpy and Pandas

This blog is written in Jupyter notebook, so you can experiment and learn by editing the notebook.
Click here for notebook.

Just change the input and check the output.

Learning by experiment and hands-on exercises is always better.

The purpose of this notebook is just to revise python basics.

Let's get started.

1. NUMPY BASICS

NumPy is a Linear Algebra Library used for multidimensional arrays

NumPy brings the best of two worlds:

  • C/Fortran computational efficiency,
  • Python language easy syntax
import numpy as np 

# Let's define a one-dimensional array 
my_list = [10, 20, 30, 40, 50, 60, 70, 80]
my_list
[10, 20, 30, 40, 50, 60, 70, 80]

Let's create a numpy array from the list "my_list"

x = np.array(my_list)
x
array([10, 20, 30, 40, 50, 60, 70, 80])

Get shape

x.shape
(8,)

Let's create a Multi-dimensional numpy array from the list "my_list"


matrix = np.array([[5, 8], [9, 13]])
matrix
array([[ 5,  8],
       [ 9, 13]])
# "rand()" uniform distribution between 0 and 1
xy = np.random.rand(7)
xy
array([0.40408966, 0.12527144, 0.04465052, 0.39450693, 0.93339664,
       0.14009694, 0.94461679])

you can create a matrix of random number from random.rand


xy = np.random.rand(2, 2)
xy
array([[0.86152202, 0.22526627],
       [0.41562272, 0.33467273]])
# "randn()" normal distribution between 0 and 1
xy = np.random.randn(7)
xy
array([-1.27678101,  1.20667812,  0.7945132 ,  0.62421099, -0.44447512,
       -0.57038096,  2.19949273])

"randint" is used to generate random integers between upper and lower bounds


xy = np.random.randint(1, 10)
xy
9

Create an evenly spaced values with a step of 7

xy = np.arange(1, 50, 7)
xy
array([ 1,  8, 15, 22, 29, 36, 43])
# Array of ones
xy = np.ones(7)
xy
array([1., 1., 1., 1., 1., 1., 1.])
# Matrices of ones
xy = np.ones((2, 2))
xy
array([[1., 1.],
       [1., 1.]])
# Array of zeros
xy = np.zeros(5)
xy
array([0., 0., 0., 0., 0.])

Reshape 1D array into a matrix

z = x.reshape(2,4)
print(x)
print(z)
[10 20 30 40 50 60 70 80]
[[10 20 30 40]
 [50 60 70 80]]

Obtain the maximum element (value)

x.max()
80

Obtain the minimum element (value)

x.min()
10

Obtain the location of the max element

x.argmax()
7
# Obtain the location of the min element
x.argmin()
0
# Access specific index from the numpy array
x[0]
10
# Starting from the first index 0 up until and NOT including the last element
x[0:3]
array([10, 20, 30])
# Broadcasting, altering several values in a numpy array at once
x[0:2] = 10
x
array([10, 10, 30, 40, 50, 60, 70, 80])

2. Pandas

Pandas is a data manipulation and analysis tool that is built on Numpy.

Pandas uses a data structure known as DataFrame (think of it as Microsoft excel in Python).

DataFrames empower programmers to store and manipulate data in a tabular fashion (rows and columns).

Series Vs. DataFrame? Series is considered a single column of a DataFrame.

import pandas as pd
# Let's define two lists as shown below:
stock_list = ['Reliance','AMAZON','facebook']
stock_list
['Reliance', 'AMZN', 'facebook']
label   = ['stock#1', 'stock#2', 'stock#3']
label
['stock#1', 'stock#2', 'stock#3']

Let's create a one dimensional Pandas "series"

Note that series is formed of data and associated labels


x_series = pd.Series(data = stock_list, index = label)
# Let's view the series
x_series
stock#1    Reliance
stock#2        AMZN
stock#3    facebook
dtype: object

Let's obtain the datatype

type(x_series)
pandas.core.series.Series

Let's define a two-dimensional Pandas DataFrame

Note that you can create a pandas dataframe from a python dictionary


bank_client_df = pd.DataFrame({'Bank client ID':[1111, 2222, 3333, 4444], 
                               'Bank Client Name':['Kiran', 'Chaitanya', 'dheeraj', 'shreyas'], 
                               'Net worth [$]':[3500, 29000, 10000, 2000], 
                               'Years with bank':[3, 4, 9, 5]})
bank_client_df
Bank client ID Bank Client Name Net worth [$] Years with bank
0 1111 Kiran 3500 3
1 2222 Chaitanya 29000 4
2 3333 dheeraj 10000 9
3 4444 shreyas 2000 5

Let's obtain the data type


type(bank_client_df)
pandas.core.frame.DataFrame

you can only view the first couple of rows using .head()

bank_client_df.head(2)
Bank client ID Bank Client Name Net worth [$] Years with bank
0 1111 Kiran 3500 3
1 2222 Chaitanya 29000 4

you can only view the last couple of rows using .tail()

bank_client_df.tail(1)
Bank client ID Bank Client Name Net worth [$] Years with bank
3 4444 shreyas 2000 5

Pandas is used to read a csv file and store data in a DataFrame

bank_df = pd.read_csv('sample.csv')

write to a csv file without an index

bank_df.to_csv('sample_output.csv', index = False)

CONCATENATING AND MERGING WITH PANDAS

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df1
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df2
A B C D
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
df3
A B C D
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
pd.concat([df1, df2, df3])
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11