# Data Science Libraries in Python

# toolboxes/libraries

# NumPy

  • introduces objects for multidimensional arrays, vectors and matrices, as well as functions that allow to easily perform advanced mathematical and statistical operations on those objects
    介绍多维数组、向量和矩阵的对象,以及允许轻松对这些对象执行高级数学和统计操作的函数
  • provides vectorization of mathematical operations on arrays and matrices which significantly improves the performance
    对数组和矩阵的数学运算进行矢量化,从而显著提高性能
  • many other python libraries are built on NumPy
    许多其他 python 库都是基于 NumPy 构建的

# Pandas

  • adds data structures (data frame) and tools designed to work with table-like data
    添加数据结构(数据框架)和用于处理类似表格的数据的工具
  • provides tools for data manipulation: reshaping, merging, sorting, slicing, aggregation etc.
    提供数据处理工具:重塑、合并、排序、切片、聚合等。
  • allows handling missing data
    允许处理丢失的数据

# SciPy

  • collection of algorithms for linear algebra, differential equations, numerical integration, optimization, statistics and more
    线性代数、微分方程、数值积分、优化、统计学等算法的集合
  • part of SciPy Stack
    SciPy 堆栈的一部分
  • built on NumPy
    建立在 NumPy 之上
  • SciPy and NumPy are usually used for matrix-based operations, such as matrix factorization
    SciPy 和 NumPy 通常用于基于矩阵的操作,例如矩阵分解

# SciKit-Learn

  • provides machine learning algorithms: classification, regression, clustering, model validation etc.
    提供机器学习算法:分类、回归、聚类、模型验证等。

  • built on NumPy, SciPy and matplotlib
    基于 NumPy、SciPy 和 matplotlib 构建

# Visualization libraries

# matplotlib

  • python 2D plotting library which produces publication quality figures in a variety of hardcopy formats
    python 2D 绘图库,以各种硬拷贝格式生成出版物质量的图形
  • a set of functionalities similar to those of MATLAB
    类似于 MATLAB 的功能
  • line plots, scatter plots, barcharts, histograms, pie charts etc.
    折线图、散点图、条形图、直方图、饼图等。
  • relatively low-level; some effort needed to create advanced visualization
    相对低级;创建高级可视化需要一些努力

# Seaborn

  • based on matplotlib
  • provides high level interface for drawing attractive statistical graphics
    提供高级界面,用于绘制有吸引力的统计图形
  • Similar (in style) to the popular ggplot2 library in R
    (样式上) 与 R 中流行的 ggplot2 库相似

# Pandas

# Introductions

Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame.
Pandas 是一个构建在 NumPy 之上的更新包,它提供了数据帧的高效实现。

DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.
DataFrame 本质上是具有附加的行和列标签的多维数组,并且通常具有异构类型和 / 或缺少数据。

# Basic data structure

# Series

  • Series is a one-dimensional array of indexed data.
    序列是索引数据的一维数组。
  • The index-value is similar to key-value in dict in Python3.
  • We can create series by given a dict or an array/list
    索引值类似于 python3 中 dict 的键值。
print('Data Structure: Series')
import numpy as np
import pandas as pd
# create series from an array ####################################
# create Series from an array with automatic indices
# 从具有自动索引的数组创建序列
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print(data)
print(data.values)
print(data.index)
print(data[1:4])
# create Series from an array with given indices
# 从具有给定索引的数组创建序列
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
print(data)
print(data.values)
print(data.index)
print(data[1:4])
print(data['c'])
# it is not necessary to be continuous indices
# 不一定要是连续的指数
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
print(data)
print(data.values)
print(data.index)
# create series from a dict 
# 从字典创建序列
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
data = pd.Series(population_dict)
print(data)
print(data.values)
print(data.index)
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
print(A)
print(B)
print(A + B)
C=A.add(B, fill_value=0)
print(C)

# DataFrame

  • DataFrame is a generalized array or a specialization of dict.
    DataFrame 是广义上的数组或特殊的 dict。
  • It can be viewed as a table which stores data in different data types.
    可以将其视为以不同数据类型存储数据的表。
print('\nData Structure: DataFrame')
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
direction_dict={'California': 'West', 'Texas': 'Middle', 'New York': 'East', 'Florida': 'East', 'Illinois': 'Middle'}
direction=pd.Series(direction_dict)
# create a dataFrame by fusing three Series or dicts
# 通过融合三个序列或 dict 创建 dataFrame
states = pd.DataFrame({'population': population, 'area': area, 'direction': direction})
print(states)
print(states.index)
print(states['area'])
print(states[1:3])
# create DataFrame in different ways: single column, multiple columns, given index and columns
# 以不同的方式创建 DataFrame:单列、多列、给定索引和列
df=pd.DataFrame(population, columns=['population'])
print(df)
df=pd.DataFrame({'population': population, 'area': area, 'direction': direction})
print(df)
df=pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c'])
print(df)
# create dataFrame from dict
# 从 dict 创建 dataFrame
df=pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
print(df)

# Index

  • Index is the object associated with Series and DataFrame
    索引是与 Series 和 DataFrame 关联的对象
  • It can be viewed as an immutable array (i.e., cannot be modified) or as an ordered set
    可以将其视为不可变数组 (即不能修改) 或有序集
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
# any set opeartions can be applied to index
# 任何集合运算都可以应用于索引
ind = indA & indB # intersection 
print(ind)
ind = indA | indB # union
print(ind)
ind = indA ^ indB # symmetric difference
print(ind)

# Slicing data in dataFrame

  • loc and iloc are used to slice rows by default.
    默认情况下, lociloc 用于切片行。
  • To slice columns, we can use df.loc[:, [1, 2, 3]]
    要切片列,可以使用 df.loc[:, [1, 2, 3]]
  • we can use index -1 in iloc , but cannot use it in loc .
  • We only use index number in iloc , but we can use both index number and label in loc
  • ix is not suggested to be used in pandas v0.2 or later version
# select data by columns
# 按列选择数据
print(states['area'])
print(states[['area','direction']])
# select data by rows
# 按行选择数据
print(states[1:3]) # index from [1, 3)
print(states['Florida':'Illinois'])
# select data by rows and columns
# 按行和列选择数据
print(states.iloc[1:3, 1:3]) # use numerical index in .iloc
print(states.loc[['Texas','New York'], ['area','direction']]) # use string key in .loc

# NaN , None and mising values in Pandas

  • None is a general missing data
  • NaN can be interpreted as missing numerical data in float type
    可以解释为浮点类型中缺少数值数据
s=pd.Series([1, np.nan, 2, None])
print(s)
# check whether there are missing values
# 检查是否有缺失值
print(s.isnull())
# drop missing values
# 删除缺失值
print(s.dropna())
# in dataFrame, it will drop rows with null values
# 在 dataFrame 中删除值为 null 的行
df = pd.DataFrame([[1, np.nan, 2],
[2, 3, 5],
[np.nan, 4, 6]])
print(df)
# it will drop rows with null values
# 删除值为 null 的行
print(df.dropna())
# it will drop columns with null values
# 删除值为 null 的列
print(df.dropna(axis='columns')) # axis =0 cols, axis =1 rows
# fill in missing values
# 填写缺失值
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
print(data)
print(data.fillna(0))

# File operations by Pandas

import pandas as pd
# load csv into dataFrame
df1=pd.read_csv('Data_students.csv')
print(type(df1))
df2=pd.read_table('Data_students.csv',sep=',')
print(type(df2))
# show head and tail data
print(df1.head(2))
print(df1.tail(3))
# print(df2.head(2))
# get column names
print(df1.columns)
print(type(df1.columns)) # by default, they are index
# convert column names to list
col_list = df1.columns.tolist()
print(col_list)
print(type(col_list)) # it is list now
# select data by rows
print('\nselect data by rows')
print(df1[1:3]) # get rows with index [1, 3)
# select data by columns
print(df1.columns)
print('\nselect data by columns')
cols = ['Age','Gender','Grade']
print(df1[cols])
# get mean value in a column
print(df1['Age'].mean())

# Data Preprocessing

# Introductions

Data preprocessing may include the following operations:
数据预处理可能包括以下操作:

  • file load
    文件加载
  • deal with missing values
    处理缺失值
  • slicing data
    切片数据
  • data normalization
    数据标准化
  • data smoothing
    数据平滑
  • data transformation, numerical to categorical
    数据转换,从数字到分类
  • data transformation, categorical to numerical
    数据转换,分类到数字
  • feature selection
    特征选择
  • feature deduction
    特征演绎
  • some special preprocessing, such as the operations in text mining, e.g., stopword removal, tokenization, TF-IDF weighting
    一些特殊的预处理,例如文本挖掘中的操作,例如,停止词删除,标记化,tf-idf 加权

the following operations will use Data_Students.csv as the data set

# deal with missing values

#Import Python Libraries
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns
from IPython.display import display, HTML
df=pd.read_csv('data_students.csv')
# get header
cols=df.columns
# get dimensions
print(df.shape)
# print header and dataType, as well as boolean value which tells missing values
# 打印标题和数据类型,以及是否为缺失值的布尔值
print(df.dtypes)
print('ColumnName, DataType, MissingValues')
for i in cols:
    print(i, ',', df[i].dtype,',',df[i].isnull().any())
# print out and display dataframe as tables in HTML
# 在 HTML 中打印并显示 dataframe 为表格
display(HTML(df.head(10).to_html()))    
# By using GradeLetter as label, and visualize data
# 使用 GradeLetter 作为标签,将数据可视化
sns.set()
sns.pairplot(df, hue='GradeLetter', height=2);
# calculate mean value by ignoring missing values
# 通过忽略缺失值来计算平均值
mean_age=df['Age'].mean(skipna=True)
mean_hr_assignment=df['Hours on Assignments'].mean(skipna=True)
mean_hr_game=df['Hours on Games'].mean(skipna=True)
mean_exam=df['Exam'].mean(skipna=True)
mean_grade=df['Grade'].mean(skipna=True)
# replace missing values in numerical variables by using mean value
# 用均值代替数值变量中的缺失值 
df["Age"].fillna(df["Age"].mean(), inplace=True)
df["Hours on Assignments"].fillna(df["Hours on Assignments"].mean(), inplace=True)
df["Hours on Games"].fillna(df["Hours on Games"].mean(), inplace=True)
df["Exam"].fillna(df["Exam"].mean(), inplace=True)
df["Grade"].fillna(df["Grade"].mean(), inplace=True)
# check again whether there are missing values
# 再次检查是否有缺失值
print('ColumnName, DataType, MissingValues')
for i in cols:
    print(i, ',', df[i].dtype,',',df[i].isnull().any())
 
# print out and display dataframe as tables in HTML
# 在 HTML 中打印并显示 dataframe 为表格
display(HTML(df.head(10).to_html()))

# Normalization

# find numeric columns 
# 查找数字列
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
# get column names
# 获取列名
cols_numeric = df.select_dtypes(include=numerics).columns.tolist()
# get column indices
# 获取列索引
cols_numeric_index=[df.columns.get_loc(col) for col in cols_numeric]
print('Numerical column names:\n',cols_numeric)
print('Numerical column indeices:\n',cols_numeric_index)
for i in cols:
    print(i, ',', df[i].dtype,',',df[i].isnull().any())
    
# create a copy first
# 先创建一个副本
# 深复制,不会影响原始的 dataframe
df_norm=df.copy(deep=True)
# Normalization method 1 = z-score normalization
# 归一化方法 1:z-score 归一化
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[cols_numeric]=scaler.fit_transform(df[cols_numeric])
display(HTML(df.head(10).to_html()))
# Normalization method 2 = Min-max normalization [0,1]
# 归一化方法 2:最小最大归一化 [0,1]
for col in cols_numeric:
    df_norm[col]=(df[col]-df[col].min())/(df[col].max()-df[col].min())    
  
# drop column ID since it is not useful in data science tasks 
# 删除 ID 列,因为它在数据科学任务中没有用
# 1 = column 1 指的是列,如果是 0,则会在行 raw index 中搜索 ID
df_norm=df_norm.drop('ID',1)
df_norm.head(10)

# Data transformation

df_transform=df_norm.copy(deep=True)   
# print out and display dataframe as tables in HTML
# 在 HTML 中打印并显示 dataframe 为表格
display(HTML(df_transform.head(5).to_html()))
# convert numerical to categorical data, e.g., Age 
# 将数值转换为分类数据,例如年龄
# 使用 cut 分成 8 组
# df_transform['Age'] = pd.cut(df_transform['Age'],8)
# display(HTML(df_transform.head(5).to_html()))
# convert categorical data to numerical data, e.g., Degree 
# 将分类数据转换为数值,例如学位
print(df_transform['Degree'].dtype)
print(df_transform['Nationality'].dtype)
df_dummies_degree=pd.get_dummies(df_transform['Degree'])
print(df_dummies_degree.head(5))
df_dummies_nation=pd.get_dummies(df_transform['Nationality'])
# add binary variables to dataframe
# 将二进制变量添加到 dataframe
df_transform=df_transform.join(df_dummies_degree)
df_transform=df_transform.join(df_dummies_nation)
# remove the original categorical variable
# 删除原始的分类变量
df_transform=df_transform.drop('Degree',1)
df_transform=df_transform.drop('Nationality',1)
display(HTML(df_transform.head(5).to_html()))
# N-1 binary variable is enough, drop 1
# N-1 二进制变量就足够了,去掉 1
df_transform=df_transform.drop(' PHD',1)
df_transform=df_transform.drop(' China',1)
display(HTML(df_transform.head(5).to_html()))

# Feature selection

import matplotlib.pyplot as plt
# print out and display dataframe as tables in HTML
# 在 HTML 中打印并显示 dataframe 为表格
display(HTML(df_transform.head(10).to_html()))
# set features and labels
# 设置特征和标签
x = df_transform.drop('GradeLetter', 1)
y = df_transform['GradeLetter']
# Feature selection by using Filter model 
# 利用过滤模型进行特征选择
# by using Pearson correlation as selecting criterion
# 以皮尔逊相关系数作为选择标准
# Pearson correlation can only be applied among numerical variables
# 皮尔逊相关只能在数值变量之间应用
# in this data, GradeLetter is highly correlated with numerical variable Grade
# 在此数据中,GradeLetter 与数值变量 Grade 高度相关
# calculate correlation and show in heatmap
# 计算相关性并在热图中显示
plt.figure(figsize=(12,10))
cor = df_transform.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()
# Correlation with output variable
# 与输出变量的相关性
cor_target = abs(cor["Grade"])
# Selecting highly correlated features
# 选择高度相关的特征
relevant_features = cor_target[cor_target>0.5]
print('\nSelected features by Filter model:\n',relevant_features)
# Feature selection by using Wrapper model 
# 使用包装模型进行特征选择
# A machine learning task is invovled in the Wrapper model
# 包装器模型中包含了一个机器学习任务
# We use the performance of the machine learning task to select influential features
# 我们使用机器学习任务的性能来选择有影响力的特征
# In this example, we use backward elimination in linear regression which predicts Grade
# 在这个例子中,我们在预测 Grade 成绩的线性回归中使用反向消除法
# # Backward Elimination 反向消除
# import statsmodels.api as sm
# cols = list(df_transform.columns)
# cols.remove('GradeLetter') # drop the nominal variable
# print('\n x variables: ',cols)
# y=list(df_transform['Grade']) # using Grade as y variable in linear regression
# pmax = 1
# while (len(cols)>0):
#     p= []
#     X_1 = df_transform[cols]
#     X_1 = sm.add_constant(X_1)
#     model = sm.OLS(y,X_1).fit()
#     p = pd.Series(model.pvalues.values[1:],index = cols)      
#     pmax = max(p)
#     feature_with_p_max = p.idxmax()
#     if(pmax>0.05):
#         cols.remove(feature_with_p_max)
#     else:
#         break
# selected_features_BE = cols
# print('\nSelected features by Wrapper model (regression):\n',selected_features_BE)
# Feature selection by using Wrapper model 
# 使用包装模型进行特征选择
# This example shows that we can use impurity criterion in decision trees to select important features
# 这个例子表明,我们可以使用决策树中的杂质准则来选择重要的特征
from sklearn.ensemble import ExtraTreesClassifier
y = df_transform['GradeLetter']
x = df_transform.drop('GradeLetter', 1)
display(HTML(x.head(10).to_html()))
model = ExtraTreesClassifier()
model.fit(x, y)
values=model.feature_importances_.tolist()
keys=x.columns.tolist()
d = dict(zip(keys, values))
# sort pairs by values descending
# 按值降序排列对
s = [(k, d[k]) for k in sorted(d, key=d.get, reverse=True)]
print('\nSelected features by Wrapper model (classification):\n')
for k, v in s:
    print(k,'\t',v)

# Feature reduction

# Example of PCA
from sklearn.decomposition import PCA
display(HTML(x.head(10).to_html()))
# convert all features to numerical, in order to apply PCA
# 将所有特征转换为数值,以便应用 PCA 主成分分析
df_dummies_nationality=pd.get_dummies(df_transform['Nationality'])
print(df_dummies_nationality.head(5))
# add binary variables to dataframe
# 将二进制变量添加到 dataframe
df_transform=df_transform.join(df_dummies_nationality)
# remove the original categorical variable
# 删除原始的分类变量
df_transform=df_transform.drop('Nationality',1)
# remove on extra binary variable
# 删除额外的二进制变量
df_transform=df_transform.drop(' China',1)
x = df_transform.drop('GradeLetter', 1)
y = df_transform['GradeLetter']
display(HTML(x.head(10).to_html()))
# feature extraction 特征提取
# API, https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
pca = PCA(n_components=10)
fit = pca.fit(x)
# summarize components 汇总组件
# print("Explained Variance: %s") % fit.explained_variance_ratio_
print('Explained variance: ', fit.explained_variance_ratio_)
print('\nPCAs:\n', fit.components_)
# select PCA and output new features
# 选择 PCA 并输出新特征
# for example, we choose the top-3 PCAs
# 例如,我们选择排名前 3 位的 PCA
PCAs = pca.fit_transform(x)
PCAs_selected = PCAs[:,:3]
df_PCAs = pd.DataFrame(data=PCAs_selected, columns=['PC1','PC2','PC3'])
df_PCAs['GraderLetter']=y
display(HTML(df_PCAs.head(10).to_html()))
# write new data to external files
# 将新数据写入外部文件
df_PCAs.to_csv('Data_Students_PCA.csv', sep=',')

# Data Splits: Examples

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from matplotlib import pyplot as plt
import matplotlib as mpl
import seaborn as sns
print(df.columns)
# Assume last column is your label, other columns are features
# 假设最后一列是标签,其他列是特征
X = df.loc[:, df.columns!='GradeLetter']
y = df.loc[:,'GradeLetter']
print(X.columns)
print(type(X))
print(type(y))
# hold-out data split 
# API: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# N-fold data split
# API: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
kf = KFold(n_splits=5, shuffle=True)
data_5folds = []
for train_index, test_index in kf.split(X,y):
    print("TRAIN:", train_index, "TEST:", test_index)
    # get actual data by index 按索引获取实际数据
    x_train, x_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # save data into fold
    fold = [x_train, x_test, y_train, y_test]
    # add each fold data into 5folds
    data_5folds.append(fold)
# N-fold cross validation
# API: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
# acc=cross_val_score(clf, x, y, cv=5, scoring='accuracy').mean()
阅读次数

请我喝[茶]~( ̄▽ ̄)~*

Ruri Shimotsuki 微信支付

微信支付

Ruri Shimotsuki 支付宝

支付宝

Ruri Shimotsuki 贝宝

贝宝