# # Data Science Libraries in Python

## # toolboxes/libraries

### # NumPy

• introduces objects for multidimensional arrays, vectors and matrices, as well as functions that allow to easily perform advanced mathematical and statistical operations on those objects
介绍多维数组、向量和矩阵的对象，以及允许轻松对这些对象执行高级数学和统计操作的函数
• provides vectorization of mathematical operations on arrays and matrices which significantly improves the performance
对数组和矩阵的数学运算进行矢量化，从而显著提高性能
• many other python libraries are built on NumPy
许多其他 python 库都是基于 NumPy 构建的

### # Pandas

• adds data structures (data frame) and tools designed to work with table-like data
添加数据结构（数据框架）和用于处理类似表格的数据的工具
• provides tools for data manipulation: reshaping, merging, sorting, slicing, aggregation etc.
提供数据处理工具：重塑、合并、排序、切片、聚合等。
• allows handling missing data
允许处理丢失的数据

### # SciPy

• collection of algorithms for linear algebra, differential equations, numerical integration, optimization, statistics and more
线性代数、微分方程、数值积分、优化、统计学等算法的集合
• part of SciPy Stack
SciPy 堆栈的一部分
• built on NumPy
建立在 NumPy 之上
• SciPy and NumPy are usually used for matrix-based operations, such as matrix factorization
SciPy 和 NumPy 通常用于基于矩阵的操作，例如矩阵分解

### # SciKit-Learn

• provides machine learning algorithms: classification, regression, clustering, model validation etc.
提供机器学习算法：分类、回归、聚类、模型验证等。

• built on NumPy, SciPy and matplotlib
基于 NumPy、SciPy 和 matplotlib 构建

## # Visualization libraries

### # matplotlib

• python 2D plotting library which produces publication quality figures in a variety of hardcopy formats
python 2D 绘图库，以各种硬拷贝格式生成出版物质量的图形
• a set of functionalities similar to those of MATLAB
类似于 MATLAB 的功能
• line plots, scatter plots, barcharts, histograms, pie charts etc.
折线图、散点图、条形图、直方图、饼图等。
• relatively low-level; some effort needed to create advanced visualization
相对低级；创建高级可视化需要一些努力

### # Seaborn

• based on matplotlib
• provides high level interface for drawing attractive statistical graphics
提供高级界面，用于绘制有吸引力的统计图形
• Similar (in style) to the popular `ggplot2` library in R
(样式上) 与 R 中流行的 `ggplot2` 库相似

# # Pandas

## # Introductions

Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame.
Pandas 是一个构建在 NumPy 之上的更新包，它提供了数据帧的高效实现。

DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.
DataFrame 本质上是具有附加的行和列标签的多维数组，并且通常具有异构类型和 / 或缺少数据。

## # Basic data structure

### # Series

• Series is a one-dimensional array of indexed data.
序列是索引数据的一维数组。
• The index-value is similar to key-value in dict in Python3.
• We can create series by given a dict or an array/list
索引值类似于 python3 中 dict 的键值。

### # DataFrame

• DataFrame is a generalized array or a specialization of dict.
DataFrame 是广义上的数组或特殊的 dict。
• It can be viewed as a table which stores data in different data types.
可以将其视为以不同数据类型存储数据的表。

### # Index

• Index is the object associated with Series and DataFrame
索引是与 Series 和 DataFrame 关联的对象
• It can be viewed as an immutable array (i.e., cannot be modified) or as an ordered set
可以将其视为不可变数组 (即不能修改) 或有序集

## # Slicing data in dataFrame

• `loc` and `iloc` are used to slice rows by default.
默认情况下， `loc``iloc` 用于切片行。
• To slice columns, we can use `df.loc[:, [1, 2, 3]]`
要切片列，可以使用 `df.loc[:, [1, 2, 3]]`
• we can use index `-1` in `iloc` , but cannot use it in `loc` .
• We only use index number in `iloc` , but we can use both index number and label in loc
• `ix` is not suggested to be used in pandas v0.2 or later version

## #`NaN` , `None` and mising values in Pandas

• `None` is a general missing data
• `NaN` can be interpreted as missing numerical data in float type
可以解释为浮点类型中缺少数值数据

# # Data Preprocessing

## # Introductions

Data preprocessing may include the following operations:

文件加载
• deal with missing values
处理缺失值
• slicing data
切片数据
• data normalization
数据标准化
• data smoothing
数据平滑
• data transformation, numerical to categorical
数据转换，从数字到分类
• data transformation, categorical to numerical
数据转换，分类到数字
• feature selection
特征选择
• feature deduction
特征演绎
• some special preprocessing, such as the operations in text mining, e.g., stopword removal, tokenization, TF-IDF weighting
一些特殊的预处理，例如文本挖掘中的操作，例如，停止词删除，标记化，tf-idf 加权

the following operations will use Data_Students.csv as the data set