文章目录
  1. 1. 基础语法部分
    1. 1.1. Byte 字节转换为 string
    2. 1.2. 数组初始化
      1. 1.2.1. 一维数组初始化
      2. 1.2.2. 二维数组初始化
    3. 1.3. 获取list的安全操作
    4. 1.4. 遍历字符串的每个字符
    5. 1.5. 三目表达式
    6. 1.6. 使用zip函数一个一个合并多个列表中的元素
    7. 1.7. 序列解包 (sequence unpacking)
  2. 2. 数据结构
    1. 2.1. Stack
    2. 2.2. Queue
      1. 2.2.1. Theard safe Queue
      2. 2.2.2. collections.deque
    3. 2.3. Map (dictionary)
    4. 2.4. LinkedHashmap
    5. 2.5. PriorityQueue (Heap)
  3. 3. 类库
    1. 3.1. Numpy
      1. 3.1.1. ndarray(N-dimensional array object)
        1. 3.1.1.1. 创建数组
        2. 3.1.1.2. 构建数组
      2. 3.1.2. ufunc(universal function object)
        1. 3.1.2.1. 连续数组的创建
        2. 3.1.2.2. 数组运算
        3. 3.1.2.3. 统计函数
          1. 3.1.2.3.1. 数组、矩阵 最大值 amax 最小值 amin
          2. 3.1.2.3.2. 最大值与最小值之差ptp
          3. 3.1.2.3.3. 百分位数 percentile
          4. 3.1.2.3.4. 加权平均值 average
          5. 3.1.2.3.5. 统计标准差std、方差var
      3. 3.1.3. 排序sort
    2. 3.2. Panda
      1. 3.2.1. 数据结构:Series DataFrame
        1. 3.2.1.1. Series - 定长的dictionary
        2. 3.2.1.2. DataFrame
      2. 3.2.2. 数据导入和输出
      3. 3.2.3. 数据清洗
        1. 3.2.3.1. 常见的数据问题
          1. 3.2.3.1.1. 删除DataFrame中的不必要行和列
          2. 3.2.3.1.2. 重命名列名
          3. 3.2.3.1.3. 去除重复数据
          4. 3.2.3.1.4. 数据格式问题
      4. 3.2.4. 使用apply函数进行数据清洗
      5. 3.2.5. 数据统计
      6. 3.2.6. 数据表合并
      7. 3.2.7. 用SQL的方式打开Pandas
  4. 4. References

基础语法部分

Byte 字节转换为 string

1
b"hello world".decode("utf-8")

或者

1
str(b"hello world", 'utf-8')

数组初始化

一维数组初始化

1
list = [0] * len(base_list)

二维数组初始化

1
array_2d_shared_element = [[0] * column_size] * row_size
1
array_2d_distinct_element = [[0 for i in xrange(column_size)] for j in xrange(row_size)]
1
2
3
import numpy as np

array_e = np.zeros((row_size, column_size))

获取list的安全操作

1
2
3
4
5
def get(self, index):
try:
return self.queue[index]
except IndexError:
return None

或者是自定义一个默认返回值:

1
2
3
4
5
def get(self, index, default):
try:
return self.queue[index]
except IndexError:
return default

遍历字符串的每个字符

1
2
for i in range(len(str)):
print(str[i])
1
2
for i, v in enumerate(str):
print(v)

三目表达式

1
2
# condition ? rt_value : rt_value_else
rt_value if condition else rt_value_else

使用zip函数一个一个合并多个列表中的元素

1
2
3
4
5
for item in zip([1, 2, 3], [4, 5, 6])
print(item)
# (1, 4)
# (2, 5)
# (3, 6)

对zip函数再次使用zip函数可以将其恢复原状:

1
2
for item in zip(*zip([1, 2, 3], [4, 5, 6]))
print(item)

对于 Python3来说zip函数会返回 iterator 对象,如果需要转换为list对象可以这样遍历操作:

1
2
3
list_zipped = list(zip(*zip([1, 2, 3], [4, 5, 6])))

# [(1, 2, 3), (4, 5, 6)]

对于Python2来说会直接生成元素为tuple对象的列表(list)

序列解包 (sequence unpacking)

常规写法:

1
first, second, third = 1, 2, 3

在实际编程中常见用法:

交换变量值

1
a, b = b, a

另外序列解包还会用到 * 号表达式:

1
2
3
first, second, *rest = 1, 2 , 3, 4

# rest = [3, 4]
1
2
3
first, *mid, last = 1, 2, 3, 4

# mid = [2, 3]

例子:

1
zipped_twice = zip(*zip([1, 2, 3], [4, 5, 6]))

我们都知道最里层的zip([1, 2, 3], [4, 5, 6])生成的结果是 [(1, 4), (2, 5), (3, 6)]
如果我们想把list内部的元素作为多参数列表传入新的zip函数就要用到sequence unpacking *
换句话说就是zip(*zip([1, 2, 3], [4, 5, 6])) 等同于 zip((1,4), (2, 5), (3, 6))
关于sequence unpacking 的其他例子可以参看Python official document - Unpacking Argument Lists

数据结构

Stack

1
2
3
4
5
6
7
8
9
10
11
from collections import deque

stack = deque()

# stack push
stack.append(1)
stack.append(2)

# stack pop
stack.pop()
# output 2

Queue

Theard safe Queue

1
2
3
4
5
6
7
8
queue = Queue()
# enqueue
queue.put(1)
queue.put(2)

# dequeue
queue.get()
# output 1

collections.deque

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from collections import deque

queue = deque()

# enqueue to tail
queue.append(1)
queue.append(2)

# dequeue from top
queue.popleft()
# output 1

# enqueue to top
queue.appendleft(3)

# dequeue from tail
queue.pop()
# output 2

left 是队列的head 部分, right是队列的tail位置

Map (dictionary)

  • Get / Set
1
2
3
map = dic()
map.get(key, default_value)
map[key] = value
  • foreach loop
1
2
for key, value in local_map.item():
print("key : {0}, value: {1}".format( key, value))

LinkedHashmap

1
2
3
4
from collections import OrderedDict

linkedHashMap = OrderedDict()
linkedHashMap[key] = value

PriorityQueue (Heap)

1
2
3
4
5
6
try
from Queue import PriorityQueue # version < 3.0
except:
from queue import PriorityQueue

pq = PriorityQueue()

入队:

1
pq.push((weight, value))

按weight出队:

1
pq.get()
  • 为什么Python的PriorityQueue不支持 remove 操作?
  • 为什么Python的priorityQueue peek 操作是什么?

动手解决这个问题:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
try:
from Queue import PriorityQueue # Version < 3.0
except ImportError:
from queue import PriorityQueue


class PriorityQueueX(PriorityQueue):
def peek(self):
try:
return self.queue[0]
except IndexError:
return None

def remove(self, element):
self.queue.remove(element)
heapq.heapify(self.queue)

类库

Numpy

ndarray(N-dimensional array object)

为什么使用NumPy的数组结构而不使用Python内建List呢?
因为内建List的元素在系统内存中是分散存储的,而NumPy数组是存储在均匀连续的内存中的。

创建数组
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np

a = np.array([1, 2, 3])
b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

b[1][0] = 6

print(a.shape)
print(b.shape)

print(a.dtype)
print(b.dtype)

print(b)
构建数组
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np
person_type = np.dtype({
'names':['name', 'chinese', 'english', 'mathematics'],
'formats':['s32', 'i', 'i', 'i']
})
peoples = np.array(
[("zhangfei", 66, 65, 30),
("guanyu", 95, 85, 98),
("zhaoyun", 93, 92, 96),
("huangzhong", 90, 88, 77),
("dianwei", 80, 90, 90)
],
dtype = person_type
)

chinese = peoples[:]['chinese']
english = peoples[:]['english']
math = peoples[:]['mathematics']

print(np.mean(chinese))
print(np.mean(english))
print(np.mean(math))

ufunc(universal function object)

连续数组的创建
1
2
3
4
5
array_range = np.arange(1, 11, 2)
array_linear = np.linspace(1, 9, 5)

print(array_range)
print(array_linear)
数组运算
1
2
3
4
5
6
7
8
9
10
11
array_range = np.arange(1, 11, 2)
array_linear = np.linspace(1, 9, 5)

print(np.add(array_range, array_linear))
print(np.subtract(array_range, array_linear))
print(np.multiply(array_range, array_linear))
print(np.divide(array_range, array_linear))
print(np.power(array_range, array_linear))
# array_range % array_linear
print(np.reminder(array_range, array_linear))
print(np.mod(array_range, array_linear))
统计函数
数组、矩阵 最大值 amax 最小值 amin
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np


matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print(np.amin(matrix))
# columns based
print(np.amin(matrix, 0))
# row based
print(np.amin(matrix, 1))

print(np.amax(matrix))
print(np.amax(matrix, 0))
print(np.amax(matrix, 1)))
最大值与最小值之差ptp
1
2
3
4
5
6
import numpy as np

matrix = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])
print(np.ptp(matrix))
print(np.ptp(matrix, axis=0))
print(np.ptp(matrix, axis=1))
百分位数 percentile
1
2
3
4
5
6
7
import numpy as np

matrix = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])

print(np.percentile(matrix, 50))
print(np.percentile(matrix, 50, axis=0))
print(np.percentile(matrix, 50, axis=1))

特别的对于50percentile数也就是 median中位数可以使用median, 对于平均数可以使用mean

1
2
3
4
5
6
7
8
9
10
11
import numpy as np


matrix = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])

print(np.median(matrix))
print(np.median(matrix, axis=0))
print(np.median(matrix, axis=1))
print(np.mean(matrix))
print(np.mean(matrix, axis=0))
print(np.mean(matrix, axis=1))
加权平均值 average
1
2
3
4
5
6
7
import numpy as np


array = np.array([1, 2, 3, 4])
weight_array = np.array([1, 2, 3, 4])
print(np.average(array))
print(np.average(array, weights=weight_array))
统计标准差std、方差var
1
2
3
4
5
6
import numpy as np


array = np.array([1, 2, 3, 4])
print(np.std(array))
print(np.var(array))

排序sort

1
2
3
4
5
6
7
8
9
10
11
import numpy as np


array_2d = np.array([[4, 3, 2],[2, 4, 1]])
print(np.sort(array_2d))
print("*" * 20)
print(np.sort(array_2d, axis=None))
print("*" * 20)
print(np.sort(array_2d, axis=0))
print("*" * 20)
print(np.sort(array_2d, axis=1))

Jupyter Notebook 参见 Github - dataAnalysis - basic_numpy_usage.ipynb

Panda

数据结构:Series DataFrame

Series - 定长的dictionary

它与dictionary最大的不同就是定长。

1
2
3
4
5
6
7
8
9
10
from pandas import Series


series_default_index = Series([1, 2, 3, 4])
series_customized_index = Series(data=[1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
series_dictionary = Series({'a': 1, 'b': 2, 'c': 3, 'd': 4})

print(series_default_index)
print(series_customized_index)
print(series_dictionary)
DataFrame
1
2
3
4
5
6
7
8
from pandas import DataFrame


data = {'Chinese': [66, 95, 93, 90,80],'English': [65, 85, 92, 88, 90],'Math': [30, 98, 96, 77, 90]}
df1= DataFrame(data)
df2 = DataFrame(data, index=['ZhangFei', 'GuanYu', 'ZhaoYun', 'HuangZhong', 'DianWei'], columns=['English', 'Math', 'Chinese'])
print(df1)
print(df2)

数据导入和输出

1
2
3
4
5
6
7
import pandas as pd
from pandas import DataFrame


data = DataFrame(pd.read_csv('data/active_user_data_province.csv'))
data.to_csv('data/active_user_data_province-copy.csv')
print(data)

数据清洗

常见的数据问题
删除DataFrame中的不必要行和列
1
2
3
4
5
6
7
8
9
10
import pandas as pd
from pandas import Series, DataFrame


data = {'Chinese': [66, 95, 93, 90,80],'English': [65, 85, 92, 88, 90],'Math': [30, 98, 96, 77, 90]}
df2 = DataFrame(data, index=['ZhangFei', 'GuanYu', 'ZhaoYun', 'HuangZhong', 'DianWei'], columns=['English', 'Math', 'Chinese'])


df2 = df2.drop(columns=['Chinese'])
df2 = df2.drop(index=['ZhangFei'])
重命名列名
1
2
3
4
5
6
7
8
from pandas import DataFrame


data = {'Chinese': [66, 95, 93, 90,80],'English': [65, 85, 92, 88, 90],'Math': [30, 98, 96, 77, 90]}
df2 = DataFrame(data, index=['ZhangFei', 'GuanYu', 'ZhaoYun', 'HuangZhong', 'DianWei'], columns=['English', 'Math', 'Chinese'])

df2.rename(columns={'Chinese': 'Mandarins', 'Math': 'Mathematics'}, inplace=True)
print(df2)
去除重复数据
1
2
data = {'Chinese': [66, 95, 93, 90, 80, 80],'English': [65, 85, 92, 88, 90, 90],'Math': [30, 98, 96, 77, 90, 90]}
df2 = DataFrame(data, index=['ZhangFei', 'GuanYu', 'ZhaoYun', 'HuangZhong', 'DianWei', 'DianWei'], columns=['English', 'Math', 'Chinese'])
数据格式问题
  • 数据类型
1
2
3
4
5
6
7
8
9
import numpy as np
from pandas import DataFrame


data = {'Chinese': [66, 95, 93, 90, 80], 'English': [65, 85, 92, 88, 90], 'Math': [30, 98, 96, 77, 90]}
df2 = DataFrame(data, index=['ZhangFei', 'GuanYu', 'ZhaoYun', 'HuangZhong', 'DianWei'], columns=['English', 'Math', 'Chinese'])

df2['Chinese'].astype('str')
df2['Chinese'].astype(np.int64)
  • 数据间空格
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from pandas import DataFrame


data = {'Chinese': ['66 ', ' 95', ' 93 ', '90$', '$80$'], 'English': [65, 85, 92, 88, 90], 'Math': [30, 98, 96, 77, 90]}
df2 = DataFrame(data, index=['ZhangFei', 'GuanYu', 'ZhaoYun', 'HuangZhong', 'DianWei'], columns=['English', 'Math', 'Chinese'])

#删除左右两边空格
df2['Chinese']=df2['Chinese'].map(str.strip)
#删除左边空格
df2['Chinese']=df2['Chinese'].map(str.lstrip)
#删除右边空格
df2['Chinese']=df2['Chinese'].map(str.rstrip)
#删除美元符号
df2['Chinese']=df2['Chinese'].str.strip('$')

print(df2)
  • 大小写转换
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from pandas import DataFrame


data = {'Chinese': ['66 ', ' 95', ' 93 ', '90$', '$80$'], 'English': [65, 85, 92, 88, 90], 'Math': [30, 98, 96, 77, 90]}
df2 = DataFrame(data, index=['ZhangFei', 'GuanYu', 'ZhaoYun', 'HuangZhong', 'DianWei'], columns=['English', 'Math', 'Chinese'])

#全部大写
# df2.columns = df2.columns.str.upper()
#全部小写
# df2.columns = df2.columns.str.lower()
#首字母大写
df2.columns = df2.columns.str.title()

print(df2)
  • 查找空值
1
2
3
4
5
6
7
8
9
10
import pandas as pd


data = {'Chinese': [66, 95, 93, 90, 80, 80], 'English': [65, 85, 92, 88, 90, 90],
'Math': [None, 98, 96, 77, 90, 90]}
df = pd.DataFrame(data, index=['张飞', '关羽', '赵云', '黄忠', '典韦', '典韦'],
columns=['English', 'Math', 'Chinese'])

# df.isnull()
df.isnull().any()

使用apply函数进行数据清洗

  • name 列大写
1
2
3
4
5
6
7
8
9
10
11
import pandas as pd


data = {'name': ['ZhangFei', 'GuanYu', 'ZhaoYun', 'HuangZhong', 'DianWei', 'DianWei'], 'Chinese': [66, 95, 93, 90, 80, 80], 'English': [65, 85, 92, 88, 90, 90],
'Math': [None, 98, 96, 77, 90, 90]}
df = pd.DataFrame(data, index=['张飞', '关羽', '赵云', '黄忠', '典韦', '典韦'],
columns=['name', 'English', 'Math', 'Chinese'])

# print(df)
df['name'] = df['name'].apply(str.upper)
print(df)
  • Math 列数据 * 2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd


data = {'name': ['ZhangFei', 'GuanYu', 'ZhaoYun', 'HuangZhong', 'DianWei', 'DianWei'], 'Chinese': [66, 95, 93, 90, 80, 80], 'English': [65, 85, 92, 88, 90, 90],
'Math': [None, 98, 96, 77, 90, 90]}
df = pd.DataFrame(data, index=['张飞', '关羽', '赵云', '黄忠', '典韦', '典韦'],
columns=['name', 'English', 'Math', 'Chinese'])


def double_value(value):
return value * 2


df['Math'] = df['Math'].apply(double_value)
print(df)
  • 根据 Chinese 和 English 列计算产生两个新列
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import pandas as pd


data = {'name': ['ZhangFei', 'GuanYu', 'ZhaoYun', 'HuangZhong', 'DianWei', 'DianWei'], 'Chinese': [66, 95, 93, 90, 80, 80], 'English': [65, 85, 92, 88, 90, 90],
'Math': [None, 98, 96, 77, 90, 90]}
df = pd.DataFrame(data, index=['张飞', '关羽', '赵云', '黄忠', '典韦', '典韦'],
columns=['name', 'English', 'Math', 'Chinese'])


def plus(df,n,m):
df['newColumns1'] = (df['Chinese']+df['English']) * m
df['newColumns2'] = (df[u'Chinese']+df['English']) * n
return df


df = df.apply(plus,axis=1,args=(2, 3,))

print(df)

数据统计

1
2
3
4
5
6
7
8
9
import pandas as pd


data = {'name': ['ZhangFei', 'GuanYu', 'ZhaoYun', 'HuangZhong', 'DianWei', 'DianWei'], 'Chinese': [66, 95, 93, 90, 80, 80], 'English': [65, 85, 92, 88, 90, 90],
'Math': [None, 98, 96, 77, 90, 90]}
df = pd.DataFrame(data, index=['张飞', '关羽', '赵云', '黄忠', '典韦', '典韦'],
columns=['name', 'English', 'Math', 'Chinese'])

df.describe()

数据表合并

  • 按列名合并
1
2
3
4
5
6
7
8
import pandas as pd
from pandas import DataFrame


df1 = DataFrame({'name': ['ZhangFei', 'GuanYu', 'a', 'b', 'c'], 'data1': range(5)})
df2 = DataFrame({'name': ['ZhangFei', 'GuanYu', 'A', 'B', 'C'], 'data2': range(5)})

df3 = pd.merge(df1, df2, on='name')
  • inner join
1
2
3
4
5
6
7
8
9
import pandas as pd
from pandas import DataFrame


df1 = DataFrame({'name': ['ZhangFei', 'GuanYu', 'a', 'b', 'c'], 'data1': range(5)})
df2 = DataFrame({'name': ['ZhangFei', 'GuanYu', 'A', 'B', 'C'], 'data2': range(5)})

df3 = pd.merge(df1, df2, how='inner')
print(df3)
  • left join
1
2
3
4
5
6
7
8
9
import pandas as pd
from pandas import DataFrame


df1 = DataFrame({'name': ['ZhangFei', 'GuanYu', 'a', 'b', 'c'], 'data1': range(5)})
df2 = DataFrame({'name': ['ZhangFei', 'GuanYu', 'A', 'B', 'C'], 'data2': range(5)})

df3 = pd.merge(df1, df2, how='left')
print(df3)
  • right join
1
2
3
4
5
6
7
8
9
import pandas as pd
from pandas import DataFrame


df1 = DataFrame({'name': ['ZhangFei', 'GuanYu', 'a', 'b', 'c'], 'data1': range(5)})
df2 = DataFrame({'name': ['ZhangFei', 'GuanYu', 'A', 'B', 'C'], 'data2': range(5)})

df3 = pd.merge(df1, df2, how='right')
print(df3)
  • outer join
1
2
3
4
5
6
7
8
9
import pandas as pd
from pandas import DataFrame


df1 = DataFrame({'name': ['ZhangFei', 'GuanYu', 'a', 'b', 'c'], 'data1': range(5)})
df2 = DataFrame({'name': ['ZhangFei', 'GuanYu', 'A', 'B', 'C'], 'data2': range(5)})

df3 = pd.merge(df1, df2, how='outer')
print(df3)

用SQL的方式打开Pandas

1
2
3
4
5
6
7
8
from pandas import DataFrame
from pandasql import sqldf


df1 = DataFrame({'name': ['ZhangFei', 'GuanYu', 'a', 'b', 'c'], 'data1': range(5)})
pysqldf = lambda sql: sqldf(sql, globals())
sql = "select * from df1 where name ='ZhangFei'"
print(pysqldf(sql))

以上代码请参见Jupyter Notebook Github - dataAnalysis - basic pandas usage.ipynb

References

文章目录
  1. 1. 基础语法部分
    1. 1.1. Byte 字节转换为 string
    2. 1.2. 数组初始化
      1. 1.2.1. 一维数组初始化
      2. 1.2.2. 二维数组初始化
    3. 1.3. 获取list的安全操作
    4. 1.4. 遍历字符串的每个字符
    5. 1.5. 三目表达式
    6. 1.6. 使用zip函数一个一个合并多个列表中的元素
    7. 1.7. 序列解包 (sequence unpacking)
  2. 2. 数据结构
    1. 2.1. Stack
    2. 2.2. Queue
      1. 2.2.1. Theard safe Queue
      2. 2.2.2. collections.deque
    3. 2.3. Map (dictionary)
    4. 2.4. LinkedHashmap
    5. 2.5. PriorityQueue (Heap)
  3. 3. 类库
    1. 3.1. Numpy
      1. 3.1.1. ndarray(N-dimensional array object)
        1. 3.1.1.1. 创建数组
        2. 3.1.1.2. 构建数组
      2. 3.1.2. ufunc(universal function object)
        1. 3.1.2.1. 连续数组的创建
        2. 3.1.2.2. 数组运算
        3. 3.1.2.3. 统计函数
          1. 3.1.2.3.1. 数组、矩阵 最大值 amax 最小值 amin
          2. 3.1.2.3.2. 最大值与最小值之差ptp
          3. 3.1.2.3.3. 百分位数 percentile
          4. 3.1.2.3.4. 加权平均值 average
          5. 3.1.2.3.5. 统计标准差std、方差var
      3. 3.1.3. 排序sort
    2. 3.2. Panda
      1. 3.2.1. 数据结构:Series DataFrame
        1. 3.2.1.1. Series - 定长的dictionary
        2. 3.2.1.2. DataFrame
      2. 3.2.2. 数据导入和输出
      3. 3.2.3. 数据清洗
        1. 3.2.3.1. 常见的数据问题
          1. 3.2.3.1.1. 删除DataFrame中的不必要行和列
          2. 3.2.3.1.2. 重命名列名
          3. 3.2.3.1.3. 去除重复数据
          4. 3.2.3.1.4. 数据格式问题
      4. 3.2.4. 使用apply函数进行数据清洗
      5. 3.2.5. 数据统计
      6. 3.2.6. 数据表合并
      7. 3.2.7. 用SQL的方式打开Pandas
  4. 4. References