Pandas merge合并操作的实现
Pandas 提供的 merge() 函数能够进行高效的合并操作,这与 SQL 关系型数据库的 join用法非常相似。从字面意思上不难理解,merge 翻译为“合并”,指的是将两个 DataFrame 数据表按照指定的规则进行连接,最后拼接成一个新的 DataFrame 数据表。
merge() 函数的法格式如下:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,left_index=False, right_index=False, sort=True,suffixes=('_x', '_y'), copy=True)
参数说明,如下表所示:
参数名称 | 说明 |
---|---|
left/right | 两个不同的 DataFrame 对象。 |
on | 指定用于连接的键(即列标签的名字),该键必须同时存在于左右两个 DataFrame 中,如果没有指定,并且其他参数也未指定, 那么将会以两个 DataFrame 的列名交集做为连接键。 |
left_on | 指定左侧 DataFrame 中作连接键的列名。该参数在左、右列标签名不相同,但表达的含义相同时非常有用。 |
right_on | 指定左侧 DataFrame 中作连接键的列名。 |
left_index | 布尔参数,默认为 False。如果为 True 则使用左侧 DataFrame 的行索引作为连接键,若 DataFrame 具有多层 索引(MultiIndex),则层的数量必须与连接键的数量相等。 |
right_index | 布尔参数,默认为 False。如果为 True 则使用左侧 DataFrame 的行索引作为连接键。 |
how | 要执行的合并类型,从 {‘left’, ‘right’, ‘outer’, ‘inner’} 中取值,默认为“inner”内连接。 |
sort | 布尔值参数,默认为True,它会将合并后的数据进行排序;若设置为 False,则按照 how 给定的参数值进行排序。 |
suffixes | 字符串组成的元组。当左右 DataFrame 存在相同列名时,通过该参数可以在相同的列名后附加后缀名,默认为(‘_x’,‘_y’)。 |
copy | 默认为 True,表示对数据进行复制。 |
注意:Pandas 库的 merge() 支持各种内外连接,与其相似的还有 join() 函数(默认为左连接)。
准备两个不同的DataFrame,这里从文件中读取:
文件empdata.csv:
EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO
7369,SMITH,CLERK,7902.0,1980-12-17,800,,20
7499,ALLEN,SALESMAN,7698.0,1981-02-20,1600,300.0,30
7521,WARD,SALESMAN,7698.0,1981-02-22,1250,500.0,30
7566,JONES,MANAGER,7839.0,1981-04-02,2975,,20
7654,MARTIN,SALESMAN,7698.0,1981-09-28,1250,1400.0,30
7698,BLAKE,MANAGER,7839.0,1981-05-01,2850,,30
7782,CLARK,MANAGER,7839.0,1981-06-09,2450,,10
7788,SCOTT,ANALYST,7566.0,1987-04-19,3000,,20
7839,KING,PRESIDENT,,1981-11-17,5000,,10
7844,TURNER,SALESMAN,7698.0,1981-09-08,1500,0.0,30
7876,ADAMS,CLERK,7788.0,1987-05-23,1100,,20
7900,JAMES,CLERK,7698.0,1981-12-03,950,,30
7902,FORD,ANALYST,7566.0,1981-12-03,3000,,20
7934,MILLER,CLERK,7782.0,1982-01-23,1300,,10
文件deptdata.csv:
DEPTNO,DNAME,LOC
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON
下面创建两个不同的 DataFrame,然后对它们进行合并操作:
import pandas as pd import numpy as np df_emp = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv') df_dept = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv') print("员工表中的df数据:\n",df_emp) print("部门表中的df数据:\n",df_dept)
输出如下:
员工表中的df数据:
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
0 7369 SMITH CLERK 7902.0 1980-12-17 800 NaN 20
1 7499 ALLEN SALESMAN 7698.0 1981-02-20 1600 300.0 30
2 7521 WARD SALESMAN 7698.0 1981-02-22 1250 500.0 30
3 7566 JONES MANAGER 7839.0 1981-04-02 2975 NaN 20
4 7654 MARTIN SALESMAN 7698.0 1981-09-28 1250 1400.0 30
5 7698 BLAKE MANAGER 7839.0 1981-05-01 2850 NaN 30
6 7782 CLARK MANAGER 7839.0 1981-06-09 2450 NaN 10
7 7788 SCOTT ANALYST 7566.0 1987-04-19 3000 NaN 20
8 7839 KING PRESIDENT NaN 1981-11-17 5000 NaN 10
9 7844 TURNER SALESMAN 7698.0 1981-09-08 1500 0.0 30
10 7876 ADAMS CLERK 7788.0 1987-05-23 1100 NaN 20
11 7900 JAMES CLERK 7698.0 1981-12-03 950 NaN 30
12 7902 FORD ANALYST 7566.0 1981-12-03 3000 NaN 20
13 7934 MILLER CLERK 7782.0 1982-01-23 1300 NaN 10
部门表中的df数据:
DEPTNO DNAME LOC
0 10 ACCOUNTING NEW YORK
1 20 RESEARCH DALLAS
2 30 SALES CHICAGO
3 40 OPERATIONS BOSTON
1) 在单个键上进行合并操作
通过 on 参数指定一个连接键,然后对上述 DataFrame 进行合并操作:
import pandas as pd import numpy as np df_emp_left = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv') df_dept_right = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv') # 通过on指定合并的主键 print(pd.merge(df_emp_left,df_dept_right,on="DEPTNO"))
输出结果:
EMPNO ENAME JOB MGR ... COMM DEPTNO DNAME LOC
0 7782 CLARK MANAGER 7839.0 ... NaN 10 ACCOUNTING NEW YORK
1 7839 KING PRESIDENT NaN ... NaN 10 ACCOUNTING NEW YORK
2 7934 MILLER CLERK 7782.0 ... NaN 10 ACCOUNTING NEW YORK
3 7369 SMITH CLERK 7902.0 ... NaN 20 RESEARCH DALLAS
4 7566 JONES MANAGER 7839.0 ... NaN 20 RESEARCH DALLAS
5 7788 SCOTT ANALYST 7566.0 ... NaN 20 RESEARCH DALLAS
6 7876 ADAMS CLERK 7788.0 ... NaN 20 RESEARCH DALLAS
7 7902 FORD ANALYST 7566.0 ... NaN 20 RESEARCH DALLAS
8 7499 ALLEN SALESMAN 7698.0 ... 300.0 30 SALES CHICAGO
9 7521 WARD SALESMAN 7698.0 ... 500.0 30 SALES CHICAGO
10 7654 MARTIN SALESMAN 7698.0 ... 1400.0 30 SALES CHICAGO
11 7698 BLAKE MANAGER 7839.0 ... NaN 30 SALES CHICAGO
12 7844 TURNER SALESMAN 7698.0 ... 0.0 30 SALES CHICAGO
13 7900 JAMES CLERK 7698.0 ... NaN 30 SALES CHICAGO[14 rows x 10 columns]
2) 在多个键上进行合并操作
下面示例,指定多个键来合并上述两个 DataFrame 对象:(这里使用自定义数据了)
import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4], 'Name': ['Smith', 'Maiki', 'Hunter', 'Hilen'], 'subject_id':['sub1','sub2','sub4','sub6']}) right = pd.DataFrame({ 'id':[1,2,3,4], 'Name': ['Bill', 'Lucy', 'Jack', 'Mike'], 'subject_id':['sub2','sub4','sub3','sub6']}) print(pd.merge(left,right,on=['id','subject_id']))
输出结果:
id Name_x subject_id Name_y
0 4 Hilen sub6 Mike
使用how参数合并
通过how
参数可以确定 DataFrame 中要包含哪些键,如果在左表、右表都不存的键,那么合并后该键对应的值为 NaN。为了便于大家学习,我们将 how 参数和与其等价的 SQL 语句做了总结:
Merge方法 | 等效 SQL | 描述 |
---|---|---|
left | LEFT OUTER JOIN | 使用左侧对象的key |
right | RIGHT OUTER JOIN | 使用右侧对象的key |
outer | FULL OUTER JOIN | 使用左右两侧所有key的并集 |
inner | INNER JOIN | 使用左右两侧key的交集 |
1) left join
import pandas as pd df_emp_left = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv') df_dept_right = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv') # 为了演示,这里将雇员7369的部门删除了,运行如下: print(pd.merge(df_emp_left,df_dept_right,on="DEPTNO",how="left"))
输出结果:能发现7369的DEPTNO,DNAME,LOC都为NaN
EMPNO ENAME JOB MGR ... COMM DEPTNO DNAME LOC
0 7369 SMITH CLERK 7902.0 ... NaN NaN NaN NaN
1 7499 ALLEN SALESMAN 7698.0 ... 300.0 30.0 SALES CHICAGO
2 7521 WARD SALESMAN 7698.0 ... 500.0 30.0 SALES CHICAGO
3 7566 JONES MANAGER 7839.0 ... NaN 20.0 RESEARCH DALLAS
4 7654 MARTIN SALESMAN 7698.0 ... 1400.0 30.0 SALES CHICAGO
5 7698 BLAKE MANAGER 7839.0 ... NaN 30.0 SALES CHICAGO
6 7782 CLARK MANAGER 7839.0 ... NaN 10.0 ACCOUNTING NEW YORK
7 7788 SCOTT ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
8 7839 KING PRESIDENT NaN ... NaN 10.0 ACCOUNTING NEW YORK
9 7844 TURNER SALESMAN 7698.0 ... 0.0 30.0 SALES CHICAGO
10 7876 ADAMS CLERK 7788.0 ... NaN 20.0 RESEARCH DALLAS
11 7900 JAMES CLERK 7698.0 ... NaN 30.0 SALES CHICAGO
12 7902 FORD ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
13 7934 MILLER CLERK 7782.0 ... NaN 10.0 ACCOUNTING NEW YORK[14 rows x 10 columns]
2) right join
import pandas as pd df_emp_left = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv') df_dept_right = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv') # 部门40因为没有员工,所以对应员工的信息全部为NaN print(pd.merge(df_emp_left,df_dept_right,on="DEPTNO",how="right"))
输出结果:(部门40因为没有员工,所以对应员工的信息全部为NaN)
EMPNO ENAME JOB MGR ... COMM DEPTNO DNAME LOC
0 7782.0 CLARK MANAGER 7839.0 ... NaN 10 ACCOUNTING NEW YORK
1 7839.0 KING PRESIDENT NaN ... NaN 10 ACCOUNTING NEW YORK
2 7934.0 MILLER CLERK 7782.0 ... NaN 10 ACCOUNTING NEW YORK
3 7369.0 SMITH CLERK 7902.0 ... NaN 20 RESEARCH DALLAS
4 7566.0 JONES MANAGER 7839.0 ... NaN 20 RESEARCH DALLAS
5 7788.0 SCOTT ANALYST 7566.0 ... NaN 20 RESEARCH DALLAS
6 7876.0 ADAMS CLERK 7788.0 ... NaN 20 RESEARCH DALLAS
7 7902.0 FORD ANALYST 7566.0 ... NaN 20 RESEARCH DALLAS
8 7499.0 ALLEN SALESMAN 7698.0 ... 300.0 30 SALES CHICAGO
9 7521.0 WARD SALESMAN 7698.0 ... 500.0 30 SALES CHICAGO
10 7654.0 MARTIN SALESMAN 7698.0 ... 1400.0 30 SALES CHICAGO
11 7698.0 BLAKE MANAGER 7839.0 ... NaN 30 SALES CHICAGO
12 7844.0 TURNER SALESMAN 7698.0 ... 0.0 30 SALES CHICAGO
13 7900.0 JAMES CLERK 7698.0 ... NaN 30 SALES CHICAGO
14 NaN NaN NaN NaN ... NaN 40 OPERATIONS BOSTON[15 rows x 10 columns]
3) outer join(并集)
import pandas as pd df_emp_left = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv') df_dept_right = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv') # 部门40因为没有员工,所以对应员工的信息全部为NaN,又因为7369没对应的部门,所以部门信息全部为Nan print(pd.merge(df_emp_left,df_dept_right,on="DEPTNO",how="outer"))
输出结果:
EMPNO ENAME JOB MGR ... COMM DEPTNO DNAME LOC
0 7369.0 SMITH CLERK 7902.0 ... NaN NaN NaN NaN
1 7499.0 ALLEN SALESMAN 7698.0 ... 300.0 30.0 SALES CHICAGO
2 7521.0 WARD SALESMAN 7698.0 ... 500.0 30.0 SALES CHICAGO
3 7654.0 MARTIN SALESMAN 7698.0 ... 1400.0 30.0 SALES CHICAGO
4 7698.0 BLAKE MANAGER 7839.0 ... NaN 30.0 SALES CHICAGO
5 7844.0 TURNER SALESMAN 7698.0 ... 0.0 30.0 SALES CHICAGO
6 7900.0 JAMES CLERK 7698.0 ... NaN 30.0 SALES CHICAGO
7 7566.0 JONES MANAGER 7839.0 ... NaN 20.0 RESEARCH DALLAS
8 7788.0 SCOTT ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
9 7876.0 ADAMS CLERK 7788.0 ... NaN 20.0 RESEARCH DALLAS
10 7902.0 FORD ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
11 7782.0 CLARK MANAGER 7839.0 ... NaN 10.0 ACCOUNTING NEW YORK
12 7839.0 KING PRESIDENT NaN ... NaN 10.0 ACCOUNTING NEW YORK
13 7934.0 MILLER CLERK 7782.0 ... NaN 10.0 ACCOUNTING NEW YORK
14 NaN NaN NaN NaN ... NaN 40.0 OPERATIONS BOSTON[15 rows x 10 columns]
4) inner join(交集)
import pandas as pd df_emp_left = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv') df_dept_right = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv') # 部门40因为没有员工,,又因为7369没对应的部门,所以部门40和员工7369没显示 print(pd.merge(df_emp_left,df_dept_right,on="DEPTNO",how="inner"))
输出结果:
EMPNO ENAME JOB MGR ... COMM DEPTNO DNAME LOC
0 7499 ALLEN SALESMAN 7698.0 ... 300.0 30.0 SALES CHICAGO
1 7521 WARD SALESMAN 7698.0 ... 500.0 30.0 SALES CHICAGO
2 7654 MARTIN SALESMAN 7698.0 ... 1400.0 30.0 SALES CHICAGO
3 7698 BLAKE MANAGER 7839.0 ... NaN 30.0 SALES CHICAGO
4 7844 TURNER SALESMAN 7698.0 ... 0.0 30.0 SALES CHICAGO
5 7900 JAMES CLERK 7698.0 ... NaN 30.0 SALES CHICAGO
6 7566 JONES MANAGER 7839.0 ... NaN 20.0 RESEARCH DALLAS
7 7788 SCOTT ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
8 7876 ADAMS CLERK 7788.0 ... NaN 20.0 RESEARCH DALLAS
9 7902 FORD ANALYST 7566.0 ... NaN 20.0 RESEARCH DALLAS
10 7782 CLARK MANAGER 7839.0 ... NaN 10.0 ACCOUNTING NEW YORK
11 7839 KING PRESIDENT NaN ... NaN 10.0 ACCOUNTING NEW YORK
12 7934 MILLER CLERK 7782.0 ... NaN 10.0 ACCOUNTING NEW YORK
注意:当 a 与 b 进行内连操作时 a.join(b) 不等于 b.join(a)。
5)使用join函数
如果左右两个表的主键名相同,可以使用join函数。
import pandas as pd df_emp_left = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv') df_dept_right = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv') print(df_emp_left.join(df_dept_right,how="outer",rsuffix='1'))
运行结果:
EMPNO ENAME JOB MGR ... DEPTNO DEPTNO1 DNAME LOC
0 7369 SMITH CLERK 7902.0 ... NaN 10.0 ACCOUNTING NEW YORK
1 7499 ALLEN SALESMAN 7698.0 ... 30.0 20.0 RESEARCH DALLAS
2 7521 WARD SALESMAN 7698.0 ... 30.0 30.0 SALES CHICAGO
3 7566 JONES MANAGER 7839.0 ... 20.0 40.0 OPERATIONS BOSTON
4 7654 MARTIN SALESMAN 7698.0 ... 30.0 NaN NaN NaN
5 7698 BLAKE MANAGER 7839.0 ... 30.0 NaN NaN NaN
6 7782 CLARK MANAGER 7839.0 ... 10.0 NaN NaN NaN
7 7788 SCOTT ANALYST 7566.0 ... 20.0 NaN NaN NaN
8 7839 KING PRESIDENT NaN ... 10.0 NaN NaN NaN
9 7844 TURNER SALESMAN 7698.0 ... 30.0 NaN NaN NaN
10 7876 ADAMS CLERK 7788.0 ... 20.0 NaN NaN NaN
11 7900 JAMES CLERK 7698.0 ... 30.0 NaN NaN NaN
12 7902 FORD ANALYST 7566.0 ... 20.0 NaN NaN NaN
13 7934 MILLER CLERK 7782.0 ... 10.0 NaN NaN NaN
到此这篇关于Pandas merge合并操作的实现的文章就介绍到这了,更多相关Pandas merge合并内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家!
相关文章
python中的elasticsearch_dsl查询语句转换成es查询语句详解
这篇文章主要介绍了python中的elasticsearch_dsl查询语句转换成es查询语句详解,ElasticSearch在实际生产里通常和LogStash,Kibana,FileBeat一起构成Elastic Stack来使用,它是这些组件里面最核心的一个,需要的朋友可以参考下2023-07-07解决tensorboard多个events文件显示紊乱的问题
今天小编就为大家分享一篇解决tensorboard多个events文件显示紊乱的问题,具有很好的参考价值,希望对大家有所帮助。一起跟随小编过来看看吧2020-02-02Python使用Rich type和TinyDB构建联系人通讯录
这篇文章主要为大家介绍了Python使用Rich type和TinyDB构建联系人通讯录应用程序,有需要的朋友可以借鉴参考下,希望能够有所帮助,祝大家多多进步,早日升职加薪2022-08-08
最新评论