糖尿病康复 > 钻石值钱吗？python分析近年钻石价格走势【包含图表分析】

钻石值钱吗？python分析近年钻石价格走势【包含图表分析】

时间：2019-02-25 17:46:37

人生苦短，我用python

本节源码+数据集:点击此处跳转文末名片获取

一、数据描述

本文件探讨的数据集是有关钻石各种属性与价格，

数据集中有53,943颗钻石，

有10个特征(carat, cut, color, clarity, depth, table, price, x, y, z)。

数据集： DiamondsPrices.csv

1、数据集中变量特征

总共10个变量，

其中3个为Object类型 [cut、 color 和 clarity]，

1个为整数(int64)类型[price]，

6个为数值(float64)类型[carat, depth, table, x, y, z]。

pandas 缺乏区分 str和object类型，

都对应dtype(‘O’)类型，

既是强制类型为dtype(‘S’)也无效。

Numpy 可以区分 str和object类型，

dtype(‘O’) 和 dtype(‘S’)分别对应与 object 、str。

2、数据集中变量含义

二、问题提出

钻石中最常见的类别

不同属性与价格的相关度

每个分类的价格分布

本节源码+数据集:点击此处跳转文末名片获取

三、数据预处理

1、数据预处理原因

原始数据存在以下问题：

不一致——数据内涵出现不一致情况

重复

不完整——感兴趣的属性没有值

含噪声——数据中存在着错误、或异常（偏离期望值）的数据

高维度

2、数据预处理的方法

3、数据预处理

在这里我们发现没有缺失值、也没有重复值，

因此原始数据可以直接使用。

import pandas as pddf = pd.read_csv('.\data\DiamondsPrices.csv')print(df.head()) df.describe().to_excel(r'.\result\data1.xlsx')print("-------------缺失值数量可以发现该数据集中没有缺失值------------")print(df.isnull().sum())print("-------------数据类型统计--------------")print(df.info())print("-------------查看重复行数据可以发现没有重复数据-----------------")print(df[df.duplicated()])

 输出结果

Unnamed: 0 caratcut color clarity ... table pricexyz0 1 0.23 IdealESI2 ... 55.0 326 3.95 3.98 2.431 2 0.21 PremiumESI1 ... 61.0 326 3.89 3.84 2.312 3 0.23GoodEVS1 ... 65.0 327 4.05 4.07 2.313 4 0.29 PremiumIVS2 ... 58.0 334 4.20 4.23 2.634 5 0.31GoodJSI2 ... 58.0 335 4.34 4.35 2.75[5 rows x 11 columns]-------------缺失值数量可以发现该数据集中没有缺失值-----------------Unnamed: 0 0carat 0cut 0color 0clarity 0depth 0table 0price 0x 0y 0z 0dtype: int64-------------数据类型统计-----------------<class 'pandas.core.frame.DataFrame'>RangeIndex: 53943 entries, 0 to 53942Data columns (total 11 columns):# ColumnNon-Null Count Dtype --- -------------------- ----- 0 Unnamed: 0 53943 non-null int64 1 carat 53943 non-null float642 cut 53943 non-null object 3 color 53943 non-null object 4 clarity53943 non-null object 5 depth 53943 non-null float646 table 53943 non-null float647 price 53943 non-null int64 8 x 53943 non-null float649 y 53943 non-null float6410 z 53943 non-null float64dtypes: float64(6), int64(2), object(3)memory usage: 4.5+ MBNone-------------查看重复行数据可以发现没有重复数据-----------------Empty DataFrameColumns: [Unnamed: 0, carat, cut, color, clarity, depth, table, price, x, y, z]Index: []

本节源码+数据集:点击此处跳转文末名片获取

四、数据可视化

1、导入模块与数据

color_palette()详解

默认6种颜色：

deep,muted, pastel, bright, dark, colorblind seaborn, color_palette(palette=None, n_colors = None, desat = None)

import pandas as pdfrom matplotlib import pyplot as plt# 加这两行避免在plt中使用中文时报运行时错误 RuntimeWarning: Glyph 20363 missing from current font. font.set_text(s, 0, flags=flags)plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = colors = sns.color_palette('pastel')[0:5]diamonds = pd.read_csv('./data/DiamondsPrices.csv')diamonds = diamonds.drop(['Unnamed: 0'], axis=1) print(diamonds.head())

2、有序因子变量、数值型列分类

diamonds_cat = ['cut', 'color', 'clarity']diamonds_num = ['carat', 'depth', 'table', 'price', 'x', 'y', 'z']

3、钻石中常见类别

for c in diamonds_cat:print('----', c, '----')print(diamonds[c].value_counts())diamonds[c].value_counts().plot(kind='bar', title=f'Counting diamonds per {c.title()}.')plt.savefig(r'.\result\Counting_diamonds per_' + f'{c.title()}.png')plt.show()

可以得出结论：

对应属性最多数量的是---->最理想的切割钻石21551，

G的颜色是11292，SI1的净度是13067

---- cut ----Ideal 21551Premium13793Very Good 12083Good4906Fair1610Name: cut, dtype: int64---- color ----G 11292E9799F9543H8304D6775I5422J2808Name: color, dtype: int64---- clarity ----SI113067VS212259SI29194VS18171VVS25066VVS13655IF 1790I1 741

4、最昂贵的钻石属性

print(diamonds[diamonds.price == diamonds.price.max()])

5、计算钻石价格在Q3范围内的各类属性的数量

关于Q3解析

第1四分位数 (Q1)，又称“较小四分位数”，等于该样本中所有数值由小到大排列后第25%的数字。第2四分位数 (Q2)，又称“中位数”，等于该样本中所有数值由小到大排列后第50%的数字。第3四分位数 (Q3)，又称“较大四分位数”，等于该样本中所有数值由小到大排列后第75%的数字。

for c in diamonds_cat:dlv = diamonds.loc[(diamonds.price >= diamonds.price.quantile(q=.75))][c].value_counts()print(c, '--\n', dlv)dlv.plot(kind='bar').set_title(f'Counting Diamonds for kind of {c.title()}.')plt.show()

6、最常见属性的钻石数据

ascending 解析

ascending表示排序方式，值为True表示升序，可以省缺，值为False表示降序。

IGS_ByPriDesc = diamonds[(diamonds.cut == 'Ideal') & (diamonds.color == 'G') & (diamonds.clarity == 'SI1')].sort_values('price', ascending=False)IGS_ByPriDesc.to_excel(r'.\result\IGS_ByPriDesc.xlsx')

7、特定特征所占比例

# 每种属性数量最多的钻石：最理想的切割钻石是21551，G的颜色是11292，SI1的净度是13067a = diamonds[(diamonds.cut == 'Ideal') & (diamonds.color == 'G') & (diamonds.clarity == 'SI1')].shape[0]b = diamonds.shape[0]plt.pie([a, b], labels=['Ideal+G+SI1数量', '钻石总数量'], colors=colors, autopct='%.6f%%')plt.title('特定特征所占比例.')plt.savefig(r'.\result\Ideal_G_SI1_pie.png')plt.show()

8、不同属性与价格的相关度

corr()详解

corr()函数的作用是用于求解不同变量之间的相关性，值越大表示变量之间的相关性越大。

print(diamonds['carat'].corr(diamonds['price'])) print(diamonds['depth'].corr(diamonds['price'])) print(diamonds['table'].corr(diamonds['price']))

9、Carat, Table, Depth and Priced 的相关热图

plt.figure(figsize=(16, 6))sns.heatmap(diamonds.loc[:, ['carat', 'table', 'depth', 'price']].corr(), vmin=-1, vmax=1, annot=True).set_title('Carat, Table, Depth, Priced 的相关热图', fontdict={'fontsize': 12}, pad=12)plt.show()

10、基于每个分类变量的价格分布

KDE分布图，是指Kernel Density Estimation核概率密度估计。可以理解为是对直方图的加窗平滑。通过KDE分布图，可以查看并对训练数据集和测试数据集中特征变量的分布情况。

for c in ['cut', 'color', 'clarity']:sns.displot(data=diamonds, x="price", hue=f"{c}", kind='kde')plt.title(f'基于{c.title()}的价格分布图')plt.subplots_adjust(top=0.95) plt.savefig(fr'.\result\基于{c.title()}的价格分布图.png')

👇问题解答 · 源码获取 · 技术交流 · 抱团学习请联系👇

如果觉得《钻石值钱吗？python分析近年钻石价格走势【包含图表分析】》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。