《利用python进行数据分析》读书笔记--第七章 数据规整化:清理、转换、合并、重塑(三)...

it2022-05-05  103

http://www.cnblogs.com/batteryhp/p/5046433.html

 

5、示例:usda食品数据库

下面是一个具体的例子,书中最重要的就是例子。

#-*- encoding: utf-8 -*- import numpy as np import pandas as pd import matplotlib.pyplot as plt from pandas import Series,DataFrame import re import json #加载下面30M+的数据 db = json.load(open('E:\\foods-2011-10-03.json')) #print len(db) #print type(db) #得到的db是个list,每个条目都是含有某种食物全部数据的字典 #print db[0] #这一条非常长 #print db[0].keys() #nutrients 是keys中的一个key,它对应的值是有关食物营养成分的一个字典列表,很长…… #print db[0]['nutrients'][0] #下面将营养成分做成DataFrame nutrients = DataFrame(db[0]['nutrients']) #将字典列表直接做成DataFrame #print nutrients.head() #print type(db[0]['nutrients']) info_keys = ['description','group','id','manufacturer'] info = DataFrame(db,columns = info_keys) #print info #查看分类分布情况 #print pd.value_counts(info.group) #现在,为了将所有的营养数据进行分析,需要将所有营养成分整合到一个大表中,下面分几个步骤来完成 nutrients = [] for rec in db: fnuts = DataFrame(rec['nutrients']) fnuts['id'] = rec['id'] #广播 nutrients.append(fnuts) nutrients = pd.concat(nutrients,ignore_index = True) #将列表连接起来,相当于rbind,把行对其连接在一起 #去重,这是数据处理的重要步骤 print nutrients.duplicated().sum() nutrients = nutrients.drop_duplicates() #由于nutrients与info有重复的名字,所以需要重命名一下info #注意下面这样的命名方式 col_mapping = {'description':'food', 'group':'fgroup'} #rename函数返回的是副本,需要copy = False info = info.rename(columns = col_mapping,copy = False) #print info.columns #查看一下列名 col_mapping = {'description':'nutrient','group':'nutgroup'} nutrients = nutrients.rename(columns = col_mapping,copy = False) #print nutrients.columns #做完上面这些,显然我们需要将两个DataFrame合并起来 print nutrients.ix[:10,:] #print info.id ndata = pd.merge(nutrients,info,on = 'id',how = 'outer') print ndata print ndata.ix[3000] #注意下面的处理方式很nice result = ndata.groupby(['nutrient','fgroup'])['value'].quantile(0.5) print result result['Zinc, Zn'].order().plot(kind = 'barh') plt.show() #只要稍微动动脑子(作者不止一次说过了……额),就可以发现各营养成分最为丰富的食物是什么了 by_nuttriend = ndata.groupby(['nutgroup','nutrient']) print by_nuttriend.head() #注意下面取出最大值的方式 get_maximum = lambda x:x.xs(x.value.idxmax()) get_minimum = lambda x:x.xs(x.value.idxmin()) max_foods = by_nuttriend.apply(get_maximum)[['value','food']] #让food小一点 max_foods.food = max_foods.food.str[:50] print max_foods.head() print max_foods.ix['Amino Acids']['food'] >>> 14179                       nutrient     nutgroup units    value    id0                       Protein  Composition     g    25.18  10081             Total lipid (fat)  Composition     g    29.20  10082   Carbohydrate, by difference  Composition     g     3.06  10083                           Ash        Other     g     3.28  10084                        Energy       Energy  kcal   376.00  10085                         Water  Composition     g    39.28  10086                        Energy       Energy    kJ  1573.00  10087          Fiber, total dietary  Composition     g     0.00  10088                   Calcium, Ca     Elements    mg   673.00  10089                      Iron, Fe     Elements    mg     0.64  100810                Magnesium, Mg     Elements    mg    22.00  1008<class 'pandas.core.frame.DataFrame'>Int64Index: 375176 entries, 0 to 375175Data columns:nutrient        375176  non-null valuesnutgroup        375176  non-null valuesunits           375176  non-null valuesvalue           375176  non-null valuesid              375176  non-null valuesfood            375176  non-null valuesfgroup          375176  non-null valuesmanufacturer    293054  non-null valuesdtypes: float64(1), int64(1), object(6)nutrient                 Glycinenutgroup             Amino Acidsunits                          gvalue                      0.073id                          1077food            Spearmint, freshfgroup          Spices and Herbsmanufacturer                    Name: 3000nutrient          fgroup                           Adjusted Protein  Sweets                               12.900                  Vegetables and Vegetable Products     2.180Alanine           Baby Foods                            0.085                  Baked Products                        0.248                  Beef Products                         1.550                  Beverages                             0.003                  Breakfast Cereals                     0.311                  Cereal Grains and Pasta               0.373                  Dairy and Egg Products                0.271                  Ethnic Foods                          1.290                  Fast Foods                            0.514                  Fats and Oils                         0.000                  Finfish and Shellfish Products        1.218                  Fruits and Fruit Juices               0.027                  Lamb, Veal, and Game Products         1.408...Zinc, Zn  Finfish and Shellfish Products       0.67          Fruits and Fruit Juices              0.10          Lamb, Veal, and Game Products        3.94          Legumes and Legume Products          1.14          Meals, Entrees, and Sidedishes       0.63          Nut and Seed Products                3.29          Pork Products                        2.32          Poultry Products                     2.50          Restaurant Foods                     0.80          Sausages and Luncheon Meats          2.13          Snacks                               1.47          Soups, Sauces, and Gravies           0.20          Spices and Herbs                     2.75          Sweets                               0.36          Vegetables and Vegetable Products    0.33Length: 2246<class 'pandas.core.frame.DataFrame'>MultiIndex: 467 entries, (u'Amino Acids', u'Alanine', 48) to (u'Vitamins', u'Vitamin K (phylloquinone)', 395)Data columns:nutrient        467  non-null valuesnutgroup        467  non-null valuesunits           467  non-null valuesvalue           467  non-null valuesid              467  non-null valuesfood            467  non-null valuesfgroup          467  non-null valuesmanufacturer    444  non-null valuesdtypes: float64(1), int64(1), object(6)                            value                                          foodnutgroup    nutrient                                                           Amino Acids Alanine         8.009             Gelatins, dry powder, unsweetened            Arginine        7.436                  Seeds, sesame flour, low-fat            Aspartic acid  10.203                           Soy protein isolate            Cystine         1.307  Seeds, cottonseed flour, low fat (glandless)            Glutamic acid  17.452                           Soy protein isolatenutrientAlanine                           Gelatins, dry powder, unsweetenedArginine                               Seeds, sesame flour, low-fatAspartic acid                                   Soy protein isolateCystine                Seeds, cottonseed flour, low fat (glandless)Glutamic acid                                   Soy protein isolateGlycine                           Gelatins, dry powder, unsweetenedHistidine                Whale, beluga, meat, dried (Alaska Native)Hydroxyproline    KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINA...Isoleucine        Soy protein isolate, PROTEIN TECHNOLOGIES INTE...Leucine           Soy protein isolate, PROTEIN TECHNOLOGIES INTE...Lysine            Seal, bearded (Oogruk), meat, dried (Alaska Na...Methionine                    Fish, cod, Atlantic, dried and saltedPhenylalanine     Soy protein isolate, PROTEIN TECHNOLOGIES INTE...Proline                           Gelatins, dry powder, unsweetenedSerine            Soy protein isolate, PROTEIN TECHNOLOGIES INTE...Threonine         Soy protein isolate, PROTEIN TECHNOLOGIES INTE...Tryptophan         Sea lion, Steller, meat with fat (Alaska Native)Tyrosine          Soy protein isolate, PROTEIN TECHNOLOGIES INTE...Valine            Soy protein isolate, PROTEIN TECHNOLOGIES INTE...Name: food[Finished in 14.1s]

  分类:  python

转载于:https://www.cnblogs.com/virusolf/p/6231748.html


最新回复(0)