http://www.cnblogs.com/batteryhp/p/5046433.html
5、示例:usda食品数据库
下面是一个具体的例子,书中最重要的就是例子。
#-*- encoding: utf-8 -*- import numpy as np import pandas as pd import matplotlib.pyplot as plt from pandas import Series,DataFrame import re import json #加载下面30M+的数据 db = json.load(open('E:\\foods-2011-10-03.json')) #print len(db) #print type(db) #得到的db是个list,每个条目都是含有某种食物全部数据的字典 #print db[0] #这一条非常长 #print db[0].keys() #nutrients 是keys中的一个key,它对应的值是有关食物营养成分的一个字典列表,很长…… #print db[0]['nutrients'][0] #下面将营养成分做成DataFrame nutrients = DataFrame(db[0]['nutrients']) #将字典列表直接做成DataFrame #print nutrients.head() #print type(db[0]['nutrients']) info_keys = ['description','group','id','manufacturer'] info = DataFrame(db,columns = info_keys) #print info #查看分类分布情况 #print pd.value_counts(info.group) #现在,为了将所有的营养数据进行分析,需要将所有营养成分整合到一个大表中,下面分几个步骤来完成 nutrients = [] for rec in db: fnuts = DataFrame(rec['nutrients']) fnuts['id'] = rec['id'] #广播 nutrients.append(fnuts) nutrients = pd.concat(nutrients,ignore_index = True) #将列表连接起来,相当于rbind,把行对其连接在一起 #去重,这是数据处理的重要步骤 print nutrients.duplicated().sum() nutrients = nutrients.drop_duplicates() #由于nutrients与info有重复的名字,所以需要重命名一下info #注意下面这样的命名方式 col_mapping = {'description':'food', 'group':'fgroup'} #rename函数返回的是副本,需要copy = False info = info.rename(columns = col_mapping,copy = False) #print info.columns #查看一下列名 col_mapping = {'description':'nutrient','group':'nutgroup'} nutrients = nutrients.rename(columns = col_mapping,copy = False) #print nutrients.columns #做完上面这些,显然我们需要将两个DataFrame合并起来 print nutrients.ix[:10,:] #print info.id ndata = pd.merge(nutrients,info,on = 'id',how = 'outer') print ndata print ndata.ix[3000] #注意下面的处理方式很nice result = ndata.groupby(['nutrient','fgroup'])['value'].quantile(0.5) print result result['Zinc, Zn'].order().plot(kind = 'barh') plt.show() #只要稍微动动脑子(作者不止一次说过了……额),就可以发现各营养成分最为丰富的食物是什么了 by_nuttriend = ndata.groupby(['nutgroup','nutrient']) print by_nuttriend.head() #注意下面取出最大值的方式 get_maximum = lambda x:x.xs(x.value.idxmax()) get_minimum = lambda x:x.xs(x.value.idxmin()) max_foods = by_nuttriend.apply(get_maximum)[['value','food']] #让food小一点 max_foods.food = max_foods.food.str[:50] print max_foods.head() print max_foods.ix['Amino Acids']['food'] >>> 14179 nutrient nutgroup units value id0 Protein Composition g 25.18 10081 Total lipid (fat) Composition g 29.20 10082 Carbohydrate, by difference Composition g 3.06 10083 Ash Other g 3.28 10084 Energy Energy kcal 376.00 10085 Water Composition g 39.28 10086 Energy Energy kJ 1573.00 10087 Fiber, total dietary Composition g 0.00 10088 Calcium, Ca Elements mg 673.00 10089 Iron, Fe Elements mg 0.64 100810 Magnesium, Mg Elements mg 22.00 1008<class 'pandas.core.frame.DataFrame'>Int64Index: 375176 entries, 0 to 375175Data columns:nutrient 375176 non-null valuesnutgroup 375176 non-null valuesunits 375176 non-null valuesvalue 375176 non-null valuesid 375176 non-null valuesfood 375176 non-null valuesfgroup 375176 non-null valuesmanufacturer 293054 non-null valuesdtypes: float64(1), int64(1), object(6)nutrient Glycinenutgroup Amino Acidsunits gvalue 0.073id 1077food Spearmint, freshfgroup Spices and Herbsmanufacturer Name: 3000nutrient fgroup Adjusted Protein Sweets 12.900 Vegetables and Vegetable Products 2.180Alanine Baby Foods 0.085 Baked Products 0.248 Beef Products 1.550 Beverages 0.003 Breakfast Cereals 0.311 Cereal Grains and Pasta 0.373 Dairy and Egg Products 0.271 Ethnic Foods 1.290 Fast Foods 0.514 Fats and Oils 0.000 Finfish and Shellfish Products 1.218 Fruits and Fruit Juices 0.027 Lamb, Veal, and Game Products 1.408...Zinc, Zn Finfish and Shellfish Products 0.67 Fruits and Fruit Juices 0.10 Lamb, Veal, and Game Products 3.94 Legumes and Legume Products 1.14 Meals, Entrees, and Sidedishes 0.63 Nut and Seed Products 3.29 Pork Products 2.32 Poultry Products 2.50 Restaurant Foods 0.80 Sausages and Luncheon Meats 2.13 Snacks 1.47 Soups, Sauces, and Gravies 0.20 Spices and Herbs 2.75 Sweets 0.36 Vegetables and Vegetable Products 0.33Length: 2246<class 'pandas.core.frame.DataFrame'>MultiIndex: 467 entries, (u'Amino Acids', u'Alanine', 48) to (u'Vitamins', u'Vitamin K (phylloquinone)', 395)Data columns:nutrient 467 non-null valuesnutgroup 467 non-null valuesunits 467 non-null valuesvalue 467 non-null valuesid 467 non-null valuesfood 467 non-null valuesfgroup 467 non-null valuesmanufacturer 444 non-null valuesdtypes: float64(1), int64(1), object(6) value foodnutgroup nutrient Amino Acids Alanine 8.009 Gelatins, dry powder, unsweetened Arginine 7.436 Seeds, sesame flour, low-fat Aspartic acid 10.203 Soy protein isolate Cystine 1.307 Seeds, cottonseed flour, low fat (glandless) Glutamic acid 17.452 Soy protein isolatenutrientAlanine Gelatins, dry powder, unsweetenedArginine Seeds, sesame flour, low-fatAspartic acid Soy protein isolateCystine Seeds, cottonseed flour, low fat (glandless)Glutamic acid Soy protein isolateGlycine Gelatins, dry powder, unsweetenedHistidine Whale, beluga, meat, dried (Alaska Native)Hydroxyproline KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINA...Isoleucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...Leucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...Lysine Seal, bearded (Oogruk), meat, dried (Alaska Na...Methionine Fish, cod, Atlantic, dried and saltedPhenylalanine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...Proline Gelatins, dry powder, unsweetenedSerine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...Threonine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...Tryptophan Sea lion, Steller, meat with fat (Alaska Native)Tyrosine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...Valine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...Name: food[Finished in 14.1s] 分类: python转载于:https://www.cnblogs.com/virusolf/p/6231748.html
