Difference between revisions of "Pandas notes"
From Simson Garfinkel
Jump to navigationJump to search
(Created page with "==Memory Ideas== print the data frame types: df.dtypes print if the data frame columns are dense are sparse: df.ftypes Other ideas: df.info() df.info(memory_...") |
m (→Filtering) |
||
(5 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
==Simple Manipulation== | |||
Get the data: | |||
df = pd.read_csv(open(INFILE)) | |||
Rows: | |||
df_rows_0_to_99 = df[0:100] | |||
Columns: | |||
df = pd.read_csv(open(INFILE)) | |||
df_just_year = df['Year'] | |||
df_year_and_count = df[['Year'],['Count']] | |||
===Filtering=== | |||
df.loc[df['Year'] > 1999] | |||
This works because df['Year']>1999 returns a series of True and False values, where True is those that match and False is those that don't. df.loc then produces a new dataframe that's selected based on the series. | |||
Print the number for each year: | |||
df.loc[df['Year'] >1000].groupby(['Year']).agg(['count']) | |||
df.loc[df['Year']>1000].groupby(df.Year)['Year'].count() | |||
Fill in the missing years: | |||
==Printing== | |||
pd.set_option('display.width',174) | |||
Options: | |||
* https://pandas.pydata.org/pandas-docs/stable/options.html | |||
==Memory Ideas== | ==Memory Ideas== | ||
print the data frame types: | print the data frame types: | ||
Line 17: | Line 50: | ||
surveys_df['record_id'].dtype | surveys_df['record_id'].dtype | ||
Missing values: | |||
any missing values = df.isnull().values.any() | |||
total missing values = df.isnull().sum() | |||
References: | References: |
Latest revision as of 06:47, 1 July 2018
Simple Manipulation
Get the data:
df = pd.read_csv(open(INFILE))
Rows:
df_rows_0_to_99 = df[0:100]
Columns:
df = pd.read_csv(open(INFILE)) df_just_year = df['Year'] df_year_and_count = df[['Year'],['Count']]
Filtering
df.loc[df['Year'] > 1999]
This works because df['Year']>1999 returns a series of True and False values, where True is those that match and False is those that don't. df.loc then produces a new dataframe that's selected based on the series.
Print the number for each year:
df.loc[df['Year'] >1000].groupby(['Year']).agg(['count'])
df.loc[df['Year']>1000].groupby(df.Year)['Year'].count()
Fill in the missing years:
Printing
pd.set_option('display.width',174)
Options:
Memory Ideas
print the data frame types:
df.dtypes
print if the data frame columns are dense are sparse:
df.ftypes
Other ideas:
df.info() df.info(memory_usage='deep') df.memory_usage(deep=True) sys.getsizeof(df)
Convert the record_id field from an integer to a float
surveys_df['record_id'] = surveys_df['record_id'].astype('float64') surveys_df['record_id'].dtype
Missing values:
any missing values = df.isnull().values.any() total missing values = df.isnull().sum()
References:
- https://stackoverflow.com/questions/22470690/get-list-of-pandas-dataframe-columns-based-on-data-type
- http://chris.friedline.net/2015-12-15-rutgers/lessons/python2/03-data-types-and-format.html
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.memory_usage.html
- https://www.dataquest.io/blog/pandas-big-data/
- https://medium.com/@jeru92/reducing-data-capacity-for-quicker-predictions-8d1210ed9536