Difference between revisions of "Pandas notes"

Latest revision as of 07:47, 1 July 2018

Simple Manipulation

Get the data:

   df = pd.read_csv(open(INFILE))

Rows:

   df_rows_0_to_99 = df[0:100]

Columns:

   df = pd.read_csv(open(INFILE))
   df_just_year = df['Year']
   df_year_and_count = df[['Year'],['Count']]

Filtering

   df.loc[df['Year'] > 1999]

This works because df['Year']>1999 returns a series of True and False values, where True is those that match and False is those that don't. df.loc then produces a new dataframe that's selected based on the series.

Print the number for each year:

    df.loc[df['Year'] >1000].groupby(['Year']).agg(['count'])

    df.loc[df['Year']>1000].groupby(df.Year)['Year'].count()

Fill in the missing years:

Printing

   pd.set_option('display.width',174)

Options:

https://pandas.pydata.org/pandas-docs/stable/options.html

Memory Ideas

print the data frame types:

   df.dtypes

print if the data frame columns are dense are sparse:

   df.ftypes

Other ideas:

   df.info()
   df.info(memory_usage='deep')
   df.memory_usage(deep=True)
   sys.getsizeof(df)

Convert the record_id field from an integer to a float

   surveys_df['record_id'] = surveys_df['record_id'].astype('float64')
   surveys_df['record_id'].dtype

Missing values:

   any missing values = df.isnull().values.any()
   total missing values = df.isnull().sum()

References:

Difference between revisions of "Pandas notes"

Latest revision as of 07:47, 1 July 2018

Contents

Simple Manipulation

Filtering

Printing

Memory Ideas

Navigation menu

Page actions

Page actions

Personal tools

Pages

Search

Academic

Special

Contact

Tools

@@ Line 1: / Line 1: @@
+==Simple Manipulation==
+Get the data:
+    df = pd.read_csv(open(INFILE))
+Rows:
+    df_rows_0_to_99 = df[0:100]
+Columns:
+    df = pd.read_csv(open(INFILE))
+    df_just_year = df['Year']
+    df_year_and_count = df[['Year'],['Count']]
+===Filtering===
+    df.loc[df['Year'] > 1999]
+This works because df['Year']>1999 returns a series of True and False values, where True is those that match and False is those that don't. df.loc then produces a new dataframe that's selected based on the series.
+Print the number for each year:
+     df.loc[df['Year'] >1000].groupby(['Year']).agg(['count'])
+     df.loc[df['Year']>1000].groupby(df.Year)['Year'].count()
+Fill in the missing years:
+==Printing==
+    pd.set_option('display.width',174)
+Options:
+* https://pandas.pydata.org/pandas-docs/stable/options.html
 ==Memory Ideas==
 print the data frame types:
@@ Line 17: / Line 50: @@
      surveys_df['record_id'].dtype
+Missing values:
+    any missing values = df.isnull().values.any()
+    total missing values = df.isnull().sum()
 References: