### 1. Creating Data Structures - **`pd.Series(data=None, index=None, dtype=None, name=None, copy=False)`**: Creates a one-dimensional labeled array. - `data`: array-like, dict, or scalar value. - `index`: list of axis labels. - `dtype`: data type. - `name`: name for the Series. - `copy`: copy input data. ```python s = pd.Series([1, 3, 5, np.nan, 6, 8], index=pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05', '2023-01-06'])) print(s) ``` - **`pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)`**: Creates a two-dimensional labeled data structure with columns of potentially different types. - `data`: dict of Series, dict of arrays/lists, or 2-D numpy array. - `index`: list of row labels. - `columns`: list of column labels. - `dtype`: data type. - `copy`: copy input data. ```python data = {'col1': [1, 2], 'col2': [3, 4]} df = pd.DataFrame(data=data, index=['row1', 'row2']) print(df) ``` - **`pd.read_csv(filepath_or_buffer, sep=',', header='infer', names=None, index_col=None, dtype=None, parse_dates=False, encoding=None, compression='infer', chunksize=None, skiprows=None, nrows=None, na_values=None)`**: Reads a comma-separated values (CSV) file into a DataFrame. - `filepath_or_buffer`: Path to the file or URL. - `sep`: Delimiter to use. - `header`: Row number(s) to use as the column names. - `names`: List of column names to use. - `index_col`: Column(s) to use as the row labels. - `dtype`: Dictionary of column names to data types. - `parse_dates`: Columns to parse as dates. - `encoding`: Encoding to use for UTF when reading/writing. - `compression`: For on-the-fly decompression of on-disk data. - `chunksize`: Return `TextFileReader` object for iteration. - `skiprows`: Line numbers to skip. - `nrows`: Number of rows to read. - `na_values`: Additional strings to recognize as NA/NaN. ```python df_csv = pd.read_csv('data.csv') ``` - **`pd.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, dtype=None, parse_dates=False)`**: Reads an Excel file into a DataFrame. - `io`: Path to the Excel file or ExcelFile object. - `sheet_name`: Strings are sheet names, integers are sheet positions. - `usecols`: Columns to parse. ```python df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1') ``` - **`pd.read_json(path_or_buf, orient=None, typ='frame', dtype=None, convert_dates=True)`**: Reads a JSON file into a DataFrame. - `orient`: Expected JSON string format ('split', 'records', 'index', 'string', 'values'). - `typ`: Type of object to recover ('series', 'frame'). ```python df_json = pd.read_json('data.json') ``` - **`pd.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None)`**: Reads SQL query or database table into a DataFrame. - `sql`: SQL query string or table name. - `con`: Connection object (e.g., SQLAlchemy engine). ```python # Requires SQLAlchemy and a database connection # from sqlalchemy import create_engine # engine = create_engine('sqlite:///my_database.db') # df_sql = pd.read_sql('SELECT * FROM my_table', engine) ``` - **`pd.read_html(io, match='.', flavor=None, header=None, index_col=None, attrs=None, parse_dates=False)`**: Reads HTML tables into a list of DataFrame objects. - `io`: URL, file-like object, or raw string containing HTML. - `match`: The set of tables containing this regex will be parsed. ```python # df_html = pd.read_html('http://www.w3schools.com/html/html_tables.asp') ``` - **`pd.read_parquet(path, engine='auto', columns=None)`**: Reads a parquet object from the file path, returning a DataFrame. ```python # df_parquet = pd.read_parquet('data.parquet') ``` - **`pd.read_pickle(filepath_or_buffer, compression='infer')`**: Reads a pickled pandas object (DataFrame or Series). ```python # df_pickle = pd.read_pickle('data.pkl') ``` - **`pd.read_clipboard(sep='\s+', header='infer', index_col=None)`**: Reads text from the clipboard and passes it to `read_csv`. ```python # df_clipboard = pd.read_clipboard() ``` - **`pd.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None)`**: Generates a fixed frequency DatetimeIndex. - `start`, `end`: Start/end dates. - `periods`: Number of periods. - `freq`: Frequency string (e.g., 'D', 'M', 'H'). ```python dates = pd.date_range(start='2023-01-01', periods=5, freq='D') print(dates) ``` - **`pd.timedelta_range(start=None, end=None, periods=None, freq=None, name=None, closed=None)`**: Generates a fixed frequency TimedeltaIndex. ```python time_deltas = pd.timedelta_range(start='0 days', periods=3, freq='H') print(time_deltas) ``` - **`pd.Categorical(values, categories=None, ordered=None, dtype=None)`**: Represents categorical data. - `values`: The values to be categorized. - `categories`: The unique categories. - `ordered`: Whether the categories have a logical order. ```python cat_series = pd.Categorical(['A', 'B', 'A', 'C'], categories=['A', 'B', 'C'], ordered=True) print(cat_series) ``` - **`pd.Index(data=None, dtype=None, name=None, tupleize_cols=True)`**: Immutable N-dimensional array representing axis labels. ```python idx = pd.Index([1, 2, 3]) print(idx) ``` - **`pd.MultiIndex.from_arrays(arrays, sortorder=None, names=None)`**, **`pd.MultiIndex.from_tuples(tuples, sortorder=None, names=None)`**: Represents multiple levels of labels on a single axis. - `arrays`: List of arrays to be combined into a MultiIndex. - `tuples`: List of tuples to be combined. ```python arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]] multi_idx = pd.MultiIndex.from_arrays(arrays, names=('level1', 'level2')) print(multi_idx) ``` ### 2. Viewing & Inspecting Data - **`.head(n=5)`**: Returns the first `n` rows. - `n`: Number of rows to select. ```python df.head(2) ``` - **`.tail(n=5)`**: Returns the last `n` rows. - `n`: Number of rows to select. ```python df.tail(1) ``` - **`.shape`**: Returns a tuple representing the dimensionality of the DataFrame (rows, columns). ```python df.shape ``` - **`.size`**: Returns the number of elements in the DataFrame. ```python df.size ``` - **`.ndim`**: Returns the number of dimensions of the DataFrame. (Always 2 for DataFrame, 1 for Series). ```python df.ndim ``` - **`.columns`**: Returns the column labels of the DataFrame. ```python df.columns ``` - **`.index`**: Returns the row labels (index) of the DataFrame. ```python df.index ``` - **`.dtypes`**: Returns a Series with the data type of each column. ```python df.dtypes ``` - **`.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None)`**: Prints a concise summary of a DataFrame. - `verbose`: Whether to print the full summary. - `memory_usage`: Specifies whether to print memory usage. - `show_counts`: Whether to show the non-null counts. ```python df.info() ``` - **`.describe(percentiles=None, include=None, exclude=None)`**: Generates descriptive statistics of DataFrame columns. - `percentiles`: Percentiles to include in the output. - `include`, `exclude`: Data types to include/exclude. ```python df.describe() ``` - **`.memory_usage(index=True, deep=False)`**: Returns the memory usage of each column in bytes. - `index`: Specifies whether to include the index memory usage. - `deep`: If True, introspect the data deeply to get an accurate size. ```python df.memory_usage() ``` - **`.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)`**: Returns a random sample of items from an axis of object. - `n`: Number of items to return. - `frac`: Fraction of axis items to return. - `replace`: Sample with or without replacement. - `random_state`: Seed for random number generator. ```python df.sample(n=1) ``` - **`.values`**: Returns a NumPy representation of the DataFrame. (Deprecated for `to_numpy()`) ```python df.values ``` - **`.to_numpy(dtype=None, copy=False, na_value=no_default)`**: Converts the DataFrame to a NumPy array. - `dtype`: The data type for the output array. - `copy`: Whether to ensure a copy of the values is returned. ```python df.to_numpy() ``` - **`.to_dict(orient='dict', into= )`**: Converts the DataFrame to a dictionary. - `orient`: Format of the dictionary ('dict', 'list', 'series', 'split', 'records', 'index'). ```python df.to_dict() ``` - **`.T`**: Transposes the DataFrame (swaps rows and columns). ```python df.T ``` - **`.nunique(axis=0, dropna=True)`**: Returns the number of unique elements in the Series/DataFrame. - `axis`: 0 for row-wise, 1 for column-wise. - `dropna`: Don't include NaN in the count. ```python df['col1'].nunique() ``` - **`.count(axis=0, level=None, numeric_only=False)`**: Counts non-NA cells for each column or row. ```python df.count() ``` - **`.unique()`**: Returns unique values in a Series. ```python s = pd.Series(['a', 'b', 'a', 'c']) s.unique() ``` ### 3. Selection & Indexing - **`.loc[row_indexer, column_indexer]`**: Label-based indexing to select by label or boolean array. - `row_indexer`: Single label, list of labels, slice with labels, or boolean array. - `column_indexer`: Single label, list of labels, slice with labels, or boolean array. ```python df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['x', 'y', 'z']) df.loc['y', 'A'] # Scalar selection df.loc[['x', 'z'], ['A', 'B']] # List of labels df.loc['x':'y', 'A':'B'] # Slice with labels df.loc[df['A'] > 1, 'B'] # Boolean indexing for rows ``` - **`.iloc[row_indexer, column_indexer]`**: Position-based indexing to select by integer position. - `row_indexer`: Integer, list of integers, or slice with integers. - `column_indexer`: Integer, list of integers, or slice with integers. ```python df.iloc[1, 0] # Scalar selection df.iloc[[0, 2], [0, 1]] # List of positions df.iloc[0:2, 0:2] # Slice with positions ``` - **`.at[row_label, column_label]`**: Fast single element access by label. Returns a single value. ```python df.at['y', 'B'] ``` - **`.iat[row_position, column_position]`**: Fast single element access by integer position. Returns a single value. ```python df.iat[1, 1] ``` - **Boolean indexing**: Selects rows/columns where a boolean condition is True. ```python df[df['A'] > 1] # Select rows where column 'A' is greater than 1 ``` - **`.isin(values)`**: Returns a boolean Series indicating whether each element in the DataFrame/Series is contained in `values`. - `values`: A set or list of values. ```python df[df['A'].isin([1, 3])] ``` - **`.between(left, right, inclusive='both')`**: Returns a boolean Series equivalent to `left 1 and B ### 4. Adding, Removing & Renaming - **`df['new_col'] = value`**: Adds a new column or modifies an existing one. - `value`: Scalar, Series, or array. ```python df['C'] = [7, 8, 9] # Add new column 'C' df['A'] = df['A'] * 10 # Modify column 'A' ``` - **`.assign(**kwargs)`**: Assign new columns to a DataFrame, returning a new object with all original columns in addition to the new ones. - `**kwargs`: Keyword arguments where the key is the new column name and the value is the column data. ```python df_assigned = df.assign(D=[10, 11, 12], E=lambda x: x['A'] + x['C']) print(df_assigned) ``` - **`.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')`**: Drop specified labels from rows or columns. - `labels`: Single label or list of labels to drop. - `axis`: 0 for rows, 1 for columns. - `index`, `columns`: Alternative to `labels` and `axis`. - `inplace`: Whether to modify the DataFrame in place. ```python df_dropped_row = df.drop(index='x') df_dropped_col = df.drop(columns='C') ``` - **`.pop(item)`**: Return item and drop from frame. - `item`: Column label to pop. ```python col_c = df.pop('C') print(col_c) print(df) ``` - **`.rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore')`**: Alter axes labels. - `mapper`: Dictionary-like or function for mapping labels. - `index`, `columns`: Alternative to `mapper` and `axis` for specific axis. - `inplace`: Whether to modify the DataFrame in place. ```python df_renamed = df.rename(columns={'A': 'Col_A', 'B': 'Col_B'}, index={'y': 'row_y'}) print(df_renamed) ``` - **`.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)`**: Set the DataFrame index using existing columns. - `keys`: Column label or list of column labels to set as index. - `drop`: Delete columns used to set new index. - `append`: Whether to append columns to existing index. ```python df_indexed = df.set_index('A') print(df_indexed) ``` - **`.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')`**: Reset the index of the DataFrame, and use the default integer index. - `level`: Only remove the specified levels from the index. - `drop`: Do not insert index as a column. ```python df_reset = df_indexed.reset_index() print(df_reset) ``` ### 5. Sorting & Ranking - **`.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)`**: Sort by the values along either axis. - `by`: Column or list of columns to sort by. - `axis`: 0 for rows, 1 for columns. - `ascending`: Sort ascending (True) or descending (False). - `na_position`: 'first' or 'last' for NaN values. ```python df_sorted_val = df.sort_values(by='A', ascending=False) print(df_sorted_val) ``` - **`.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)`**: Sort object by labels (along an axis). - `axis`: 0 for rows, 1 for columns. - `level`: If MultiIndex, sort only on indicated level. ```python df_sorted_idx = df.sort_index(axis=0, ascending=True) print(df_sorted_idx) ``` - **`.rank(axis=0, method='average', numeric_only=False, na_option='keep', ascending=True, pct=False)`**: Compute numerical data ranks (1 through n). - `method`: How to handle ties ('average', 'min', 'max', 'first', 'dense'). - `na_option`: How to handle NaN values ('keep', 'top', 'bottom'). - `pct`: Whether to return percentile ranks. ```python df['rank_A'] = df['A'].rank(method='min') print(df) ``` - **`.nlargest(n, columns, keep='first')`**: Return the first `n` rows ordered by `columns` in descending order. - `n`: Number of rows to return. - `columns`: Column(s) to order by. ```python df.nlargest(2, 'A') ``` - **`.nsmallest(n, columns, keep='first')`**: Return the first `n` rows ordered by `columns` in ascending order. - `n`: Number of rows to return. - `columns`: Column(s) to order by. ```python df.nsmallest(1, 'A') ``` ### 6. Handling Missing Data - **`.isna()` / `.isnull()`**: Returns a boolean same-sized object indicating if the values are NA. ```python df_nan = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 5, np.nan]}) df_nan.isna() ``` - **`.notna()` / `.notnull()`**: Returns a boolean same-sized object indicating if the values are not NA. ```python df_nan.notna() ``` - **`.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)`**: Removes NA values. - `axis`: 0 for rows, 1 for columns. - `how`: 'any' if any NA present, 'all' if all NA. - `thresh`: Require that many non-NA values. - `subset`: Labels along other axis to consider. ```python df_nan.dropna(how='any') # Drop rows with any NaN df_nan.dropna(axis=1, how='all') # Drop columns where all values are NaN ``` - **`.fillna(value=None, method=None, axis=None, inplace=False, limit=None)`**: Fill NA/NaN values using the specified method. - `value`: Scalar or dict to use for filling. - `method`: 'pad'/'ffill' (forward fill), 'backfill'/'bfill' (backward fill). - `limit`: Maximum number of consecutive NA values to fill. ```python df_nan.fillna(0) # Fill with a scalar df_nan.fillna(method='ffill') # Forward fill df_nan['A'].fillna(df_nan['A'].mean()) # Fill with column mean ``` - **`.interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction='forward', limit_area=None)`**: Interpolate missing values. - `method`: Interpolation method ('linear', 'time', 'slinear', 'quadratic', 'cubic', 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'barycentric', 'krogh', 'polynomial', 'spline', 'piecewise_polynomial', 'from_derivatives', 'pchip', 'akima', 'cubicspline'). - `limit_direction`: 'forward', 'backward', 'both'. ```python s_nan = pd.Series([1, np.nan, 3, np.nan, 5]) s_nan.interpolate() ``` - **`.replace(to_replace=None, value=no_default, inplace=False, limit=None, regex=False, method='pad')`**: Replace values given in `to_replace` with `value`. - `to_replace`: Value(s) to be replaced. - `value`: Value to replace with. - `regex`: Whether to interpret `to_replace` as a regular expression. ```python df_rep = pd.DataFrame({'A': [1, 2, 1], 'B': ['x', 'y', 'x']}) df_rep.replace(1, 100) df_rep.replace({'A': 1, 'B': 'x'}, {'A': 100, 'B': 'z'}) ``` ### 7. Operations & Computations - **Arithmetic operations**: `+`, `-`, `*`, `/`, `//` (floor division), `%` (modulo), `**` (exponentiation). - These operations are element-wise and align by index. ```python df_ops = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}) df_ops + 10 df_ops['A'] * df_ops['B'] ``` - **Aggregation functions**: - **`.sum(axis=None, skipna=True, level=None, numeric_only=False, min_count=0)`**: Sum of values. - **`.mean(axis=None, skipna=True, level=None, numeric_only=False)`**: Mean of values. - **`.median(axis=None, skipna=True, level=None, numeric_only=False)`**: Median of values. - **`.mode(axis=0, numeric_only=False, dropna=True)`**: Mode(s) of values. - **`.std(axis=None, skipna=True, level=None, ddof=1, numeric_only=False)`**: Standard deviation. - **`.var(axis=None, skipna=True, level=None, ddof=1, numeric_only=False)`**: Variance. - **`.min(axis=None, skipna=True, level=None, numeric_only=False)`**: Minimum value. - **`.max(axis=None, skipna=True, level=None, numeric_only=False)`**: Maximum value. - **`.prod(axis=None, skipna=True, level=None, numeric_only=False, min_count=0)`**: Product of values. - **`.cumsum(axis=None, skipna=True)`**: Cumulative sum. - **`.cumprod(axis=None, skipna=True)`**: Cumulative product. ```python df_ops.sum() df_ops.mean(axis=1) df_ops['A'].cumsum() ``` - **Numeric transforms**: - **`.abs()`**: Absolute value. - **`.round(decimals=0)`**: Round to the nearest integer or specified number of decimals. - **`.clip(lower=None, upper=None, axis=None, inplace=False)`**: Trim values at input threshold(s). ```python s_num = pd.Series([-1.5, 2.3, -0.7]) s_num.abs() s_num.round(0) s_num.clip(lower=0) ``` - **`.corr(method='pearson', min_periods=1, numeric_only=False)`**: Compute pairwise correlation of columns. - `method`: 'pearson', 'kendall', 'spearman'. ```python df_corr = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}) df_corr.corr() ``` - **`.cov(min_periods=None, ddof=1, numeric_only=False)`**: Compute pairwise covariance of columns. ```python df_corr.cov() ``` - **`.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)`**: Apply a function along an axis of the DataFrame. - `func`: Function to apply. - `axis`: 0 for rows, 1 for columns. - `result_type`: 'expand', 'reduce', 'broadcast'. ```python df_ops.apply(lambda x: x.max() - x.min()) # Apply to each column df_ops.apply(np.sum, axis=1) # Apply to each row ``` - **`.map(arg, na_action=None)`**: Element-wise mapping for Series. - `arg`: Function, dict, or Series. - `na_action`: 'ignore' to propagate NaN. ```python s_map = pd.Series([1, 2, 3]) s_map.map({1: 'one', 2: 'two'}) ``` - **`.applymap(func, na_action=None)`**: Element-wise application function for DataFrame. (Deprecated in favor of `DataFrame.map()`) ```python # df_ops.applymap(lambda x: x * 2) # Apply to every element df_ops.map(lambda x: x * 2) # Recommended alternative ``` - **`.pipe(func, *args, **kwargs)`**: Apply `func(self, *args, **kwargs)`. Useful for chaining custom functions. ```python def add_one_and_multiply(df, multiplier): return df + 1 * multiplier df_ops.pipe(add_one_and_multiply, multiplier=2) ``` ### 8. String Operations (`.str`) - **Case conversion**: - **`.str.lower()`**: Convert strings in Series/Index to lowercase. - **`.str.upper()`**: Convert strings in Series/Index to uppercase. - **`.str.title()`**: Convert strings in Series/Index to titlecase. - **`.str.capitalize()`**: Capitalize first letter of each string. ```python s_str = pd.Series(['apple', 'Banana', 'ORANGE']) s_str.str.upper() ``` - **Whitespace removal**: - **`.str.strip(to_strip=None)`**: Remove leading and trailing whitespace. - **`.str.lstrip(to_strip=None)`**: Remove leading whitespace. - **`.str.rstrip(to_strip=None)`**: Remove trailing whitespace. ```python s_ws = pd.Series([' hello ', 'world ']) s_ws.str.strip() ``` - **Replacement & search**: - **`.str.replace(pat, repl, n=-1, regex=False)`**: Replace occurrences of pattern/regex. - **`.str.contains(pat, case=True, flags=0, na=nan, regex=True)`**: Test if pattern or regex is contained in string. - **`.str.startswith(pat, na=nan)`**: Test if string starts with pattern. - **`.str.endswith(pat, na=nan)`**: Test if string ends with pattern. ```python s_str.str.replace('a', 'X') s_str.str.contains('an') ``` - **Splitting & joining**: - **`.str.split(pat=None, n=-1, expand=False)`**: Split strings around given pattern. - **`.str.join(sep)`**: Join strings in the Series/Index with a separator. - **`.str.cat(others=None, sep=None, na_rep=None, join='left')`**: Concatenate strings. ```python s_split = pd.Series(['apple,banana', 'orange,grape']) s_split.str.split(',') pd.Series(['a', 'b', 'c']).str.cat(sep='-') ``` - **Length & counts**: - **`.str.len()`**: Compute string length. - **`.str.count(pat, flags=0)`**: Count occurrences of pattern/regex. - **`.str.extract(pat, flags=0, expand=True)`**: Extract groups from regular expression match. - **`.str.findall(pat, flags=0)`**: Find all occurrences of pattern/regex. ```python s_str.str.len() s_str.str.count('a') ``` - **Type checks**: - **`.str.isnumeric()`**: Check if all characters in the string are numeric. - **`.str.isalpha()`**: Check if all characters in the string are alphabetic. - **`.str.isdigit()`**: Check if all characters in the string are digits. ```python pd.Series(['123', 'abc', '1a']).str.isdigit() ``` ### 9. Date & Time (`.dt`) (Access `.dt` accessor after converting to datetime dtype using `pd.to_datetime()`) - **Date components**: - **`.dt.year`**: Year of the datetime. - **`.dt.month`**: Month of the datetime. - **`.dt.day`**: Day of the datetime. - **`.dt.hour`**: Hour of the datetime. - **`.dt.minute`**: Minute of the datetime. - **`.dt.second`**: Second of the datetime. ```python dates_s = pd.Series(pd.to_datetime(['2023-01-01 10:30', '2023-02-15 11:45'])) dates_s.dt.year dates_s.dt.hour ``` - **Week & day components**: - **`.dt.week`** (deprecated, use **`.dt.isocalendar().week`**): Week number of the year. - **`.dt.weekday`**: Day of the week (Monday=0, Sunday=6). - **`.dt.dayofweek`**: Same as `weekday`. - **`.dt.dayofyear`**: Day of the year. ```python dates_s.dt.dayofweek dates_s.dt.dayofyear ``` - **Period & flags**: - **`.dt.weekofyear`** (deprecated, use **`.dt.isocalendar().week`**): Week number of the year. - **`.dt.quarter`**: Quarter of the year. - **`.dt.is_month_end`**: Boolean indicating if the date is the last day of the month. - **`.dt.is_month_start`**: Boolean indicating if the date is the first day of the month. ```python dates_s.dt.quarter dates_s.dt.is_month_start ``` - **Formatting & conversion**: - **`.dt.strftime(format)`**: Format datetime to string using `strftime` format codes. - **`.dt.to_period(freq=None)`**: Convert Series to PeriodIndex. - **`.dt.to_pydatetime()`**: Return Series of Python datetime objects. ```python dates_s.dt.strftime('%Y-%m-%d') dates_s.dt.to_period('M') ``` - **`pd.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, format=None)`**: Convert argument to datetime. - `arg`: String, list, Series, or DataFrame to convert. - `format`: Explicit format codes (e.g., '%Y-%m-%d'). - `errors`: 'ignore', 'raise', 'coerce'. ```python pd.to_datetime('2023-01-01') pd.to_datetime(['2023-01-01', '2023-01-02']) ``` - **`pd.to_timedelta(arg, unit='ns', errors='raise')`**: Convert argument to timedelta. - `arg`: String, integer, float, list, Series, or DataFrame. - `unit`: Unit of the input. ```python pd.to_timedelta('1 day') pd.to_timedelta([1, 2], unit='D') ``` ### 10. Grouping & Aggregation - **`.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, dropna=True)`**: Group DataFrame using a mapper or by a Series of columns. - `by`: Mapping, function, label, or list of labels for grouping. - `axis`: 0 for rows, 1 for columns. - `level`: If MultiIndex, group by specific level. - `as_index`: Whether to return grouped keys as index. ```python df_grp = pd.DataFrame({'key': ['A', 'B', 'A', 'B'], 'data1': [1, 2, 3, 4], 'data2': [5, 6, 7, 8]}) grouped = df_grp.groupby('key') grouped.sum() ``` - **Aggregation methods after `groupby()`**: - **`.agg(func, *args, **kwargs)`**: Apply one or more aggregation functions. - `func`: Function, string function name, or list/dict of functions. - Can pass a dictionary to apply different functions to different columns. ```python grouped.agg({'data1': 'sum', 'data2': ['mean', 'min']}) ``` - **`.transform(func, *args, **kwargs)`**: Apply a function to each group and return a Series/DataFrame with the same index as the original. - `func`: Function, string function name. ```python df_grp['data1_mean_per_key'] = grouped['data1'].transform('mean') print(df_grp) ``` - **`.filter(func, dropna=True, *args, **kwargs)`**: Filter data based on group properties. - `func`: Function that takes a DataFrame and returns a boolean. ```python grouped.filter(lambda x: x['data1'].sum() > 4) ``` - **Common aggregation functions (can be used directly on Series/DataFrame or after `groupby()`):** - **`.size()`**: Number of items in each group (after `groupby()`). - **`.count()`**: Count of non-NA values. - **`.mean()`**: Mean of values. - **`.sum()`**: Sum of values. - **`.median()`**: Median of values. - **`.first()`**: First non-NA value. - **`.last()`**: Last non-NA value. ```python grouped.size() grouped['data1'].mean() ``` - **`.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)`**: Create a spreadsheet-style pivot table as a DataFrame. - `values`: Column to aggregate. - `index`: Column(s) to group by on the rows. - `columns`: Column(s) to group by on the columns. - `aggfunc`: Function or list of functions for aggregation. ```python df_pt = pd.DataFrame({'A': ['one', 'one', 'two', 'three'], 'B': ['x', 'y', 'x', 'y'], 'C': [1, 2, 3, 4]}) pd.pivot_table(df_pt, values='C', index='A', columns='B', aggfunc='sum') ``` - **`pd.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True, normalize=False)`**: Compute a frequency table of two (or more) factors. - `index`, `columns`: Array-like, factors to group by. - `values`: Array-like, optional, values to aggregate. - `aggfunc`: Function to aggregate. ```python pd.crosstab(df_pt['A'], df_pt['B']) ``` ### 11. Combining DataFrames - **`pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)`**: Concatenate pandas objects along a particular axis. - `objs`: A sequence or mapping of Series or DataFrame objects. - `axis`: 0 to stack row-wise, 1 to stack column-wise. - `join`: 'outer' for union of indexes, 'inner' for intersection. - `ignore_index`: If True, do not use the index values along the concatenation axis. - `keys`: Construct hierarchical index for concatenated frame. ```python df1 = pd.DataFrame({'A': [1, 2]}, index=[0, 1]) df2 = pd.DataFrame({'A': [3, 4]}, index=[2, 3]) pd.concat([df1, df2]) # Stack rows df3 = pd.DataFrame({'B': [5, 6]}, index=[0, 1]) pd.concat([df1, df3], axis=1) # Stack columns ``` - **`.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), indicator=False, validate=None)`**: Merge DataFrame or named Series objects with a database-style join. - `right`: DataFrame to merge with. - `how`: Type of merge ('left', 'right', 'outer', 'inner', 'cross'). - `on`: Column or list of column names to join on. - `left_on`, `right_on`: Column(s) in left/right DataFrame to join on. - `left_index`, `right_index`: Use index from left/right DataFrame as join key. ```python df_left = pd.DataFrame({'key': ['K0', 'K1'], 'A': [1, 2]}) df_right = pd.DataFrame({'key': ['K0', 'K3'], 'B': [3, 4]}) df_left.merge(df_right, on='key', how='inner') df_left.merge(df_right, on='key', how='outer') ``` - **`.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False, validate=None)`**: Join columns with other DataFrame or Series on index. Similar to merge but defaults to joining on index. - `other`: DataFrame or Series to join with. - `on`: Column name or list of column names in the caller to look for in `other`'s index. - `how`: Type of join. - `lsuffix`, `rsuffix`: Suffixes to apply to overlapping column names. ```python df_left.set_index('key').join(df_right.set_index('key'), how='inner') ``` - **`.append(other, ignore_index=False, verify_integrity=False, sort=False)`**: Append rows of `other` to the end of caller, returning a new object. (Deprecated since v1.4.0, use `pd.concat()`). ```python # df1.append(df2, ignore_index=True) # Use pd.concat([df1, df2], ignore_index=True) instead ``` - **`.combine_first(other)`**: Combine two DataFrame objects by filling null values in one with matching values from other. - `other`: DataFrame or Series. ```python df_cf1 = pd.DataFrame({'A': [1, np.nan], 'B': [np.nan, 2]}) df_cf2 = pd.DataFrame({'A': [10, 11], 'B': [20, 21]}) df_cf1.combine_first(df_cf2) ``` ### 12. Reshaping & Pivoting - **`.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)`**: Unpivot a DataFrame from wide to long format. - `id_vars`: Column(s) to use as identifier variables. - `value_vars`: Column(s) to unpivot. If None, uses all columns not in `id_vars`. - `var_name`: Name for the new 'variable' column. - `value_name`: Name for the new 'value' column. ```python df_wide = pd.DataFrame({'A': {0: 'a', 1: 'b'}, 'B': {0: 1, 1: 3}, 'C': {0: 2, 1: 4}}) pd.melt(df_wide, id_vars=['A'], value_vars=['B', 'C']) ``` - **`.pivot(index=None, columns=None, values=None)`**: Reshape data based on column values. - `index`: Column to make new DataFrame index. - `columns`: Column to make new DataFrame columns. - `values`: Column(s) to fill with values. ```python df_long = pd.DataFrame({'date': ['2023-01-01', '2023-01-01', '2023-01-02'], 'item': ['A', 'B', 'A'], 'value': [10, 20, 15]}) df_long.pivot(index='date', columns='item', values='value') ``` - **`.pivot_table()`**: (See Grouping & Aggregation) Can also be used for reshaping. - **`.stack(level=-1, dropna=True, future_stack=False)`**: Stack the prescribed level(s) from columns to index. - `level`: Level(s) to stack. - `dropna`: Whether to drop rows with all NA values. ```python df_stack = pd.DataFrame([[1, 2], [3, 4]], index=['a', 'b'], columns=['col1', 'col2']) df_stack.stack() ``` - **`.unstack(level=-1, fill_value=None)`**: Pivot a level of the (re)index to obtain a (wider) DataFrame. - `level`: Level(s) to unstack. - `fill_value`: Value to replace NaN with. ```python stacked = df_stack.stack() stacked.unstack() ``` - **`.explode(column, ignore_index=False)`**: Transform each element of a list-like entry to a row, replicating index values. - `column`: Column to explode. ```python df_explode = pd.DataFrame({'A': [1, 2], 'B': [[10, 20], [30, 40, 50]]}) df_explode.explode('B') ``` - **`.transpose(*args, copy=False)` or `.T`**: Transpose index and columns. ```python df_stack.T ``` ### 13. Window & Rolling Operations - **`.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None, method='single')`**: Provides rolling window calculations. - `window`: Size of the moving window. - `min_periods`: Minimum number of observations in window required to have a value. - `center`: Set the labels at the center of the window. - `on`: For DataFrame, column on which to calculate the rolling window, rather than the index. ```python s_roll = pd.Series([1, 2, 3, 4, 5]) s_roll.rolling(window=2).mean() ``` - **`.expanding(min_periods=1, center=False, axis=0, method='single')`**: Provides expanding window calculations (cumulative). - `min_periods`: Minimum number of observations in window required to have a value. ```python s_roll.expanding(min_periods=1).sum() ``` - **`.ewm(com=None, span=None, halflife=None, alpha=None, min_periods=0, adjust=True, ignore_na=False, axis=0, times=None)`**: Provides exponential weighted moving (EWM) functions. - `com`, `span`, `halflife`, `alpha`: Parameters to define the decay. ```python s_roll.ewm(span=2).mean() ``` - **`.shift(periods=1, freq=None, axis=0, fill_value=no_default)`**: Shift index by desired number of periods. - `periods`: Number of periods to shift. - `freq`: Offset to use from the `DateOffset` class. ```python s_roll.shift(1) ``` - **`.diff(periods=1, axis=0)`**: First discrete difference of element. - `periods`: Periods to shift for calculating difference. ```python s_roll.diff(1) ``` - **`.rank()`**: (See Sorting & Ranking) Can be used in window operations as well. ```python # s_roll.rolling(window=3).apply(lambda x: pd.Series(x).rank().iloc[-1]) # Example of ranking within a window ``` ### 14. Visualization (Built-in) - **`.plot(x=None, y=None, kind='line', ax=None, subplots=False, sharex=None, sharey=False, layout=None, figsize=None, use_index=True, title=None, grid=None, legend=True, style=None, logx=False, logy=False, loglog=False, xticks=None, yticks=None, xlim=None, ylim=None, rot=None, fontsize=None, colormap=None, position=0.5, table=False, yerr=None, xerr=None, secondary_y=False, mark_right=True, backend=None, **kwargs)`**: Make plots of Series or DataFrame. - `kind`: 'line', 'bar', 'barh', 'hist', 'box', 'kde', 'density', 'area', 'pie', 'scatter', 'hexbin'. - `x`, `y`: Labels or positions for plotting. - `title`: Title for the plot. - `figsize`: Tuple (width, height) in inches. ```python df_plot = pd.DataFrame(np.random.rand(10, 2), columns=['A', 'B']) df_plot.plot(kind='line', title='Line Plot') df_plot.plot(kind='bar', figsize=(8, 4)) ``` - **`.hist(column=None, by=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, ax=None, sharex=False, sharey=False, figsize=None, layout=None, bins=10, backend=None, legend=False, **kwargs)`**: Draw histogram of the DataFrame's series. - `column`: Column(s) to plot. - `bins`: Number of histogram bins. ```python df_plot['A'].hist(bins=5) ``` - **`.boxplot(column=None, by=None, ax=None, fontsize=None, rot=0, grid=True, figsize=None, layout=None, return_type='axes', backend=None, **kwargs)`**: Make a box plot from DataFrame columns. ```python df_plot.boxplot() ``` - **`.kde(bw_method=None, ind=None, **kwargs)`**: Generate Kernel Density Estimate plot using Gaussian kernels. ```python df_plot['A'].kde() ``` ### 15. Performance & Advanced - **`.astype(dtype, copy=True, errors='raise')`**: Cast a pandas object to a specified dtype. - `dtype`: Data type to cast to. - `errors`: 'raise' or 'ignore'. ```python df_types = pd.DataFrame({'A': [1, 2], 'B': ['3', '4']}) df_types['B'] = df_types['B'].astype(int) print(df_types.dtypes) ``` - **`.copy(deep=True)`**: Make a copy of this object's indices and data. - `deep`: If True, a new object and new data are created. ```python df_copy = df.copy(deep=True) ``` - **`.memory_usage(index=True, deep=False)`**: (See Viewing & Inspecting Data) Used for memory optimization. - **`.info(memory_usage='deep')`**: Used to get detailed memory usage, especially for object dtypes. - `memory_usage='deep'`: Introspect the data deeply to get an accurate size. ```python df.info(memory_usage='deep') ``` - **`.query(expr, inplace=False, **kwargs)`**: (See Selection & Indexing) Efficient filtering, especially for large DataFrames. - **`.eval(expr, inplace=False, **kwargs)`**: Evaluate a string describing operations on DataFrame columns. Faster than standard Python evaluation for complex expressions. - `expr`: The expression to evaluate. ```python df_eval = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}) df_eval.eval('C = A + B') ``` - **`pd.factorize(values, sort=False, use_na_sentinel=True, size_hint=None)`**: Encode the object as an enumerated type or categorical variable. - `values`: Sequence. - `sort`: Sort `uniques` and factorize values according to the sorted `uniques`. ```python labels, uniques = pd.factorize(['a', 'b', 'a', 'c']) print(labels, uniques) ``` - **`pd.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)`**: Convert categorical variable into dummy/indicator variables (one-hot encoding). - `data`: Series or DataFrame. - `prefix`: String to append to column names. - `dummy_na`: Add a column for NaN values. - `drop_first`: Whether to drop the first category in each column. ```python s_ohe = pd.Series(list('abcabc')) pd.get_dummies(s_ohe, prefix='cat') ```