Python Data Analytics

Shared 12/27/2025•864 views

/ 1

Cheatsheet Content

1. Python Basics for Data Variables & Types: Numbers: int , float Text: str Boolean: bool ( True , False ) Data Structures: List: Ordered, mutable. my_list = [1, 2, 'a'] Tuple: Ordered, immutable. my_tuple = (1, 2, 'a') Dictionary: Unordered, mutable, key-value pairs. my_dict = {'a': 1, 'b': 2} Set: Unordered, mutable, unique elements. my_set = {1, 2, 2} results in {1, 2} Control Flow: if/elif/else for loops: for item in collection: while loops: while condition: Functions: Define: def my_func(arg1, arg2): return arg1 + arg2 Lambda: add = lambda x, y: x + y 2. NumPy (Numerical Python) Import: import numpy as np NumPy Arrays: Efficient multi-dimensional arrays. Create: arr = np.array([1, 2, 3]) Shape: arr.shape Data type: arr.dtype Zeros/Ones: np.zeros((2, 3)) , np.ones((2, 3)) Range: np.arange(0, 10, 2) Reshape: arr.reshape((row, col)) Array Operations: Element-wise by default. Addition: arr1 + arr2 Multiplication: arr1 * arr2 Dot Product: np.dot(arr1, arr2) or arr1 @ arr2 Broadcasting: Smaller array "stretches" to match larger one. Indexing & Slicing: arr[0] , arr[0, 1] (for 2D) Slicing: arr[start:end:step] Boolean indexing: arr[arr > 5] Aggregations: np.sum(arr) , arr.sum(axis=0) (column sum) np.mean(arr) , np.std(arr) , np.min(arr) , np.max(arr) 3. Pandas (Data Manipulation) Import: import pandas as pd Series: 1D labeled array. Create: s = pd.Series([1, 3, 5, np.nan, 6, 8]) Index: s.index , s.values DataFrame: 2D labeled table (rows & columns). Create from dict: df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}) From NumPy: df = pd.DataFrame(np.random.randn(6, 4)) Read CSV: df = pd.read_csv('data.csv') Write CSV: df.to_csv('output.csv', index=False) Viewing Data: df.head(n) , df.tail(n) df.info() , df.describe() (summary statistics) df.columns , df.index , df.values Selection & Indexing: Column: df['col_name'] or df.col_name Multiple columns: df[['col1', 'col2']] Row by label: df.loc[label] , df.loc[row_label, col_label] Row by position: df.iloc[idx] , df.iloc[row_idx, col_idx] Boolean indexing: df[df['col'] > 5] Missing Data: Check: df.isnull() , df.notnull() Count: df.isnull().sum() Drop: df.dropna() (rows), df.dropna(axis=1) (columns) Fill: df.fillna(value) , df.fillna(method='ffill') (forward fill) Operations: Apply function: df['col'].apply(lambda x: x*2) Unique values: df['col'].unique() , df['col'].nunique() Value counts: df['col'].value_counts() Sorting: df.sort_values(by='col', ascending=False) Grouping: df.groupby('col_name').agg({'col2': 'mean', 'col3': 'sum'}) Aggregation functions: mean() , sum() , count() , min() , max() , std() Merging/Joining: pd.merge(df1, df2, on='key_col', how='inner') how options: 'inner' , 'outer' , 'left' , 'right' Concatenation: pd.concat([df1, df2], axis=0) (rows), axis=1 (columns) 4. Matplotlib & Seaborn (Visualization) Import: import matplotlib.pyplot as plt import seaborn as sns Basic Plotting (Matplotlib): Line plot: plt.plot(x, y) Scatter plot: plt.scatter(x, y) Histogram: plt.hist(data, bins=10) Bar plot: plt.bar(categories, values) Labels & Title: plt.xlabel('X-axis'), plt.ylabel('Y-axis'), plt.title('Plot Title') Show plot: plt.show() Save plot: plt.savefig('plot.png') Seaborn for Statistical Plots: Distribution plot: sns.histplot(data=df, x='col') , sns.kdeplot(data=df, x='col') Scatter plot: sns.scatterplot(data=df, x='col1', y='col2', hue='category_col') Box plot: sns.boxplot(data=df, x='category', y='value') Violin plot: sns.violinplot(data=df, x='category', y='value') Heatmap: sns.heatmap(df.corr(), annot=True) Pair plot: sns.pairplot(df) (scatter plots for all pairs) 5. SciPy (Scientific Python) Import: import scipy as sp Statistics ( scipy.stats ): T-test: sp.stats.ttest_ind(data1, data2) ANOVA: sp.stats.f_oneway(group1, group2, group3) Distributions: sp.stats.norm.pdf(x, loc=mean, scale=std) Optimization ( scipy.optimize ): Minimization: sp.optimize.minimize(func, x0) 6. Scikit-learn (Machine Learning) Import: from sklearn.module import Class Preprocessing: Scaling: StandardScaler().fit_transform(X) , MinMaxScaler() Encoding: OneHotEncoder() , LabelEncoder() Splitting data: train_test_split(X, y, test_size=0.2) Model Training Workflow: from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, y_train) predictions = model.predict(X_test) Common Models: Regression: LinearRegression , Ridge , Lasso , SVR Classification: LogisticRegression , SVC , DecisionTreeClassifier , RandomForestClassifier Clustering: KMeans , DBSCAN Model Evaluation (Classification): Accuracy: accuracy_score(y_true, y_pred) Precision: precision_score(y_true, y_pred) Recall: recall_score(y_true, y_pred) F1-score: f1_score(y_true, y_pred) Confusion Matrix: confusion_matrix(y_true, y_pred) ROC AUC: roc_auc_score(y_true, y_pred_proba) Model Evaluation (Regression): Mean Absolute Error (MAE): mean_absolute_error(y_true, y_pred) Mean Squared Error (MSE): mean_squared_error(y_true, y_pred) R-squared ($R^2$): r2_score(y_true, y_pred)

Python Data Analytics

Create Your Own AI Cheatsheet