API Reference¶
Note
All of the caproj
modules outlined below were developed for use in an interactive enviroment, such as that of a Jupyter notebook.
Please note that no testing is in place for any of the modules outlined here. Therefore, if you wish to use functions or classes contained in any of these modules, expect your mileage may vary.
Modules
caproj.datagen¶
This module contains functions for generating the interval metrics used in modeling for each unique capital project
Module variables:
List of column names containing info for each project’s end-state |
|
Dictionary for mapping members of |
|
List of column names containing descriptive info for each project |
|
Dictionary for mapping members of |
Module functions:
|
Prints summary of records and unique projects in dataframe |
|
Generates a project analysis dataset for the specified interval |
|
Prints summary of data dictionary for the generate_interval_data output |
-
caproj.datagen.
add_change_features
(df)¶ Calculates interval change metrics for each PID and appends the dataset
- Parameters
df – pd.DataFrame containing joined project interval data output from the join_data_endstate() function
- Returns
Copy of input pd.DataFrame with the new metrics appended as additional columns
-
caproj.datagen.
endstate_column_rename_dict
= {'Budget_Forecast': 'Budget_End', 'Change_Years': 'Final_Change_Years', 'Current_Phase': 'Phase_End', 'Date_Reported_As_Of': 'Final_Change_Date', 'Forecast_Completion': 'Schedule_End', 'PID_Index': 'Number_Changes'}¶ Dictionary for mapping members of
endstate_columns
to new column names
-
caproj.datagen.
endstate_columns
= ['Date_Reported_As_Of', 'Change_Years', 'PID', 'Current_Phase', 'Budget_Forecast', 'Forecast_Completion', 'PID_Index']¶ List of column names containing info for each project’s end-state
-
caproj.datagen.
ensure_datetime_and_sort
(df)¶ Ensures datetime columns are formatted correctly and changes are sorted
- Parameters
df – pd.DataFrame of the cleaned capital projects change records data
- Returns
Original pd.DataFrame with datetime columns formatted and records sorted
-
caproj.datagen.
extract_project_details
(df, copy_columns=['PID', 'Project_Name', 'Description', 'Category', 'Borough', 'Managing_Agency', 'Client_Agency', 'Current_Phase', 'Current_Project_Years', 'Current_Project_Year', 'Design_Start', 'Original_Budget', 'Original_Schedule'], column_rename_dict={'Current_Phase': 'Phase_Start', 'Original_Budget': 'Budget_Start', 'Original_Schedule': 'Schedule_Start'}, use_record=0, record_index='PID_Index')¶ Generates a dataframe with project details for each unique PID
- Parameters
df (pd.DataFrame) – The cleaned capital projects change records data
copy_columns (list, optional) – list of the names of columns that should be copied containing primary information about each project, defaults to info_columns
column_rename_dict (dict, optional) – dict of column name mappings to rename copied columns, defaults to info_column_rename_dict
use_record (int, optional) – integer record_index value to use as the basis the resulting project info, defaults to 0 (indicating that the first chronological record for each project will be used)
record_index (str, optional) – indicates the column name to use for the record_index referenced use_record, defaults to “PID_Index”
- Returns
dataframe containing the primary project details for each unique PID, and the PID is set as the index
- Return type
pd.DataFrame
-
caproj.datagen.
find_max_record_indices
(df, record_index='PID_Index')¶ Creates a list of Record_ID values of the max record ID for each PID
- Parameters
df – pd.DataFrame containing the cleaned capital project change records
record_index – string name of column containing PID ordinal indices (defaul record_index=’PID_Index’)
- Returns
list of max Record_ID values for each PID
-
caproj.datagen.
generate_interval_data
(data, change_year_interval=None, inclusive_stop=True, to_csv=False, save_dir='../data/interim/', custom_filename=None, verbose=1, return_df=True)¶ Generates a project analysis dataset for the specified interval
Note
If you specify
to_csv=True
, the default bahavior will be to save the resulting dataframe as:../data/interim/NYC_capital_projects_{predict_interval}yr.csv
or if
change_year_interval=None
:../data/interim/NYC_capital_projects_all.csv
The
save_dir
andcustom_filename
arguments allow you to change thisto_csv
behavior, however using them is not recommended for the sake of file naming consistency in this project.- Parameters
data – pd.DataFrame of the cleaned capital projects change records data
change_year_interval – integer or None representing the maximum year from which to include changes for each project, if None, then all years’ worth of changes included (default change_year_interval=None)
inclusive_stop – boolean, indicating whether projects to be included in the subset dataframe need to be older than the change_year_interval year or can be equal-to-or-older-than the change_year_interval year If True, >= is used for subsetting, if False > is used (default inclusive_stop=True)
to_csv – boolean, indicating whether or not the resulting dataframe should be saved to disk (default to_csv=False)
save_path – string or None, indicating the path to which the resulting dataframe should be saved to .csv, if None the dataframe is not saved, just returned (default save_path=None)
custom_filename – string or None, indicating whether to name the resulting .csv file something other than the name ‘NYC_capital_projects_{interval}yr.csv’ (default custom_filename=None)
verbose – integer, default verbose=1 prints the number of project remaining in the resulting dataframe, otherwise that information is not printed
return_df – boolean, determines whether the resulting pd.DataFrame object is returned (default return_df=True)
- Returns
pd.DataFrame containing the summary change data for each unique project matching the specified change_year_interval
-
caproj.datagen.
info_column_rename_dict
= {'Current_Phase': 'Phase_Start', 'Original_Budget': 'Budget_Start', 'Original_Schedule': 'Schedule_Start'}¶ Dictionary for mapping members of
info_columns
to new column names
-
caproj.datagen.
info_columns
= ['PID', 'Project_Name', 'Description', 'Category', 'Borough', 'Managing_Agency', 'Client_Agency', 'Current_Phase', 'Current_Project_Years', 'Current_Project_Year', 'Design_Start', 'Original_Budget', 'Original_Schedule']¶ List of column names containing descriptive info for each project
-
caproj.datagen.
join_data_endstate
(df_details, df_endstate, how='inner')¶ Creates dataframe joining the df_details and df_endstate dataframes by PID
- Parameters
df_details – pd.DataFrame output from the extract_project_details() function
df_endstate – pd.DataFrame output from the project_interval_endstate() function
how – string passed to the pd.merge method indicating the type of join to perform (default how=’inner’)
- Returns
pd.DataFrame containing the join results, the index is reset
-
caproj.datagen.
print_interval_dict
(datadict_dir='../references/data_dicts/', datadict_filename='data_dict_interval.csv')¶ Prints summary of data dictionary for the generate_interval_data output
- Parameters
datadict_dir – optional string indicating directory location of target data dictionary (default
../references/data_dicts/
)datadict_filename – optional string indicating filename of target data dict (default
data_dict_interval.csv
)
- Returns
No objects are returned, printed output only
-
caproj.datagen.
print_record_project_count
(dataframe, dataset='full')¶ Prints summary of records and unique projects in dataframe
- Parameters
dataframe – pd.DataFrame object for the version of the NYC capital projects data you wish to summarize
dataset – string, accepts ‘full’, ‘all’, ‘training’, or ‘test’ (default ‘full’)
- Returns
prints to standard output, no objects returned
-
caproj.datagen.
project_interval_endstate
(df, keep_columns=['Date_Reported_As_Of', 'Change_Years', 'PID', 'Current_Phase', 'Budget_Forecast', 'Forecast_Completion', 'PID_Index'], column_rename_dict={'Budget_Forecast': 'Budget_End', 'Change_Years': 'Final_Change_Years', 'Current_Phase': 'Phase_End', 'Date_Reported_As_Of': 'Final_Change_Date', 'Forecast_Completion': 'Schedule_End', 'PID_Index': 'Number_Changes'}, change_year_interval=None, record_index='PID_Index', change_col='Change_Year', project_age_col='Current_Project_Year', inclusive_stop=True)¶ Generates a dataframe of endstate data for each unique PID given the specified analysis interval
- Parameters
df – pd.DataFrame of the cleaned capital projects change records data
keep_columns – list of column names for columns that should be kept as part of the resulting dataframe (default keep_columns=endstate_columns module variable)
column_rename_dict – dict mapping existing column names to the new names to which they should be named (default column_rename_dict=endstate_column_rename_dict module variable)
change_year_interval – integer or None representing the maximum year from which to include changes for each project, if None, then all years’ worth of changes included (default change_year_interval=None)
record_index – string name of column containing PID ordinal indices (defaul record_index=’PID_Index’)
change_col – string, name of column containing change year indicators (default change_col=’Change_Year’)
project_age_col – string, name of column containing current age of each project at the time the dataset was compiled (default project_age_col=’Current_Project_Year’)
inclusive_stop – boolean, indicating whether projects to be included in the subset dataframe need to be older than the change_year_interval year or can be equal-to-or-older-than the change_year_interval year If True, >= is used for subsetting, if False > is used (default inclusive_stop=True)
- Returns
pd.DataFrame containing endstate data for each unique project, the index is set to the PID
-
caproj.datagen.
subset_project_changes
(df, change_year_interval=3, change_col='Change_Year', project_age_col='Current_Project_Year', inclusive_stop=True)¶ Generates a subsetted dataframe with only the change records that occur in or before the specified max interval year
- Parameters
df – pd.DataFrame of the cleaned capital projects change records data
change_year_interval – integer representing the maximum year from which to include changes for each project (default change_year_interval=3)
change_col – string, name of column containing change year indicators (default change_col=’Change_Year’)
project_age_col – string, name of column containing current age of each project at the time the dataset was compiled (default project_age_col=’Current_Project_Year’)
inclusive_stop – boolean, indicating whether projects to be included in the subset dataframe need to be older than the change_year_interval year or can be equal-to-or-older-than the change_year_interval year If True, >= is used for subsetting, if False > is used (default inclusive_stop=True)
- Returns
pd.DataFrame of the subsetted data, the index is set to each record’s ‘Record_ID’ value
caproj.scale¶
This module contains functions for scaling features of an X features design matrix and for encoding categorical variables
Module functions:
|
Encodes categorical variable column and appends values to dataframe |
|
Scales val_df features based on train_df and returns scaled dataframe |
|
Efficient numpy sigmoid transformation of dataframe, array, or matrix |
|
Adds 1 to input data and then applies Log transformation to those values |
-
caproj.scale.
encode_categories
(data, colname, one_hot=True, drop_cat=None, cat_list=None, drop_original_col=False, append_colname=None)¶ Encodes categorical variable column and appends values to dataframe
This function offers the option to either one-hot-encode (0,1) or LabelEncode (as consecutive integers (0, n)) categorical values by setting one_hot to either True or False.
- Parameters
data – The pd.dataframe object containing the column you wish to encode
colname – string indicating name of column you wish to encode
one_hot – boolean indicating whether you with to one-hot-encode the categories. If False, the values are simply encoded to a set of consecutive integers. (default)
drop_cat – None or category value you wish to drop from your one-hot-encoded variable columns. If None and one_hot=True, no variable columns are dropped. If one_hot=False, any category value passed drop_cat will ensure that value is sorted to the last place position in the resulting encoded integer values (default drop_cat=None)
cat_list – None or list specifying the full set of category values contained in your target column. The benefit of providing your own list is that it allows you to provide a custom ordering of categories to the encoder. If None, the categories will default to alphabetical order. (default cat_list=None)
drop_original_col – Boolean indicating whether the original category column specified by colname will be dropped from the resulting dataframe
append_colname – None or string, indicating what should be appended to one hot encoded column names. This is useful in instances where multiple columns have identical category names within them or, a category name matches an existing column. None will result in no string being added. (default append_colname=None)
- Returns
pd.DataFrame of the original input dataframe with the additional encoded category column(s) appended to it.
-
caproj.scale.
log_plus_one
(x)¶ Adds 1 to input data and then applies Log transformation to those values
- Parameters
x – data to undergo transformation (datatypes accepted include, pandas DataFrames and Series, numpy matrices and arrays, or single int or float values x)
- Returns
The transformed dataframe, series, array, or value depending on the type of original input x object
-
caproj.scale.
scale_features
(train_df, val_df, exclude_scale_cols=[], scaler=<class 'sklearn.preprocessing._data.RobustScaler'>, scale_before_func=None, scale_after_func=None, reapply_scaler=False, **kwargs)¶ Scales val_df features based on train_df and returns scaled dataframe
Accepts various sklearn scalers and allows you to specify features you do not want affected by scaling by using the exclude_scale_cols parameter.
Note
Be certain to reset the index of your accompanying y_train and y_test dataframes, or you will risk running into potential indexing errors while working with your scaled X dataframes
- Parameters
train_df – The training data
val_df – Your test/validation data
exclude_scale_cols – Optional list containing names of columns we do not wish to scale, default=[]
scaler – The sklearn scaler method used to fit the data (i.e. StandardScaler, MinMaxScaler, RobustScaler, etc.), default=RobustScaler
scale_before_func – Optional function (i.e. np.log, np.sigmoid, or custom function) to be applied to train and val dfs prior to the scaler fitting and scaling val_df, default=None
scale_after_func – Optional function (i.e. np.log, np.sigmoid, or custom function) to be applied to val_df after the scaler has scaled the datafrme
reapply_scaler – Boolean, if set to True, the scaler is fitted a second time after the scale_after_func is applied (useful if using MinMaxScaler and you wish to maintain a 0 to 1 scale after applying a secondary transformation to the data), default is reapply_scaler=False
kwargs – Any additional arguments are passed as parameters to the selected scaler (for instance feature_range=(-1,1) would be an appropriate argument if scaler is set to MinMaxScaler)
- Returns
a feature-scaled version of the val_df dataframe, and a list of fitted sklearn scaler objects that were used to scale values (for later use in case original values need to be restored), list will either be of length 1 or 2 depending on whether reapply_scaler was set to True
-
caproj.scale.
sigmoid
(x)¶ Efficient numpy sigmoid transformation of dataframe, array, or matrix
- Parameters
x – data to undergo transformation (datatypes accepted include, pandas DataFrames and Series, numpy matrices and arrays, or single int or float values x)
- Returns
The transformed dataframe, series, array, or value depending on the type of original input x object
caproj.model¶
This module contains functions for generating fitted models and summarizing the results
Module functions:
|
Fits the specified model type and generates a dictionary of results |
|
Prints a model results summary from the model dictionary generated using the generate_model_dict() function |
-
caproj.model.
generate_model_dict
(model, model_descr, X_train, X_test, y_train, y_test, multioutput=True, verbose=False, predictions=True, scores=True, model_api='sklearn', sm_formulas=None, y_stored=True, **kwargs)¶ Fits the specified model type and generates a dictionary of results
This function works for fitting and generating predictions for sklearn, keras, and statsmodels models. PyGam models typically also work by specifying the ‘sklearn’ model_api. For statsmodels models, only those that depend on the statsmodels.formula.api work.
The returned output dictionary follows this structure:
{ 'description': model_descr_string 'model': fitted model object 'y_variables': [y1_varname_string, y2_varname_string] 'formulas': [y1_formula_string, y2_formula_string] empty list if statsmodel api is not used 'y_values': { 'train': y_train array, 'test': y_test array, } 'predictions': { 'train': train_predictions array, 'test': test_predictions array, } 'score': { 'train': training r2_score array, 'test': test r2_score array } }
- Parameters
model – the uninitialized sklearn, pygam, or statsmodels regression model object, or a previously compiled keras model
model_descr – a brief string describing the model (cannot exceed 80 characters)
X_test, y_train, y_test (X_train,) – the datasets on which to fit and evaluate the model
multioutput – Boolean, if True and sklearn model_api, will attempt fitting a single multioutput model, if False or ‘statsmodel’ model_api fits separate models for each output
verbose – if True, prints resulting fitted model object (default=False)
predictions – if True the dict stores
model.predict()
predictions for both theX_train
andX_test
input dataframesscores – if True, metrics scores are calculated and stored in the resulting dict for both the train and test predictions
model_api – specifies the api-type required for the input model, options include ‘sklearn’, ‘keras’, or ‘statsmodels’ (default=’sklearn’)
sm_formulas – list of statsmodels formulas defining model for each output y (include only endogenous variables, such as
x1 + x2 + x3
instead ofy ~ x1 + x2 + x3
), default is Noney_stored – boolean, determines whether the true y values are stored in the resulting dictionary. It is convenient to keep these stored alongside the predictions for easier evaluation later (default is y_stored=True)
kwargs – are optional arguments that pass directly to the model object at time of initialization, or in the case of the ‘keras’ model api, they pass to the
keras.mdoel.fit()
method
- Returns
returns a dictionary object containing the resulting fitted model object, resulting predictions, and train and test scores (if specified as True)
-
caproj.model.
print_model_results
(model_dict, score='both')¶ Prints a model results summary from the model dictionary generated using the generate_model_dict() function
- Parameters
model_dict – dict, output dictionary from the generate_model_dict() function
accuracy – None, ‘both’, ‘test’, or ‘train’ parameters accepted, identifies which results to print for this particular metric
- Returns
nothing is returned, this function just prints summary output
caproj.visualize¶
This module contains functions for visualizing data and model results
Module functions:
|
Generates barplot from pandas value_counts series |
|
Generates a horizontal barplot from a pandas value_counts series |
|
Plots side-by-side histograms for comparison with log yscale option |
|
Generates line plot given input x, y values |
|
Plots 2D scatterplot of dimension-reduced embeddings for train and test |
|
Plots model prediction results directly from model_dict or input arrays |
|
Plots original vs scaled versions of budget and schedule input data |
|
Plots 4 subplots showing project budget and duration forecast change trend |
|
Calculates and plots the partial dependence and 95% CIs for a GAM model |
|
Plots coefficients from statsmodels linear regression model |
|
Loads an image from file, converts it to np.array and returns the array |
|
Plots a jpeg image from file |
-
caproj.visualize.
load_img_to_numpy
(filepath)¶ Loads an image from file, converts it to np.array and returns the array
- Parameters
filepath (str) – path to image file
- Returns
numpy representation of image
- Return type
array
-
caproj.visualize.
plot_2d_embed_scatter
(data1, data2, title, xlabel, ylabel, data1_name='training obs', data2_name='TEST obs', height=5, point_size=None)¶ Plots 2D scatterplot of dimension-reduced embeddings for train and test
2D matplotlib scatterplot, no objects are returned.
- NOTE: This function assumes the data inputs are 2D np.array objects of
share (n, 2), and that two separate sets of encoded embeddings are going to be plotted together (i.e. the train and the test observations). 2D pd.DataFrame objects can be passed, and are converted to np.array within the plotting function.
- Parameters
data1 – np.array 2D containing 2 encoded dimensions
data2 – a second np.array 2D containing 2 encoded dimensions
title – str, text used for plot title
xlabel – string representing the label for the x axis
ylabel – string representing the label for the y axis
data1_name – string representing the name of the first dataset, this will be the label given to those points in the plot’s legend (default ‘training obs’)
data2_name – string representing the name of the first dataset, this will be the label given to those points in the plot’s legend (default ‘TEST obs’)
height – integer that determines the hieght of the plot (default is 5)
point_size – integer or None, default of None will revert to matplotlib scatter default, integer entered will override the default marker size
-
caproj.visualize.
plot_barplot
(value_counts, title, height=6, varname=None, color='k', label_space=0.01)¶ Generates a horizontal barplot from a pandas value_counts series
- Parameters
value_counts – pd.Series object generated by pandas value_counts() method
title – string, the printed title of the plot
height – integer, the desired height of the plot (default is 6)
varname – string or None, text to print for plot’s y-axis title
color – string, the matplotlib color name for the color you would like for the plotted bars (default is ‘k’ or black)
label_space – float, a coefficient used to space the count label an appropriate distance from the plotted bar (default is 0.01)
- Returns
a matplotlib plot. No objects are returned
-
caproj.visualize.
plot_bdgt_sched_scaled
(X, X_scaled, scale_descr, X_test=None, X_test_scaled=None, bdgt_col='Budget_Start', sched_col='Duration_Start')¶ Plots original vs scaled versions of budget and schedule input data
Generates 1x2 subplotted scatterplots, no objects returned
- Parameters
X – Dataframe or 2D array with original budget and schedule train data
X_scaled – Dataframe or 2D array with scaled budget and schedule train data
scale_descr – Short string description of scaling transformation used to title scaled data plot (e.g. ‘Sigmoid Standardized’)
X_test – Optional, Dataframe or 2D array with original test data, which will plot test data as overlay with training data (default is X_test=None, which does not plot any overlay)
X_test_scaled – Optional, Dataframe or 2D array with original test data, which plots overlay similar to X_test (default is X_test_scaled=None)
bdgt_col – string name of budget values column for input dataframes (default bdgt_col=’Budget_Start’)
sched_col – string name of budget values column for input dataframes (default bdgt_col=’Duration_Start’)
-
caproj.visualize.
plot_change_trend
(trend_data, pid_data, pid, interval=None)¶ Plots 4 subplots showing project budget and duration forecast change trend
Generates image of 4 subplots, no objects are returned.
- Parameters
trend_data – pd.DataFrame, the cleaned dataset of all project change records (i.e. ‘Capital_Projects_clean.csv’ dataframe)
pid_data – pd.DataFrame, the prediction_interval dataframe produced using this project’s data generator function (i.e. ‘NYC_Capital_Projects_3yr.csv’ dataframe)
pid – integer, the PID for the project you wish to plot
interval – integer or None, indicating the max Change_Year you wish to plot, if None all change records are plotted for the specified pid (default, interval=None)
-
caproj.visualize.
plot_coefficients
(model_dict, subplots=1, 2, fig_height=8, suptitle_spacing=1)¶ Plots coefficients from statsmodels linear regression model
Generates a plotted series of subplots illustrating estimated coefficients and 95% CIs. No objects are returned
- Parameters
model_dict (dict) – model dictionary object from generate model dict function, containing fitted Statsmodels linear regression model objects (NOTE: this function is compatible with statsmodels models only)
subplots (tuple) – to plot each of the 2 predicted y variables, provides the dimension of subplots for the figure (NOTE: currently this function is only configured to plot 2 columns of subplots, therefore no other value other than two is accepted for the subplots width dimension), defaults to (1, 2)
fig_height (int or float) – this value is passed directly to the
figsize
parameter ofplt.subplots()
and determines the overall height of your plot, defaults to 8suptitle_spacing (float) – this value is passed to the ‘y’ parameter for
plt.suptitle()
, defaults to 1.10
-
caproj.visualize.
plot_gam_by_predictor
(model_dict, model_index, X_data, y_data, dataset='train', suptitle_y=1)¶ Calculates and plots the partial dependence and 95% CIs for a GAM model
Plots a set of subplots for each predictor contained in your X data. No objects are returned.
- Parameters
model_dict – model dictionary containing the fitted PyGAM models you wish to plot
model_index – integer indicating the index of the model stored in yur model_dict that you wish to plot
X_data – pd.DataFrame containing the matching predictor set you wish to plot beneath your predictor contribution lines
y_data – pd.DataFrame containing the matching outcome set you wish to plot beneath your predictor contribution lines
dataset – string, ‘train’ or ‘test’ indicating the type of X and y data you have entered for the X_data and y_data arguments (default=’train)
suptitle – float > 1.00 indicating the spacing required to prevent your plot from overlapping your title text (default=1.04)
-
caproj.visualize.
plot_hist_comps
(df, metric_1, metric_2, y_log=False, bins=20)¶ Plots side-by-side histograms for comparison with log yscale option
Plots 2 subplots, no objects are returned
- Parameters
df – pd.DataFrame object containing the data you wish to plot
metric_1 – string, name of column containing data for the first plot
metric_2 – string, name of column containing data for second plot
y_log – boolean, indicating whether the y-axis should be plotted with a log scale (default False)
bins – integer, the number of bins to use for the histogram (default 20)
-
caproj.visualize.
plot_jpg
(filepath, title, figsize=16, 12)¶ Plots a jpeg image from file
- Parameters
filepath (str) – path to file for plotting
title (str) – plot title text
figsize (tuple) – dimensions of resulting plot, defaults to (16, 12)
-
caproj.visualize.
plot_line
(x_vals, y_vals, title, x_label, y_label, height=3.5)¶ Generates line plot given input x, y values
-
caproj.visualize.
plot_true_pred
(model_dict=None, dataset='train', y_true=None, y_pred=None, model_descr=None, y1_label=None, y2_label=None)¶ Plots model prediction results directly from model_dict or input arrays
Generates 5 subplots, (1) true values with predicted values overlay, each y variable on its own axis, (2) output variable 1 true vs. predicted on each axis,(3) output variable 2 true vs. predicted on each axis (4) output variable 1 true vs. residuals, (5) output variable 2 true vs. residuals (no objects are returned)
This plotting function only really requires that a model_dict from the
generate_model_dict()
function be used as input. However, through use of the y_true, y_pred, model_descr, and y1 and y2 label parameters, predictions stored in a shape (n,2) array can be plotted directly wihtout the use of a model_dict- NOTE: This plotting function requires y to consist of 2 output variables.
Therefore, it will not work with y data not of shape=(n, 2).
- Parameters
model_dict – dictionary or None, if model results from the generate_model_dict func is used, function defaults to data from that dict for plot, if None plot expects y_true, y_pred, model_descr, and y1/y2 label inputs for plotting
dataset – string, ‘train’ or ‘test’, indicates whether to plot training or test results if using model_dict as data source, and labels plots accordingly if y_pred and y_true inputs are used (default is ‘train’)
y_pred (y_true,) – None or pd.DataFrame and np.array shape=(n,2) data sources accepted and used for plotting if model_dict=None (default for both is None)
model_descr – None or string of max length 80 used to describe model in title. If None, model_descr defaults to description in model_dict, if string is entered, that string overrides the description in model_dict, if using y_true/y_test as data source model_descr must be specified as a string (default is None)
y2_label (y1_label,) – None or string of max length 40 used to describe the 2 output y variables being plotted. These values appear along the plot axes and in the titles of subplots. If None, the y_variables names from the model_dict are used. If strings are entered, those strings are used to override the model_dict values. If using y_true/y_test as data source, these values must be specified (default is None for both label)
-
caproj.visualize.
plot_value_counts
(value_counts, figsize=9, 3, color='tab:blue')¶ Generates barplot from pandas value_counts series
- Parameters
value_counts (DataFrame) – pandas DataFrame generated using the pandas
value_counts
methodfigsize (tuple, optional) – dimensions of resulting plot, defaults to (9, 3)
color (str, optional) – color of resulting plotted bars, defaults to “tab:blue”
caproj.cluster¶
This module contains functions for visualizing data and model results
Module classes:
|
Class methods for generating UMAP embedding and HDBSCAN clusters |
Module functions:
|
Generates silhouette subplot of kmeans clusters alongside PCA n=2 |
|
Generates plots of gap stats with error bars for each number of clusters |
|
Fits n nearest neighbors based on min samples and returns distances |
|
Plot epsilon by index sorted by increasing distance |
|
Generates sil score ommitting observations not assigned to any cluster by dbscan |
|
Fits dbscan and returns dictionary of results including model, labels, indices |
|
Prints summary results of fitted DBSCAN results dictionary |
|
Plots a dendrogram given a set of input hierarchy linkage data |
|
Requires melted dataframe as input and plots histograms by cluster |
|
plots scatterplot with color scale |
|
plots scatterplot with categories colors |
-
class
caproj.cluster.
UMAP_embedder
(scaler, final_cols, mapper_dict, clusterer, bert_embedding)¶ Class methods for generating UMAP embedding and HDBSCAN clusters
-
get_clustering
(attributes_2D_mapping)¶ Returns HDBSCAN cluster labels
-
get_full_df
(df, dimensions='all')¶ Returns UMAP full dataframe
-
get_mapping_attributes
(df, return_extra=False, dimensions='all')¶ - if return extra = True, returns 3 objects:
mapping
columns needed to be added to harmonize with entire data
dummified df before adding columns of [1]
-
get_mapping_description
(df, dimensions='all')¶ Returns UMAP final dataframe
-
-
caproj.cluster.
display_gapstat_with_errbars
(gap_df, height=4)¶ Generates plots of gap stats with error bars for each number of clusters
- Parameters
gap_df (DataFrame) – dataframe attribute of a fitted
gap_statistic.OptimalK
object for plotting (i.e.OptimalK.gap_df
)height (int, optional) – hieght of the resulting plot, defaults to 4
-
caproj.cluster.
fit_dbscan
(data, min_samples, eps)¶ Fits dbscan and returns dictionary of results including model, labels, indices
- Parameters
data (array-like) – original data to be fitted using
sklearn.cluster.DBSCAN
min_samples (int) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
eps (int or float) – The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
- Returns
dictionary of results and important characteristics of the fitted DBSCAN algorithm (see NOTE below)
- Return type
dict
NOTE: Dictionary returned includes the following items:
{ "model": DBSCAN(eps=eps, min_samples=min_samples).fit(data), "n_clusters": sum([i != -1 for i in set(model.labels_)]), "labels": model.labels_, "core_sample_indices": model.core_sample_indices_, "clustered_bool": [i != -1 for i in labels], "cluster_counts": pd.Series(labels).value_counts(), "sil_score": silscore_dbscan(data, labels, clustered_bool), }
-
caproj.cluster.
fit_neighbors
(data, min_samples)¶ Fits n nearest neighbors based on min samples and returns distances
This is a simple implementation of the
sklearn.neighbors.NearestNeighbors
and returns the distance results from that object’sfitted_neighbors
method- Parameters
data (dataframe or array) – data on which to perform nearest neighbors algorithm
min_samples (int) – number of neighbors to use by default for kneighbors queries
- Returns
array representing the lengths to points
- Return type
array
-
caproj.cluster.
make_spider
(mean_peaks_per_cluster, row, name, color)¶ Generate spider plot showing attributes of a single cluster
-
caproj.cluster.
plot_category_scatter
(data, x_col, y_col, cat_col, title, colormap='Paired', xlabel='1st dimension', ylabel='2nd dimension')¶ plots scatterplot with categories colors
-
caproj.cluster.
plot_cluster_hist
(data, title, metric, cluster_col='cluster', val_col='Standardized Metric Value', metric_col='Metric', cmap='Paired', bins=6)¶ Requires melted dataframe as input and plots histograms by cluster
-
caproj.cluster.
plot_dendrogram
(linkage_data, method_name, yticks=16, ytick_interval=1, height=4.5)¶ Plots a dendrogram given a set of input hierarchy linkage data
- Parameters
linkage_data – np.array output from scipy.cluster.hierarchy, which should have been applied to a distance matrix to convert it to linkage data
method_name – string describing the linkage method used, should be fewer than 30 characters
yticks – integer, the number of desired y tick lavels for the resulting plot
ytick_interval – integer, the desired interval for the resulting y ticks
height – float, the desired height of the resulting plot
return: plots dendrogram, no objects are returned
-
caproj.cluster.
plot_epsilon
(distances, min_samples, height=5)¶ Plot epsilon by index sorted by increasing distance
Generates a line plot of epsilon with observations sorted by increasing distances
- Parameters
distances (array) – distances generated by
fit_neighbors()
min_samples (int) – number of neighbors used to generate distances
height (int, optional) – height of plot, defaults to 5
-
caproj.cluster.
plot_spider_clusters
(title, mean_peaks_per_cluster)¶ Generate spider plot subplots for all input clusters
-
caproj.cluster.
plot_umap_scatter
(x, y, color, title, scale_var, colormap='Reds', xlabel='1st dimension', ylabel='2nd dimension')¶ plots scatterplot with color scale
-
caproj.cluster.
print_dbscan_results
(dbscan_dict)¶ Prints summary results of fitted DBSCAN results dictionary
Provides printed summary and plotted value counts by cluster
- Parameters
dbscan_dict (dict) – returned output dictionary from
fit_dbscan`()
function
-
caproj.cluster.
silplot
(X, cluster_labels, clusterer, pointlabels=None, height=6)¶ Generates silhouette subplot of kmeans clusters alongside PCA n=2
Two side-by-side subplots are generated showing (1) the silhouette plot of the clusterer’s results and (2) the PCA 2-dimensional reduction of the input data, color-coded by cluster.
- Source: The majority of the code from this function was provided as a
helper function from the CS109b staff in HW2
The original code authored by the cs109b teaching staff is modified from: http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
- Parameters
X (pandas.DataFrame) – original mutli-dimensional data values against which you are plotting
cluster_labels (list or array) – list of labels for each observation in
X
clusterer (sklearn.cluster.KMeans object) – fitted sklearn kmeans clustering object
pointlabels (list or None, optional) – list of labels for each point, defaults to None
height (int, optional) – height of resulting subplots, defaults to 6
-
caproj.cluster.
silscore_dbscan
(data, labels, clustered_bool)¶ Generates sil score ommitting observations not assigned to any cluster by dbscan
- Parameters
data (array or dataframe) – original data used for dbscan clustering
labels (array) – cluster label for each observation in
data
clustered_bool (list or 1-d array) – boolean value for each observation indicated whether it had been clustered by dbscan
- Returns
silhouette score
- Return type
float
caproj.autoencoder¶
This module contains functions for visualizing data and model results
Module variables:
Import and set seed for reproducible results |
Module functions:
|
Builds and compiles a tensorflow.keras dense autoencoder network |
|
Plot training and validation loss using keras history object |
-
caproj.autoencoder.
build_dense_ae_architecture
(input_dim, encoding_dim, droprate, learning_rate, name)¶ Builds and compiles a tensorflow.keras dense autoencoder network
Note
This network architecture was designed for the specific purpose of encoding a dataset of 1D embeddings. Therefore, the input dimension must be 1D with a length that equals the number of values in any single observation’s embedding
- Parameters
input_dim – integer, the length of each embedding (must all be of the same length)
encoding_dim – integer, the desired bottleneck dimension for the encoder network
droprate – float >0 <1, this is passed to the rate argument for the dropout layers between each dense layer
learning_rate – float, the desired learning rate for the Adam optimizer used while compiling the model
name – string, the desired name of the resulting network
- Returns
tuple of 3 tf.keras model object, [0] full autoencoder model, [1] encoder model, [2] decoder model
-
caproj.autoencoder.
plot_history
(history, title, val_name='validation', loss_type='MSE')¶ Plot training and validation loss using keras history object
- Parameters
history – keras training history object or dict. If a dict is used, it must have two keys named ‘loss’ and ‘val_loss’ for which the corresponding values must be lists or arrays with float values
title – string, the title of the resulting plot
val_name – string, the name for the val_loss line in the plot legend (default ‘validation’)
loss_type – string, the loss type name to be printed as the y axis label (default ‘MSE’)
- Returns
a line plot illustrating model training history, no objects are returned
-
caproj.autoencoder.
random_seed
= 109¶ Import and set seed for reproducible results
-
caproj.autoencoder.
seed
(self, seed=None)¶ Reseed a legacy MT19937 BitGenerator
Notes
This is a convenience, legacy function.
The best practice is to not reseed a BitGenerator, rather to recreate a new one. This method is here for legacy reasons. This example demonstrates best practice.
>>> from numpy.random import MT19937 >>> from numpy.random import RandomState, SeedSequence >>> rs = RandomState(MT19937(SeedSequence(123456789))) # Later, you want to restart the stream >>> rs = RandomState(MT19937(SeedSequence(987654321)))
caproj.utils¶
This module contains utlility functions for performaing HDBSCAN and UMAP analyses
Note
Documentation is currently incomplete for each function in this module.
Module functions:
|
Run data through each classifier in ensemble list to get predicted probabilities |
|
Adjust class predictions based on the prediction threshold (t) |
|
Print a comprehensive classification report on both validation and training set |
|
Generate plot of UMAP algorithm results based on specified arguments |
|
Generate plot of HDBSCAN algorithm results based on specified arguments |
-
caproj.utils.
adjusted_classes
(y_scores, t)¶ Adjust class predictions based on the prediction threshold (t)
Will only work for binary classification problems.
-
caproj.utils.
cluster_hdbscan
(clusterable_embedding, min_cluster_size, viz_embedding_list)¶ Generate plot of HDBSCAN algorithm results based on specified arguments
-
caproj.utils.
draw_umap
(data, n_neighbors=15, min_dist=0.1, c=None, n_components=2, metric='euclidean', title='', plot=True, cmap=None, use_plotly=False, **kwargs)¶ Generate plot of UMAP algorithm results based on specified arguments
-
caproj.utils.
predict_ensemble
(ensemble, X)¶ Run data through each classifier in ensemble list to get predicted probabilities
Those are then averaged out across all classifiers.
-
caproj.utils.
print_report
(m, X_valid, y_valid, t=0.5, X_train=None, y_train=None, show_output=True)¶ Print a comprehensive classification report on both validation and training set
The metrics returned are AUC, F1, Precision, Recall and Confusion Matrix.
It accepts both single classifiers and ensembles.
Results are dependent on probability threshold applied to individual predictions.
caproj.trees¶
This module contains functions for generating and analyzing trees and tree ensemble models and visualizing the model results
Module variables:
sets default depths for comparison in cross validation |
|
sets cross-validation kfold parameter |
Module functions:
|
Generates adaboost staged scores in order to find ideal number of iterations |
|
Plots the adaboost staged scores for each y variable’s predictions and iteration |
|
Fits and generates tree classifier results, iterated for each input depth |
|
Fits and generates tree regressor results, iterated for each input depth |
|
plot the best depth finder for decision tree model |
|
Calculate decision tree results using a particular set of X features |
|
Iterate over all combinations of attributes to return lists of resulting models |
-
caproj.trees.
calc_meanstd_logistic
(X_tr, y_tr, X_te, y_te, depths: list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], cv: int = 5)¶ Fits and generates tree classifier results, iterated for each input depth
- Parameters
X_tr (array-like) – Training data X values
y_tr (array-like) – Training data y values
X_te (array-like) – Test data X values
y_te (array-like) – Test data y values
depths (list, optional) – List of depths for each iterated decision tree classifier, defaults to depths
cv (int, optional) – Number of k-folds used for cross-validation, defaults to cv
- Returns
Five arrays are returned (1) mean cross-validation scores for each iteration, (2) standard deviation of each cross-validation score, (3) each training observation’s ROC AUC score, (4) each test observation’s ROC AUC score, (5) each fitted classifier’s model object
- Return type
tuple
-
caproj.trees.
calc_meanstd_regression
(X_tr, y_tr, X_te, y_te, depths: list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], cv: int = 5)¶ Fits and generates tree regressor results, iterated for each input depth
- Parameters
X_tr (array-like) – Training data X values
y_tr (array-like) – Training data y values
X_te (array-like) – Test data X values
y_te (array-like) – Test data y values
depths (list, optional) – List of depths for each iterated decision tree regressor, defaults to depths
cv (int, optional) – Number of k-folds used for cross-validation, defaults to cv
- Returns
Five arrays are returned (1) mean cross-validation scores for each iteration, (2) standard deviation of each cross-validation score, (3) each training observation’s \(R^2\) score, (4) each test observation’s \(R^2\) score, (5) each fitted regressors’s model object
- Return type
tuple
-
caproj.trees.
calc_models
(data_train, data_test, categories, nondescr_attrbutes, descr_attributes, responses_list, logistic=True)¶ Iterate over all combinations of attributes to return lists of resulting models
- Parameters
data_train (array-like) – Training dataset
data_test (array-like) – Test dataset
categories (list) – List of project categories as they appear in the data
nondescr_attrbutes (list) – Column names of all features not consisting of those engineered from project descriptions
descr_attributes (list) – Column names of features engineered using project descriptions
responses_list (list) – Column names of model responses (i.e. each different y variable)
logistic (bool, optional) – Indicates whether to use decision tree classifier (i.e.
logistic=True
) or regressor (i.e.logistic=False
), defaults to True
- Returns
Two list objects containing (1) lists of dictionaries of model results and (2) lists of fitted model dictionaries for each iterated model
- Return type
tuple
-
caproj.trees.
calculate
(data_train, data_test, categories, attributes: list, responses_list: list, logistic=True)¶ Calculate decision tree results using a particular set of X features
- Parameters
data_train (array-like) – Training dataset
data_test (array-like) – Test dataset
categories (list) – List of project categories as they appear in the data
attributes (list) – Column names of feature columns (i.e. each different X variable under consideration)
responses_list (list) – Column names of model responses (i.e. each different y variable)
logistic (bool, optional) – Indicates whether to use decision tree classifier (i.e.
logistic=True
) or regressor (i.e.logistic=False
), defaults to True
- Returns
Two lists containing (1) dictionaries of model results and (2) fitted model dictionaries, one dictionary for each response variable
- Return type
tuple
-
caproj.trees.
cv
= 5¶ sets cross-validation kfold parameter
-
caproj.trees.
depths
= [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]¶ sets default depths for comparison in cross validation
-
caproj.trees.
generate_adaboost_staged_scores
(model_dict, X_train, X_test, y_train, y_test)¶ Generates adaboost staged scores in order to find ideal number of iterations
- Parameters
model_dict (dict) – Output fitted model dictionary generated using
caproj.model.generate_model_dict()
X_train (array-like) – Training data X values
X_test (array-like) – Test data X values
y_train (array-like) – Training data y values
y_test (array-like) – Test data y values
- Returns
tuple of 2D numpy arrays for adaboost staged scores at each iteration and each response variable, one array for training scores and one for test
- Return type
tuple
-
caproj.trees.
plot_adaboost_staged_scores
(model_dict, X_train, X_test, y_train, y_test, height=4)¶ Plots the adaboost staged scores for each y variable’s predictions and iteration
- Parameters
model_dict (dict) – Output fitted model dictionary generated using
caproj.model.generate_model_dict()
X_train (array-like) – Training data X values
X_test (array-like) – Test data X values
y_train (array-like) – Training data y values
y_test (array-like) – Test data y values
height (int, optional) – Height dimension of resulting plot, defaults to 4
-
caproj.trees.
plot_me
(result)¶ plot the best depth finder for decision tree model
- Parameters
result (dict) – Dictionary returned from the
calculate()
function