import numpy as np
[docs]def get_multi_false_predictions(clf, X_data, y_data, n):
"""
The get_unique_false_predictions function takes in a classifier, the training data and labels,
and an integer n. It then runs the classification on all of the training data n times. For each iteration
of running the classification, it finds which predictions were false positives and which were false negatives.
It then adds these to lists containing all of the false positive indices and all of the false negative indices
respectively. After this is done for every iteration (n), it returns two lists: one containing unique values from
the list of false positive indices (unique_false_pos) and one containing unique values fromt he list of fasle negative ids(unique_false_neg). This is because there are many instances where multiple iterations will find a particular index as being misclassified.
:param clf: Used to fit the classifier.
:param X_data: Used to pass the x_data to the function.
:param y_data: Used to get the true labels of the data.
:param n: Used to specify the number of times to run the get_unique_false_predictions function.
:return: Two lists of the indices of the false positives and false negatives. May contain duplicates.
:doc-author: Julian M. Kleber
"""
false_pos_indices = []
false_neg_indices = []
for i in range(n):
clf = clf.fit(X_data, y_data)
false_pos, false_neg = get_false_predictions(
fitted_clf=clf, X_data=X_data, y_true=y_data.astype(np.int32)
)
false_pos_indices += false_pos
false_neg_indices += false_neg
return false_neg_indices, false_pos_indices
[docs]def get_data_impurities(fitted_clf, X_data, y_true, n=6) -> list:
"""
The get_data_impurities function takes a classifier, the data and true labels. It returns two lists of column names:
the first list is the false positive columns, and the second list is the false negative columns. The function also takes an
optional argument n which defaults to 6. This means that if there are more than 6 impurities in either category (false positives or negatives) then only
the first six will be returned.
:param clf: Used to store the classifier that is used to make predictions.
:param X_data: Used to get the data that is used to train and test the model.
:param y_true: Used to get the true labels of the data.
:param n=6: Used to specify the number of false positive and negative values to be returned.
:return: A list of two lists.
:doc-author: Julian M: Kleber
"""
false_pos_col = []
false_neg_col = []
for i in range(n):
false_neg, false_pos = get_false_predictions(
fitted_clf=fitted_clf, X_data=X_data, y_true=y_true
)
false_pos_col.append(false_pos)
false_neg_col.append(false_neg)
false_pos = np.unique(false_pos_col).tolist()
false_neg = np.unique(false_neg_col).tolist()
return [false_pos, false_neg]
[docs]def get_false_predictions(fitted_clf, X_data, y_true) -> list:
"""
The get_false_predictions function returns a tuple of two lists. The first list contains the indices of false positive predictions, and the second list contains the indices of false negative predictions.
Parameters: clf (sklearn classifier object) - A trained sklearn classifier object
X_data (numpy array) - A numpy array containing feature data for each example in dataset
y_data (numpy array) - A numpy array containing labels for each example in dataset
Returns: tuple(list): Two lists, one with all indices where a false positive occured, and one with all indicies where a false negative occured
:param clf: Used to store the classifier that is being used.
:param X_data: Used to store the data that is used to make predictions with the classifier.
:param y_data: Used to get the true values of the data.
:return: A tuple containing two lists: false_pos and false_neg.
:doc-author: Julian M. Kleber
"""
y_pred = fitted_clf.predict(X_data)
mask_1 = y_true == 0
mask_2 = y_pred == 1
false_pos = np.where(mask_1 * mask_2)[0]
mask_1 = y_true == 1
mask_2 = y_pred == 0
false_neg = np.where(mask_1 * mask_2)[0]
return false_pos.tolist(), false_neg.tolist()