chembee.actions package

Submodules

chembee.actions.applicability module

chembee.actions.applicability.get_applicability_domain(clf, X_train: ndarray, X_test: ndarray, y_train: ndarray, y_test: ndarray, interval=(0, 0.65, 0.05), similarity_metric='tanimoto', metric_evaluation='auc') dict[source]

The get_applicability_domain function takes as input a classifier, training and test data, and returns the applicability domain of the classifier. The function also takes in optional arguments for threshold_step (default is 0.05), similarity metric (default is tanimoto), and metric evaluation (default is AUC). The function outputs a dictionary with keys ‘threshold’, ‘similarity’, and ‘metric’ with corresponding values for each key. It depends on the pyADAqsar package and further documentation can be found there (https://github.com/jeffrichardchemistry/pyADA)

Parameters
  • clf – Used to Specify the classifier that will be used to evaluate the applicability domain.

  • X_train – Used to Train the model.

  • X_test – Used to Define the test set to be used for applicability domain construction.

  • y_train – Used to Fit the model on.

  • y_test – Used to Evaluate the performance of the classifier.

  • threshold_step=(0 – Used to Define the step size of the threshold.

  • 0.65 – Used to Define the threshold for the applicability domain.

  • 0.05) – Used to Define the step size of the threshold.

  • similarity_metric="tanimoto" – Used to Define the similarity metric that will be used to compute the applicability domain.

  • metric_evaluation="auc" – Used to Specify the metric used to evaluate the applicability domain.

:param : Used to Define the threshold step. :return: A dictionary with the following keys:.

Doc-author

Julian M. Kleber

chembee.actions.benchmark_algorithms module

chembee.actions.benchmark_algorithms.benchmark_algorithm(algorithms: list, X: ~numpy.ndarray, y: ~numpy.ndarray, plot_function: <object object at 0x7fed9e5475e0>, file_name: str = 'benchmark', prefix: str = 'benchmarks/', feature_names: list = ['Feature 1', 'Feature 2'], response_method: str = 'predict', to_fit=True) dict[source]

This function is not used in production at the moment and serves as a draft on how to abstract the functions above, to refactor the code and make it more maintainable. The benchmark_algorithm function is used to benchmark the performance of a given algorithm. It takes as input an algorithm, and returns a dictionary containing the metrics for that algorithm. The function is designed to be called within a loop, so that it can be used to compare multiple algorithms at once.

Parameters
  • algorithm – Used to specify the algorithm to use.

  • X:np.ndarray – Used to pass the data to be used for training and testing.

  • y:np.ndarray – Used to pass the target variable to the plot_roc_curve function.

  • plot_function:object() – Used to pass a function that plots the results.

  • file_name:str="benchmark" – Used to name the plot file.

  • prefix:str="benchmarks/" – Used to specify the path to store the generated plots.

  • feature_names:list=["Feature1" – Used to name the columns of the dataframe.

  • "Feature2"] – Used to Specify the names of the features in your dataset.

  • response_method:str="predict" – Used to determine whether the models should be used to predict the response or transform it.

  • to_fit=True – Used to fit the model to the data.

:param : Used to specify the name of the file to which we want to save our plots. :return: A dictionary with the following keys:.

Doc-author

Julian M. Kleber

chembee.actions.benchmark_algorithms.benchmark_algorithm_standard(algorithm: list, X: ~numpy.ndarray, y: ~numpy.ndarray, plot_function: <object object at 0x7fed9e5475c0>, file_name: str = 'benchmark', prefix: str = 'benchmarks/', feature_names: list = ['Feature 1', 'Feature 2'], response_method: str = 'predict', to_fit=True) dict[source]

The benchmark_algorithm_standard function takes a list of algorithms, a list of names for the algorithms, and a dataset (X and y), and returns an array of metrics. The function is meant to be used for standard algorithms and not cross validation objects. For cross_validation objects, choose the benchmark_cv object. The function also plots the results.

Parameters
  • algorithm:list – Used to store the algorithms that will be used in the benchmark.

  • names:list – Used to pass the names of the algorithms that will be used in the plot.

  • X:np.ndarray – Used to pass the data to the benchmark_algorithm_standard function.

  • y:np.ndarray – Used to pass the target variable to the plot_roc_curve function.

  • plot_function:object() – Used to pass a function to the benchmark_algorithm_standard function.

  • file_name:str="benchmark" – Used to define the name of the file that will be saved.

  • prefix:str="benchmarks/" – Used to specify the path where the plots will be saved.

  • feature_names:list=["Feature1" – Used to label the x-axis of the plot.

  • "Feature2"] – Used to Specify the name of the feature that is used in the plot.

  • response_method:str="predict" – Used to determine the method used to generate a response from the model.

  • to_fit=True – Used to tell the algorithm to fit the model before predicting.

:param : Used to define the type of plot to be created. :return: A dictionary of metrics.

Doc-author

Julian M. Kleber

chembee.actions.benchmark_algorithms.benchmark_cv_algorithms(algorithms: list, names: list, X: ~numpy.ndarray, y: ~numpy.ndarray, plot_function=<function plot_comparison_1>, file_name: str = 'after_cv_benchmark', prefix: str = 'plots/benchmarks/', feature_names: list = ['Feature 1', 'Feature 2'], response_method: str = 'predict', to_fit=True)[source]

The benchmark_cv_algorithms function takes a list of algorithms and fits them to the data. It then plots the results using plot_cv_algorithms function.

Parameters
  • algorithms:list – Used to specify which algorithms to use in the benchmark.

  • names:list – Used to specify the names of the algorithms.

  • X:np.ndarray – Used to pass the data to be used for training and testing.

  • y:np.ndarray – Used to pass the target variable.

  • file_name:str="benchmark" – Used to name the output file.

  • prefix:str="benchmarks/" – Used to specify the path to the directory where all benchmark plots are saved.

  • feature_names:list=["Feature1" – Used to label the x-axis of the plots.

  • "Feature2"] – Used to Specify the name of the feature that is used for plotting.

  • response_method:str="predict" – Used to determine which method is used to generate the response.

  • to_fit=True – Used to determine whether the model should be fitted before predicting or transforming.

:param : Used to determine the method used to obtain the response from the model. :return: A list of metrics for each model.

Doc-author

Julian M. Kleber

chembee.actions.benchmark_algorithms.benchmark_standard(X, y, feature_names=['Feature 1', 'Feature 2'], file_name='benchmark', prefix='plots/benchmarks', algorithms=[<class 'chembee.config.benchmark.svc.SVClassifier'>, <class 'chembee.config.benchmark.spectral_clustering.SpectralClusteringClassifier'>, <class 'chembee.config.benchmark.random_forest.RandomForestClassifier'>, <class 'chembee.config.benchmark.naive_bayes.NaiveBayesClassifier'>, <class 'chembee.config.benchmark.logistic_regression.LogisticRegressionClassifier'>, <class 'chembee.config.benchmark.linear_regression.LinearRegressionClassifier'>, <class 'chembee.config.benchmark.kmeans.KMeansClassifier'>, <class 'chembee.config.benchmark.knn.KNNClassifier'>, <class 'chembee.config.benchmark.mlp_classifier.NeuralNetworkClassifier'>, <class 'chembee.config.benchmark.restricted_bm.RBMClassifier'>], to_fit=True)[source]

chembee.actions.calibration module

chembee.actions.calibration.get_calibration_displays(X_train, y_train, X_test, y_test, colors, clf_list: list, ax_calibration_curve) dict[source]

The get_calibration_displays function fits a list of classifiers to the training data and plots calibration curves for each one. The function takes no arguments, but it does require that the X_train, y_train, X_test and y_test variables be defined in the global namespace. It returns nothing.

Returns

A dictionary of calibration displays.

Doc-author

Julian M. Kleber

chembee.actions.calibration.screen_calibration(X_train, X_test, y_train, y_test, clf_list=[LogisticRegressionClassifierAlgorithm(multi_class='multinomial'), GaussianNBAlgorithm(), NaivelyCalibratedSVC(gamma=0.7), NaivelyCalibratedSVC(degree=5, kernel='poly'), RandomForestClassifierAlgorithm(), KNeighborsClassifierAlgorithm(), MLPClassifierAlgorithm(hidden_layer_sizes=(100, 20, 20, 100), max_iter=10000), MLPClassifierAlgorithm(activation='tanh', hidden_layer_sizes=(100, 20, 20, 100), max_iter=10000)], grid=(6, 2), file_name='calibration', prefix='plots/benchmarks')[source]

clf_fit(X_train, y_train, name The screen_calibration function takes in a training and test set, as well as the name of the output file. It then plots calibration curves for each model in the list of models. The function returns nothing.

Parameters
  • X_train – Used to Provide the training data for the calibration algorithm.

  • X_test – Used to Plot the test data.

  • y_train – Used to Pass the labels of the training data.

  • y_test – Used to Store the true values of the test set.

  • out_name="benchmark" – Used to Name the output file.

  • prefix="plots/benchmarks" – Used to Specify the location where all plots will be saved.

Doc-author

Julian M. Kleber

chembee.actions.classifier_fit module

chembee.actions.classifier_fit.clf_fit(clf, X_train, y_train, name)[source]

The clf_fit function fits the classifier to the training data. The function serves the only purpose of readability and maintainability-

Parameters
  • X_train – Used to Pass the x_train dataset to the clf.

  • y_train – Used to Fit the model to the training data.

  • name – Used to Name the classifier.

Returns

The fitted model.

Doc-author

Julian M. Kleber

chembee.actions.clf_list module

chembee.actions.cross_validation module

The cross_validation_grid_search function takes in a list of scores, the classifier, the training data and labels, the test data and labels. It then performs a grid search on all hyperparameters for each score in scores using cross-validation on the training set. It returns a dictionary with keys as each score name and values as another dictionary with keys ‘best_params’ (a list of best parameters) ‘best_score’ (the best average cross-validation score), ‘train_accuracy’ (training accuracy on entire training set), ‘test_accuracy’(testing accuracy on test set). The function also prints out all results.

Parameters
  • scores:list – Used to Specify the metrics that we want to use for evaluating our model.

  • clf – Used to Pass the classifier that will be used in cross validation.

  • X_train – Used to Train the algorithm.

  • X_test – Used to Test the model on unseen data.

  • y_train – Used to Fit the model.

  • y_test – Used to Calculate the accuracy of the model.

  • refit – Used to Choose the best model from the grid search.

Returns

The best parameters and the best estimator.

Doc-author

Trelent

chembee.actions.cross_validation.stratified_n_fold(clf, X_data, y_data, n=5, cut_off_filter=None) dict[source]

The stratified_n_fold function takes a classifier, training data and labels, and returns the average accuracy of the classifier on each fold. The function also takes an optional cut_off_filter parameter that allows you to filter out features with low variance.

Parameters
  • clf – Used to pass in the classifier that will be used.

  • X_data – Used to pass the data to be used for training and testing.

  • y_data – Used to specify the labels of the data.

  • n=5 – Used to specify the number of folds.

  • cut_off_filter=None – Used to filter the data.

Returns

A dictionary with two lists: accuracies_train and accuracies_test.

Doc-author

Julian M. Kleber

chembee.actions.cross_validation.stratified_n_fold_filter(clf, X_data, y_data, n=5, cut_off_filter=None) dict[source]

The stratified_n_fold_filter function takes in a classifier, training data and labels, and returns the average accuracy of the classifier on each fold. Additionally it also returns a list of indices that were filtered out due to having an accuracy below some cut_off_filter value. Note: To be SOLID and avoid weird code as well as performance issues the function, stratified_n_fold is basically implemented twice. One with and one time without a filter. :param clf: Used to pass the classifier that should be used. :param X_data: Used to pass the data to be used for training and testing. :param y_data: Used to determine the number of classes. :param n=5: Used to specify the number of folds in the stratifiedkfold function. :param cut_off_filter=None: Used to filter out the indices of the test set that have a lower accuracy than cut_off_filter. :return: A dictionary with the following keys:.

Doc-author

Julian M. Kleber

chembee.actions.evaluation module

chembee.actions.evaluation.calculate_metric(y_test, y_pred, metric)[source]

The calculate_metric function takes as input the true labels and predicted labels and returns the specified metric. The available metrics are: accuracy, precision, recall, fscore.

Parameters
  • y_test – Used to pass the actual values of y.

  • y_pred – Used to calculate the metric.

  • metric – Used to specify the function that should be used to calculate the metric.

Returns

The result of the metric function.

Doc-author

Julian M. Kleber

chembee.actions.evaluation.calculate_metrics_classifier(clf, X_test, y_test, y_pred)[source]

The calculate_metrics_classifier function calculates the accuracy, average precision, precision and recall of a classifier. It takes as input two lists: y_test and y_pred. The function returns a dictionary with four keys: ‘accuracy’, ‘average_precision’, ‘precision’ and ‘recall’. Each key contains the corresponding value for each metric.

Parameters
  • y_test – Used to define the true labels of the test data.

  • y_pred – Used to Calculate the metrics.

Returns

A dictionary with the following keys: accuracy, avgerage_preciscion, precision, and roc_auc.

Doc-author

Julian M. Kleber

chembee.actions.evaluation.calculate_roc_curve(classifier, X_test, y_test)[source]

The calculate_roc_curve function calculates the false positive rate and true positive rate for a given classifier. It returns these values as numpy arrays.

Parameters
  • classifier – Used to pass the classifier object.

  • X_test – Used to test the model.

  • y_test – Used to calculate the true positive rate and false positive rate.

Returns

The false positive rate, true positive rate and thresholds.

Doc-author

Julian M. Kleber

chembee.actions.evaluation.f_score(precision, recall)[source]

The f_score function computes the harmonic mean of precision and recall.

Args:

precision (float): The number of true positives divided by all positive predictions.

recall (float): The number of true positives divided by the number of positive values in the dataset.

Returns:

float: The f_score or accuracy for this prediction and target value, given as a floating point value between 0 and 1.

Parameters
  • precision – Used to Control the number of false positives.

  • recall – Used to Avoid false positives.

Returns

The f_score of the precision and recall.

Doc-author

Julian M. Kleber

chembee.actions.evaluation.init_metrics_collection() dict[source]

The init_metrics_collection function initializes the metrics_collection dictionary. The metrics_collection dictionary is a collection of dictionaries, each of which contains a set of keys and values. The first level key is “scalar”, “array”, or “matrix”. The value for this key is another dictionary that contains the metric names as keys and their corresponding values as values.

Returns

A dictionary with three keys: ‘scalar’, ‘array’, ‘matrix’.

Doc-author

Julian M. Kleber

chembee.actions.evaluation.make_result(**kwargs)[source]

The make_result function takes in the accuracy, average precision, precision and roc_auc score of a model. It then returns a dictionary with these four values.

Parameters
  • acc – used to store the accuracy of the model.

  • avg_prec – used to Store the average precision score of the model.

  • prec – used to store the precision for each class.

  • rac – sed to calculate the area under the curve.

Returns

A dictionary with four keys: accuracy, avgerage_preciscion, precision, and roc_auc.

Doc-author

Trelent

chembee.actions.evaluation.parse_multi_output(metrics_collection)[source]
chembee.actions.evaluation.precision(true_pos, false_pos)[source]

The precision function takes two parameters: true_pos and false_pos. It returns the precision of the classifier, which is defined as the ratio of true positives (true_pos) to the sum of true positives and false positives (true_pos + false_pos).

Parameters
  • true_pos – Used to Calculate the precision.

  • false_pos – Used to Calculate the precision.

Returns

A value between 0 and 1.

Doc-author

Trelent

chembee.actions.evaluation.recall(true_pos, false_neg)[source]

The recall function takes two inputs: true positives and false negatives. It returns the recall of those values as a percentage.

Parameters
  • true_pos – Used to Calculate the recall.

  • false_neg – Used to Calculate the recall.

Returns

The ratio of true positives to the sum of true positives and false negatives.

Doc-author

Julian M. Kleber

chembee.actions.evaluation.screen_classifier_for_metrics(X_train, y_train, X_test, y_test, file_name='evaluation', prefix='plots/evaluation', clf_list: list = [LogisticRegressionClassifierAlgorithm(multi_class='multinomial'), GaussianNBAlgorithm(), NaivelyCalibratedSVC(gamma=0.7), NaivelyCalibratedSVC(degree=5, kernel='poly'), RandomForestClassifierAlgorithm(), KNeighborsClassifierAlgorithm(), MLPClassifierAlgorithm(hidden_layer_sizes=(100, 20, 20, 100), max_iter=10000), MLPClassifierAlgorithm(activation='tanh', hidden_layer_sizes=(100, 20, 20, 100), max_iter=10000)], to_fit=True) dict[source]

The screen_classifier_for_metrics function takes a list of classifiers and fits them to the training data. It then predicts the test data using each classifier, calculates metrics for each prediction, and stores all of this information in a dictionary. The function returns this dictionary. Intended use for web application

Parameters
  • clf:list=clf_list – Used to pass the classifiers to be evaluated.

  • X_train – Used to pass the training data set.

  • y_train – Used to train the classifier.

  • name – Used to create a folder where the plots will be stored.

  • X_test – Used to test the classifier on a set of data.

  • y_test – Used to calculate the metrics.

  • file_name="evaluation" – Used to specify the name of the file where all metrics will be saved.

  • prefix="plots/evaluation" – Used to set the path where the plots will be saved.

Returns

A dictionary of metrics for each classifier.

Doc-author

Julian M. Kleber

chembee.actions.evaluation.screen_classifier_for_metrics_stratified(X_data, y_true, n, file_name='evaluation', prefix='plots/evaluation', clf_list: list = [LogisticRegressionClassifierAlgorithm(multi_class='multinomial'), GaussianNBAlgorithm(), NaivelyCalibratedSVC(gamma=0.7), NaivelyCalibratedSVC(degree=5, kernel='poly'), RandomForestClassifierAlgorithm(), KNeighborsClassifierAlgorithm(), MLPClassifierAlgorithm(hidden_layer_sizes=(100, 20, 20, 100), max_iter=10000), MLPClassifierAlgorithm(activation='tanh', hidden_layer_sizes=(100, 20, 20, 100), max_iter=10000)]) dict[source]

The screen_classifier_for_metrics function takes a list of classifiers and fits them to the training data using stratified n-fold cross validation. The goal is to have statistically more meaningful evaluations of a classifier and at the same time some sense of the distribution and purity of the dataset. The fubction predicts the test data using each classifier, calculates metrics for each prediction, and stores all of this information in a dictionary. The function returns this dictionary. Intended use for web application

Parameters
  • clf:list=clf_list – Used to pass the classifiers to be evaluated.

  • X_train – Used to pass the training data set.

  • y_train – Used to train the classifier.

  • name – Used to create a folder where the plots will be stored.

  • X_test – Used to test the classifier on a set of data.

  • y_test – Used to calculate the metrics.

  • file_name="evaluation" – Used to specify the name of the file where all metrics will be saved.

  • prefix="plots/evaluation" – Used to set the path where the plots will be saved.

Returns

A dictionary of metrics for each classifier.

Doc-author

Julian M. Kleber

chembee.actions.evaluation.specificity(true_neg, false_pos)[source]

The specificity function takes in the number of true negatives and false positives, and returns the specificity score for a given classifier

Parameters
  • true_neg – Used to Calculate the specificity.

  • false_pos – Used to Calculate the specificity.

Returns

The proportion of negatives that are correctly identified as such (e.

Doc-author

Trelent

chembee.actions.feature_extraction module

chembee.actions.feature_extraction.filter_importance_by_std(result_json, cut_off=0.01)[source]

The filter_importance_by_std function takes a JSON object containing feature names, indices, importances and standard deviations and filters out features with low importance. The cut_off parameter is used to determine the minimum importance level for a feature to be included in the output. For example, if cut_off = 0.01 then only features with an importance greater than 1% will be included in the output.

Parameters
  • result_json – Used to pass the result of a sklearn.

  • cut_off=0.01 – Used to filter out the features with low importance.

Returns

A dictionary with the following keys:.

Doc-author

Julian M. Kleber

chembee.actions.feature_extraction.get_feature_importances(X_data: ndarray, y_data: ndarray, feature_names: list) dict[source]
The get_feature_importances function accepts three arguments:
  1. X_data - A numpy array of the features in the dataset

  2. y_data - A numpy array of the labels in the dataset

  3. feature_names - An optional list containing names for each feature

The function returns a dictionary with four keys:

  1. “feature_names” which contains all of the feature names passed to this function, and

  2. “importances”, which is another dictionary where each key is a column name from X data,and its value is that column’s importance as determined by sklearn’s RandomForestClassifier algorithm.

  3. “std”, which is the standard deviation of the feature importance for each feature,

  4. “feature_indices”, which is a list of the indices of the respective feature names

Parameters
  • X_data:np.ndarray – Used to pass the data.

  • y_data:np.ndarray – Used to pass the target variable.

  • feature_names:list – Used to get the names of the features.

Returns

A dictionary with the following keys:.

Doc-author

Julian M. Kleber

chembee.actions.get_false_predictions module

chembee.actions.get_false_predictions.get_data_impurities(fitted_clf, X_data, y_true, n=6) list[source]

The get_data_impurities function takes a classifier, the data and true labels. It returns two lists of column names: the first list is the false positive columns, and the second list is the false negative columns. The function also takes an optional argument n which defaults to 6. This means that if there are more than 6 impurities in either category (false positives or negatives) then only the first six will be returned.

Parameters
  • clf – Used to store the classifier that is used to make predictions.

  • X_data – Used to get the data that is used to train and test the model.

  • y_true – Used to get the true labels of the data.

  • n=6 – Used to specify the number of false positive and negative values to be returned.

Returns

A list of two lists.

Doc-author

Julian M: Kleber

chembee.actions.get_false_predictions.get_false_predictions(fitted_clf, X_data, y_true) list[source]

The get_false_predictions function returns a tuple of two lists. The first list contains the indices of false positive predictions, and the second list contains the indices of false negative predictions.

Parameters: clf (sklearn classifier object) - A trained sklearn classifier object

X_data (numpy array) - A numpy array containing feature data for each example in dataset y_data (numpy array) - A numpy array containing labels for each example in dataset

Returns: tuple(list): Two lists, one with all indices where a false positive occured, and one with all indicies where a false negative occured

Parameters
  • clf – Used to store the classifier that is being used.

  • X_data – Used to store the data that is used to make predictions with the classifier.

  • y_data – Used to get the true values of the data.

Returns

A tuple containing two lists: false_pos and false_neg.

Doc-author

Julian M. Kleber

chembee.actions.get_false_predictions.get_multi_false_predictions(clf, X_data, y_data, n)[source]

The get_unique_false_predictions function takes in a classifier, the training data and labels, and an integer n. It then runs the classification on all of the training data n times. For each iteration of running the classification, it finds which predictions were false positives and which were false negatives. It then adds these to lists containing all of the false positive indices and all of the false negative indices respectively. After this is done for every iteration (n), it returns two lists: one containing unique values from the list of false positive indices (unique_false_pos) and one containing unique values fromt he list of fasle negative ids(unique_false_neg). This is because there are many instances where multiple iterations will find a particular index as being misclassified.

Parameters
  • clf – Used to fit the classifier.

  • X_data – Used to pass the x_data to the function.

  • y_data – Used to get the true labels of the data.

  • n – Used to specify the number of times to run the get_unique_false_predictions function.

Returns

Two lists of the indices of the false positives and false negatives. May contain duplicates.

Doc-author

Julian M. Kleber

chembee.actions.get_undersampled_data module

chembee.actions.save_model module

chembee.actions.save_model.save_model(clf, file_name: str, prefix: str, ending='.bee') dict[source]

chembee.actions.search module

chembee.actions.search.get_similar_compounds(compounds_of_interest: ndarray, compounds: ndarray, distance='tanimoto', return_sim_matrix=True) dict[source]

The get_similar_compounds function takes a list of compounds and returns the n most similar compounds. The similarity is calculated by calculating a user defined distance metric between two compound vectors. The function works on arbitrary data and features. It could also take in vectorized fingerprints. The function is written as such that it could return the most similar compounds that are the same as the search compounds. You would have to exclude them manually. Still, the function is experimental and if you feel handling the exlusion of of identical compounds inside of this function is neccessary, please file an issue, or feature request.

Parameters
  • compounds_of_interest – Used to specify the compounds that we want to find similar compounds for.

  • compounds – Used to store the similarity scores for each compound.

  • n=10 – Used to specify the number of similar compounds to return. Specify n=’all’ if you want to return all coumpounds

Returns

A dictionary containing the jsonified result, ready to use in web tech like MongoDB, Flask, React, etc., and another dictionary cointaining the complete similarity matrix

Doc-author

Julian M. Kleber

chembee.actions.search.get_similar_compounds_structure(compounds_of_interest, compounds, n=10, distance='tanimoto') dict[source]

The get_similar_compounds_structure function takes a list of compounds and returns the most similar compounds based on structure. The function takes in four arguments:

  1. A list of compound names (compounds_of_interest),

  2. A dataframe containing all the other compounds (compounds), and

  3. An integer n that specifies how many similar compounds to return. The default is 10.

  4. A distnace

The function returns a dictionary with three keys: “InChi”, “number” and “similarity”. Number corresponds to the index number for each compound, while similarity contains their similarity score.

Parameters
  • compounds_of_interest – Used to Specify the compounds for which similar compounds are to be found.

  • compounds – Used to Get the structure of all compounds in the database.

  • n=10 – Used to Specify how many similar compounds we want to find.

  • distance="Tanimoto" – Used to Specify the similarity metric to be used.

Returns

A dictionary with the following keys:.

Doc-author

Julian M. Kleber

chembee.actions.search.get_top_n_similar_compounds(similarities: list, n: int = 10, labels=None)[source]

The get_top_n_similar_compounds function takes in a list of similarities and returns the top n similar compounds. The function takes in a list of similarities, sorts it, and then returns the top n most similar compounds.

Parameters
  • similarities:list – Used to pass in the list of similarities for each compound.

  • n:int=10 – Used to Specify the number of similar compounds to return.

Returns

The top n similar compounds for a given compound, for each similarity list

Doc-author

Julian M. Kleber

chembee.actions.search.screen_fingerprints_against_data(to_screen, base) dict[source]

The screen_fingerprints_against_data function takes two arguments: a list of fingerprints to screen, and a list of fingerprints against which the first argument will be screened. The function returns a dictionary with keys that are the fingerprint names in the first argument, and values that are lists containing the similarity scores for each molecule in the second argument. For example:

>>>screen_fingerprints_against_data([FP11, FP12], [FP21, FP22])

{1: {similarityScores: [0.24, 0.6], fingerPrint: FP11}, 2: {similarityScores: [0.1, 0.42], fingerPrint: FP12}}

Parameters
  • to_screen – Used to specify the fingerprints to be screened against the base.

  • base – Used to define the set of molecules to compare against.

Returns

A dictionary with the similarity scores of the fingerprints in to_screen against all of the fingerprints in base.

Doc-author

Julian M. Kleber

Module contents