chembee.datasets package

Submodules

chembee.datasets.BioDegDataSet module

class chembee.datasets.BioDegDataSet.BioDegDataSet(data_set_path, target, split_ratio=0.7)[source]

Bases: DataSet

clean_data(data)[source]

The clean_data function takes in a dataframe and cleans it by removing the SMILES, CASRN, ID and Dataset columns. It also converts all of the dtypes to float64 or int64. It returns a cleaned dataframe.

Parameters

self – Used to reference the class itself.
data – Used to pass the data that is to be cleaned.

Returns

A dataframe with the columns “smiles”, “dataset”, “casrn” and “id” dropped, and all other columns converted to numeric type.

Doc-author

Trelent

get_feature_names(data, target)[source]

The get_feature_names function takes in a dataframe and returns the feature names. It does this by iterating through the columns of the dataframe and appending them to a list. The function then returns that list.

Parameters

self – Used to reference the class itself.
data – Used to drop the target column from the dataframe.
target – Used to determine if the function is being used for a training or testing set.

Returns

The names of the features.

Doc-author

Trelent

load_data_set(file_name: str)[source]

The load_data function loads the data from a sdf file and returns it as a Pandas DataFrame.

Parameters: file_path:str – Used to Specify the location of the.
Returns: A dataframe with the following columns:.
Doc-author: Trelent

load_data_set_from_csv(file_name: str) → DataFrame[source]

The load_data_set_from_csv function loads a csv file into a pandas dataframe. The function’s purpose is to improve readability. Nothing special The function takes one argument, the name of the csv file to be loaded. The function returns a pandas dataframe containing all of the information in the specified csv file.

Parameters

self – Used to Access variables that belongs to the class.
file_name:str – Used to Specify the name of the csv file that will be used to load data from.

Returns

A pandas data frame.

Doc-author

Julian M. Kleber

make_train_test_split(data, split_ratio: float, y_col: str, shuffle=True)[source]

The make_train_test_split function splits the data into a training set and test set. The split_ratio parameter determines how much of the data is used for training, and how much is used for testing. The y_col parameter specifies which column in the dataset contains labels (y). The shuffle parameter allows you to specify whether or not to shuffle your data before splitting it into train/test sets.

Parameters

self – Used to Reference the class itself.
data – Used to Specify the dataframe that is used for splitting.
split_ratio:float – Used to Determine the ratio of data that will be used for training.
y_col:str – Used to Specify the column name of the dependent variable.
shuffle=True – Used to Shuffle the data before splitting it into training and testing sets.

Returns

The X_train, X_test, y_train and y_test dataframes.

Doc-author

Julian M. Kleber

name = 'biodeg'

save_data_csv(data: DataFrame, file_name)[source]

The save_data_csv function saves the data of the instance to a csv file.

Parameters

self – Used to Access the attributes and methods of the class in python.
data:pd.DataFrame – Used to Save the data that is passed to it.
file_name – Used to Specify the name of the file to be saved.
prefix – Used to Add a prefix to the file name.

Returns

The name of the file where the data was saved.

Doc-author

Julian M. Kleber

save_data_sdf(data, file_name, prefix, molColName='Molecule')[source]

The save_data_sdf function saves the data in a pandas DataFrame to an sdf file.

Parameters

self – Used to access the attributes and methods of the class in a method.
data – Used to specify the dataframe that is to be saved.
file_name – Used to specify the name of the file to which data is saved.
prefix – Used to add a prefix to the file name.
molColName="Molecule" – Used to specify the name of the molecule column in the sdf file.

Returns

A dataframe with the same number of rows as the input data and one column named “romol” containing a mol object.

Doc-author

Julian M. Kleber

chembee.datasets.BreastCancer module

class chembee.datasets.BreastCancer.BreastCancerDataset(split_ratio)[source]

Bases: DataSet

load_data_set()[source]

The load_data_set function loads the data from the csv file and creates a list of lists. The function also removes any rows with missing values, as well as any columns that have all zeros. The function returns a tuple containing two elements: (data_set, target)

Parameters: self – Used to Reference the class object.
Returns: The cancer_data dataframe.
Doc-author: Trelent

make_train_test_split(data, split_ratio, shuffle=True)[source]

The make_train_test_split function splits the data into training and testing sets.

Parameters:

data (object): The dataset to be split.

split_ratio (float): The ratio of the number of training samples to total number of samples in the dataset.

shuffle (bool, optional): Whether or not to shuffle the input before splitting it into train and test sets. Defaults to True if not specified otherwise.

Parameters

self – Used to Reference the class instance.
data – Used to Pass the data set to be split.
split_ratio – Used to determine the ratio of samples used for training.
shuffle=True – Used to shuffle the data before splitting it into train and test sets.

Returns

The following:.

Doc-author

Trelent

name = 'breast-cancer'

save_data_npy(data, file_name, prefix=None)[source]

chembee.datasets.ChemicalDataSet module

chembee.datasets.DataSet module

class chembee.datasets.DataSet.DataSet[source]

Bases: object

data = None

get_split()[source]

load_data_set()[source]

make_train_test_split(data_set, split_ratio)[source]

name = None

save_data_csv(file_name)[source]

chembee.datasets.IrisDataSet module

class chembee.datasets.IrisDataSet.IrisDataSet(split_ratio)[source]

Bases: BreastCancerDataset

load_data_set()[source]

The load_data_set function loads the data from the csv file and creates a list of lists. The function also removes any rows with missing values, as well as any columns that have all zeros. The function returns a tuple containing two elements: (data_set, target)

Parameters: self – Used to Reference the class object.
Returns: The cancer_data dataframe.
Doc-author: Trelent

name = 'iris-dataset'

chembee.datasets package

Submodules

chembee.datasets.BioDegDataSet module

chembee.datasets.BreastCancer module

chembee.datasets.ChemicalDataSet module

chembee.datasets.DataSet module

chembee.datasets.IrisDataSet module

Module contents