sktutor package

Submodules

sktutor.preprocessing module

class sktutor.preprocessing.BitwiseOperator(operator, mapper)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Apply a bitwise operator & or | to a list of columns.

Parameters:
  • mapper (dict) – A mapping from new columns which will be defined by applying the bitwise operator to a list of old columns
  • operator (str) – the name of the bitwise operator to apply. ‘and’, ‘or’ are acceptable inputs

mapper takes the form:

{'new_column1': ['old_column1', 'old_column2', 'old_column3'],
 'new_column2': ['old_column2', 'old_column4', 'old_column5']
 }
fit(X, y=None, **fit_params)[source]

Fit the dropper on X. Checks that all columns are in X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X, **transform_params)[source]

Drop the specified columns in X.

Parameters:X (pandas DataFrame) – The input data.
Return type:A DataFrame without specified columns.
class sktutor.preprocessing.BoxCoxTransformer(adder=0)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Create BoxCox Transformations on all columns.

Parameters:adder (numeric) – the amount to add to each column before the BoxCox transformation
fit(X, y=None, **fit_params)[source]

Fit the transformer on X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
fit_transform(X, y=None, **fit_params)[source]

Fit the validator on X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X, **transform_params)[source]

Checks whether a dataset to transform has the same columns as the fitting dataset, and returns X with columns in the same order as the dataset in fit.

Parameters:X (pandas DataFrame) – The input data.
Return type:A DataFrame with specified columns.
class sktutor.preprocessing.ColumnDropper(col)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Drop a list of columns from a DataFrame.

Parameters:col (list of strings) – A list of columns to extract from the DataFrame
fit(X, y=None, **fit_params)[source]

Fit the dropper on X. Checks that all columns are in X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X, **transform_params)[source]

Drop the specified columns in X.

Parameters:X (pandas DataFrame) – The input data.
Return type:A DataFrame without specified columns.
class sktutor.preprocessing.ColumnExtractor(col)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Extract a list of columns from a DataFrame.

Parameters:col (list of strings) – A list of columns to extract from the DataFrame
fit(X, y=None, **fit_params)[source]

Fit the extractor on X. Checks that all columns are in X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X, **transform_params)[source]

Extract the specified columns in X.

Parameters:X (pandas DataFrame) – The input data.
Return type:A DataFrame with specified columns.
class sktutor.preprocessing.ColumnNameCleaner[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Replaces spaces and formula symbols in column names that conflict with patsy formula interpretation

fit(X, y=None, **fit_params)[source]

Fit the transformer on X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X, **transform_params)[source]

Transform X with clean column names for patsy

Parameters:X (pandas DataFrame) – The input data.
Return type:A DataFrame with specified columns.
class sktutor.preprocessing.ColumnValidator[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Ensure that the transformed dataset has the same columns and order as the original fit dataset. Could be useful to check at the beginning and end of pipelines.

fit(X, y=None, **fit_params)[source]

Fit the validator on X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X, **transform_params)[source]

Checks whether a dataset to transform has the same columns as the fitting dataset, and returns X with columns in the same order as the dataset in fit.

Parameters:X (pandas DataFrame) – The input data.
Return type:A DataFrame with specified columns.
class sktutor.preprocessing.ContinuousFeatureBinner(field, bins, right_inclusive=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Creates bins for continuous features

Parameters:
  • field (string) – the continuous field for which to create bins
  • bins (array-like) – The criteria to bin by.
  • right_inclusive (bool) – interval should be right-inclusive or not
fit(X, y=None)[source]

Fit the ContinuousFeatureBinner on X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X)[source]

Transform X on field, adding a new column with _GRP appended.

Parameters:X (pandas DataFrame) – The input data.
Return type:A DataFrame with specified columns.
class sktutor.preprocessing.DummyCreator(**kwargs)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Create dummy variables from categorical variables.

Parameters:
  • dummy_na (boolean) – Add a column to indicate NaNs, if False NaNs are ignored.
  • drop_first (boolean) – Whether to get k-1 dummies out of k categorical levels by removing the first level.
fit(X, y=None, **fit_params)[source]

Fit the dummy creator on X. Retains a record of columns produced with the fitting data.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
fit_transform(X, y=None, **fit_params)[source]

Fit the dummy creator on X, then transform X. Same as calling self.fit().transform(), but more convenient and efficient.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X, **transform_params)[source]

Create dummies for the columns in X.

Parameters:X (pandas DataFrame) – The input data.
Return type:A DataFrame with dummy variables.
class sktutor.preprocessing.FactorLimiter(factors_per_column=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

For each named column, it limits factors to a list of acceptable values. Non-comforming factors, including missing values, are replaced by a default value.

Parameters:factors_per_column (dictionary) – dictionary mapping column name keys to a dictionary with a list of acceptable factor values and a default factor value for non-conforming values

factors_per_column takes the form:

{'column_name': {'factors': ['value1', 'value2', 'value3'],
                 'default': 'value1'},
                 }
 }
fit(X, y=None)[source]

Fit the factor limiter on X. Checks that all columns in factors_per_column are in present in X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X)[source]

Limit the factors in X with the values in the factor_per_column.

Parameters:X (pandas DataFrame) – The input data.
Return type:A DataFrame with factors limited to the specifications.
class sktutor.preprocessing.GenericTransformer(function, params=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Generic transformer that applies user-defined function within pipeline framework. Arbitrary callable should only make transformations and does not store any fit() parameters. Lambda functions are not supported as they cannot be pickled.

Parameters:
  • function (callable) – arbitrary function to use as a transformer
  • params (dict) – dict with function parameter name as key and parameter value as value
fit(X, y=None, **fit_params)[source]
transform(X, **transform_params)[source]
class sktutor.preprocessing.GroupByImputer(impute_type, group=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Imputes Missing Values by Group with specified function. If a group parameter is given, it can be the name of any function which can be passed to the agg function of a pandas GroupBy object. If a group paramter is not given, then only ‘mean’, ‘median’, and ‘most_frequent’ can be used.

Parameters:
  • impute_type (string) – The type of imputation to be performed.
  • group (string or list of strings) – The column name or a list of column names to group the pandas DataFrame.
fit(X, y=None)[source]

Fit the imputer on X

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X)[source]

Impute the eligible missing values in X

Parameters:X (pandas DataFrame) – The input data with missing values to be imputed.
Return type:A DataFrame with eligible missing values imputed.
class sktutor.preprocessing.InteractionCreator(columns1, columns2)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Creates interactions across columns of a DataFrame

Parameters:
  • columns1 (list of strings) – first list of columns to create interactions with each of the second list of columns
  • columns2 (list of strings) – second list of columns to create interactions with each of the second list of columns
fit(X, y=None, **fit_params)[source]

Fit the transformer on X. Checks that all columns are in X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X, **transform_params)[source]

Add specified interactions to X.

Parameters:X (pandas DataFrame) – The input data.
Return type:A DataFrame without specified columns.
class sktutor.preprocessing.MissingColumnsReplacer(cols, value)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Fill in missing columns to a DataFrame :param cols: The expected list of columns. :param value: The value to fill the new columns with by default

fit(X, y=None)[source]

Fit the imputer on X. :param X: The input data. :type X: pandas DataFrame :rtype: Returns self.

transform(X)[source]

Impute the eligible missing values in X. :param X: The input data with missing values to be filled. :type X: pandas DataFrame :rtype: A DataFrame with eligible missing values filled.

class sktutor.preprocessing.MissingValueFiller(value)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Fill missing values with a specified value. Should only be used with columns of similar dtypes.

Parameters:value – The value to impute for missing factors.
fit(X, y=None)[source]

Fit the imputer on X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X)[source]

Impute the eligible missing values in X.

Parameters:X (pandas DataFrame) – The input data with missing values to be filled.
Return type:A DataFrame with eligible missing values filled.
class sktutor.preprocessing.OverMissingThresholdDropper(threshold)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Drop columns with more missing data than a given threshold.

Parameters:threshold (float) – Maximum portion of missing data that is acceptable. Must be within the interval [0,1]
fit(X, y=None)[source]

Fit the dropper on X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X)[source]

Impute the eligible missing values in X.

Parameters:X (pandas DataFrame) – The input data.
Return type:A DataFrame with columns dropped.
class sktutor.preprocessing.PolynomialFeatures(degree=2, interaction_only=False)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Creates polynomail features from inputs.

Parameters:degree – The degree of the polynomial
Interaction_only:
 if true, only interaction features are produced:

features that are products of at most degree distinct input features.

fit(X, y=None, **fit_params)[source]

Fit the transformer on X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X, **transform_params)[source]

Transform X with clean column names for patsy

Parameters:X (pandas DataFrame) – The input data.
Return type:A DataFrame with specified columns.
class sktutor.preprocessing.SingleValueAboveThresholdDropper(threshold=1, dropna=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Removes columns with a single value representing a higher percentage of values than a given threshold

Parameters:
  • threshold (float) – percentage of single value in a column to be removed
  • dropna (boolean) – If True, do not consider NaN as a value
fit(X, y=None)[source]

Fit the dropper on X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X)[source]

Drop the columns in X with single values that exceed the threshold.

Parameters:X (pandas DataFrame) – The input data.
Return type:A DataFrame with columns dropped to the specifications.
class sktutor.preprocessing.SingleValueDropper(dropna=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Drop columns with only one unique value

Parameters:dropna (boolean) – If True, do not consider NaN as a value
fit(X, y=None)[source]

Fit the dropper on X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X)[source]

Drop the columns in X with single non-missing values.

Parameters:X (pandas DataFrame) – The input data.
Return type:A DataFrame with columns dropper.
class sktutor.preprocessing.SklearnPandasWrapper(transformer)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Wrap a scikit-learn Transformer with a pandas-friendly version that keeps columns and row indices in place. Will only work for Transformers that do not add or change the order of columns. :param transformer: The scikit-learn compatible Transformer object. :type transformer: sklearn Transformer

fit(X, y=None)[source]

Fit the imputer on X. :param X: The input data. :type X: pandas DataFrame :rtype: Returns self.

transform(X)[source]

Transform values in X. :param X: The input data to be transformed. :type X: pandas DataFrame :rtype: A DataFrame trasnformed.

class sktutor.preprocessing.StandardScaler(columns=None, **kwargs)[source]

Bases: sklearn.preprocessing._data.StandardScaler

Standardize features by removing mean and scaling to unit variance

fit(X, y=None, **fit_params)[source]

Fit the transformer on X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
fit_transform(X, y=None, **fit_params)[source]

Fit and transform the StandardScaler on X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
inverse_transform(X, partial_cols=None, **transform_params)[source]

Inverse transform X with the standard scaling

Parameters:
  • X (list) – The input data.
  • partial_cols – when specified, only return these columns
Return type:

A DataFrame with specified columns.

transform(X, partial_cols=None, **transform_params)[source]

Transform X with the standard scaling

Parameters:
  • X (list) – The input data.
  • partial_cols – when specified, only return these columns
Return type:

A DataFrame with specified columns.

class sktutor.preprocessing.TextContainsDummyExtractor(mapper)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Extract one or more dummy variables based on whether one or more text columns contains one or more strings.

Parameters:mapper (dict) – a mapping of new columns to criteria to populate it as True

mapper takes the form:

{'old_column1':
 {'new_column1':
  [{'pattern': 'string1', 'kwargs': {'case': False}},
   {'pattern': 'string2', 'kwargs': {'case': False}}
   ],
  'new_column2':
  [{'pattern': 'string3', 'kwargs': {'case': False}},
   {'pattern': 'string4', 'kwargs': {'case': False}}
   ],
  },
 'old_column2':
 {'new_column3':
  [{'pattern': 'string5', 'kwargs': {'case': False}},
   {'pattern': 'string6', 'kwargs': {'case': False}}
   ],
  'new_column4':
  [{'pattern': 'string7', 'kwargs': {'case': False}},
   {'pattern': 'string8', 'kwargs': {'case': False}}
   ]
  }
 }
fit(X, y=None)[source]

Fit the imputer on X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X)[source]

Impute the eligible missing values in X.

Parameters:X (pandas DataFrame) – The input data with missing values to be filled.
Return type:A DataFrame with eligible missing values filled.
class sktutor.preprocessing.TypeExtractor(type)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Returns dataframe with only specified field type

Parameters:type (string) – desired type; either ‘numeric’ or ‘categorical’
fit(df, **fit_params)[source]

Fit the TypeExtractor on X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(df, **transform_params)[source]

Extract all columns of type.

Parameters:X (pandas DataFrame) – The input data.
Return type:A DataFrame with extracted columns.
class sktutor.preprocessing.ValueReplacer(mapper=None, inverse_mapper=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Replaces Values in each column according to a nested dictionary. inverse_mapper is probably more intuitive for when one value replaces many values. Only one of inverse_mapper or mapper can be used.

Parameters:
  • mapper (dictionary) – Nested dictionary with columns mapping to dictionaries that map old values to new values.
  • inverse_mapper (dictionary) – Nested dictionary with columns mapping to dictionaries that map new values to a list of old values

mapper takes the form:

{'column_name': {'old_value1': 'new_value1',
                 'old_value2': 'new_value1',
                 'old_value3': 'new_value2'}
 }

while inverse_mapper takes the form:

{'column_name': {'new_value1': ['old_value1', 'old_value2'],
                 'new_value2': ['old_value1']}
 }
fit(X, y=None)[source]

Fit the value replacer on X. Checks that all columns in mapper are in present in X.

Parameters:X (pandas DataFrame) – The input data.
Return type:Returns self.
transform(X)[source]

Replace the values in X with the values in the mapper. Values not accounted for in the mapper will be left untransformed.

Parameters:X (pandas DataFrame) – The input data.
Return type:A DataFrame with old values mapped to new values.
sktutor.preprocessing.mode(x)[source]

Return the most frequent occurance. If two or more values are tied with the most occurances, then return the lowest value.

Parameters:x (pandas Series) – A data vector.
Return type:The the most frequent value in x.

sktutor.pipline module

class sktutor.pipeline.FeatureUnion(transformer_list, *, n_jobs=None, transformer_weights=None, verbose=False)[source]

Bases: sklearn.pipeline.FeatureUnion

Perform a list of transformations in parallel and concat the results

Parameters:
  • transformers – list of (string, transformer) tuples
  • n_jobs – Number of jobs to run in parallel (default 1).
fit_transform(X, y=None, **fit_params)[source]

Transform X separately by each transformer, concatenate results.

Parameters:X (iterable or array-like, depending on transformers) – Input data to be transformed.
Return type:DataFrame with concatenated results of transformers.
transform(X)[source]

Transform X separately by each transformer, concatenate results.

Parameters:X (iterable or array-like, depending on transformers) – Input data to be transformed.
Return type:DataFrame with concatenated results of transformers.
sktutor.pipeline.make_union(*transformers, **kwargs)[source]

Construct a FeatureUnion from the given transformers. This is a shorthand for the FeatureUnion constructor; it does not require, and does not permit, naming the transformers. Instead, they will be given names automatically based on their types. It also does not allow weighting.

Parameters:
  • transformers – list of estimators
  • n_jobs – Number of jobs to run in parallel (default 1).
Return type:

FeatureUnion

Module contents