sktutor package¶

Submodules¶

sktutor.preprocessing module¶

class sktutor.preprocessing.BitwiseOperator(operator, mapper)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Apply a bitwise operator & or | to a list of columns.

Parameters:	mapper (dict) – A mapping from new columns which will be defined by applying the bitwise operator to a list of old columns operator (str) – the name of the bitwise operator to apply. ‘and’, ‘or’ are acceptable inputs

mapper takes the form:

{'new_column1': ['old_column1', 'old_column2', 'old_column3'],
 'new_column2': ['old_column2', 'old_column4', 'old_column5']
 }

fit(X, y=None, **fit_params)[source]¶

Fit the dropper on X. Checks that all columns are in X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X, **transform_params)[source]¶

Drop the specified columns in X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	A `DataFrame` without specified columns.

class sktutor.preprocessing.BoxCoxTransformer(adder=0)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Create BoxCox Transformations on all columns.

Parameters:	adder (numeric) – the amount to add to each column before the BoxCox transformation

fit(X, y=None, **fit_params)[source]¶

Fit the transformer on X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

fit_transform(X, y=None, **fit_params)[source]¶

Fit the validator on X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X, **transform_params)[source]¶

Checks whether a dataset to transform has the same columns as the fitting dataset, and returns X with columns in the same order as the dataset in fit.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	A `DataFrame` with specified columns.

class sktutor.preprocessing.ColumnDropper(col)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Drop a list of columns from a DataFrame.

Parameters:	col (list of strings) – A list of columns to extract from the `DataFrame`

fit(X, y=None, **fit_params)[source]¶

Fit the dropper on X. Checks that all columns are in X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X, **transform_params)[source]¶

Drop the specified columns in X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	A `DataFrame` without specified columns.

class sktutor.preprocessing.ColumnExtractor(col)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Extract a list of columns from a DataFrame.

Parameters:	col (list of strings) – A list of columns to extract from the `DataFrame`

fit(X, y=None, **fit_params)[source]¶

Fit the extractor on X. Checks that all columns are in X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X, **transform_params)[source]¶

Extract the specified columns in X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	A `DataFrame` with specified columns.

class sktutor.preprocessing.ColumnNameCleaner[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Replaces spaces and formula symbols in column names that conflict with patsy formula interpretation

fit(X, y=None, **fit_params)[source]¶

Fit the transformer on X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X, **transform_params)[source]¶

Transform X with clean column names for patsy

Parameters:	X (pandas DataFrame) – The input data.
Return type:	A `DataFrame` with specified columns.

class sktutor.preprocessing.ColumnValidator[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Ensure that the transformed dataset has the same columns and order as the original fit dataset. Could be useful to check at the beginning and end of pipelines.

fit(X, y=None, **fit_params)[source]¶

Fit the validator on X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X, **transform_params)[source]¶

Checks whether a dataset to transform has the same columns as the fitting dataset, and returns X with columns in the same order as the dataset in fit.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	A `DataFrame` with specified columns.

class sktutor.preprocessing.ContinuousFeatureBinner(field, bins, right_inclusive=True)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Creates bins for continuous features

Parameters:	field (string) – the continuous field for which to create bins bins (array-like) – The criteria to bin by. right_inclusive (bool) – interval should be right-inclusive or not

fit(X, y=None)[source]¶

Fit the ContinuousFeatureBinner on X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X)[source]¶

Transform X on field, adding a new column with _GRP appended.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	A `DataFrame` with specified columns.

class sktutor.preprocessing.DummyCreator(**kwargs)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Create dummy variables from categorical variables.

Parameters:	dummy_na (boolean) – Add a column to indicate NaNs, if False NaNs are ignored. drop_first (boolean) – Whether to get k-1 dummies out of k categorical levels by removing the first level.

fit(X, y=None, **fit_params)[source]¶

Fit the dummy creator on X. Retains a record of columns produced with the fitting data.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

fit_transform(X, y=None, **fit_params)[source]¶

Fit the dummy creator on X, then transform X. Same as calling self.fit().transform(), but more convenient and efficient.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X, **transform_params)[source]¶

Create dummies for the columns in X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	A `DataFrame` with dummy variables.

class sktutor.preprocessing.FactorLimiter(factors_per_column=None)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

For each named column, it limits factors to a list of acceptable values. Non-comforming factors, including missing values, are replaced by a default value.

Parameters:	factors_per_column (dictionary) – dictionary mapping column name keys to a dictionary with a list of acceptable factor values and a default factor value for non-conforming values

factors_per_column takes the form:

{'column_name': {'factors': ['value1', 'value2', 'value3'],
                 'default': 'value1'},
                 }
 }

fit(X, y=None)[source]¶

Fit the factor limiter on X. Checks that all columns in factors_per_column are in present in X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X)[source]¶

Limit the factors in X with the values in the factor_per_column.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	A `DataFrame` with factors limited to the specifications.

class sktutor.preprocessing.GenericTransformer(function, params=None)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Generic transformer that applies user-defined function within pipeline framework. Arbitrary callable should only make transformations and does not store any fit() parameters. Lambda functions are not supported as they cannot be pickled.

Parameters:	function (callable) – arbitrary function to use as a transformer params (dict) – dict with function parameter name as key and parameter value as value

fit(X, y=None, **fit_params)[source]¶

transform(X, **transform_params)[source]¶

class sktutor.preprocessing.GroupByImputer(impute_type, group=None)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Imputes Missing Values by Group with specified function. If a group parameter is given, it can be the name of any function which can be passed to the agg function of a pandas GroupBy object. If a group paramter is not given, then only ‘mean’, ‘median’, and ‘most_frequent’ can be used.

Parameters:	impute_type (string) – The type of imputation to be performed. group (string or list of strings) – The column name or a list of column names to group the `pandas DataFrame`.

fit(X, y=None)[source]¶

Fit the imputer on X

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X)[source]¶

Impute the eligible missing values in X

Parameters:	X (pandas DataFrame) – The input data with missing values to be imputed.
Return type:	A `DataFrame` with eligible missing values imputed.

class sktutor.preprocessing.InteractionCreator(columns1, columns2)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Creates interactions across columns of a DataFrame

Parameters:	columns1 (list of strings) – first list of columns to create interactions with each of the second list of columns columns2 (list of strings) – second list of columns to create interactions with each of the second list of columns

fit(X, y=None, **fit_params)[source]¶

Fit the transformer on X. Checks that all columns are in X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X, **transform_params)[source]¶

Add specified interactions to X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	A `DataFrame` without specified columns.

class sktutor.preprocessing.MissingColumnsReplacer(cols, value)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Fill in missing columns to a DataFrame :param cols: The expected list of columns. :param value: The value to fill the new columns with by default

fit(X, y=None)[source]¶: Fit the imputer on X. :param X: The input data. :type X: pandas DataFrame :rtype: Returns self.

transform(X)[source]¶: Impute the eligible missing values in X. :param X: The input data with missing values to be filled. :type X: pandas DataFrame :rtype: A DataFrame with eligible missing values filled.

class sktutor.preprocessing.MissingValueFiller(value)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Fill missing values with a specified value. Should only be used with columns of similar dtypes.

Parameters:	value – The value to impute for missing factors.

fit(X, y=None)[source]¶

Fit the imputer on X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X)[source]¶

Impute the eligible missing values in X.

Parameters:	X (pandas DataFrame) – The input data with missing values to be filled.
Return type:	A `DataFrame` with eligible missing values filled.

class sktutor.preprocessing.OverMissingThresholdDropper(threshold)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Drop columns with more missing data than a given threshold.

Parameters:	threshold (float) – Maximum portion of missing data that is acceptable. Must be within the interval [0,1]

fit(X, y=None)[source]¶

Fit the dropper on X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X)[source]¶

Impute the eligible missing values in X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	A `DataFrame` with columns dropped.

class sktutor.preprocessing.PolynomialFeatures(degree=2, interaction_only=False)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Creates polynomail features from inputs.

Interaction_only:
Parameters:	degree – The degree of the polynomial
	if true, only interaction features are produced:

features that are products of at most degree distinct input features.

fit(X, y=None, **fit_params)[source]¶

Fit the transformer on X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X, **transform_params)[source]¶

Transform X with clean column names for patsy

Parameters:	X (pandas DataFrame) – The input data.
Return type:	A `DataFrame` with specified columns.

class sktutor.preprocessing.SingleValueAboveThresholdDropper(threshold=1, dropna=True)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Removes columns with a single value representing a higher percentage of values than a given threshold

Parameters:	threshold (float) – percentage of single value in a column to be removed dropna (boolean) – If True, do not consider NaN as a value

fit(X, y=None)[source]¶

Fit the dropper on X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X)[source]¶

Drop the columns in X with single values that exceed the threshold.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	A `DataFrame` with columns dropped to the specifications.

class sktutor.preprocessing.SingleValueDropper(dropna=True)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Drop columns with only one unique value

Parameters:	dropna (boolean) – If True, do not consider NaN as a value

fit(X, y=None)[source]¶

Fit the dropper on X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X)[source]¶

Drop the columns in X with single non-missing values.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	A `DataFrame` with columns dropper.

class sktutor.preprocessing.SklearnPandasWrapper(transformer)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Wrap a scikit-learn Transformer with a pandas-friendly version that keeps columns and row indices in place. Will only work for Transformers that do not add or change the order of columns. :param transformer: The scikit-learn compatible Transformer object. :type transformer: sklearn Transformer

fit(X, y=None)[source]¶: Fit the imputer on X. :param X: The input data. :type X: pandas DataFrame :rtype: Returns self.

transform(X)[source]¶: Transform values in X. :param X: The input data to be transformed. :type X: pandas DataFrame :rtype: A DataFrame trasnformed.

class sktutor.preprocessing.StandardScaler(columns=None, **kwargs)[source]¶

Bases: sklearn.preprocessing._data.StandardScaler

Standardize features by removing mean and scaling to unit variance

fit(X, y=None, **fit_params)[source]¶

Fit the transformer on X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

fit_transform(X, y=None, **fit_params)[source]¶

Fit and transform the StandardScaler on X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

inverse_transform(X, partial_cols=None, **transform_params)[source]¶

Inverse transform X with the standard scaling

Parameters:	X (list) – The input data. partial_cols – when specified, only return these columns
Return type:	A `DataFrame` with specified columns.

transform(X, partial_cols=None, **transform_params)[source]¶

Transform X with the standard scaling

Parameters:	X (list) – The input data. partial_cols – when specified, only return these columns
Return type:	A `DataFrame` with specified columns.

class sktutor.preprocessing.TextContainsDummyExtractor(mapper)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Extract one or more dummy variables based on whether one or more text columns contains one or more strings.

Parameters:	mapper (dict) – a mapping of new columns to criteria to populate it as True

mapper takes the form:

{'old_column1':
 {'new_column1':
  [{'pattern': 'string1', 'kwargs': {'case': False}},
   {'pattern': 'string2', 'kwargs': {'case': False}}
   ],
  'new_column2':
  [{'pattern': 'string3', 'kwargs': {'case': False}},
   {'pattern': 'string4', 'kwargs': {'case': False}}
   ],
  },
 'old_column2':
 {'new_column3':
  [{'pattern': 'string5', 'kwargs': {'case': False}},
   {'pattern': 'string6', 'kwargs': {'case': False}}
   ],
  'new_column4':
  [{'pattern': 'string7', 'kwargs': {'case': False}},
   {'pattern': 'string8', 'kwargs': {'case': False}}
   ]
  }
 }

fit(X, y=None)[source]¶

Fit the imputer on X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X)[source]¶

Impute the eligible missing values in X.

Parameters:	X (pandas DataFrame) – The input data with missing values to be filled.
Return type:	A `DataFrame` with eligible missing values filled.

class sktutor.preprocessing.TypeExtractor(type)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Returns dataframe with only specified field type

Parameters:	type (string) – desired type; either ‘numeric’ or ‘categorical’

fit(df, **fit_params)[source]¶

Fit the TypeExtractor on X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(df, **transform_params)[source]¶

Extract all columns of type.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	A `DataFrame` with extracted columns.

class sktutor.preprocessing.ValueReplacer(mapper=None, inverse_mapper=None)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Replaces Values in each column according to a nested dictionary. inverse_mapper is probably more intuitive for when one value replaces many values. Only one of inverse_mapper or mapper can be used.

Parameters:	mapper (dictionary) – Nested dictionary with columns mapping to dictionaries that map old values to new values. inverse_mapper (dictionary) – Nested dictionary with columns mapping to dictionaries that map new values to a list of old values

mapper takes the form:

{'column_name': {'old_value1': 'new_value1',
                 'old_value2': 'new_value1',
                 'old_value3': 'new_value2'}
 }

while inverse_mapper takes the form:

{'column_name': {'new_value1': ['old_value1', 'old_value2'],
                 'new_value2': ['old_value1']}
 }

fit(X, y=None)[source]¶

Fit the value replacer on X. Checks that all columns in mapper are in present in X.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	Returns self.

transform(X)[source]¶

Replace the values in X with the values in the mapper. Values not accounted for in the mapper will be left untransformed.

Parameters:	X (pandas DataFrame) – The input data.
Return type:	A `DataFrame` with old values mapped to new values.

sktutor.preprocessing.mode(x)[source]¶

Return the most frequent occurance. If two or more values are tied with the most occurances, then return the lowest value.

Parameters:	x (pandas Series) – A data vector.
Return type:	The the most frequent value in x.

sktutor.pipline module¶

class sktutor.pipeline.FeatureUnion(transformer_list, *, n_jobs=None, transformer_weights=None, verbose=False)[source]¶

Bases: sklearn.pipeline.FeatureUnion

Perform a list of transformations in parallel and concat the results

Parameters:	transformers – list of (string, transformer) tuples n_jobs – Number of jobs to run in parallel (default 1).

fit_transform(X, y=None, **fit_params)[source]¶

Transform X separately by each transformer, concatenate results.

Parameters:	X (iterable or array-like, depending on transformers) – Input data to be transformed.
Return type:	DataFrame with concatenated results of transformers.

transform(X)[source]¶

Transform X separately by each transformer, concatenate results.

Parameters:	X (iterable or array-like, depending on transformers) – Input data to be transformed.
Return type:	DataFrame with concatenated results of transformers.

sktutor.pipeline.make_union(*transformers, **kwargs)[source]¶

Construct a FeatureUnion from the given transformers. This is a shorthand for the FeatureUnion constructor; it does not require, and does not permit, naming the transformers. Instead, they will be given names automatically based on their types. It also does not allow weighting.

Parameters:	transformers – list of estimators n_jobs – Number of jobs to run in parallel (default 1).
Return type:	FeatureUnion

sktutor package¶

Submodules¶

sktutor.preprocessing module¶

sktutor.pipline module¶

Module contents¶

Table of Contents

This Page