ArrayAssembler#
- class pyspark.ml.connect.feature.ArrayAssembler(*, inputCols=None, outputCol=None, featureSizes=None, handleInvalid='error')[source]#
- A feature transformer that merges multiple input columns into an array type column. - Parameters
- You need to set param `inputCols` for specifying input column names,
- and set param `featureSizes` for specifying corresponding input column
- feature size, for scalar type input column, corresponding feature size must be set to 1,
- otherwise, set corresponding feature size to feature array length.
- Output column is “array<double”> type and contains array of assembled features.
- All elements in input feature columns must be convertible to double type.
- You can set ‘handler_invalid’ param to specify how to handle invalid input value
- (None or NaN), if it is set to ‘error’, error is thrown for invalid input value,
- if it is set to ‘keep’, it returns relevant number of NaN in the output.
- .. versionadded:: 4.0.0
 
 - Examples - >>> from pyspark.ml.connect.feature import ArrayAssembler >>> import numpy as np >>> >>> spark_df = spark.createDataFrame( ... [ ... ([2.0, 3.5, 1.5], 3.0, True, 1), ... ([-3.0, np.nan, -2.5], 4.0, False, 2), ... ], ... schema=["f1", "f2", "f3", "f4"], ... ) >>> assembler = ArrayAssembler( ... inputCols=["f1", "f2", "f3", "f4"], ... outputCol="out", ... featureSizes=[3, 1, 1, 1], ... handleInvalid="keep", ... ) >>> assembler.transform(spark_df).select("out").show(truncate=False) - Methods - clear(param)- Clears a param from the param map if it has been explicitly set. - copy([extra])- Creates a copy of this instance with the same uid and some extra params. - explainParam(param)- Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string. - Returns the documentation of all params with their optionally default values and user-supplied values. - extractParamMap([extra])- Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. - Gets the value of featureSizes or its default value. - Gets the value of handleInvalid or its default value. - Gets the value of inputCols or its default value. - getOrDefault(param)- Gets the value of a param in the user-supplied param map or its default value. - Gets the value of outputCol or its default value. - getParam(paramName)- Gets a param by its name. - hasDefault(param)- Checks whether a param has a default value. - hasParam(paramName)- Tests whether this instance contains a param with a given (string) name. - isDefined(param)- Checks whether a param is explicitly set by user or has a default value. - isSet(param)- Checks whether a param is explicitly set by user. - load(path)- Load Estimator / Transformer / Model / Evaluator from provided cloud storage path. - loadFromLocal(path)- Load Estimator / Transformer / Model / Evaluator from provided local path. - save(path, *[, overwrite])- Save Estimator / Transformer / Model / Evaluator to provided cloud storage path. - saveToLocal(path, *[, overwrite])- Save Estimator / Transformer / Model / Evaluator to provided local path. - set(param, value)- Sets a parameter in the embedded param map. - transform(dataset[, params])- Transforms the input dataset. - Attributes - Returns all params ordered by name. - Methods Documentation - clear(param)#
- Clears a param from the param map if it has been explicitly set. 
 - copy(extra=None)#
- Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using - copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.- Parameters
- extradict, optional
- Extra parameters to copy to the new instance 
 
- Returns
- Params
- Copy of this instance 
 
 
 - explainParam(param)#
- Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string. 
 - explainParams()#
- Returns the documentation of all params with their optionally default values and user-supplied values. 
 - extractParamMap(extra=None)#
- Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. - Parameters
- extradict, optional
- extra param values 
 
- Returns
- dict
- merged param map 
 
 
 - getFeatureSizes()#
- Gets the value of featureSizes or its default value. 
 - getHandleInvalid()#
- Gets the value of handleInvalid or its default value. 
 - getInputCols()#
- Gets the value of inputCols or its default value. 
 - getOrDefault(param)#
- Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set. 
 - getOutputCol()#
- Gets the value of outputCol or its default value. 
 - getParam(paramName)#
- Gets a param by its name. 
 - hasDefault(param)#
- Checks whether a param has a default value. 
 - hasParam(paramName)#
- Tests whether this instance contains a param with a given (string) name. 
 - isDefined(param)#
- Checks whether a param is explicitly set by user or has a default value. 
 - isSet(param)#
- Checks whether a param is explicitly set by user. 
 - classmethod load(path)#
- Load Estimator / Transformer / Model / Evaluator from provided cloud storage path. - New in version 3.5.0. 
 - classmethod loadFromLocal(path)#
- Load Estimator / Transformer / Model / Evaluator from provided local path. - New in version 3.5.0. 
 - save(path, *, overwrite=False)#
- Save Estimator / Transformer / Model / Evaluator to provided cloud storage path. - New in version 3.5.0. 
 - saveToLocal(path, *, overwrite=False)#
- Save Estimator / Transformer / Model / Evaluator to provided local path. - New in version 3.5.0. 
 - set(param, value)#
- Sets a parameter in the embedded param map. 
 - transform(dataset, params=None)#
- Transforms the input dataset. The dataset can be either pandas dataframe or spark dataframe, if it is a spark DataFrame, the result of transformation is a new spark DataFrame that contains all existing columns and output columns with names, If it is a pandas DataFrame, the result of transformation is a shallow copy of the input pandas dataframe with output columns with names. - Note: Transformers does not allow output column having the same name with existing columns. - Parameters
- datasetpyspark.sql.DataFrameor py:class:pandas.DataFrame
- input dataset. 
- paramsdict, optional
- an optional param map that overrides embedded params. 
 
- dataset
- Returns
- pyspark.sql.DataFrameor py:class:pandas.DataFrame
- transformed dataset, the type of output dataframe is consistent with input dataframe. 
 
 
 - Attributes Documentation - featureSizes = Param(parent='undefined', name='featureSizes', doc='input feature size list for input columns of vector assembler')#
 - handleInvalid = Param(parent='undefined', name='handleInvalid', doc="how to handle invalid entries. Options are 'error' (throw an error), or 'keep' (return relevant number of NaN in the output). Default value is 'error'")#
 - inputCols = Param(parent='undefined', name='inputCols', doc='input column names.')#
 - outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')#
 - params#
- Returns all params ordered by name. The default implementation uses - dir()to get all attributes of type- Param.
 - uid#
- A unique id for the object.