OmniSciDB
a5dc49c757
|
Classes | |
struct | KeyToOneHotColBytemap |
A struct that creates a bytemap to map each key to its corresponding one-hot column index. More... | |
struct | OneHotEncodingInfo |
struct | OneHotEncodedCol |
Functions | |
NEVER_INLINE HOST std::pair < std::vector< int32_t >, bool > | get_top_k_keys (const Column< TextEncodingDict > &text_col, const int32_t top_k, const double min_perc_col_total_per_key) |
This function calculates the top k most frequent keys (categories) in the provided column based on a given minimum percentage of the column total per key. It returns the top k keys along with a boolean value indicating whether there are other keys beyond the top k keys. More... | |
template<typename F > | |
NEVER_INLINE HOST std::vector < std::vector< F > > | allocate_one_hot_cols (const int64_t num_one_hot_cols, const int64_t col_size) |
Allocates memory for the one-hot encoded columns and initializes them to zero. It takes the number of one-hot columns and the column size as input and returns a vector of one-hot encoded columns. More... | |
std::pair< int32_t, int32_t > | get_min_max_keys (const std::vector< int32_t > &top_k_keys) |
Finds the minimum and maximum keys in a given vector of keys and returns them as a pair. More... | |
template<typename F > | |
NEVER_INLINE HOST OneHotEncodedCol< F > | one_hot_encode (const Column< TextEncodingDict > &text_col, const TableFunctions_Namespace::OneHotEncoder_Namespace::OneHotEncodingInfo &one_hot_encoding_info) |
Takes a column of text-encoded data and one-hot encoding information as input. It performs the one-hot encoding process and returns an object containing the one-hot encoded columns and their corresponding categorical features. More... | |
template<typename F > | |
NEVER_INLINE HOST std::vector < OneHotEncodedCol< F > > | one_hot_encode (const ColumnList< TextEncodingDict > &text_cols, const std::vector< TableFunctions_Namespace::OneHotEncoder_Namespace::OneHotEncodingInfo > &one_hot_encoding_infos) |
One-hot encode multiple columns of text-encoded data in a column list, given a vector of one-hot encoding information for each column. More... | |
Variables | |
constexpr int16_t | INVALID_COL_IDX {-1} |
template NEVER_INLINE HOST std::vector< std::vector< double > > TableFunctions_Namespace::OneHotEncoder_Namespace::allocate_one_hot_cols | ( | const int64_t | num_one_hot_cols, |
const int64_t | col_size | ||
) |
Allocates memory for the one-hot encoded columns and initializes them to zero. It takes the number of one-hot columns and the column size as input and returns a vector of one-hot encoded columns.
F |
num_one_hot_cols | - number of one-hot encoded columns to allocate |
col_size | - Size of each column in number of values/rows |
Definition at line 120 of file OneHotEncoder.cpp.
References threading_serial::async(), CHECK_GE, ThreadInfo::num_elems_per_thread, and ThreadInfo::num_threads.
std::pair<int32_t, int32_t> TableFunctions_Namespace::OneHotEncoder_Namespace::get_min_max_keys | ( | const std::vector< int32_t > & | top_k_keys | ) |
Finds the minimum and maximum keys in a given vector of keys and returns them as a pair.
top_k_keys | - The top-k keys for a column |
Definition at line 164 of file OneHotEncoder.cpp.
References StringDictionary::INVALID_STR_ID.
Referenced by one_hot_encode().
NEVER_INLINE HOST std::pair<std::vector<int32_t>, bool> TableFunctions_Namespace::OneHotEncoder_Namespace::get_top_k_keys | ( | const Column< TextEncodingDict > & | text_col, |
const int32_t | top_k, | ||
const double | min_perc_col_total_per_key | ||
) |
This function calculates the top k most frequent keys (categories) in the provided column based on a given minimum percentage of the column total per key. It returns the top k keys along with a boolean value indicating whether there are other keys beyond the top k keys.
text_col | - dictionary-encoded text column to extract top-k keys from |
top_k | - integer representing the top-k most common keys to return |
min_perc_col_total_per_key | - Enforces that any top-k key must represent at least this minimum percentage of the total number of values in the column |
Definition at line 43 of file OneHotEncoder.cpp.
References anonymous_namespace{Utm.h}::a, DEBUG_TIMER, get_column_min_max(), Column< TextEncodingDict >::isNull(), threading_serial::parallel_for(), and Column< TextEncodingDict >::size().
Referenced by one_hot_encode().
NEVER_INLINE HOST OneHotEncodedCol< F > TableFunctions_Namespace::OneHotEncoder_Namespace::one_hot_encode | ( | const Column< TextEncodingDict > & | text_col, |
const TableFunctions_Namespace::OneHotEncoder_Namespace::OneHotEncodingInfo & | one_hot_encoding_info | ||
) |
Takes a column of text-encoded data and one-hot encoding information as input. It performs the one-hot encoding process and returns an object containing the one-hot encoded columns and their corresponding categorical features.
F |
text_col | - input TextEncodingDict column |
one_hot_encoding_info | - struct of parameters specifying how to encode the column |
Definition at line 238 of file OneHotEncoder.cpp.
References TableFunctions_Namespace::OneHotEncoder_Namespace::OneHotEncodingInfo::cat_features, TableFunctions_Namespace::OneHotEncoder_Namespace::OneHotEncodedCol< F >::cat_features, CHECK, CHECK_GE, CHECK_GT, DEBUG_TIMER, TableFunctions_Namespace::OneHotEncoder_Namespace::OneHotEncodedCol< F >::encoded_buffers, TableFunctions_Namespace::OneHotEncoder_Namespace::KeyToOneHotColBytemap::get_col_idx_for_key(), get_min_max_keys(), get_top_k_keys(), StringDictionaryProxy::getIdOfString(), StringDictionaryProxy::getString(), TableFunctions_Namespace::OneHotEncoder_Namespace::OneHotEncodingInfo::include_others_attr, INVALID_COL_IDX, TableFunctions_Namespace::OneHotEncoder_Namespace::OneHotEncodingInfo::is_one_hot_encoded, TableFunctions_Namespace::OneHotEncoder_Namespace::OneHotEncodingInfo::min_attr_proportion, threading_serial::parallel_for(), Column< TextEncodingDict >::size(), Column< TextEncodingDict >::string_dict_proxy_, to_string(), and TableFunctions_Namespace::OneHotEncoder_Namespace::OneHotEncodingInfo::top_k_attrs.
NEVER_INLINE HOST std::vector< OneHotEncodedCol< F > > TableFunctions_Namespace::OneHotEncoder_Namespace::one_hot_encode | ( | const ColumnList< TextEncodingDict > & | text_cols, |
const std::vector< TableFunctions_Namespace::OneHotEncoder_Namespace::OneHotEncodingInfo > & | one_hot_encoding_infos | ||
) |
One-hot encode multiple columns of text-encoded data in a column list, given a vector of one-hot encoding information for each column.
F |
text_cols | - Vector of input TextEncodingDict columns |
one_hot_encoding_infos | - structs of parameters for each column specifying how to encode the column |
Definition at line 320 of file OneHotEncoder.cpp.
References ColumnList< TextEncodingDict >::num_rows_, ColumnList< TextEncodingDict >::numCols(), ColumnList< TextEncodingDict >::ptrs_, and ColumnList< TextEncodingDict >::string_dict_proxies_.
constexpr int16_t TableFunctions_Namespace::OneHotEncoder_Namespace::INVALID_COL_IDX {-1} |