Feature columns are used to specify how Tensors received from the
input function should be combined and transformed before entering the
model. A feature column can be a plain mapping to some input column
(e.g. column_numeric()
for a column of numerical data), or
a transformation of other feature columns
(e.g. column_crossed()
to define a new column as the cross
of two other feature columns).
The following feature columns are available:
Feature Column | Description |
---|---|
column_categorical_with_vocabulary_list() |
Construct a Categorical Column with In-Memory Vocabulary. |
column_categorical_with_vocabulary_file() |
Construct a Categorical Column with a Vocabulary File. |
column_categorical_with_identity() |
Construct a Categorical Column that Returns Identity Values. |
column_categorical_with_hash_bucket() |
Represents Sparse Feature where IDs are set by Hashing. |
column_categorical_weighted() |
Construct a Weighted Categorical Column. |
column_indicator() |
Represents Multi-Hot Representation of Given Categorical Column. |
column_numeric() |
Construct a Real-Valued Column. |
column_embedding() |
Construct a Dense Column. |
column_crossed() |
Construct a Crossed Column. |
column_bucketized() |
Construct a Bucketized Column. |
Some typical mappings of R data types to feature column are:
Data Type | Feature Column |
---|---|
Numeric | column_numeric() |
Factor | column_categorical_with_identity() |
Character | column_categorical_with_hash_bucket() |
We’ll use the flights dataset from the nycflights13 package to explore how feature columns can be constructed. The flights dataset records airline on-time data for all flights departing NYC in 2013.
> print(flights)
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl>
1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227
2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227
3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160
# ... with 336,766 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
For example, we can define numeric columns based on the
dep_time
and dep_delay
variables:
You can also define multiple feature columns at once.
Often, you will find that you want to generate a number of feature
column definitions based on some pattern existing in the names of your
data set. tfestimators uses the tidyselect package to
make it easy to define feature columns, similar to what you might be
familiar with in the dplyr
package. You can use the
names =
argument of feature_columns()
function
to define a context from which variable names will be selected.
For example, we can use the ends_with()
helper to assert
that all columns ending with "time"
are numeric columns as
follows:
The names
parameter can either be a character vector
with the names as-is, or any named R object.
If the code you are using to compose columns is more complicated, or
if you need to save references to columns for use in column embeddings
you can also establish a scope for given set of column names using the
with_columns()
function:
You can also use an alternate syntax of the form
(pattern) ~ (column)
, which can add clarity when longer
pattern rules are used, as it separates the matching rule from the
column definition:
Available pattern matching operators include:
Operator | Description |
---|---|
starts_with() |
Starts with a prefix |
ends_with() |
Ends with a suffix |
contains() |
Contains a literal string |
matches() |
Matches a regular expression |
one_of() |
Included in character vector |
everything() |
All columns |
See help("select_helpers", package = "tidyselect")
for
full information on the set of helpers made available by the
tidyselect package.