--- title: "Feature Columns" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Feature Columns} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} type: docs repo: https://github.com/rstudio/tfestimators menu: main: name: "Feature Columns" identifier: "tfestimators-feature-columns" parent: "tfestimators-using-tfestimators" weight: 40 --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, eval = FALSE) ``` ## Overview Feature columns are used to specify how Tensors received from the input function should be combined and transformed before entering the model. A feature column can be a plain mapping to some input column (e.g. `column_numeric()` for a column of numerical data), or a transformation of other feature columns (e.g. `column_crossed()` to define a new column as the cross of two other feature columns). The following feature columns are available: | Feature Column | Description | |---------------------------------------|----------------------------------------------------------------| | `column_categorical_with_vocabulary_list()` | Construct a Categorical Column with In-Memory Vocabulary. | | `column_categorical_with_vocabulary_file()` | Construct a Categorical Column with a Vocabulary File. | | `column_categorical_with_identity()` | Construct a Categorical Column that Returns Identity Values. | | `column_categorical_with_hash_bucket()` | Represents Sparse Feature where IDs are set by Hashing. | | `column_categorical_weighted()` | Construct a Weighted Categorical Column. | | `column_indicator()` | Represents Multi-Hot Representation of Given Categorical Column. | | `column_numeric()` | Construct a Real-Valued Column. | | `column_embedding()` | Construct a Dense Column. | | `column_crossed()` | Construct a Crossed Column. | | `column_bucketized()` | Construct a Bucketized Column. | Some typical mappings of R data types to feature column are: | Data Type | Feature Column | |------------|-----------------------------------------| | Numeric | `column_numeric()` | | Factor | `column_categorical_with_identity()` | | Character | `column_categorical_with_hash_bucket()` | We'll use the *flights* dataset from the [nycflights13](https://cran.r-project.org/package=nycflights13) package to explore how feature columns can be constructed. The *flights* dataset records airline on-time data for all flights departing NYC in 2013. ```{r} library(nycflights13) print(flights) ``` ``` > print(flights) # A tibble: 336,776 x 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 # ... with 336,766 more rows, and 4 more variables: distance , hour , minute , time_hour ``` For example, we can define numeric columns based on the `dep_time` and `dep_delay` variables: ```{r} cols <- feature_columns( column_numeric("dep_time"), column_numeric("dep_delay") ) ``` You can also define multiple feature columns at once. ```{r} cols <- feature_columns( column_numeric("dep_time", "dep_delay") ) ``` ## Pattern Matching Often, you will find that you want to generate a number of feature column definitions based on some pattern existing in the names of your data set. **tfestimators** uses the [tidyselect](https://github.com/r-lib/tidyselect) package to make it easy to define feature columns, similar to what you might be familiar with in the `dplyr` package. You can use the `names = ` argument of `feature_columns()` function to define a context from which variable names will be selected. For example, we can use the `ends_with()` helper to assert that all columns ending with `"time"` are numeric columns as follows: ```{r} library(nycflights13) cols <- feature_columns(names = flights, column_numeric(ends_with("time")) ) ``` The `names` parameter can either be a character vector with the names as-is, or any named R object. If the code you are using to compose columns is more complicated, or if you need to save references to columns for use in column embeddings you can also establish a scope for given set of column names using the `with_columns()` function: ```{r} cols <- with_columns(flights, { feature_columns( column_numeric(ends_with("time")) ) }) ``` You can also use an alternate syntax of the form `(pattern) ~ (column)`, which can add clarity when longer pattern rules are used, as it separates the matching rule from the column definition: ```{r} cols <- with_columns(flights, { feature_columns( ends_with("time") ~ column_numeric(), ) }) ``` Available pattern matching operators include: | Operator | Description | |---------------------------------------|----------------------------------------------------------------| | `starts_with()` | Starts with a prefix | | `ends_with()` | Ends with a suffix | | `contains()` | Contains a literal string | | `matches()` | Matches a regular expression | | `one_of()` | Included in character vector | | `everything()` | All columns | See `help("select_helpers", package = "tidyselect")` for full information on the set of helpers made available by the **tidyselect** package.