47  Manual Text Features

When talking about manual text features, we are talking about hand-crafted metrics or counts based on the text. you will be able to find some off-the-shelf features that fit into this category. But generally, this is where you can use your domain knowledge to extract useful information.

Typical of-the-shelf counts are generally counted. So it will be counts of words, sentences, linebreaks, commas, hashtags, emojis, and punctuation. A lot of these will be proxies for text length so some kind of normalization will be useful here. Normalizing in this setting is typically done by dividing by the text length, which then gives different interpretations as we are no longer looking at the β€œnumber of words”, and now finding β€œthe inverse of average word length”.

The above features are easy to calculate and will therefore not be hard to include in your model. But this is where creativity and domain knowledge shine!

TODO

find a good reference for β€œWhat is a word?”

One thing you might need to do when working with these hand-crafted features is knowledge about working with regular expressions.

47.2 Pros and Cons

47.2.1 Pros

  • Clear and actionable features
  • High interpretability

47.2.2 Cons

  • Can be time-consuming to create
  • Computational speed depends on the feature
  • Will likely need to

47.3 R Examples

TODO

find a better data set

The textfeatures package is one package in R that contains a bunch of general features that may or may not be useful.

library(textfeatures)
library(modeldata)

textfeatures(modeldata::tate_text$medium, word_dims = 0, 
             verbose = FALSE) |>
  dplyr::glimpse()
Rows: 4,284
Columns: 34
$ n_urls           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_uq_urls        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_hashtags       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_uq_hashtags    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_mentions       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_uq_mentions    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_chars          <dbl> 1.39951429, -0.91391731, -0.91391731, -0.91391731, -0…
$ n_uq_chars       <dbl> 1.3463720, -0.2656168, -0.2656168, -0.2656168, -0.585…
$ n_commas         <dbl> 1.2867430, -0.6470182, -0.6470182, -0.6470182, -0.647…
$ n_digits         <dbl> -0.2800874, -0.2800874, -0.2800874, -0.2800874, -0.28…
$ n_exclaims       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_extraspaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_lowers         <dbl> 1.348546506, -0.912127069, -0.912127069, -0.912127069…
$ n_lowersp        <dbl> -1.0518721, 0.1014681, 0.1014681, 0.1014681, 0.352584…
$ n_periods        <dbl> -0.04324894, -0.04324894, -0.04324894, -0.04324894, -…
$ n_words          <dbl> 1.3593949, -0.7937823, -0.7937823, -0.7937823, -0.162…
$ n_uq_words       <dbl> 1.4230658, -0.7920930, -0.7920930, -0.7920930, -0.142…
$ n_caps           <dbl> -0.04050572, -0.04050572, -0.04050572, -0.04050572, -…
$ n_nonasciis      <dbl> -0.02646899, -0.02646899, -0.02646899, -0.02646899, -…
$ n_puncts         <dbl> 5.6233563, -0.2031327, -0.2031327, -0.2031327, -0.203…
$ n_capsp          <dbl> -1.2508524, 0.8890397, 0.8890397, 0.8890397, 0.538919…
$ n_charsperword   <dbl> 1.09976675, -0.87544061, -0.87544061, -0.87544061, -1…
$ sent_afinn       <dbl> 0.01511448, 0.01511448, 0.01511448, 0.01511448, 0.015…
$ sent_bing        <dbl> -0.07864915, -0.07864915, -0.07864915, -0.07864915, -…
$ sent_syuzhet     <dbl> -0.1334035, -0.1334035, -0.1334035, -0.1334035, -0.13…
$ sent_vader       <dbl> -0.06711618, -0.06711618, -0.06711618, -0.06711618, -…
$ n_polite         <dbl> 0.05597655, 0.05597655, 0.05597655, 0.05597655, 0.055…
$ n_first_person   <dbl> -0.01527831, -0.01527831, -0.01527831, -0.01527831, -…
$ n_first_personp  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_second_person  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_second_personp <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_third_person   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_tobe           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_prepositions   <dbl> -2.3324219, 0.3482094, 0.3482094, 0.3482094, 0.348209…
TODO

Come up with domain-specific examples

47.4 Python Examples