Tuesday, September 10, 2024
HomeArtificial IntelligenceState-of-the-art NLP fashions from R

State-of-the-art NLP fashions from R



Introduction

The Transformers repository from “Hugging Face” accommodates lots of prepared to make use of, state-of-the-art fashions, that are easy to obtain and fine-tune with Tensorflow & Keras.

For this goal the customers normally must get:

  • The mannequin itself (e.g. Bert, Albert, RoBerta, GPT-2 and and many others.)
  • The tokenizer object
  • The weights of the mannequin

On this publish, we are going to work on a basic binary classification activity and practice our dataset on 3 fashions:

Nonetheless, readers ought to know that one can work with transformers on quite a lot of down-stream duties, comparable to:

  1. characteristic extraction
  2. sentiment evaluation
  3. textual content classification
  4. query answering
  5. summarization
  6. translation and many extra.

Conditions

Our first job is to put in the transformers bundle through reticulate.

reticulate::py_install('transformers', pip = TRUE)

Then, as normal, load commonplace ‘Keras’, ‘TensorFlow’ >= 2.0 and a few basic libraries from R.

Observe that if working TensorFlow on GPU one may specify the next parameters with a view to keep away from reminiscence points.

physical_devices = tf$config$list_physical_devices('GPU')
tf$config$experimental$set_memory_growth(physical_devices[[1]],TRUE)

tf$keras$backend$set_floatx('float32')

Template

We already talked about that to coach a knowledge on the precise mannequin, customers ought to obtain the mannequin, its tokenizer object and weights. For instance, to get a RoBERTa mannequin one has to do the next:

# get Tokenizer
transformer$RobertaTokenizer$from_pretrained('roberta-base', do_lower_case=TRUE)

# get Mannequin with weights
transformer$TFRobertaModel$from_pretrained('roberta-base')

Knowledge preparation

A dataset for binary classification is offered in text2vec bundle. Let’s load the dataset and take a pattern for quick mannequin coaching.

Break up our information into 2 elements:

idx_train = pattern.int(nrow(df)*0.8)

practice = df[idx_train,]
take a look at = df[!idx_train,]

Knowledge enter for Keras

Till now, we’ve simply coated information import and train-test cut up. To feed enter to the community we’ve to show our uncooked textual content into indices through the imported tokenizer. After which adapt the mannequin to do binary classification by including a dense layer with a single unit on the finish.

Nonetheless, we need to practice our information for 3 fashions GPT-2, RoBERTa, and Electra. We have to write a loop for that.

Observe: one mannequin basically requires 500-700 MB

# listing of three fashions
ai_m = listing(
  c('TFGPT2Model',       'GPT2Tokenizer',       'gpt2'),
   c('TFRobertaModel',    'RobertaTokenizer',    'roberta-base'),
   c('TFElectraModel',    'ElectraTokenizer',    'google/electra-small-generator')
)

# parameters
max_len = 50L
epochs = 2
batch_size = 10

# create an inventory for mannequin outcomes
gather_history = listing()

for (i in 1:size(ai_m)) {
  
  # tokenizer
  tokenizer = glue::glue("transformer${ai_m[[i]][2]}$from_pretrained('{ai_m[[i]][3]}',
                         do_lower_case=TRUE)") %>% 
    rlang::parse_expr() %>% eval()
  
  # mannequin
  model_ = glue::glue("transformer${ai_m[[i]][1]}$from_pretrained('{ai_m[[i]][3]}')") %>% 
    rlang::parse_expr() %>% eval()
  
  # inputs
  textual content = listing()
  # outputs
  label = listing()
  
  data_prep = operate(information) {
    for (i in 1:nrow(information)) {
      
      txt = tokenizer$encode(information[['comment_text']][i],max_length = max_len, 
                             truncation=T) %>% 
        t() %>% 
        as.matrix() %>% listing()
      lbl = information[['target']][i] %>% t()
      
      textual content = textual content %>% append(txt)
      label = label %>% append(lbl)
    }
    listing(do.name(plyr::rbind.fill.matrix,textual content), do.name(plyr::rbind.fill.matrix,label))
  }
  
  train_ = data_prep(practice)
  test_ = data_prep(take a look at)
  
  # slice dataset
  tf_train = tensor_slices_dataset(listing(train_[[1]],train_[[2]])) %>% 
    dataset_batch(batch_size = batch_size, drop_remainder = TRUE) %>% 
    dataset_shuffle(128) %>% dataset_repeat(epochs) %>% 
    dataset_prefetch(tf$information$experimental$AUTOTUNE)
  
  tf_test = tensor_slices_dataset(listing(test_[[1]],test_[[2]])) %>% 
    dataset_batch(batch_size = batch_size)
  
  # create an enter layer
  enter = layer_input(form=c(max_len), dtype='int32')
  hidden_mean = tf$reduce_mean(model_(enter)[[1]], axis=1L) %>% 
    layer_dense(64,activation = 'relu')
  # create an output layer for binary classification
  output = hidden_mean %>% layer_dense(models=1, activation='sigmoid')
  mannequin = keras_model(inputs=enter, outputs = output)
  
  # compile with AUC rating
  mannequin %>% compile(optimizer= tf$keras$optimizers$Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0),
                    loss = tf$losses$BinaryCrossentropy(from_logits=F),
                    metrics = tf$metrics$AUC())
  
  print(glue::glue('{ai_m[[i]][1]}'))
  # practice the mannequin
  historical past = mannequin %>% keras::match(tf_train, epochs=epochs, #steps_per_epoch=len/batch_size,
                validation_data=tf_test)
  gather_history[[i]] historical past
  names(gather_history)[i] = ai_m[[i]][1]
}


Reproduce in a           Pocket book

Extract outcomes to see the benchmarks:

Each the RoBERTa and Electra fashions present some further enhancements after 2 epochs of coaching, which can’t be mentioned of GPT-2. On this case, it’s clear that it may be sufficient to coach a state-of-the-art mannequin even for a single epoch.

Conclusion

On this publish, we confirmed how you can use state-of-the-art NLP fashions from R.
To know how you can apply them to extra complicated duties, it’s extremely really useful to evaluation the transformers tutorial.

We encourage readers to check out these fashions and share their outcomes under within the feedback part!

Corrections

If you happen to see errors or need to counsel modifications, please create a problem on the supply repository.

Reuse

Textual content and figures are licensed beneath Artistic Commons Attribution CC BY 4.0. Supply code is on the market at https://github.com/henry090/transformers, until in any other case famous. The figures which were reused from different sources do not fall beneath this license and might be acknowledged by a be aware of their caption: “Determine from …”.

Quotation

For attribution, please cite this work as

Abdullayev (2020, July 30). Posit AI Weblog: State-of-the-art NLP fashions from R. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2020-07-30-state-of-the-art-nlp-models-from-r/

BibTeX quotation

@misc{abdullayev2020state-of-the-art,
  writer = {Abdullayev, Turgut},
  title = {Posit AI Weblog: State-of-the-art NLP fashions from R},
  url = {https://blogs.rstudio.com/tensorflow/posts/2020-07-30-state-of-the-art-nlp-models-from-r/},
  12 months = {2020}
}
RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments