word2vec.Rmd
You must run setup_word2vec
at the begining of every session, you will otherwise encounter errors and be prompted to do so.
library(word2vec.r)
# setup word2vec Julia dependency
setup_word2vec()
#> Julia version 1.1.1 at location /home/jp/Downloads/julia-1.1.1-linux-x86_64/julia-1.1.1/bin will be used.
#> Loading setup script for JuliaCall...
#> Finish loading setup script for JuliaCall.
The package comes with a dataset, Macbeth by Shakespeare. However, being a corpus of 17,319 words it is not lazyly loaded and needs to be imported manually with the data
function. Note that the dataset is mildly preprocessed, all words are lowercase and punctuation has been removed.
data("macbeth", package = "word2vec.r")
With data we can train a model and extract the vectors.
model_path <- word2vec(macbeth) # train model
model <- word_vectors(model_path) # get word vectors
There are then a multitude of functions one can use on the model.
get_vector
vocabulary
in_vocabulary
size
index
cosine
cosine_similar_words
similarity
analogy
analogy_words
All are well documented and have examples, visit their respective man pages with i.e.: ?get_vector
. Note that since all the functions listed above require the output of word_vectors
(the model
object in our case). Therefore a convenient reference class also exists.
# words similar to king
cosine_similar_words(model, "king", 5L)
#> [1] "king" "yet" "rosse" "and" "from"
# size of model
size(model)
#> # A tibble: 1 x 2
#> length words
#> <int> <int>
#> 1 100 511
# get vocabulary
vocab <- vocabulary(model)
head(vocab)
#> [1] "</s>" "the" "and" "to" "i" "of"
# index of word macbeth
idx <- index(model, "macbeth")
vocab[idx]
#> [1] "macbeth"
Because everything depends on the vectors (model
object in our case) we provide reference class which limits the repetitive specification of said model as first argument to all functions.
wv <- WordVectors$new(model)
wv$get_vector("macbeth")
#> [1] 0.060261002 -0.099035711 0.089729781 -0.131802295 0.054377451
#> [6] -0.037143736 0.034620966 0.013240862 -0.046540684 0.216212103
#> [11] 0.062457392 0.184753333 -0.145580993 -0.073077572 0.003581891
#> [16] 0.196514459 0.120471438 0.053800082 0.140348820 0.136948506
#> [21] -0.031166868 0.050116140 0.124824226 -0.164362478 -0.005901479
#> [26] -0.092760047 0.007204235 0.018347540 0.167392284 0.069425808
#> [31] -0.069325596 -0.015448745 -0.162950776 0.053471405 -0.045633260
#> [36] -0.001947240 0.099237974 -0.089840106 0.039227503 -0.065056930
#> [41] -0.008485846 0.145537322 -0.139205575 -0.139850518 0.056980666
#> [46] 0.111228944 -0.021939085 -0.019801994 -0.056312279 -0.061632252
#> [51] -0.020469921 -0.082446020 0.063011317 -0.054398137 -0.014775302
#> [56] -0.034342855 0.161132708 -0.078148394 -0.020963626 -0.218176811
#> [61] 0.083799342 -0.134379308 -0.029002656 -0.061294381 -0.073210881
#> [66] 0.175620705 0.110194186 -0.070679837 0.158414571 0.084456696
#> [71] 0.005258834 -0.020719532 0.093608631 -0.102572553 -0.169927925
#> [76] -0.009947655 -0.173633013 0.014916886 -0.194387940 -0.147128763
#> [81] 0.028671680 0.025153226 -0.046986122 -0.079825796 0.074098080
#> [86] 0.187334483 0.049447294 -0.055605278 0.227896461 -0.056259874
#> [91] -0.039628351 0.077658366 0.016895844 0.038136662 0.047202176
#> [96] 0.088280384 0.065880691 0.166309718 0.009127571 -0.059123733
wv$cosine("rosse")
#> # A tibble: 10 x 2
#> index cosine
#> <int> <dbl>
#> 1 67 1
#> 2 106 1.000
#> 3 91 1.000
#> 4 51 1.000
#> 5 54 1.000
#> 6 56 1.000
#> 7 115 1.000
#> 8 13 1.000
#> 9 3 1.000
#> 10 10 1.000