JudiLing is able to call the python package pyndl internally to compute NDL models. pyndl uses event files to compute the mapping matrices, which have to be generated manually or by using pyndl in Python, see documentation here. The advantage of calling pyndl from JudiLing is that the resulting weights, cue and semantic matrices can be directly translated into JudiLing format and further processing can be done in JudiLing.
For pyndl to be available in JudiLing, PyCall has to be imported before JudiLing:
using PyCall
using JudiLing
Calling pyndl from JudiLing
JudiLing.Pyndl_Weight_Struct
— TypePyndl_Weight_Struct
cues::Vector{String}
outcomes::Vector{String}
weight::Matrix{Float64}
cues::Vector{String}
: Vector of cues, in the order that they appear in the weight matrix.outcomes::Vector{String}
: Vector of outcomes, in the order that they appear in the weight matrix.weight::Matrix{Float64}
: Weight matrix.
JudiLing.pyndl
— Methodpyndl(
data_path::String;
alpha::Float64 = 0.1,
betas::Tuple{Float64,Float64} = (0.1, 0.1),
method::String = "openmp"
)
Compute weights using pyndl. See the documentation of pyndl for more information: https://pyndl.readthedocs.io/en/latest/
Obligatory arguments
data_path::String
: Path to an events file as generated by pyndl's preprocess.createeventfile
Optional arguments
alpha::Float64 = 0.1
: α learning rate.betas::Tuple{Float64,Float64} = (0.1, 0.1)
: β1 and β2 learning ratesmethod::String = "openmp"
: One of {"openmp", "threading"}. "openmp" only works on Linux.
Example
weights = JudiLing.pyndl("data/latin_train_events.tab.gz")
Translating output of pyndl to cue and semantic matrices in JudiLing
With the weights in hand, the cue and semantic matrices can be computed:
JudiLing.make_cue_matrix
— Methodmake_cue_matrix(
data::DataFrame,
pyndl_weights::Pyndl_Weight_Struct;
grams = 3,
target_col = "Words",
tokenized = false,
sep_token = nothing,
keep_sep = false,
start_end_token = "#",
verbose = false,
)
Make the cue matrix based on a dataframe and weights computed with pyndl. Practically this means that the cues are extracted from the weights object and translated to the JudiLing format.
Obligatory arguments
data::DataFrame
: Dataset with all the word types on which the weights were trained.pyndl_weights::Pyndl_Weight_Struct
: Weights trained with JudiLing.pyndl
Optional argyments
grams = 3
: N-gram size (has to match the n-gram granularity of the cues on which the weights were trained).target_col = "Words"
: Column with target words.tokenized = false
: Whether the target words are already tokenizedsep_token = nothing
: The string separating the tokens (only used iftokenized=true
).keep_sep = false
: Whether thesep_token
should be retained in the cues.start_end_token = "#"
: The string with which to mark word boundaries.verbose = false
: Verbose mode.
Example
weights = JudiLing.pyndl("data/latin_train_events.tab.gz")
cue_obj = JudiLing.make_cue_matrix("latin_train.csv", weights,
grams = 3,
target_col = "Word")
JudiLing.make_S_matrix
— Methodmake_S_matrix(
data::DataFrame,
pyndl_weights::Pyndl_Weight_Struct,
n_features_columns::Vector;
tokenized::Bool=false,
sep_token::String="_"
)
Create semantic matrix based on a dataframe and weights computed with pyndl. Practically this means that the semantic features are extracted from the weights object and translated to the JudiLing format.
Obligatory arguments
data::DataFrame
: The dataset with word types.pyndl_weights::Pyndl_Weight_Struct
: Weights trained with JudiLing.pyndl.n_features_columns::Vector
: Vector of columns with the features in the dataset.
Optional arguments
tokenized=false
: Whether the features inn_features_columns
columns are already tokenized (e.g."feature1_feature2_feature3"
)sep_token="_"
: The string with which the features are separated (only used iftokenized=false
).
Example
weights = JudiLing.pyndl("data/latin_train_events.tab.gz")
S = JudiLing.make_S_matrix(data,
weights_latin,
["Lexeme", "Person", "Number", "Tense", "Voice", "Mood"],
tokenized=false)
JudiLing.make_S_matrix
— Methodmake_S_matrix(
data_train::DataFrame,
data_val::DataFrame,
pyndl_weights::Pyndl_Weight_Struct,
n_features_columns::Vector;
tokenized::Bool=false,
sep_token::String="_"
)
Create semantic matrix based on a training and validation dataframe and weights computed with pyndl. Practically this means that the semantic features are extracted from the weights object and translated to the JudiLing format.
Obligatory arguments
data_train::DataFrame
: The training dataset.data_val::DataFrame
: The validation dataset.pyndl_weights::Pyndl_Weight_Struct
: Weights trained with JudiLing.pyndl.n_features_columns::Vector
: Vector of columns with the features in the training and validation datasets.
Optional arguments
tokenized=false
: Whether the features inn_features_columns
columns are already tokenized (e.g."feature1_feature2_feature3"
)sep_token="_"
: The string with which the features are separated (only used iftokenized=false
).
Example
weights = JudiLing.pyndl("data/latin_train_events.tab.gz")
S_train, S_val = JudiLing.make_S_matrix(train,
val,
weights_latin,
["Lexeme", "Person", "Number", "Tense", "Voice", "Mood"],
tokenized=false)