Pyndl · JudiLing.jl

JudiLing is able to call the python package pyndl internally to compute NDL models. pyndl uses event files to compute the mapping matrices, which have to be generated manually or by using pyndl in Python, see documentation here. The advantage of calling pyndl from JudiLing is that the resulting weights, cue and semantic matrices can be directly translated into JudiLing format and further processing can be done in JudiLing.

Note

For pyndl to be available in JudiLing, PyCall has to be imported before JudiLing:

using PyCall
using JudiLing

Calling pyndl from JudiLing

JudiLing.Pyndl_Weight_Struct — Type

Pyndl_Weight_Struct
    cues::Vector{String}
    outcomes::Vector{String}
    weight::Matrix{Float64}

cues::Vector{String}: Vector of cues, in the order that they appear in the weight matrix.
outcomes::Vector{String}: Vector of outcomes, in the order that they appear in the weight matrix.
weight::Matrix{Float64}: Weight matrix.

source

JudiLing.pyndl — Method

pyndl(
    data_path::String;
    alpha::Float64 = 0.1,
    betas::Tuple{Float64,Float64} = (0.1, 0.1),
    method::String = "openmp"
)

Compute weights using pyndl. See the documentation of pyndl for more information: https://pyndl.readthedocs.io/en/latest/

Obligatory arguments

data_path::String: Path to an events file as generated by pyndl's preprocess.createeventfile

Optional arguments

alpha::Float64 = 0.1: α learning rate.
betas::Tuple{Float64,Float64} = (0.1, 0.1): β1 and β2 learning rates
method::String = "openmp": One of {"openmp", "threading"}. "openmp" only works on Linux.

Example

weights = JudiLing.pyndl("data/latin_train_events.tab.gz")

source

Translating output of pyndl to cue and semantic matrices in JudiLing

With the weights in hand, the cue and semantic matrices can be computed:

JudiLing.make_cue_matrix — Method

make_cue_matrix(
    data::DataFrame,
    pyndl_weights::Pyndl_Weight_Struct;
    grams = 3,
    target_col = "Words",
    tokenized = false,
    sep_token = nothing,
    keep_sep = false,
    start_end_token = "#",
    verbose = false,
)

Make the cue matrix based on a dataframe and weights computed with pyndl. Practically this means that the cues are extracted from the weights object and translated to the JudiLing format.

Obligatory arguments

data::DataFrame: Dataset with all the word types on which the weights were trained.
pyndl_weights::Pyndl_Weight_Struct: Weights trained with JudiLing.pyndl

Optional argyments

grams = 3: N-gram size (has to match the n-gram granularity of the cues on which the weights were trained).
target_col = "Words": Column with target words.
tokenized = false: Whether the target words are already tokenized
sep_token = nothing: The string separating the tokens (only used if tokenized=true).
keep_sep = false: Whether the sep_token should be retained in the cues.
start_end_token = "#": The string with which to mark word boundaries.
verbose = false: Verbose mode.

Example

weights = JudiLing.pyndl("data/latin_train_events.tab.gz")
cue_obj = JudiLing.make_cue_matrix("latin_train.csv", weights,
                                    grams = 3,
                                    target_col = "Word")

source

JudiLing.make_S_matrix — Method

make_S_matrix(
    data::DataFrame,
    pyndl_weights::Pyndl_Weight_Struct,
    n_features_columns::Vector;
    tokenized::Bool=false,
    sep_token::String="_"
)

Create semantic matrix based on a dataframe and weights computed with pyndl. Practically this means that the semantic features are extracted from the weights object and translated to the JudiLing format.

Obligatory arguments

data::DataFrame: The dataset with word types.
pyndl_weights::Pyndl_Weight_Struct: Weights trained with JudiLing.pyndl.
n_features_columns::Vector: Vector of columns with the features in the dataset.

Optional arguments

tokenized=false: Whether the features in n_features_columns columns are already tokenized (e.g. "feature1_feature2_feature3")
sep_token="_": The string with which the features are separated (only used if tokenized=false).

Example

weights = JudiLing.pyndl("data/latin_train_events.tab.gz")
S = JudiLing.make_S_matrix(data,
                            weights_latin,
                            ["Lexeme", "Person", "Number", "Tense", "Voice", "Mood"],
                            tokenized=false)

source

JudiLing.make_S_matrix — Method

make_S_matrix(
    data_train::DataFrame,
    data_val::DataFrame,
    pyndl_weights::Pyndl_Weight_Struct,
    n_features_columns::Vector;
    tokenized::Bool=false,
    sep_token::String="_"
)

Create semantic matrix based on a training and validation dataframe and weights computed with pyndl. Practically this means that the semantic features are extracted from the weights object and translated to the JudiLing format.

Obligatory arguments

data_train::DataFrame: The training dataset.
data_val::DataFrame: The validation dataset.
pyndl_weights::Pyndl_Weight_Struct: Weights trained with JudiLing.pyndl.
n_features_columns::Vector: Vector of columns with the features in the training and validation datasets.

Optional arguments

tokenized=false: Whether the features in n_features_columns columns are already tokenized (e.g. "feature1_feature2_feature3")
sep_token="_": The string with which the features are separated (only used if tokenized=false).

Example

weights = JudiLing.pyndl("data/latin_train_events.tab.gz")
S_train, S_val = JudiLing.make_S_matrix(train,
                            val,
                            weights_latin,
                            ["Lexeme", "Person", "Number", "Tense", "Voice", "Mood"],
                            tokenized=false)

source