JudiLingMeasures.jl

JudiLingMeasures enables easy calculation of measures in Discriminative Lexicon Models developed with JudiLing (Luo, Heitmeier, Chuang and Baayen, 2024).

Most measures are based on R implementations in WpmWithLdl (Baayen et al., 2018) and LdlConvFunctions (Schmitz, 2021) and the python implementation in pyldl (Saito, 2022) (but all errors are my own). The conceptual work behind this package is therefore very much an effort of many people (see Bibliography). I have tried to acknowledge where each measure is used/introduced, but if I have missed anything, or you find any errors please let me know: maria dot heitmeier at uni dot tuebingen dot de.

Installation

using Pkg
Pkg.add("https://github.com/quantling/JudiLingMeasures.jl")

Requires JudiLing 0.5.5. Update your JudiLing version by running

using Pkg
Pkg.update("JudiLing")

If this step does not work, i.e. the version of JudiLing is still not 0.5.5, refer to this forum post for a workaround.

How to use

For a demo of this package, please see notebooks/measures_demo.ipynb.

Calculating measures in this package

The following gives an overview over all measures available in this package. For a closer description of the parameters, please refer to Measures. All measures come with examples. In order to run them, first run the following piece of code, taken from the Readme of the JudiLing package. For a detailed explanation of this code please refer to the JudiLing Readme and documentation.

using JudiLing
using CSV # read csv files into dataframes
using DataFrames # parse data into dataframes
using JudiLingMeasures

# if you haven't downloaded this file already, get it here:
download("https://osf.io/2ejfu/download", "latin.csv")

latin =
    DataFrame(CSV.File(joinpath(@__DIR__, "latin.csv")));

cue_obj = JudiLing.make_cue_matrix(
    latin,
    grams = 3,
    target_col = :Word,
    tokenized = false,
    keep_sep = false
);

n_features = size(cue_obj.C, 2);
S = JudiLing.make_S_matrix(
    latin,
    ["Lexeme"],
    ["Person", "Number", "Tense", "Voice", "Mood"],
    ncol = n_features
);

G = JudiLing.make_transform_matrix(S, cue_obj.C);
F = JudiLing.make_transform_matrix(cue_obj.C, S);

Chat = S * G;
Shat = cue_obj.C * F;

A = cue_obj.A;
max_t = JudiLing.cal_max_timestep(latin, :Word);

Make sure that you set check_gold_path=true.

res_learn, gpi_learn, rpi_learn = JudiLing.learn_paths_rpi(
    latin,
    latin,
    cue_obj.C,
    S,
    F,
    Chat,
    A,
    cue_obj.i2f,
    cue_obj.f2i, # api changed in 0.3.1
    gold_ind = cue_obj.gold_ind,
    Shat_val = Shat,
    check_gold_path = true,
    max_t = max_t,
    max_can = 10,
    grams = 3,
    threshold = 0.05,
    tokenized = false,
    sep_token = "_",
    keep_sep = false,
    target_col = :Word,
    issparse = :dense,
    verbose = false,
);

Almost all available measures can be simply computed with

all_measures = JudiLingMeasures.compute_all_measures_train(latin, # the data of interest
                                                     cue_obj, # the cue_obj of the training data
                                                     Chat, # the Chat of the data of interest
                                                     S, # the S matrix of the data of interest
                                                     Shat, # the Shat matrix of the data of interest
                                                     F, # the F matrix
                                                     G, # the G matrix
                                                     res_learn_train=res_learn, # the output of learn_paths for the data of interest
                                                     gpi_learn_train=gpi_learn, # the gpi_learn object of the data of interest
                                                     rpi_learn_train=rpi_learn); # the rpi_learn object of the data of interest

It's also possible to not compute measures based on the learn_paths algorithm:

all_measures = JudiLingMeasures.compute_all_measures_train(latin, # the data of interest
                                                     cue_obj, # the cue_obj of the training data
                                                     Chat, # the Chat of the training data
                                                     S, # the S matrix of the training data
                                                     Shat, # the Shat matrix of the training data
                                                     F, # the F matrix
                                                     G, # the G matrix); #

If low_cost_measures_only is set to true, only measures which are computationally relatively lean are computed.

The only measures not computed in JudiLingMeasures.compute_all_measures_train are those which return multiple values for each wordform. These are

  • "Functional Load"
  • "Semantic Support for Form" with sum_supports=false

It is also possible to compute measures for validation data, please see the measures_demo.ipynb notebook for details.

Overview over all available measures

Measures capturing comprehension (processing on the semantic side of the network)

Measures of semantic vector length/uncertainty/activation

  • L1Norm

    Computes the L1-Norm (city-block distance) of the predicted semantic vectors $\hat{S}$:

    Example:

    JudiLingMeasures.L1Norm(Shat)

    Used in Schmitz et al. (2021), Stein and Plag (2021) (called Semantic Vector length in their paper), Saito (2022) (called VecLen)

  • L2Norm

    Computes the L2-Norm (euclidean distance) of the predicted semantic vectors $\hat{S}$:

    Example:

    JudiLingMeasures.L2Norm(Shat)

    Used in Schmitz et al. (2021)

Measures of semantic neighbourhood

  • Density

    Computes the average correlation/cosine similarity of each predicted semantic vector in $\hat{S}$ with the $n$ most correlated/closest semantic vectors in $S$:

    Example:

    _, cor_s = JudiLing.eval_SC(Shat, S, R=true)
    correlation_density = JudiLingMeasures.density(cor_s, 10)
    
    cosine_sims = JudiLingMeasures.cosine_similarity(Shat, S)
    cosine_density = JudiLingMeasures.density(cosine_sim, 10)

    Used in Heitmeier et al. (2022) (called Semantic Density, based on Cosine Similarity), Schmitz et al. (2021), Stein and Plag (2021) (called Semantic Density, based on correlation)

  • ALC

    Average Lexical Correlation. Computes the average correlation between each predicted semantic vector and all semantic vectors in $S$.

    Example:

    _, cor_s = JudiLing.eval_SC(Shat, S, R=true)
    JudiLingMeasures.ALC(cor_s)

    Used in Schmitz et al. (2021), Chuang et al. (2020)

  • EDNN

    Euclidean Distance Nearest Neighbour. Computes the euclidean distance between each predicted semantic vector and all semantic vectors in $S$ and returns for each predicted semantic vector the distance to the closest neighbour.

    Example:

    JudiLingMeasures.EDNN(Shat, S)

    Used in Schmitz et al. (2021), Chuang et al. (2020)

  • NNC

    Nearest Neighbour Correlation. Computes the correlation between each predicted semantic vector and all semantic vectors in $S$ and returns for each predicted semantic vector the correlation to the closest neighbour.

    Example:

    _, cor_s = JudiLing.eval_SC(Shat, S, R=true)
    JudiLingMeasures.NNC(cor_s)

    Used in Schmitz et al. (2021), Chuang et al. (2020)

  • Total Distance (F)

    Summed Euclidean distances between predicted semantic vectors of trigrams in the target form. Code by Yu-Ying Chuang.

    Example:

    JudiLingMeasures.total_distance(cue_obj, F, :F)

    Used in Chuang et al. (to appear)

Measures of comprehension accuracy/uncertainty

  • TargetCorrelation

    Correlation between each predicted semantic vector and its target semantic vector in $S$.

    Example:

    _, cor_s = JudiLing.eval_SC(Shat, S, R=true)
    JudiLingMeasures.TargetCorrelation(cor_s)

    Used in Stein and Plag (2021) and Saito (2022) (but called PredAcc there)

  • Rank

    Rank of the correlation with the target semantics among the correlations between the predicted semantic vector and all semantic vectors in $S$.

    Example:

    _, cor_s = JudiLing.eval_SC(Shat, S, R=true)
    JudiLingMeasures.rank(cor_s)
  • Recognition

    Whether a word form was correctly comprehended. Not currently implemented.

    NOT YET IMPLEMENTED

  • Comprehension Uncertainty

    Sum of production of correlation/mse/cosine cosimilarity of shat with all vectors in S and the ranks of this correlation/mse/cosine similarity.

    Note: the current version of Comprehension Uncertainty is not completely tested against its original implementation in pyldl.

    Example:

    JudiLingMeasures.uncertainty(S, Shat, method="corr") # default
    JudiLingMeasures.uncertainty(S, Shat, method="mse")
    JudiLingMeasures.uncertainty(S, Shat, method="cosine")

    Used in Saito (2022).

  • Functional Load

    Correlation/MSE of rows in F of triphones in word w and the target semantic vector of w.

    Note: the current version of Functional Load is not completely tested against its original implementation in pyldl.

    Example:

    JudiLingMeasures.functional_load(F, Shat, cue_obj, method="corr")
    JudiLingMeasures.functional_load(F, Shat, cue_obj, method="mse")

    Instead of returning the functional load for each cue in each word, a list of cues can also be specified. In this case it is assumed that cues are specified in the same order as the words they are to be compared to are specified in F and Shat.

    JudiLingMeasures.functional_load(F[:,1:6], Shat[1:6,:], cue_obj, cue_list = ["#vo", "#vo", "#vo","#vo","#vo","#vo"])
    JudiLingMeasures.functional_load(F[:,1:6], Shat[1:6,:], cue_obj, cue_list = ["#vo", "#vo", "#vo","#vo","#vo","#vo"], method="mse")

    Used in Saito (2022).

Measures capturing production (processing on the form side of the network)

Measures of production accuracy/support/uncertainty for the predicted form

  • SCPP

    The correlation between the predicted semantics of the word form produced by the path algorithm and the target semantics.

    Example:

    df = JudiLingMeasures.get_res_learn_df(res_learn, latin, cue_obj, cue_obj)
    JudiLingMeasures.SCPP(df, latin)

    Used in Chuang et al. (2020) (based on WpmWithLDL)

  • PathSum

    The summed path supports for the highest supported predicted form, produced by the path algorithm. Path supports are taken from the $\hat{Y}$ matrices.

    Example:

    pred_df = JudiLing.write2df(rpi_learn)
    JudiLingMeasures.path_sum(pred_df)

    Used in Schmitz et al. (2021) (but based on WpmWithLDL)

  • TargetPathSum

    The summed path supports for the target word form, produced by the path algorithm. Path supports are taken from the $\hat{Y}$ matrices.

    Example:

    JudiLingMeasures.target_path_sum(gpi_learn)

    Used in Chuang et al. (2022) (but called Triphone Support)

  • PathSumChat

    The summed path supports for the highest supported predicted form, produced by the path algorithm. Path supports are taken from the $\hat{C}$ matrix.

    Example:

    JudiLingMeasures.path_sum_chat(res_learn, Chat)
  • C-Precision

    Correlation between the predicted form vector and the target form vector.

    Example:

    JudiLingMeasures.c_precision(Chat, cue_obj.C)

    Used in Heitmeier et al. (2022), Gahl and Baayen (2022) (called Semantics to Form Mapping Precision)

  • L1Chat

    L1-Norm of the predicted $\hat{c}$ vectors.

    Example:

    JudiLingMeasures.L1Norm(Chat)

    Used in Heitmeier et al. (2022)

  • Semantic Support for Form

    Sum of activation of ngrams in the target wordform.

    Example:

    JudiLingMeasures.semantic_support_for_form(cue_obj, Chat)

    Instead of summing the activations, the function can also return the activation for each ngram:

    JudiLingMeasures.semantic_support_for_form(cue_obj, Chat, sum_supports=false)

    Used in Gahl and Baayen (2022) (unclear which package this was based on?) The activation of individual ngrams was used in Saito (2022).

Measures of production accuracy/support/uncertainty for the target form

  • Production Uncertainty

    Sum of production of correlation/mse/cosine similarity of chat with all vectors in C and the ranks of this correlation/mse/cosine similarity.

    Note: the current version of Production Uncertainty is not completely tested against its original implementation in pyldl.

    Example:

    JudiLingMeasures.uncertainty(cue_obj.C, Chat, method="corr") # default
    JudiLingMeasures.uncertainty(cue_obj.C, Chat, method="mse")
    JudiLingMeasures.uncertainty(cue_obj.C, Chat, method="cosine")

    Used in Saito (2022)

  • Total Distance (G)

    Summed Euclidean distances between predicted form vectors of trigrams in the target form. Code by Yu-Ying Chuang.

    Example:

    JudiLingMeasures.total_distance(cue_obj, G, :G)

    Used in Chuang et al. (to appear)

Measures of support for the predicted path, focusing on the path transitions and components of the path

  • LastSupport

    The support for the last trigram of each target word in the Chat matrix.

    Example:

    JudiLingMeasures.last_support(cue_obj, Chat)

    Used in Schmitz et al. (2021) (called Support in their paper).

  • WithinPathEntropies

    The entropy over path supports for the highest supported predicted form, produced by the path algorithm. Path supports are taken from the $\hat{Y}$ matrices.

    Example:

    pred_df = JudiLing.write2df(rpi_learn)
    JudiLingMeasures.within_path_entropies(pred_df)
  • MeanWordSupport

    Summed path support divided by each word form's length. Path supports are taken from the $\hat{Y}$ matrices.

    Example:

    pred_df = JudiLing.write2df(rpi_learn)
    JudiLingMeasures.mean_word_support(res_learn, pred_df)
  • MeanWordSupportChat

    Summed path support divided by each word form's length. Path supports are taken from the $\hat{C}$ matrix.

    Example:

    JudiLingMeasures.mean_word_support_chat(res_learn, Chat)

    Used in Stein and Plag (2021) (but based on WpmWithLDL)

  • lwlr

    The ratio between the predicted form's length and its weakest support from the production algorithm. Supports taken from the $\hat{Y}$ matrices.

    Example:

    pred_df = JudiLing.write2df(rpi_learn)
    JudiLingMeasures.lwlr(res_learn, pred_df)
  • lwlrChat

    The ratio between the predicted form's length and its weakest support. Supports taken from the $\hat{C}$ matrix.

    Example:

    JudiLingMeasures.lwlr_chat(res_learn, Chat)

Measures of support for competing forms

  • PathCounts

    The number of candidates predicted by the path algorithm.

    Example:

    df = JudiLingMeasures.get_res_learn_df(res_learn, latin, cue_obj, cue_obj)
    JudiLingMeasures.PathCounts(df)

    Used in Schmitz et al. (2021) (but based on WpmWithLDL)

  • PathEntropiesChat

    The entropy over the summed path supports for the candidate forms produced by the path algorithm. Path supports are taken from the $\hat{C}$ matrix.

    Example:

    JudiLingMeasures.path_entropies_chat(res_learn, Chat)

    Used in Schmitz et al. (2021) (but based on WpmWithLDL), Stein and Plag (2021) (but based on WpmWithLDL)

  • PathEntropiesSCP

    The entropy over the semantic supports for the candidate forms produced by the path algorithm.

    Example:

    df = JudiLingMeasures.get_res_learn_df(res_learn, latin, cue_obj, cue_obj)
    JudiLingMeasures.path_entropies_scp(df)
  • ALDC

    Average Levenstein Distance of Candidates. Average of Levenshtein distance between each predicted word form candidate and the target word form.

    Example:

    df = JudiLingMeasures.get_res_learn_df(res_learn, latin, cue_obj, cue_obj)
    JudiLingMeasures.ALDC(df)

    Used in Schmitz et al. (2021), Chuang et al. (2020) (both based on WpmWithLDL)

Bibliography

Baayen, R. H., Chuang, Y.-Y., and Blevins, J. P. (2018). Inflectional morphology with linear mappings. The Mental Lexicon, 13 (2), 232-270.

Chuang, Y.-Y., Kang, M., Luo, X. F. and Baayen, R. H. (to appear). Vector Space Morphology with Linear Discriminative Learning. In Crepaldi, D. (Ed.) Linguistic morphology in the mind and brain.

Chuang, Y-Y., Vollmer, M-l., Shafaei-Bajestan, E., Gahl, S., Hendrix, P., and Baayen, R. H. (2020). The processing of pseudoword form and meaning in production and comprehension: A computational modeling approach using Linear Discriminative Learning. Behavior Research Methods, 1-51.

Gahl, S., and Baayen, R. H. (2022). Time and thyme again: Connecting spoken word duration to models of the mental lexicon. OSF, January 22, 1-41.

Heitmeier, M., Chuang, Y.-Y., and Baayen, R. H. (2022). How trial-to-trial learning shapes mappings in the mental lexicon: Modelling Lexical Decision with Linear Discriminative Learning. ArXiv, July 1, 1-38.

Saito, Motoki (2022): pyldl - Linear Discriminative Learning in Python. URL: https://github.com/msaito8623/pyldl

Schmitz, Dominic. (2021). LDLConvFunctions: Functions for measure computation, extraction, and other handy stuff. R package version 1.2.0.1. URL: https://github.com/dosc91/LDLConvFunctions

Schmitz, D., Plag, I., Baer-Henney, D., & Stein, S. D. (2021). Durational differences of word-final/s/emerge from the lexicon: Modelling morpho-phonetic effects in pseudowords with linear discriminative learning. Frontiers in psychology, 12.

Stein, S. D., & Plag, I. (2021). Morpho-phonetic effects in speech production: Modeling the acoustic duration of English derived words with linear discriminative learning. Frontiers in Psychology, 12.