Make Cue Matrix

JudiLing.Cue_Matrix_StructType

A structure that stores information created by makecuematrix: C is the cue matrix; f2i is a dictionary returning the indices for features; i2f is a dictionary returning the features for indices; goldind is a list of indices of gold paths; A is the adjacency matrix; grams is the number of grams for cues; targetcol is the column name for target strings; tokenized is whether the dataset target is tokenized; septoken is the separator; keepsep is whether to keep separators in cues; startendtoken is the start and end token in boundary cues.

source
JudiLing.make_cue_matrixMethod
make_cue_matrix(data::DataFrame)

Make the cue matrix for training datasets and corresponding indices as well as the adjacency matrix and gold paths given a dataset in a form of dataframe.

Obligatory Arguments

  • data::DataFrame: the dataset

Optional Arguments

  • grams::Int64=3: the number of grams for cues
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • keep_sep::Bool=false: if true, keep separators in cues
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • verbose::Bool=false: if true, more information is printed

Examples

# make cue matrix without tokenization
cue_obj_train = JudiLing.make_cue_matrix(
     latin_train,
    grams=3,
    target_col=:Word,
    tokenized=false,
    sep_token="-",
    start_end_token="#",
    keep_sep=false,
    verbose=false
    )

# make cue matrix with tokenization
cue_obj_train = JudiLing.make_cue_matrix(
    french_train,
    grams=3,
    target_col=:Syllables,
    tokenized=true,
    sep_token="-",
    start_end_token="#",
    keep_sep=true,
    verbose=false
    )
source
JudiLing.make_cue_matrixMethod
make_cue_matrix(data::DataFrame, cue_obj::Cue_Matrix_Struct)

Make the cue matrix for validation datasets and corresponding indices as well as the adjacency matrix and gold paths given a dataset in a form of dataframe.

Obligatory Arguments

  • data::DataFrame: the dataset
  • cue_obj::Cue_Matrix_Struct: training cue object

Optional Arguments

  • grams::Int64=3: the number of grams for cues
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • keep_sep::Bool=false: if true, keep separators in cues
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • verbose::Bool=false: if true, more information is printed

Examples

# make cue matrix without tokenization
cue_obj_val = JudiLing.make_cue_matrix(
  latin_val,
  cue_obj_train,
  grams=3,
  target_col=:Word,
  tokenized=false,
  sep_token="-",
  keep_sep=false,
  start_end_token="#",
  verbose=false
  )

# make cue matrix with tokenization
cue_obj_val = JudiLing.make_cue_matrix(
    french_val,
    cue_obj_train,
    grams=3,
    target_col=:Syllables,
    tokenized=true,
    sep_token="-",
    keep_sep=true,
    start_end_token="#",
    verbose=false
    )
source
JudiLing.make_cue_matrixMethod
make_cue_matrix(data_train::DataFrame, data_val::DataFrame)

Make the cue matrix for traiing and validation datasets at the same time.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset

Optional Arguments

  • grams::Int64=3: the number of grams for cues
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • keep_sep::Bool=false: if true, keep separators in cues
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • verbose::Bool=false: if true, more information is printed

Examples

# make cue matrix without tokenization
cue_obj_train, cue_obj_val = JudiLing.make_cue_matrix(
    latin_train,
    latin_val,
    grams=3,
    target_col=:Word,
    tokenized=false,
    keep_sep=false
    )

# make cue matrix with tokenization
cue_obj_train, cue_obj_val = JudiLing.make_cue_matrix(
    french_train,
    french_val,
    grams=3,
    target_col=:Syllables,
    tokenized=true,
    sep_token="-",
    keep_sep=true,
    start_end_token="#",
    verbose=false
    )
source
JudiLing.make_combined_cue_matrixMethod
make_combined_cue_matrix(data_train, data_val)

Make the cue matrix for training and validation datasets at the same time, where the features and adjacencies are combined.

Obligatory Arguments

  • data_train::DataFrame: the training dataset
  • data_val::DataFrame: the validation dataset

Optional Arguments

  • grams::Int64=3: the number of grams for cues
  • target_col::Union{String, Symbol}=:Words: the column name for target strings
  • tokenized::Bool=false:if true, the dataset target is assumed to be tokenized
  • sep_token::Union{Nothing, String, Char}=nothing: separator
  • keep_sep::Bool=false: if true, keep separators in cues
  • start_end_token::Union{String, Char}="#": start and end token in boundary cues
  • verbose::Bool=false: if true, more information is printed

Examples

# make cue matrix without tokenization
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(
    latin_train,
    latin_val,
    grams=3,
    target_col=:Word,
    tokenized=false,
    keep_sep=false
    )

# make cue matrix with tokenization
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(
    french_train,
    french_val,
    grams=3,
    target_col=:Syllables,
    tokenized=true,
    sep_token="-",
    keep_sep=true,
    start_end_token="#",
    verbose=false
    )
source
JudiLing.make_cue_matrix_from_CFBSMethod
make_cue_matrix_from_CFBS(features::Vector{Vector{T}};
                          pad_val::T = 0.,
                          ncol::Union{Missing,Int}=missing) where {T}

Create a cue matrix from a vector of feature vectors (usually CFBS vectors). It is expected (though of course not necessary) that the vectors have varying lengths. They are consequently padded on the right with the provided pad_val.

Obligatory arguments

  • features::Vector{Vector{T}}: vector of vectors containing C-FBS features

Optional arguments

  • pad_val::T = 0.: Value with which the feature vectors will be padded
  • ncol::Union{Missing,Int}=missing: Number of columns of the C matrix. If not set, will be set to the maximum number of features

Examples

C = JudiLing.make_cue_matrix_from_CFBS(features)
source
JudiLing.make_combined_cue_matrix_from_CFBSMethod
make_combined_cue_matrix_from_CFBS(features_train::Vector{Vector{T}},
                                   features_test::Vector{Vector{T}};
                                   pad_val::T = 0.,
                                   ncol::Union{Missing,Int}=missing) where {T}

Create cue matrices from two vectors of feature vectors (usually CFBS vectors). It is expected (though of course not necessary) that the vectors have varying lengths. They are consequently padded on the right with the provided pad_val. The cue matrices are set to have to the size of the maximum number of feature values in features_train and features_test.

Obligatory arguments

  • features_train::Vector{Vector{T}}: vector of vectors containing C-FBS features
  • features_test::Vector{Vector{T}}: vector of vectors containing C-FBS features

Optional arguments

  • pad_val::T = 0.: Value with which the feature vectors will be padded
  • ncol::Union{Missing,Int}=missing: Number of columns of the C matrices. If not set, will be set to the maximum number of features in features_train and features_test

Examples

C_train, C_test = JudiLing.make_combined_cue_matrix_from_CFBS(features_train, features_test)
source
JudiLing.make_ngramsMethod
make_ngrams(tokens, grams, keep_sep, sep_token, start_end_token)

Given a list of string tokens return a list of all n-grams for these tokens.

source