Make Cue Matrix
JudiLing.Cue_Matrix_Struct
— TypeA structure that stores information created by makecuematrix: C is the cue matrix; f2i is a dictionary returning the indices for features; i2f is a dictionary returning the features for indices; goldind is a list of indices of gold paths; A is the adjacency matrix; grams is the number of grams for cues; targetcol is the column name for target strings; tokenized is whether the dataset target is tokenized; septoken is the separator; keepsep is whether to keep separators in cues; startendtoken is the start and end token in boundary cues.
JudiLing.make_cue_matrix
— FunctionConstruct cue matrix.
JudiLing.make_combined_cue_matrix
— FunctionConstruct cue matrix where combined features and adjacencies for both training datasets and validation datasets.
JudiLing.make_ngrams
— FunctionGiven a list of string tokens, extract their n-grams.
JudiLing.make_cue_matrix
— Methodmake_cue_matrix(data::DataFrame)
Make the cue matrix for training datasets and corresponding indices as well as the adjacency matrix and gold paths given a dataset in a form of dataframe.
Obligatory Arguments
data::DataFrame
: the dataset
Optional Arguments
grams::Int64=3
: the number of grams for cuestarget_col::Union{String, Symbol}=:Words
: the column name for target stringstokenized::Bool=false
:if true, the dataset target is assumed to be tokenizedsep_token::Union{Nothing, String, Char}=nothing
: separatorkeep_sep::Bool=false
: if true, keep separators in cuesstart_end_token::Union{String, Char}="#"
: start and end token in boundary cuesverbose::Bool=false
: if true, more information is printed
Examples
# make cue matrix without tokenization
cue_obj_train = JudiLing.make_cue_matrix(
latin_train,
grams=3,
target_col=:Word,
tokenized=false,
sep_token="-",
start_end_token="#",
keep_sep=false,
verbose=false
)
# make cue matrix with tokenization
cue_obj_train = JudiLing.make_cue_matrix(
french_train,
grams=3,
target_col=:Syllables,
tokenized=true,
sep_token="-",
start_end_token="#",
keep_sep=true,
verbose=false
)
JudiLing.make_cue_matrix
— Methodmake_cue_matrix(data::DataFrame, cue_obj::Cue_Matrix_Struct)
Make the cue matrix for validation datasets and corresponding indices as well as the adjacency matrix and gold paths given a dataset in a form of dataframe.
Obligatory Arguments
data::DataFrame
: the datasetcue_obj::Cue_Matrix_Struct
: training cue object
Optional Arguments
grams::Int64=3
: the number of grams for cuestarget_col::Union{String, Symbol}=:Words
: the column name for target stringstokenized::Bool=false
:if true, the dataset target is assumed to be tokenizedsep_token::Union{Nothing, String, Char}=nothing
: separatorkeep_sep::Bool=false
: if true, keep separators in cuesstart_end_token::Union{String, Char}="#"
: start and end token in boundary cuesverbose::Bool=false
: if true, more information is printed
Examples
# make cue matrix without tokenization
cue_obj_val = JudiLing.make_cue_matrix(
latin_val,
cue_obj_train,
grams=3,
target_col=:Word,
tokenized=false,
sep_token="-",
keep_sep=false,
start_end_token="#",
verbose=false
)
# make cue matrix with tokenization
cue_obj_val = JudiLing.make_cue_matrix(
french_val,
cue_obj_train,
grams=3,
target_col=:Syllables,
tokenized=true,
sep_token="-",
keep_sep=true,
start_end_token="#",
verbose=false
)
JudiLing.make_cue_matrix
— Methodmake_cue_matrix(data_train::DataFrame, data_val::DataFrame)
Make the cue matrix for traiing and validation datasets at the same time.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation dataset
Optional Arguments
grams::Int64=3
: the number of grams for cuestarget_col::Union{String, Symbol}=:Words
: the column name for target stringstokenized::Bool=false
:if true, the dataset target is assumed to be tokenizedsep_token::Union{Nothing, String, Char}=nothing
: separatorkeep_sep::Bool=false
: if true, keep separators in cuesstart_end_token::Union{String, Char}="#"
: start and end token in boundary cuesverbose::Bool=false
: if true, more information is printed
Examples
# make cue matrix without tokenization
cue_obj_train, cue_obj_val = JudiLing.make_cue_matrix(
latin_train,
latin_val,
grams=3,
target_col=:Word,
tokenized=false,
keep_sep=false
)
# make cue matrix with tokenization
cue_obj_train, cue_obj_val = JudiLing.make_cue_matrix(
french_train,
french_val,
grams=3,
target_col=:Syllables,
tokenized=true,
sep_token="-",
keep_sep=true,
start_end_token="#",
verbose=false
)
JudiLing.make_combined_cue_matrix
— Methodmake_combined_cue_matrix(data_train, data_val)
Make the cue matrix for training and validation datasets at the same time, where the features and adjacencies are combined.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation dataset
Optional Arguments
grams::Int64=3
: the number of grams for cuestarget_col::Union{String, Symbol}=:Words
: the column name for target stringstokenized::Bool=false
:if true, the dataset target is assumed to be tokenizedsep_token::Union{Nothing, String, Char}=nothing
: separatorkeep_sep::Bool=false
: if true, keep separators in cuesstart_end_token::Union{String, Char}="#"
: start and end token in boundary cuesverbose::Bool=false
: if true, more information is printed
Examples
# make cue matrix without tokenization
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(
latin_train,
latin_val,
grams=3,
target_col=:Word,
tokenized=false,
keep_sep=false
)
# make cue matrix with tokenization
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(
french_train,
french_val,
grams=3,
target_col=:Syllables,
tokenized=true,
sep_token="-",
keep_sep=true,
start_end_token="#",
verbose=false
)
JudiLing.make_cue_matrix_from_CFBS
— Methodmake_cue_matrix_from_CFBS(features::Vector{Vector{T}};
pad_val::T = 0.,
ncol::Union{Missing,Int}=missing) where {T}
Create a cue matrix from a vector of feature vectors (usually CFBS vectors). It is expected (though of course not necessary) that the vectors have varying lengths. They are consequently padded on the right with the provided pad_val
.
Obligatory arguments
features::Vector{Vector{T}}
: vector of vectors containing C-FBS features
Optional arguments
pad_val::T = 0.
: Value with which the feature vectors will be paddedncol::Union{Missing,Int}=missing
: Number of columns of the C matrix. If not set, will be set to the maximum number of features
Examples
C = JudiLing.make_cue_matrix_from_CFBS(features)
JudiLing.make_combined_cue_matrix_from_CFBS
— Methodmake_combined_cue_matrix_from_CFBS(features_train::Vector{Vector{T}},
features_test::Vector{Vector{T}};
pad_val::T = 0.,
ncol::Union{Missing,Int}=missing) where {T}
Create cue matrices from two vectors of feature vectors (usually CFBS vectors). It is expected (though of course not necessary) that the vectors have varying lengths. They are consequently padded on the right with the provided pad_val
. The cue matrices are set to have to the size of the maximum number of feature values in features_train
and features_test
.
Obligatory arguments
features_train::Vector{Vector{T}}
: vector of vectors containing C-FBS featuresfeatures_test::Vector{Vector{T}}
: vector of vectors containing C-FBS features
Optional arguments
pad_val::T = 0.
: Value with which the feature vectors will be paddedncol::Union{Missing,Int}=missing
: Number of columns of the C matrices. If not set, will be set to the maximum number of features infeatures_train
andfeatures_test
Examples
C_train, C_test = JudiLing.make_combined_cue_matrix_from_CFBS(features_train, features_test)
JudiLing.make_ngrams
— Methodmake_ngrams(tokens, grams, keep_sep, sep_token, start_end_token)
Given a list of string tokens return a list of all n-grams for these tokens.