Find Paths
Structures
JudiLing.Result_Path_Info_Struct
— TypeStore paths' information built by learn_paths
or build_paths
JudiLing.Gold_Path_Info_Struct
— TypeStore gold paths' information including indices and indices' support and total support. It can be used to evaluate how low the threshold needs to be set in order to find most of the correct paths or if set very low, all of the correct paths.
JudiLing.Threshold_Stat_Struct
— TypeStore threshold and tolerance proportional for each timestep.
Build paths
JudiLing.build_paths
— FunctionThe build_paths function constructs paths by only considering those n-grams that are close to the target. It first takes the predicted c-hat vector and finds the closest n neighbors in the C matrix. Then it selects all n-grams of these neighbors, and constructs all valid paths with those n-grams. The path producing the best correlation with the target semantic vector (through synthesis by analysis) is selected.
JudiLing.build_paths
— Methodbuild_paths(
data_val,
C_train,
S_val,
F_train,
Chat_val,
A,
i2f,
C_train_ind;
rC = nothing,
max_t = 15,
max_can = 10,
n_neighbors = 10,
grams = 3,
tokenized = false,
sep_token = nothing,
target_col = :Words,
start_end_token = "#",
if_pca = false,
pca_eval_M = nothing,
ignore_nan = true,
verbose = false,
)
The build_paths function constructs paths by only considering those n-grams that are close to the target. It first takes the predicted c-hat vector and finds the closest n neighbors in the C matrix. Then it selects all n-grams of these neighbors, and constructs all valid paths with those n-grams. The path producing the best correlation with the target semantic vector (through synthesis by analysis) is selected.
Obligatory Arguments
data::DataFrame
: the training datasetdata_val::DataFrame
: the validation datasetC_train::SparseMatrixCSC
: the C matrix for the training datasetS_val::Union{SparseMatrixCSC, Matrix}
: the S matrix for the validation datasetF_train::Union{SparseMatrixCSC, Matrix}
: the F matrix for the training datasetChat_val::Matrix
: the Chat matrix for the validation datasetA::SparseMatrixCSC
: the adjacency matrixi2f::Dict
: the dictionary returning features given indicesC_train_ind::Array
: the gold paths' indices for the training dataset
Optional Arguments
rC::Union{Nothing, Matrix}=nothing
: correlation Matrix of C and Chat, specify to save computing timemax_t::Int64=15
: maximum number of timestepsmax_can::Int64=10
: maximum number of candidates to considern_neighbors::Int64=10
: the top n form neighbors to be consideredgrams::Int64=3
: the number n of grams that make up n-gramstokenized::Bool=false
: if true, the dataset target is tokenizedsep_token::Union{Nothing, String, Char}=nothing
: separatortarget_col::Union{String, :Symbol}=:Words
: the column name for target stringsif_pca::Bool=false
: turn on to enable pca modepca_eval_M::Matrix=nothing
: pass original F for pca modeverbose::Bool=false
: if true, more information will be printed
Examples
# training dataset
JudiLing.build_paths(
latin_train,
cue_obj_train.C,
S_train,
F_train,
Chat_train,
A,
cue_obj_train.i2f,
cue_obj_train.gold_ind,
max_t=max_t,
n_neighbors=10,
verbose=false
)
# validation dataset
JudiLing.build_paths(
latin_val,
cue_obj_train.C,
S_val,
F_train,
Chat_val,
A,
cue_obj_train.i2f,
cue_obj_train.gold_ind,
max_t=max_t,
n_neighbors=10,
verbose=false
)
# pca mode
res_build = JudiLing.build_paths(
korean,
Array(Cpcat),
S,
F,
ChatPCA,
A,
cue_obj.i2f,
cue_obj.gold_ind,
max_t=max_t,
if_pca=true,
pca_eval_M=Fo,
n_neighbors=3,
verbose=true
)
Learn paths
JudiLing.learn_paths
— FunctionA sequence finding algorithm using discrimination learning to predict, for a given word, which n-grams are best supported for a given position in the sequence of n-grams.
JudiLing.learn_paths
— Methodlearn_paths(
data::DataFrame,
cue_obj::Cue_Matrix_Struct,
S_val::Union{SparseMatrixCSC, Matrix},
F_train,
Chat_val::Union{SparseMatrixCSC, Matrix};
Shat_val::Union{Nothing, Matrix} = nothing,
check_gold_path::Bool = false,
threshold::Float64 = 0.1,
is_tolerant::Bool = false,
tolerance::Float64 = (-1000.0),
max_tolerance::Int = 3,
activation::Union{Nothing, Function} = nothing,
ignore_nan::Bool = true,
verbose::Bool = true)
A high-level wrapper function for learn_paths
with much less control. It aims for users who is very new to JudiLing and learn_paths
function.
Obligatory Arguments
data::DataFrame
: the training datasetcue_obj::Cue_Matrix_Struct
: the C matrix object containing all information with CS_val::Union{SparseMatrixCSC, Matrix}
: the S matrix for validation datasetF_train::Union{SparseMatrixCSC, Matrix, Chain}
: either the F matrix for training dataset, or a deep learning comprehension model trained on the training setChat_val::Union{SparseMatrixCSC, Matrix}
: the Chat matrix for validation dataset
Optional Arguments
Shat_val::Union{Nothing, Matrix}=nothing
: the Shat matrix for the validation datasetcheck_gold_path::Bool=false
: if true, return a list of support values for the gold path; this information is returned as second output valuethreshold::Float64=0.1
:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into considerationis_tolerant::Bool=false
: if true, select a specified number (given bymax_tolerance
) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the pathtolerance::Float64=(-1000.0)
: the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the pathmax_tolerance::Int64=4
: maximum number of n-grams allowed in a pathactivation::Function=nothing
: the activation function you want to passignore_nan::Bool=true
: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation valueverbose::Bool=false
: if true, more information is printed
Examples
res = learn_paths(latin, cue_obj, S, F, Chat)
JudiLing.learn_paths
— Methodlearn_paths(
data_train::DataFrame,
data_val::DataFrame,
C_train::Union{Matrix, SparseMatrixCSC},
S_val::Union{Matrix, SparseMatrixCSC},
F_train,
Chat_val::Union{Matrix, SparseMatrixCSC},
A::SparseMatrixCSC,
i2f::Dict,
f2i::Dict;
gold_ind::Union{Nothing, Vector} = nothing,
Shat_val::Union{Nothing, Matrix} = nothing,
check_gold_path::Bool = false,
max_t::Int = 15,
max_can::Int = 10,
threshold::Float64 = 0.1,
is_tolerant::Bool = false,
tolerance::Float64 = (-1000.0),
max_tolerance::Int = 3,
grams::Int = 3,
tokenized::Bool = false,
sep_token::Union{Nothing, String} = nothing,
keep_sep::Bool = false,
target_col::Union{Symbol, String} = "Words",
start_end_token::String = "#",
issparse::Union{Symbol, Bool} = :auto,
sparse_ratio::Float64 = 0.05,
if_pca::Bool = false,
pca_eval_M::Union{Nothing, Matrix} = nothing,
activation::Union{Nothing, Function} = nothing,
ignore_nan::Bool = true,
check_threshold_stat::Bool = false,
verbose::Bool = false
)
A sequence finding algorithm using discrimination learning to predict, for a given word, which n-grams are best supported for a given position in the sequence of n-grams.
Obligatory Arguments
data::DataFrame
: the training datasetdata_val::DataFrame
: the validation datasetC_train::Union{SparseMatrixCSC, Matrix}
: the C matrix for training datasetS_val::Union{SparseMatrixCSC, Matrix}
: the S matrix for validation datasetF_train::Union{SparseMatrixCSC, Matrix, Chain}
: the F matrix for training dataset, or a deep learning comprehension model trained on the training dataChat_val::Union{SparseMatrixCSC, Matrix}
: the Chat matrix for validation datasetA::SparseMatrixCSC
: the adjacency matrixi2f::Dict
: the dictionary returning features given indicesf2i::Dict
: the dictionary returning indices given features
Optional Arguments
gold_ind::Union{Nothing, Vector}=nothing
: gold paths' indicesShat_val::Union{Nothing, Matrix}=nothing
: the Shat matrix for the validation datasetcheck_gold_path::Bool=false
: if true, return a list of support values for the gold path; this information is returned as second output valuemax_t::Int64=15
: maximum timestepmax_can::Int64=10
: maximum number of candidates to considerthreshold::Float64=0.1
:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into considerationis_tolerant::Bool=false
: if true, select a specified number (given bymax_tolerance
) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the pathtolerance::Float64=(-1000.0)
: the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the pathmax_tolerance::Int64=4
: maximum number of n-grams allowed in a pathgrams::Int64=3
: the number n of grams that make up an n-gramtokenized::Bool=false
: if true, the dataset target is tokenizedsep_token::Union{Nothing, String, Char}=nothing
: separator tokenkeep_sep::Bool=false
:if true, keep separators in cuestarget_col::Union{String, :Symbol}=:Words
: the column name for target stringsstart_end_token::Union{String, Char}="#"
: start and end token in boundary cuesissparse::Union{Symbol, Bool}=:auto
: control of whether output of Mt matrix is a dense matrix or a sparse matrixsparse_ratio::Float64=0.05
: the ratio to decide whether a matrix is sparseif_pca::Bool=false
: turn on to enable pca modepca_eval_M::Matrix=nothing
: pass original F for pca modeactivation::Function=nothing
: the activation function you want to passignore_nan::Bool=true
: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation valuecheck_threshold_stat::Bool=false
: if true, return a threshold and torlerance proportion for each timestepverbose::Bool=false
: if true, more information is printed
Examples
# basic usage without tokenization
res = JudiLing.learn_paths(
latin,
latin,
cue_obj.C,
S,
F,
Chat,
A,
cue_obj.i2f,
max_t=max_t,
max_can=10,
grams=3,
threshold=0.1,
tokenized=false,
keep_sep=false,
target_col=:Word,
verbose=true)
# basic usage with tokenization
res = JudiLing.learn_paths(
french,
french,
cue_obj.C,
S,
F,
Chat,
A,
cue_obj.i2f,
max_t=max_t,
max_can=10,
grams=3,
threshold=0.1,
tokenized=true,
sep_token="-",
keep_sep=true,
target_col=:Syllables,
verbose=true)
# basic usage for validation data
res_val = JudiLing.learn_paths(
latin_train,
latin_val,
cue_obj_train.C,
S_val,
F_train,
Chat_val,
A,
cue_obj_train.i2f,
max_t=max_t,
max_can=10,
grams=3,
threshold=0.1,
tokenized=false,
keep_sep=false,
target_col=:Word,
verbose=true)
# turn on tolerance mode
res_val = JudiLing.learn_paths(
...
threshold=0.1,
is_tolerant=true,
tolerance=-0.1,
max_tolerance=4,
...)
# turn on check gold paths mode
res_train, gpi_train = JudiLing.learn_paths(
...
gold_ind=cue_obj_train.gold_ind,
Shat_val=Shat_train,
check_gold_path=true,
...)
res_val, gpi_val = JudiLing.learn_paths(
...
gold_ind=cue_obj_val.gold_ind,
Shat_val=Shat_val,
check_gold_path=true,
...)
# control over sparsity
res_val = JudiLing.learn_paths(
...
issparse=:auto,
sparse_ratio=0.05,
...)
# pca mode
res_learn = JudiLing.learn_paths(
korean,
korean,
Array(Cpcat),
S,
F,
ChatPCA,
A,
cue_obj.i2f,
cue_obj.f2i,
check_gold_path=false,
gold_ind=cue_obj.gold_ind,
Shat_val=Shat,
max_t=max_t,
max_can=10,
grams=3,
threshold=0.1,
tokenized=true,
sep_token="_",
keep_sep=true,
target_col=:Verb_syll,
if_pca=true,
pca_eval_M=Fo,
verbose=true);
JudiLing.learn_paths_rpi
— Methodlearn_paths_rpi(
data_train::DataFrame,
data_val::DataFrame,
C_train::Union{Matrix, SparseMatrixCSC},
S_val::Union{Matrix, SparseMatrixCSC},
F_train,
Chat_val::Union{Matrix, SparseMatrixCSC},
A::SparseMatrixCSC,
i2f::Dict,
f2i::Dict;
gold_ind::Union{Nothing, Vector} = nothing,
Shat_val::Union{Nothing, Matrix} = nothing,
check_gold_path::Bool = false,
max_t::Int = 15,
max_can::Int = 10,
threshold::Float64 = 0.1,
is_tolerant::Bool = false,
tolerance::Float64 = (-1000.0),
max_tolerance::Int = 3,
grams::Int = 3,
tokenized::Bool = false,
sep_token::Union{Nothing, String} = nothing,
keep_sep::Bool = false,
target_col::Union{Symbol, String} = "Words",
start_end_token::String = "#",
issparse::Union{Symbol, Bool} = :auto,
sparse_ratio::Float64 = 0.05,
if_pca::Bool = false,
pca_eval_M::Union{Nothing, Matrix} = nothing,
activation::Union{Nothing, Function} = nothing,
ignore_nan::Bool = true,
check_threshold_stat::Bool = false,
verbose::Bool = false
)
Calculate learn_paths with results indices supports as well.
Obligatory Arguments
data::DataFrame
: the training datasetdata_val::DataFrame
: the validation datasetC_train::Union{SparseMatrixCSC, Matrix}
: the C matrix for training datasetS_val::Union{SparseMatrixCSC, Matrix}
: the S matrix for validation datasetF_train::Union{SparseMatrixCSC, Matrix, Chain}
: the F matrix for training dataset, or a deep learning comprehension model trained on the training dataChat_val::Union{SparseMatrixCSC, Matrix}
: the Chat matrix for validation datasetA::SparseMatrixCSC
: the adjacency matrixi2f::Dict
: the dictionary returning features given indicesf2i::Dict
: the dictionary returning indices given features
Optional Arguments
gold_ind::Union{Nothing, Vector}=nothing
: gold paths' indicesShat_val::Union{Nothing, Matrix}=nothing
: the Shat matrix for the validation datasetcheck_gold_path::Bool=false
: if true, return a list of support values for the gold path; this information is returned as second output valuemax_t::Int64=15
: maximum timestepmax_can::Int64=10
: maximum number of candidates to considerthreshold::Float64=0.1
:the value set for the support such that if the support of an n-gram is higher than this value, the n-gram will be taking into considerationis_tolerant::Bool=false
: if true, select a specified number (given bymax_tolerance
) of n-grams whose supports are below threshold but above a second tolerance threshold to be added to the pathtolerance::Float64=(-1000.0)
: the value set for the second threshold (in tolerant mode) such that if the support for an n-gram is in between this value and the threshold and the max_tolerance number has not been reached, then allow this n-gram to be added to the pathmax_tolerance::Int64=4
: maximum number of n-grams allowed in a pathgrams::Int64=3
: the number n of grams that make up an n-gramtokenized::Bool=false
: if true, the dataset target is tokenizedsep_token::Union{Nothing, String, Char}=nothing
: separator tokenkeep_sep::Bool=false
:if true, keep separators in cuestarget_col::Union{String, :Symbol}=:Words
: the column name for target stringsstart_end_token::Union{String, Char}="#"
: start and end token in boundary cuesissparse::Union{Symbol, Bool}=:auto
: control of whether output of Mt matrix is a dense matrix or a sparse matrixsparse_ratio::Float64=0.05
: the ratio to decide whether a matrix is sparseif_pca::Bool=false
: turn on to enable pca modepca_eval_M::Matrix=nothing
: pass original F for pca modeactivation::Function=nothing
: the activation function you want to passignore_nan::Bool=true
: whether to ignore NaN when compare correlations, otherwise NaN will be selected as the max correlation valuecheck_threshold_stat::Bool=false
: if true, return a threshold and torlerance proportion for each timestepverbose::Bool=false
: if true, more information is printed
Utility functions
JudiLing.eval_can
— Methodeval_can(candidates, S, F::Union{Matrix,SparseMatrixCSC, Chain}, i2f, max_can, if_pca, pca_eval_M)
Calculate for each candidate path the correlation between predicted semantic vector and the gold standard semantic vector, and select as target for production the path with the highest correlation.
JudiLing.find_top_feature_indices
— Methodfind_top_feature_indices(rC, C_train_ind)
Find all indices for the n-grams of the top n closest neighbors of a given target.
JudiLing.make_ngrams_ind
— Methodmake_ngrams_ind(res, n)
Construct ngrams indices.
JudiLing.predict_shat
— Methodpredict_shat(F::Union{Matrix, SparseMatrixCSC},
ci::Vector{Int})
Predicts semantic vector shat given a comprehension matrix F
and a list of indices of ngrams ci
.
Obligatory arguments
F::Union{Matrix, SparseMatrixCSC}
: Comprehension matrix F.ci::Vector{Int}
: Vector of indices of ngrams in c vector. Essentially, this is a vector indicating which ngrams in a c vector are absent and which are present.