Make Semantic Matrix
Make binary semantic vectors
JudiLing.PS_Matrix_Struct
— TypeA structure that stores the discrete semantic vectors: pS is the discrete semantic matrix; f2i is a dictionary returning the indices for features; i2f is a dictionary returning the features for indices.
JudiLing.make_pS_matrix
— FunctionMake discrete semantic matrix.
JudiLing.make_pS_matrix
— Methodmake_pS_matrix(data)
Create a discrete semantic matrix given a dataframe.
Obligatory Arguments
data::DataFrame
: the dataset
Optional Arguments
features_col::Symbol=:CommunicativeIntention
: the column name for targetsep_token::String="_"
: separator
Examples
s_obj_train = JudiLing.make_pS_matrix(
utterance,
features_col=:CommunicativeIntention,
sep_token="_")
JudiLing.make_pS_matrix
— Methodmake_pS_matrix(data_val, pS_obj)
Construct discrete semantic matrix for the validation datasets given by the exemplar in the dataframe, and given the S matrix for the training datasets.
Obligatory Arguments
data_val::DataFrame
: the datasetpS_obj::PS_Matrix_Struct
: training PS object
Optional Arguments
features_col::Symbol=:CommunicativeIntention
: the column name for targetsep_token::String="_"
: separator
Examples
s_obj_val = JudiLing.make_pS_matrix(
data_val,
s_obj_train,
features_col=:CommunicativeIntention,
sep_token="_")
JudiLing.make_combined_pS_matrix
— Methodmake_combined_pS_matrix(
data_train,
data_val;
features_col = :CommunicativeIntention,
sep_token = "_",
)
Create discrete semantic matrices for a train and validation dataframe.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation dataset
Optional Arguments
features_col::Symbol=:CommunicativeIntention
: the column name for targetsep_token::String="_"
: separator
Examples
s_obj_train, s_obj_val = JudiLing.make_combined_pS_matrix(
data_train,
data_val,
features_col=:CommunicativeIntention,
sep_token="_")
Simulate semantic vectors
JudiLing.L_Matrix_Struct
— TypeA structure that stores Lexome semantic vectors: L is Lexome semantic matrix; f2i is a dictionary returning the indices for features; i2f is a dictionary returning the features for indices.
JudiLing.make_S_matrix
— FunctionMake simulated semantic matrix.
JudiLing.make_L_matrix
— FunctionMake simulated lexome matrix.
JudiLing.make_combined_S_matrix
— FunctionMake combined simulated S matrices, where combined features from both training datasets and validation datasets
JudiLing.make_combined_L_matrix
— FunctionMake combined simulated Lexome matrix, where combined features from both training datasets and validation datasets
JudiLing.make_S_matrix
— Methodmake_S_matrix(data::DataFrame, base::Vector, inflections::Vector)
Create simulated semantic matrix for the training datasets, given the input data of a vector specified contex lexemes and a vector specified gramatic lexemes. The semantic vector of a word form is constructed summing semantic vectors of content and gramatic lexemes.
Obligatory Arguments
data::DataFrame
: the datasetbase::Vector
: context lexemesinflections::Vector
: grammatic lexemes
Optional Arguments
ncol::Int64=200
: dimension of semantic vectors, usually the same as that of cue vectorssd_base_mean::Int64=1
: the sd mean of base featuressd_inflection_mean::Int64=1
: the sd mean of inflectional featuressd_base::Int64=4
: the sd of base featuressd_inflection::Int64=4
: the sd of inflectional featuresseed::Int64=314
: the random seedisdeep::Bool=true
: if true, mean of each feature is also randomizedadd_noise::Bool=true
: if true, add additional Gaussian noisesd_noise::Int64=1
: the sd of the Gaussian noisenormalized::Bool=false
: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd
Examples
# basic usage
S_train = JudiLing.make_S_matrix(
french,
["Lexeme"],
["Tense","Aspect","Person","Number","Gender","Class","Mood"],
ncol=200)
# deep mode
S_train = JudiLing.make_S_matrix(
...
sd_base_mean=1,
sd_inflection_mean=1,
isdeep=true,
...)
# non-deep mode
S_train = JudiLing.make_S_matrix(
...
isdeep=false,
...)
# add additional Gaussian noise
S_train = JudiLing.make_S_matrix(
...
add_noise=true,
sd_noise=1,
...)
# further control of means and standard deviations
S_train = JudiLing.make_S_matrix(
...
sd_base_mean=1,
sd_inflection_mean=1,
sd_base=4,
sd_inflection=4,
sd_noise=1,
...)
JudiLing.make_S_matrix
— Methodmake_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector)
Create simulated semantic matrix for the validation datasets, given the input data of a vector specified contex lexemes and a vector specified gramatic lexemes. The semantic vector of a word form is constructed summing semantic vectors of content and gramatic lexemes.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation datasetbase::Vector
: context lexemesinflections::Vector
: grammatic lexemes
Optional Arguments
ncol::Int64=200
: dimension of semantic vectors, usually the same as that of cue vectorssd_base_mean::Int64=1
: the sd mean of base featuressd_inflection_mean::Int64=1
: the sd mean of inflectional featuressd_base::Int64=4
: the sd of base featuressd_inflection::Int64=4
: the sd of inflectional featuresseed::Int64=314
: the random seedisdeep::Bool=true
: if true, mean of each feature is also randomizedadd_noise::Bool=true
: if true, add additional Gaussian noisesd_noise::Int64=1
: the sd of the Gaussian noisenormalized::Bool=false
: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd
Examples
# basic usage
S_train, S_val = JudiLing.make_S_matrix(
french,
french_val,
["Lexeme"],
["Tense","Aspect","Person","Number","Gender","Class","Mood"],
ncol=200)
# deep mode
S_train, S_val = JudiLing.make_S_matrix(
...
sd_base_mean=1,
sd_inflection_mean=1,
isdeep=true,
...)
# non-deep mode
S_train, S_val = JudiLing.make_S_matrix(
...
isdeep=false,
...)
# add additional Gaussian noise
S_train, S_val = JudiLing.make_S_matrix(
...
add_noise=true,
sd_noise=1,
...)
# further control of means and standard deviations
S_train, S_val = JudiLing.make_S_matrix(
...
sd_base_mean=1,
sd_inflection_mean=1,
sd_base=4,
sd_inflection=4,
sd_noise=1,
...)
JudiLing.make_S_matrix
— Methodmake_S_matrix(data::DataFrame, base::Vector)
Create simulated semantic matrix for the training datasets with only base features, given the input data of a vector specified contex lexemes and a vector specified gramatic lexemes. The semantic vector of a word form is constructed summing semantic vectors of content and gramatic lexemes.
Obligatory Arguments
data::DataFrame
: the datasetbase::Vector
: context lexemes
Optional Arguments
ncol::Int64=200
: dimension of semantic vectors, usually the same as that of cue vectorssd_base_mean::Int64=1
: the sd mean of base featuressd_base::Int64=4
: the sd of base featuresseed::Int64=314
: the random seedisdeep::Bool=true
: if true, mean of each feature is also randomizedadd_noise::Bool=true
: if true, add additional Gaussian noisesd_noise::Int64=1
: the sd of the Gaussian noisenormalized::Bool=false
: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd
Examples
# basic usage
S_train = JudiLing.make_S_matrix(
french,
["Lexeme"],
ncol=200)
# deep mode
S_train = JudiLing.make_S_matrix(
...
sd_base_mean=1,
sd_inflection_mean=1,
isdeep=true,
...)
# non-deep mode
S_train = JudiLing.make_S_matrix(
...
isdeep=false,
...)
# add additional Gaussian noise
S_train = JudiLing.make_S_matrix(
...
add_noise=true,
sd_noise=1,
...)
# further control of means and standard deviations
S_train = JudiLing.make_S_matrix(
...
sd_base_mean=1,
sd_inflection_mean=1,
sd_base=4,
sd_inflection=4,
sd_noise=1,
...)
JudiLing.make_S_matrix
— Methodmake_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector)
Create simulated semantic matrix for the validation datasets with only base features, given the input data of a vector specified contex lexemes and a vector specified gramatic lexemes. The semantic vector of a word form is constructed summing semantic vectors of content and gramatic lexemes.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation datasetbase::Vector
: context lexemes
Optional Arguments
ncol::Int64=200
: dimension of semantic vectors, usually the same as that of cue vectorssd_base_mean::Int64=1
: the sd mean of base featuressd_base::Int64=4
: the sd of base featuresseed::Int64=314
: the random seedisdeep::Bool=true
: if true, mean of each feature is also randomizedadd_noise::Bool=true
: if true, add additional Gaussian noisesd_noise::Int64=1
: the sd of the Gaussian noisenormalized::Bool=false
: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd
Examples
# basic usage
S_train, S_val = JudiLing.make_S_matrix(
french,
french_val,
["Lexeme"],
ncol=200)
# deep mode
S_train, S_val = JudiLing.make_S_matrix(
...
sd_base_mean=1,
sd_inflection_mean=1,
isdeep=true,
...)
# non-deep mode
S_train, S_val = JudiLing.make_S_matrix(
...
isdeep=false,
...)
# add additional Gaussian noise
S_train, S_val = JudiLing.make_S_matrix(
...
add_noise=true,
sd_noise=1,
...)
# further control of means and standard deviations
S_train, S_val = JudiLing.make_S_matrix(
...
sd_base_mean=1,
sd_inflection_mean=1,
sd_base=4,
sd_inflection=4,
sd_noise=1,
...)
JudiLing.make_S_matrix
— Methodmake_S_matrix(data_train::DataFrame, base::Vector, inflections::Vector, L::L_Matrix_Struct)
Create simulated semantic matrix where lexome matrix is available.
Obligatory Arguments
data::DataFrame
: the datasetbase::Vector
: context lexemesinflections::Vector
: grammatic lexemesL::L_Matrix_Struct
: the lexome matrix
Optional Arguments
add_noise::Bool=true
: if true, add additional Gaussian noisesd_noise::Int64=1
: the sd of the Gaussian noisenormalized::Bool=false
: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd
Examples
# basic usage
S1 = JudiLing.make_S_matrix(
latin,
["Lexeme"],
["Person","Number","Tense","Voice","Mood"],
L1,
add_noise=true,
sd_noise=1,
normalized=false
)
JudiLing.make_S_matrix
— Methodmake_S_matrix(data_train::DataFrame, data_val::Union{DataFrame, Nothing}, base::Vector, L::L_Matrix_Struct)
Create simulated semantic matrix where lexome matrix is available.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation datasetbase::Vector
: context lexemesL::L_Matrix_Struct
: the lexome matrix
Optional Arguments
add_noise::Bool=true
: if true, add additional Gaussian noisesd_noise::Int64=1
: the sd of the Gaussian noisenormalized::Bool=false
: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd
Examples
# basic usage
S1, S2 = JudiLing.make_S_matrix(
latin,
latin_val,
["Lexeme"],
L1,
add_noise=true,
sd_noise=1,
normalized=false
)
JudiLing.make_S_matrix
— Methodmake_S_matrix(data::DataFrame, base::Vector, L::L_Matrix_Struct)
Create simulated semantic matrix where lexome matrix is available.
Obligatory Arguments
data::DataFrame
: the datasetbase::Vector
: context lexemesL::L_Matrix_Struct
: the lexome matrix
Optional Arguments
add_noise::Bool=true
: if true, add additional Gaussian noisesd_noise::Int64=1
: the sd of the Gaussian noisenormalized::Bool=false
: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd
Examples
# basic usage
S1 = JudiLing.make_S_matrix(
latin,
["Lexeme"],
L1,
add_noise=true,
sd_noise=1,
normalized=false
)
JudiLing.make_S_matrix
— Methodmake_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector, L::L_Matrix_Struct)
Create simulated semantic matrix where lexome matrix is available.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation datasetbase::Vector
: context lexemesinflections::Vector
: grammatic lexemesL::L_Matrix_Struct
: the lexome matrix
Optional Arguments
add_noise::Bool=true
: if true, add additional Gaussian noisesd_noise::Int64=1
: the sd of the Gaussian noisenormalized::Bool=false
: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd
Examples
# basic usage
S1, S2 = JudiLing.make_S_matrix(
latin,
latin_val,
["Lexeme"],
["Person","Number","Tense","Voice","Mood"],
L1,
add_noise=true,
sd_noise=1,
normalized=false
)
JudiLing.make_L_matrix
— Methodmake_L_matrix(data::DataFrame, base::Vector)
Create Lexome Matrix with simulated semantic vectors where there are only base features.
Obligatory Arguments
data::DataFrame
: the datasetbase::Vector
: context lexemes
Optional Arguments
ncol::Int64=200
: dimension of semantic vectors, usually the same as that of cue vectorssd_base_mean::Int64=1
: the sd mean of base featuressd_base::Int64=4
: the sd of base featuresseed::Int64=314
: the random seedisdeep::Bool=true
: if true, mean of each feature is also randomized
Examples
# basic usage
L = JudiLing.make_L_matrix(
latin,
["Lexeme"],
ncol=200)
JudiLing.make_combined_S_matrix
— Methodmake_combined_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector, L::L_Matrix_Struct)
Create simulated semantic matrix for the training datasets and validation datasets with existing Lexome matrix, where features are combined from both training datasets and validation datasets.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation datasetbase::Vector
: context lexemesinflections::Vector
: grammatic lexemesL::L_Matrix_Struct
: the Lexome Matrix
Optional Arguments
add_noise::Bool=true
: if true, add additional Gaussian noisesd_noise::Int64=1
: the sd of the Gaussian noisenormalized::Bool=false
: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd
Examples
# basic usage
S_train, S_val = JudiLing.make_combined_S_matrix(
latin_train,
latin_val,
["Lexeme"],
["Person","Number","Tense","Voice","Mood"],
L)
JudiLing.make_combined_S_matrix
— Methodmake_combined_S_matrix(data_train::DataFrame, data_val::Union{DataFrame, Nothing}, base::Vector, L::L_Matrix_Struct)
Create simulated semantic matrix for the training datasets and validation datasets with existing Lexome matrix, where features are combined from both training datasets and validation datasets.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation datasetbase::Vector
: context lexemesL::L_Matrix_Struct
: the Lexome Matrix
Optional Arguments
add_noise::Bool=true
: if true, add additional Gaussian noisesd_noise::Int64=1
: the sd of the Gaussian noisenormalized::Bool=false
: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd
Examples
# basic usage
S_train, S_val = JudiLing.make_combined_S_matrix(
latin_train,
latin_val,
["Lexeme"],
["Person","Number","Tense","Voice","Mood"],
L)
JudiLing.make_combined_S_matrix
— Methodmake_combined_S_matrix( data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector)
Create simulated semantic matrix for the training datasets and validation datasets, where features are combined from both training datasets and validation datasets.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation datasetbase::Vector
: context lexemesinflections::Vector
: grammatic lexemes
Optional Arguments
ncol::Int64=200
: dimension of semantic vectors, usually the same as that of cue vectorssd_base_mean::Int64=1
: the sd mean of base featuressd_inflection_mean::Int64=1
: the sd mean of inflectional featuressd_base::Int64=4
: the sd of base featuressd_inflection::Int64=4
: the sd of inflectional featuresseed::Int64=314
: the random seedisdeep::Bool=true
: if true, mean of each feature is also randomizedadd_noise::Bool=true
: if true, add additional Gaussian noisesd_noise::Int64=1
: the sd of the Gaussian noisenormalized::Bool=false
: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd
Examples
# basic usage
S_train, S_val = JudiLing.make_combined_S_matrix(
latin_train,
latin_val,
["Lexeme"],
["Person","Number","Tense","Voice","Mood"],
ncol=n_features)
JudiLing.make_combined_S_matrix
— Methodmake_combined_S_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector)
Create simulated semantic matrix for the training datasets and validation datasets, where features are combined from both training datasets and validation datasets.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation datasetbase::Vector
: context lexemes
Optional Arguments
ncol::Int64=200
: dimension of semantic vectors, usually the same as that of cue vectorssd_base_mean::Int64=1
: the sd mean of base featuressd_inflection_mean::Int64=1
: the sd mean of inflectional featuressd_base::Int64=4
: the sd of base featuressd_inflection::Int64=4
: the sd of inflectional featuresseed::Int64=314
: the random seedisdeep::Bool=true
: if true, mean of each feature is also randomizedadd_noise::Bool=true
: if true, add additional Gaussian noisesd_noise::Int64=1
: the sd of the Gaussian noisenormalized::Bool=false
: if true, most of the values range between 1 and -1, it may slightly exceed between 1 or -1 depending on the sd
Examples
# basic usage
S_train, S_val = JudiLing.make_combined_S_matrix(
latin_train,
latin_val,
["Lexeme"],
["Person","Number","Tense","Voice","Mood"],
ncol=n_features)
JudiLing.make_combined_L_matrix
— Methodmake_combined_L_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector, inflections::Vector)
Create Lexome Matrix with simulated semantic vectors, where features are combined from both training datasets and validation datasets.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation datasetbase::Vector
: context lexemesinflections::Vector
: grammatic lexemes
Optional Arguments
ncol::Int64=200
: dimension of semantic vectors, usually the same as that of cue vectorssd_base_mean::Int64=1
: the sd mean of base featuressd_inflection_mean::Int64=1
: the sd mean of inflectional featuressd_base::Int64=4
: the sd of base featuressd_inflection::Int64=4
: the sd of inflectional featuresseed::Int64=314
: the random seedisdeep::Bool=true
: if true, mean of each feature is also randomized
Examples
# basic usage
L = JudiLing.make_combined_L_matrix(
latin_train,
latin_val,
["Lexeme"],
["Person","Number","Tense","Voice","Mood"],
ncol=n_features)
JudiLing.make_combined_L_matrix
— Methodmake_combined_L_matrix(data_train::DataFrame, data_val::DataFrame, base::Vector)
Create Lexome Matrix with simulated semantic vectors, where features are combined from both training datasets and validation datasets.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation datasetbase::Vector
: context lexemes
Optional Arguments
ncol::Int64=200
: dimension of semantic vectors, usually the same as that of cue vectorssd_base_mean::Int64=1
: the sd mean of base featuressd_inflection_mean::Int64=1
: the sd mean of inflectional featuressd_base::Int64=4
: the sd of base featuressd_inflection::Int64=4
: the sd of inflectional featuresseed::Int64=314
: the random seedisdeep::Bool=true
: if true, mean of each feature is also randomized
Examples
# basic usage
L = JudiLing.make_combined_L_matrix(
latin_train,
latin_val,
["Lexeme"],
ncol=n_features)
JudiLing.L_Matrix_Struct
— MethodL_Matrix_Struct(L, sd_base, sd_base_mean, sd_inflection, sd_inflection_mean, base_f, infl_f, base_f2i, infl_f2i, n_base_f, n_infl_f, ncol)
Construct LMatrixStruct with deep mode.
JudiLing.L_Matrix_Struct
— MethodL_Matrix_Struct(L, sd_base, sd_inflection, base_f, infl_f, base_f2i, infl_f2i, n_base_f, n_infl_f, ncol)
Construct LMatrixStruct without deep mode.
Load from word2vec, fasttext or similar
JudiLing.load_S_matrix_from_fasttext
— Methodload_S_matrix_from_fasttext(data::DataFrame,
language::Symbol;
target_col=:Word,
default_file::Int=1)
Load semantic matrix from fasttext, loaded using the Embeddings.jl package. Subset fasttext vectors to include only words in target_col
of data
, and subset data to only include words in target_col
for which semantic vector is available.
The last parameter, default_file, specifies which vectors are loaded. To learn about all available vectors, use the following commands:
using Embeddings
language_files(FastText_Text{:nl})
replacing the language code (here :nl
) with the language you are interested in. In general, for all languages other than English, these files are available:
default_file=1
loads from https://fasttext.cc/docs/en/crawl-vectors.html, paper: E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages License: CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0/default_file=2
loads from https://fasttext.cc/docs/en/pretrained-vectors.html paper: P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information License: CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0/
Obligatory Arguments
data::DataFrame
: the datasetlanguage::Symbol
: the language of the words in the dataset, offically ISO 639-2 (see https://github.com/JuliaText/Embeddings.jl/issues/34#issuecomment-782604523) but practically it seems more like ISO 639-1 to me with ISO 639-2 only being used if ISO 639-1 isn't available (see https://en.wikipedia.org/wiki/ListofISO639-2codes)
Optional Arguments
target_col=:Word
: column with orthographic representation of words indata
default_file::Int=1
: source of vectors, for more information see above and here: https://github.com/JuliaText/Embeddings.jl#loading-different-embeddings
Examples
# basic usage
latin_small, S = JudiLing.load_S_matrix_from_fasttext(latin, :la, target_col=:Word)
JudiLing.load_S_matrix_from_fasttext
— Methodload_S_matrix_from_fasttext(data_train::DataFrame,
data_val::DataFrame,
language::Symbol;
target_col=:Word,
default_file::Int=1)
Load semantic matrix from fasttext, loaded using the Embeddings.jl package. Subset fasttext vectors to include only words in target_col
of data_train
and data_val
, and subset data to only include words in target_col
for which semantic vector is available. Returns subsetted train and val data and train and val semantic matrices.
The last parameter, default_file, specifies which vectors are loaded. To learn about all available vectors, use the following commands:
using Embeddings
language_files(FastText_Text{:nl})
replacing the language code (here :nl
) with the language you are interested in. In general, for all languages other than English, these files are available:
default_file=1
loads from https://fasttext.cc/docs/en/crawl-vectors.html, paper: E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages License: CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0/default_file=2
loads from https://fasttext.cc/docs/en/pretrained-vectors.html paper: P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information License: CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0/
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation datasetlanguage::Symbol
: the language of the words in the dataset, offically ISO 639-2 (see https://github.com/JuliaText/Embeddings.jl/issues/34#issuecomment-782604523) but practically it seems more like ISO 639-1 to me with ISO 639-2 only being used if ISO 639-1 isn't available (see https://en.wikipedia.org/wiki/ListofISO639-2codes)
Optional Arguments
target_col=:Word
: column with orthographic representation of words indata
default_file::Int=1
: source of vectors, for more information see above and here: https://github.com/JuliaText/Embeddings.jl#loading-different-embeddings
Examples
# basic usage
latin_small_train, latin_small_val, S_train, S_val = JudiLing.load_S_matrix_from_fasttext(latin_train,
latin_val,
:la,
target_col=:Word)
JudiLing.load_S_matrix_from_word2vec_file
— Methodload_S_matrix_from_word2vec_file(data::DataFrame,
filepath::String;
target_col=:Word)
Load semantic matrix from word2vec filepath. Subset word2vec vectors to include only words in target_col
of data
, and subset data to only include words in target_col
for which semantic vector is available. Returns subsetted data and semantic matrix.
Obligatory Arguments
data::DataFrame
: the training datasetfilepath::String
: path to file with word2vec vectors in .txt (not compressed in any way)
Optional Arguments
target_col=:Word
: column with orthographic representation of words indata
JudiLing.load_S_matrix_from_word2vec_file
— Methodload_S_matrix_from_word2vec_file(data_train::DataFrame,
data_val::DataFrame,
filepath::String;
target_col=:Word)
Load semantic matrix from word2vec filepath. Subset word2vec vectors to include only words in target_col
of data_train
and data_val
, and subset data to only include words in target_col
for which semantic vector is available. Returns subsetted train and val data and train and val semantic matrices.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation datasetfilepath::String
: path to file with word2vec vectors in .txt (not compressed in any way)
Optional Arguments
target_col=:Word
: column with orthographic representation of words indata
JudiLing.load_S_matrix_from_fasttext_file
— Methodload_S_matrix_from_fasttext_file(data::DataFrame,
filepath::String;
target_col=:Word)
Load semantic matrix from fasttext filepath. Subset fasttext vectors to include only words in target_col
of data
, and subset data to only include words in target_col
for which semantic vector is available. Returns subsetted data and semantic matrix.
Obligatory Arguments
data::DataFrame
: the training datasetfilepath::String
: path to file with fasttext vectors in .txt or .vec (not compressed in any way)
Optional Arguments
target_col=:Word
: column with orthographic representation of words indata
JudiLing.load_S_matrix_from_fasttext_file
— Methodload_S_matrix_from_fasttext_file(data_train::DataFrame,
data_val::DataFrame,
filepath::String;
target_col=:Word)
Load semantic matrix from fasttext filepath. Subset fasttext vectors to include only words in target_col
of data_train
and data_val
, and subset data to only include words in target_col
for which semantic vector is available. Returns subsetted train and val data and train and val semantic matrices.
Obligatory Arguments
data_train::DataFrame
: the training datasetdata_val::DataFrame
: the validation datasetfilepath::String
: path to file with fasttext vectors in .txt (not compressed in any way)
Optional Arguments
target_col=:Word
: column with orthographic representation of words indata
Utility functions
JudiLing.process_features
— Methodprocess_features(data, feature_cols)
Collect all features given datasets and feature column names.
JudiLing.comp_f_M!
— Methodcomp_f_M!(L, sd, sd_mean, n_f, ncol, n_b)
Compose feature Matrix with deep mode.
JudiLing.comp_f_M!
— Methodcomp_f_M!(L, sd, n_f, ncol, n_b)
Compose feature Matrix without deep mode.
JudiLing.merge_f2i
— Methodmerge_f2i(base_f2i, infl_f2i, n_base_f, n_infl_f)
Merge base f2i dictionary and inflectional f2i dictionary.
JudiLing.lexome_sum
— Methodlexome_sum(L, features)
Sum up semantic vector, given lexome vector.
JudiLing.make_St
— Methodmake_St(L, n, data, base, inflections)
Make S transpose matrix with inflections.
JudiLing.make_St
— Methodmake_St(L, n, data, base)
Make S transpose matrix without inflections.
JudiLing.add_St_noise!
— Methodadd_St_noise!(St, sd_noise)
Add noise.
JudiLing.normalize_St!
— Methodnormalize_St!(St, n_base, n_infl)
Normalize S transpose with inflections.
JudiLing.normalize_St!
— Methodnormalize_St!(St, n_base)
Normalize S transpose without inflections.