Loading data
JudiLing.load_dataset
— Methodload_dataset(filepath::String;
delim::String=",",
kargs...)
Load a dataset from file, usually comma- or tab-separated. Returns a DataFrame.
Obligatory arguments
filepath::String
: Path to file to be loaded.
Optional arguments
delim::String=","
: Delimiter in the file (usually either","
or"\t"
).kargs...
: Further keyword arguments are passed toCSV.File()
.
Example
latin = JudiLing.load_dataset("latin.csv")
first(latin, 10)
JudiLing.loading_data_randomly_split
— Methodloading_data_randomly_split(
data_path::String,
output_dir_path::String,
data_prefix::String;
val_sample_size::Int = 0,
val_ratio::Float64 = 0.0,
random_seed::Int = 314)
Read in a dataframe, splitting the dataframe into a training and validation dataset. The two are also written to output_dir_path
at the same time.
The order of data_prefix
and output_dir_path
is exactly reversed compared to loading_data_careful_split
.
Obligatory arguments
data_path::String
: Path to where the dataset is stored.output_dir_path::String
: Path to where the new dataframes should be stored.data_prefix::String
: Prefix of the two new files, will be calleddata_prefix_train.csv
anddata_prefix_val.csv
.
Optional arguments
val_sample_size::Int = 0
: Size of the validation dataset (onlyval_sample_size
orval_ratio
may be used).val_ratio::Float64 = 0.0
: Fraction of the data that should be in the validation dataset (onlyval_sample_size
orval_ratio
may be used).random_seed::Int = 314
: Random seed for controlling random split.
Example
data_train, data_val = JudiLing.loading_data_randomly_split(
"latin.csv",
"careful",
"latin",
["Lexeme","Person","Number","Tense","Voice","Mood"]
)
JudiLing.loading_data_careful_split
— Methodloading_data_careful_split(
data_path::String,
data_prefix::String,
output_dir_path::String,
n_features_columns::Union{Vector{Symbol},Vector{String}};
train_sample_size::Int = 0,
val_sample_size::Int = 0,
val_ratio::Float64 = 0.0,
n_grams_target_col::Union{Symbol, String} = :Word,
n_grams_tokenized::Bool = false,
n_grams_sep_token::Union{Nothing, String} = nothing,
grams::Int = 3,
n_grams_keep_sep::Bool = false,
start_end_token::String = "#",
random_seed::Int = 314,
verbose::Bool = false)
Read in a dataframe, splitting the dataframe into a training and validation dataset. The split is done such that all features in the columns specified in n_features_columns
occur both in the training and validation data. It is also ensured that the unique grams resulting from splitting the strings in column n_grams_target_col
into grams
-grams occur in both datasets. The two are also written to output_dir_path
at the same time.
The order of data_prefix
and output_dir_path
is exactly reversed compared to loading_data_randomly_split
.
Obligatory arguments
data_path::String
: Path to where the dataset is stored.output_dir_path::String
: Path to where the new dataframes should be stored.data_prefix::String
: Prefix of the two new files, will be calleddata_prefix_train.csv
anddata_prefix_val.csv
.n_features_columns::Vector{Union{Symbol, String}}
: Vector with columns whose features have to occur in both the training and validation data.
Optional arguments
val_sample_size::Int = 0
: Size of the validation dataset (onlyval_sample_size
orval_ratio
may be used).val_ratio::Float64 = 0.0
: Fraction of the data that should be in the validation dataset (onlyval_sample_size
orval_ratio
may be used).n_grams_target_col::Union{Symbol, String} = :Word
: Column with target words.n_grams_tokenized::Bool = false
: Whether the words inn_grams_target_col
are already tokenized.n_grams_sep_token::Union{Nothing, String} = nothing
: String with which tokens inn_grams_target_col
are separated (only used ifn_grams_tokenized=true
).grams::Int = 3
: Granularity of the n-grams.n_grams_keep_sep::Bool = false
: Whether the token separators should be kept in the ngrams (this is useful e.g. when working with syllables).start_end_token::String = "#"
: Token with which the start and end of words should be marked.random_seed::Int = 314
: Random seed for controlling random split.
Example
data_train, data_val = JudiLing.loading_data_careful_split(
"latin.csv",
"latin",
"careful",
["Lexeme","Person","Number","Tense","Voice","Mood"]
)