PEACEWORD
Prototype for Extracting And Considering the Explainability of WORD embeddings.
This simple Git project contains two classic heuristics for assessing their suitability for word embeddings (proofs of concepts).
This project is a work from the research team ORKAD of the CRIStAL laboratory of the University of Lille 🌐 ORKAD team web site
Required Elements
-
Python compiler
(version 3.12 or higher) Git
Quick installation
Default installation can be summarized as follows:
git clone https://gitlab.cristal.univ-lille.fr/orkad-public/peaceword.git
cd peaceword
python3 -m venv venv
. venv/bin/activate
pip install -r requirements.txt
Folder organisation
- models the folder containing the downloaded datasets, note that the 'text8' dataset (based on Wikipedia text) is constantly evolving. The resulting cosine similarity may therefore vary depending on when the dataset was downloaded.
- methods python package containing two approaches (hillclimbing and greedy)
- project root or . contains the different main programs (described below)
Programs
This section describes the various python programs included in this Git project
Downloading datasets
There are two programs, the first (load_model.py) allows you to download a model from the gensim library, the second (load_glove_model.py) is specific to glove-XXX datasets.
It's easy to use: launch the python code with the name of the dataset as argument, and the loaded model is stored in the models directory.
Here's an example for the 'text8' dataset.
python3 load_model.py text8
The greedy method
The main program run_greedy.py requires several parameters:
-
dataset
: the dataset location -
only_pos
: 'yes' if the research is limited to positive words, 'no' otherwise. -
min_d
: the minimum distance between two dimension values for them to be considered closed (double value). -
min_p
: the minimum percentage (integer value) of close dimensions for selecting a word -
threshold
: the minimum absolute double value for which a dimension value is considered relevant. -
target
: target word name
Here is an example:
python3 run_greedy.py ./models/text8 yes 0.0279 5 0.2233 yes queen
results are stored in a CSV file
The HillClimbing method
The main program run_hillclimbing.py requires several parameters:
-
dataset
: the dataset location -
only_pos
: 'yes' if the research is limited to positive words, 'no' otherwise. -
seed
: a seed number, the hillclimbing method is not deterministic -
target
: target word name
Here is an example:
python3 run_hillclimbing.py ./models/text8 no 16 brother
results are stored in a CSV file
The exampleAnalogy.py program
This simple program computes and displays the cosine similarity between fixed solutions and a target word.
The unique parameter is the dataset location, here is an example :
python3 exampleAnalogy.py ./models/glove-wiki-gigaword-100
Information
Authors
See Authors
License
PEACEWORD is licensed under the following license :
- GNU General Public License version 3 (GPLv3) GPL refers to the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.