Skip to content
Snippets Groups Projects
Commit 964c2400 authored by Caron Olivier's avatar Caron Olivier
Browse files

methods for word embedding

parent 987f095e
No related branches found
No related tags found
No related merge requests found
Showing with 1664 additions and 1 deletion
venv
__pycache__
.idea
Authors 0 → 100644
:: PEACEWORD ::
Authors:
Olivier Caron
Alexander Bassett
Julie Jacques
Julien Baste
LICENSE 0 → 100644
This diff is collapsed.
# PEACEWORD # PEACEWORD
Prototype for Extracing And Considering the Explainability of WORD Embeddings Prototype for Extracting And Considering the Explainability of WORD embeddings.
This simple Git project contains two classic heuristics for assessing their suitability
for word embeddings (proofs of concepts).
This project is a work from the research team ORKAD of the CRIStAL laboratory of the University of Lille
[:globe_with_meridians: ORKAD team web site](https://orkad.univ-lille.fr)
## Required Elements
* `Python compiler` (version 3.12 or higher)
* `Git`
## Quick installation
Default installation can be summarized as follows:
```bash
git clone https://gitlab.cristal.univ-lille.fr/orkad-public/peaceword.git
cd peaceword
python3 -m venv venv
. venv/bin/activate
pip install -r requirements.txt
```
## Folder organisation
- **models** the folder containing the downloaded datasets, note that the 'text8' dataset
(based on Wikipedia text) is constantly evolving. The resulting cosine similarity may therefore
vary depending on when the dataset was downloaded.
- **methods** python package containing two approaches (hillclimbing and greedy)
- **project root** or **.** contains the different main programs (described below)
## Programs
This section describes the various python programs included in this Git project
### Downloading datasets
There are two programs, the first (_load_model.py_)
allows you to download a model from the gensim library,
the second (_load_glove_model.py_) is specific to glove-XXX datasets.
It's easy to use: launch the python code with the name of the dataset as argument, and the loaded model is stored in the *models* directory.
Here's an example for the 'text8' dataset.
```bash
python3 load_model.py text8
```
### The greedy method
The main program **run_greedy.py** requires several parameters:
- `dataset` : the dataset location
- `only_pos` : 'yes' if the research is limited to positive words, 'no' otherwise.
- `min_d` : the minimum distance between two dimension values for them to be considered
closed (double value).
- `min_p` : the minimum percentage (integer value) of close dimensions for selecting a word
- `threshold` : the minimum absolute double value for which a dimension value is considered relevant.
- `target` : target word name
Here is an example:
```bash
python3 run_greedy.py ./models/text8_article yes 0.0279 5 0.2233 yes queen
```
## Information
### Authors
See [Authors](./Authors)
### License
PEACEWORD is licensed under the following license :
* [GNU General Public License version 3 (GPLv3)](./LICENSE) GPL refers to the GNU General Public License as published by the Free Software Foundation;
either version 3 of the License, or (at your option) any later version.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
PEACEWORD, Prototype for Extracting And Considering
the Explainability of WORD embeddings.
(c) 2025 University of Lille, CNRS
copyright: Peaceword developers (see Authors file),
GPL v3 License (see LICENSE file)
"""
import sys
import gensim.downloader as api
def main():
"""
This simple program loads a glove dataset from the gensim library and store it into the "models" directory
Example : python3 load_glove_model.py glove_wiki-gigaword-100
"""
if len(sys.argv) < 2:
print('Usage: load_glove_model.py gensimModelName')
sys.exit()
model = api.load(sys.argv[1])
model.save_word2vec_format(f"./models/{sys.argv[1]}", binary=True) # save in binary format
print(f"INFO :: model Trained {sys.argv[1]} saved in ./models")
if __name__ == '__main__':
main()
\ No newline at end of file
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
PEACEWORD, Prototype for Extracting And Considering
the Explainability of WORD embeddings.
(c) 2025 University of Lille, CNRS
copyright: Peaceword developers (see Authors file),
GPL v3 License (see LICENSE file)
"""
from gensim.models.word2vec import Word2Vec
import sys
import gensim.downloader as api
def main():
"""
This simple program loads a gensim dataset and store it into the "models" directory
Example : python3 load_model_pretrained text8
"""
if len(sys.argv) < 2:
print('Usage: load_model_pretrained.py gensimModelName')
sys.exit()
corpus = api.load(sys.argv[1])
model = Word2Vec(corpus)
model.save(f"./models/{sys.argv[1]}")
print(f"INFO :: model Trained {sys.argv[1]} saved in ./models")
if __name__ == '__main__':
main()
\ No newline at end of file
# -*- coding: utf-8 -*-
"""
PEACEWORD, Prototype for Extracting And Considering
the Explainability of WORD embeddings.
(c) 2025 University of Lille, CNRS
copyright: Peaceword developers (see Authors file),
GPL v3 License (see LICENSE file)
"""
from numpy import dot
from numpy.linalg import norm
import numpy as np
def normalize_model(model):
"""
normalization of gensim KeyedVectors
Args:
model: the no normalized downloaded model
Returns: the corresponding map where all dimensions of the vector words are normalized
and the elapsed time of the normalization
"""
# Retrieve keys and corresponding vectors
keys = list(model.key_to_index.keys())
vectors = np.stack([model[key] for key in keys])
normalized_model = {}
for i_dim in range(model.vector_size):
vectors.T[i_dim] = vectors.T[i_dim] / max(abs(vectors.T[i_dim]))
nb = 0
for word in keys:
normalized_model[word] = vectors[nb]
nb = nb + 1
return normalized_model
def cosine_similarity(vector_a, vector_b) -> float:
"""
compute the cosine similarity between two vectors
Arguments:
vector_a - the first vector
vector_b - the second vector
"""
a = vector_a
b = vector_b
if len(vector_a) == 0:
return 0.0
norm_a = norm(a)
norm_b = norm(b)
if norm_a == 0.0 or norm_b == 0.0:
return 0.0
cos_sim = dot(a, b) / (norm_a * norm_b)
return cos_sim
\ No newline at end of file
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
PEACEWORD, Prototype for Extracting And Considering
the Explainability of WORD embeddings.
(c) 2025 University of Lille, CNRS
copyright: Peaceword developers (see Authors file),
GPL v3 License (see LICENSE file)
"""
from methods.solution import Solution
from methods.common import cosine_similarity
def coverage_similarity(vector_a, vector_b, min_d) -> int:
"""
compute the number of dimensions that are equivalent (according to given min_d)
Arguments:
vector_a - the first word vector
vector_b - the second word vector
min_d - the minimum distance to consider dimension values as equivalent
"""
counter = 0
for value1, value2 in zip(vector_a, vector_b):
if abs(value1-value2)<=min_d:
counter = counter + 1
return counter
def sub_vector(v, coverage):
"""
returns the sub vector of v according to a coverage
Arguments:
v - the given vector
coverage - a vector that contains all vector indexes to take into account
"""
return [v[i] for i in coverage]
def cosine_closest_match(vocab, vector_words, target_word_vector, coverage, banned_words):
"""
returns the word from the model base that is the closest_match according to cosine_similarity and given coverage
Arguments:
vocab - array containing the words including target_word
vector_words - dictionary of vector words
target_word_vector - the given target word vector
coverage - a vector that contains all vector indexes to take into account
banned_words - list of words to not taking into account
"""
best = None
best_cosine = None
for name in vocab:
if name not in banned_words:
cosine = cosine_similarity(sub_vector(vector_words[name],coverage), sub_vector(target_word_vector,coverage))
if best is None:
best = name
best_cosine = cosine
else:
if cosine > best_cosine:
best = name
best_cosine = cosine
return best
def coverage_closest_match(vocab, vector_words, target_word_vector, coverage, banned_words, min_d):
"""
returns the word from the model base that is the closest_match according to coverage_similarity and given coverage
Arguments:
vocab - array containing the words including target_word
vector_words - dictionary of vector words
target_word_vector - the given target word vector
coverage - a vector that contains all vector indexes to take into account
banned_words - list of words to not taking into account
min_d - the minimum distance to consider dimension values as equivalent
"""
best=None
best_match = None
for name in vocab:
if name not in banned_words:
nb = coverage_similarity(sub_vector(vector_words[name],coverage), sub_vector(target_word_vector,coverage),min_d)
if best is None:
best = name
best_match = nb
else:
if nb > best_match:
best = name
best_match = nb
return best
def test_filter(vector_a, vector_b, vector_size, min_d, min_p):
"""
this function returns true if it exists at least x percent of dimensions equivalent between vector_a and vector_b,
returns false otherwise
Args:
vector_a: first given word vector
vector_b: second given word vector
vector_size : the size of the two given word vectors
min_d: the value used for comparing dimensions
min_p : the required percentage of equivalent dimensions to produce a true result
Returns: true if vectors are considered as equivalent
"""
if vector_size == 0:
return False
counter = coverage_similarity(vector_a, vector_b, min_d) # coincidence test of word
return ((counter * 100) / vector_size) >= min_p
def greedy_prepare_data(norm_model, pos_only, min_d, min_p, threshold, target_word):
"""
Step 1 of the greedy algorithm, returns the vector of words and the pertinent coverage
:param norm_model: the dataset
:param pos_only: only manage positive words if equals to True
:param min_d - the minimum distance to compare for each dimension of the word vector
:param min_p - percentage minimum of dimensions (step 1),
if equals to zero, consider all words
:param threshold - this parameter allows to reduce the coverage for all abs(values) of the target vector > threshold
if threshold is equals to 0, the resulting coverage contains all dimensions
:param target_word: the name of the target
:return: the map of word vectors, the pertinent coverage .
"""
print("before step 1")
print("initial number of words:", len(norm_model.keys()))
print("pos_only:",pos_only)
print("min_d:",min_d, "min_p:",min_p)
print("threshold:",threshold)
# step 1 : complete the base with negative words if pos_only is equal to False
print("Step 1: ")
if not pos_only:
print("complete base with negative words")
vector_words = {}
if target_word[0] == "-":
target_word_vector = -1 * norm_model[target_word[1:]]
else:
target_word_vector = norm_model[target_word]
wv_size = len(target_word_vector)
print("size word:",wv_size)
if threshold == 0:
coverage = list(range(wv_size)) # init coverage
else:
coverage = []
for i in range(wv_size):
if abs(target_word_vector[i]) > threshold:
coverage.append(i)
for word in norm_model.keys():
current_wv = norm_model[word]
if (word == target_word or (min_p == 0) or
(test_filter(sub_vector(current_wv, coverage), sub_vector(target_word_vector, coverage),
len(coverage), min_d, min_p))):
vector_words[word] = current_wv
if not pos_only:
inverse_word = f"-{word}"
inverse_wv = -1 * current_wv # works with np.ndarray
# inverse_wv = [x * -1 for x in current_wv]
if (inverse_word == target_word or (min_p == 0) or
(test_filter(sub_vector(inverse_wv, coverage), sub_vector(target_word_vector, coverage),
len(coverage), min_d, min_p))):
vector_words[inverse_word] = inverse_wv
print("after step 1")
print("number of words:", len(vector_words))
print("size of coverage:",len(coverage))
return vector_words, coverage
def algo_greedy(vector_words, target_word, min_d, max_number_words, test_improve, coverage):
"""
Greedy algorithm for finding semantically related words
Arguments:
model - the model base provided by the gensim library
target_word - the given target word
min_d - the initial minimum distance to compare for each dimension of the word vector
max_number_words - max size of the result vector
test_improve - if True, the found word (step 2) must improve the solution to be taken into account
coverage - a vector of index of dimensions to take into account
returns the found words, the resulting solution and the used coverage size at the end of the process
"""
solution = Solution([], [])
target_word_vector = vector_words[target_word]
wv_size = len(target_word_vector)
vocab = list(vector_words.keys())
iteration = 1
delta_min_d = (min_d / max_number_words) /2
best_cs = -1 # init
banned_words = [ target_word ]
while iteration <= max_number_words and len(coverage)>0:
# Step 2 : find the closest word
print(f"Step 2 (find the closest word by cosine similarity), iteration number :", iteration)
print("coverage size:",len(coverage))
new_word = cosine_closest_match(vocab, vector_words, target_word_vector, coverage,banned_words)
if new_word is not None:
improve = True
banned_words.append(new_word)
if test_improve: # test if new_word improves the solution
temp_sol = solution.word_vector(vector_words,wv_size) + vector_words[new_word]
current_cs = cosine_similarity(temp_sol, target_word_vector)
if current_cs < best_cs:
improve = False
else:
best_cs = current_cs
if improve:
solution.add(new_word)
print("After Step 2 : found word:",new_word)
# Step 3 :
print("Step 3 - update coverage")
coverage = update_coverage(vector_words[new_word], target_word_vector, min_d, coverage)
print("After Step 3 : coverage size : ", len(coverage))
else:
print("After Step 2 : the found word:", new_word, "does not improve the solution")
iteration = iteration + 1
min_d = min_d - delta_min_d
return solution, len(coverage)
def update_coverage(word_vector, target_word_vector, min_d, coverage):
"""
compute a new coverage by subtracting equivalent dimension (according to min_d)
between a word vector and the target word vector
Arguments:
word_vector - the word_vector to compare with the target one
target_word_vector the target word vector
min_d - the minimum delta for each dimension
coverage - the current coverage
"""
for i,(v1,v2) in enumerate(zip(word_vector,target_word_vector)):
if i in coverage and abs(v1-v2)<=min_d:
coverage.remove(i)
return coverage
# -*- coding: utf-8 -*-
"""
PEACEWORD, Prototype for Extracting And Considering
the Explainability of WORD embeddings.
(c) 2025 University of Lille, CNRS
copyright: Peaceword developers (see Authors file),
GPL v3 License (see LICENSE file)
"""
import copy
class HillClimbing:
"""
class that implements the hillclimbing local research algorithm
"""
def __init__(self, word_embedding, solution, target_word, banned_words=None) -> None:
"""
class constructor, initialization of instance variables
:param word_embedding: the class that contains the dataset
:param solution: an initial solution
:param target_word: the target word
:param banned_words: list of words to be banned
"""
if banned_words is None:
banned_words = []
self.model = word_embedding
self.initial_sol = solution
self.current_sol = solution
self.target_word = target_word
self.vocab = copy.deepcopy(self.model.get_vocab())
for word in banned_words:
print(word)
self.vocab.remove(word)
self.iterations = 0
self.evals = 0
self.eval_method = self.model.evaluate
def step(self):
"""
perform one step of the algorithm
:return: the score of a better solution, returns False if no found better solution
"""
score = self.evaluate(self.current_sol, self.target_word)
for neighbor in self.model.neighbor_solutions(self.current_sol, self.vocab, self.target_word):
next_score = self.evaluate(neighbor, self.target_word)
if next_score - score > 0.05:
score = next_score
self.set_current_solution(neighbor)
return score
return False
def evaluate(self, sol, predict_word):
"""
computes an evaluation of a given solution
:param sol: the current solution
:param predict_word: the target word
:return:
"""
self.evals += 1
return self.eval_method(sol, predict_word)
def set_current_solution(self, new_sol):
"""
set the current solution
:param new_sol: the selected solution
:return:
"""
self.current_sol = new_sol
self.iterations +=1
def get_current_solution(self):
"""
provides the current solution
:return:
"""
return self.current_sol
def get_iterations(self):
"""
provides the number of iterations
"""
return self.iterations
def __str__(self) -> str:
return f"HillClimbing (model={self.model}, init={self.initial_sol})"
# -*- coding: utf-8 -*-
"""
PEACEWORD, Prototype for Extracting And Considering
the Explainability of WORD embeddings.
(c) 2025 University of Lille, CNRS
copyright: Peaceword developers (see Authors file),
GPL v3 License (see LICENSE file)
"""
import copy
import random
from methods.solution import Solution
from methods.common import cosine_similarity
class WordEmbedding:
"""
class that contains the dataset, provides getter methods and a method for calculating a new neighborhood
"""
def __init__(self, vectors, pos_only, seed) -> None:
"""
class constructor, initialize the instance variables
:param vectors: KeyedVectors object from gensim library
:param pos_only: boolean value, restricted to positive words if set to True
:param seed: the random seed
"""
self.wv = vectors
self.pos_only = pos_only
self.list_keys = list(self.wv.keys())
random.seed(seed)
def get_vocab_size(self):
""" provides the dataset size """
return len(self.list_keys)
def get_vocab(self):
""" provides the list of words managed by the dataset"""
return self.list_keys
def get_random_word(self):
""" provides a random word from the data set """
return random.choice(self.list_keys)
def random_solution(self, nb_words, eval_word):
""" provides a random solution containing nb_words
Arguments:
nb_words : the solution size to generate
eval_word : the target word, this word must not be in the solution
"""
positives = []
negatives = []
couple = (positives, negatives)
for i in range(nb_words):
new_word = self.get_random_word()
while new_word in (positives+negatives) or new_word == eval_word:
new_word = self.get_random_word()
if self.pos_only:
positives.append(new_word)
else:
random.choice(couple).append(new_word)
return Solution(positives, negatives)
def evaluate(self, solution, predict_word):
"""
provides the cosine_similarity of the solution and the target word
:param solution: a solution
:param predict_word: the target word
:return: the cosine similarity
"""
result = 0
if len(solution.positive) == 0 and len(solution.negative) == 0:
print("WARNING : EMPTY SOLUTION PASSED")
return 0
for word in solution.positive:
result = result + self.wv[word]
for word in solution.negative:
result = result - self.wv[word]
return round(cosine_similarity(result, self.wv[predict_word]), 6)
def neighbor_solutions(self, solution, vocab, eval_word):
"""
calculates a list of neighboring solutions from a given solution;
neighboring solutions differ from the initial solutions by a single word
:param solution: the current solution
:param vocab: vocabulary
:param eval_word: the target word
"""
neighbors = []
#part 1 : removing a word
for word in solution.positive:
s_copy = copy.deepcopy(solution)
s_copy.positive.remove(word)
neighbors.append(s_copy)
if not self.pos_only:
for word in solution.negative:
s_copy = copy.deepcopy(solution)
s_copy.negative.remove(word)
neighbors.append(s_copy)
#part 2 : changing a word
for word in solution.positive:
s_copy = copy.deepcopy(solution)
s_copy.positive.remove(word)
for v in vocab:
if v not in (solution.positive+solution.negative) and v != eval_word and v != word:
replaced = copy.deepcopy(s_copy)
replaced.positive.append(v)
neighbors.append(replaced)
if not self.pos_only:
for word in solution.negative:
s_copy = copy.deepcopy(solution)
s_copy.negative.remove(word)
for v in vocab:
if v not in (solution.positive + solution.negative) and v != eval_word and v != word:
replaced = copy.deepcopy(s_copy)
replaced.negative.append(v)
neighbors.append(replaced)
# part 3 : add a word
for word in vocab:
if word not in (solution.positive + solution.negative) and word != eval_word:
s_copy = copy.deepcopy(solution)
s_copy.positive.append(word)
neighbors.append(s_copy)
if not self.pos_only and word not in (solution.positive + solution.negative) and word != eval_word:
s_copy = copy.deepcopy(solution)
s_copy.negative.append(word)
neighbors.append(s_copy)
# remove empty solutions
for n in neighbors:
if n.is_empty():
neighbors.remove(n)
return neighbors
\ No newline at end of file
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
PEACEWORD, Prototype for Extracting And Considering
the Explainability of WORD embeddings.
(c) 2025 University of Lille, CNRS
copyright: Peaceword developers (see Authors file),
GPL v3 License (see LICENSE file)
"""
import numpy as np
class Solution:
def __init__(self, positives=None, negatives=None) -> None:
""" initialisation of a solution with positive and negative set of words"""
if positives is None:
self.positive = []
else:
self.positive = positives
if negatives is None:
self.negative = []
else:
self.negative = negatives
def is_empty(self):
"""
:return: true if the solution contains no word, false otherwise
"""
return len(self.positive) == 0 and len(self.negative) == 0
def word_vector(self, map_word_vector, vector_size):
""" computes the corresponding word vector of the solution
Args:
map_word_vector : the map corresponding to the dataset
vector_size: the vector size
Returns:
result: the corresponding word vector of the solution
"""
result = np.zeros(vector_size)
for word in self.positive:
# result = [x + y for x, y in zip(result, map_word_vector[word])]
result = result + map_word_vector[word]
for word in self.negative:
# result = [x + y for x, y in zip(result, map_word_vector["-"+word])]
result = result + map_word_vector["-"+word] # to test
return result
def add(self,word):
""" add a new word in the solution
Args:
word: the new word, if word starts with the character '-', it is a negative word
"""
if word[0] == '-':
self.negative.append(word[1:])
else:
self.positive.append(word)
def __str__(self) -> str:
return f"Solution (positive={self.positive}, negative={self.negative})"
\ No newline at end of file
this folder will contain downloaded datasets
\ No newline at end of file
numpy==1.26.4
gensim==4.3.3
pandas==2.2.3
#!/bin/bash
words="queen berlin brother euro athens"
dataset=$1
pos_only=$2
for word in $words
do
for ((seed=1; seed<= 50; seed++))
do
sbatch we_slurm.sl $dataset $pos_only $seed $word
done
done
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
PEACEWORD, Prototype for Extracting And Considering
the Explainability of WORD embeddings.
(c) 2025 University of Lille, CNRS
copyright: Peaceword developers (see Authors file),
GPL v3 License (see LICENSE file)
"""
import sys
import csv
import re
import os
import time
from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors
from methods.greedy.greedy import algo_greedy, greedy_prepare_data
from methods.common import cosine_similarity, normalize_model
def main():
if len(sys.argv) != 8:
print('Usage: python3 run_greedy.py dataset_location pos_only (yes or no) ',
' min_d (real), min_p (0-10) '
'threshold (0 or limit value) test_improve (yes or no) eval_word')
sys.exit()
# parameters :
max_number_words = 6
dataset_location = sys.argv[1]
param_pos_only = sys.argv[2]
param_min_d = sys.argv[3]
param_min_p = sys.argv[4]
param_threshold = sys.argv[5]
param_test_improve = sys.argv[6]
target_word=sys.argv[7]
regex = '[+-]?[0-9]+.?[0-9]*'
if param_pos_only not in ['yes','no']:
print(f'bad pos_only parameter : {param_pos_only} (yes or no are only supported)')
sys.exit()
if not re.search(regex, param_threshold):
print(f'bad threshold parameter : {param_threshold} must be a number')
sys.exit()
if not re.search(regex, param_min_d):
print(f'bad min_d parameter : {param_min_d} must be a number')
sys.exit()
min_d = float(param_min_d)
if not re.search(regex, param_min_p):
print(f'bad min_p parameter : {param_min_p} must be a number')
sys.exit()
if param_test_improve not in ['yes','no']:
print(f'bad test_improve parameter : {param_test_improve} (yes or no are only supported)')
sys.exit()
path, filename = os.path.split(dataset_location)
if "glove" in dataset_location:
model = KeyedVectors.load_word2vec_format(dataset_location, binary=True)
else:
model = Word2Vec.load(dataset_location)
model = model.wv
norm_model = normalize_model(model)
start_time = time.time()
vector_words, coverage =greedy_prepare_data (norm_model,
(param_pos_only=='yes'),
min_d, float(param_min_p),
float(param_threshold),target_word)
word_size = len(vector_words[target_word])
nb_unused_dimensions = word_size - len(coverage)
print(f"unused_dimensions: {nb_unused_dimensions}")
solution, last_coverage_size = algo_greedy(vector_words, target_word, min_d,
max_number_words, (param_test_improve=='yes'), coverage)
end_time = time.time()
print(f"found solution for target {target_word} :", solution)
run_time=end_time - start_time
print(f"Total execution time: {run_time} seconds")
print("solution:", solution)
cs = cosine_similarity(solution.word_vector(vector_words,word_size), vector_words[target_word])
print(f"cosine similarity of solution :{cs}")
print(f"last coverage size:{last_coverage_size}")
with open(f"greedy_{filename}_{target_word}.csv", "a", newline='', encoding='utf-8') as csvfile:
fwriter = csv.writer(csvfile, delimiter=';', quoting=csv.QUOTE_NONE, escapechar='@')
fwriter.writerow([filename, target_word, last_coverage_size, str(run_time), cs,
param_pos_only, param_min_p, param_min_d,
param_threshold, param_test_improve,solution,nb_unused_dimensions])
csvfile.close()
print(filename, target_word, last_coverage_size
, str(run_time), cs,
param_pos_only, param_min_p,param_min_d,
param_threshold, param_test_improve, solution, nb_unused_dimensions)
if __name__ == '__main__':
main()
\ No newline at end of file
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
PEACEWORD, Prototype for Extracting And Considering
the Explainability of WORD embeddings.
(c) 2025 University of Lille, CNRS
copyright: Peaceword developers (see Authors file),
GPL v3 License (see LICENSE file)
"""
from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors
import time
import sys
import csv
import os
from methods.hillclimbing.utils import WordEmbedding
from methods.hillclimbing.hillclimbing import HillClimbing
from methods.common import normalize_model
def main():
if len(sys.argv) != 5:
print('Usage: python3 run_hillclimbing.py dataset_location pos_only seed eval_word')
sys.exit()
dataset_location = sys.argv[1]
path, dataset_name = os.path.split(dataset_location)
param_pos_only = sys.argv[2]
seed = sys.argv[3]
eval_word = sys.argv[4]
if param_pos_only not in ['yes','no']:
print(f'bad pos_only parameter : {param_pos_only} (yes or no are only supported)')
sys.exit()
else:
pos_only = (param_pos_only == 'yes')
if "glove" in dataset_location:
model = KeyedVectors.load_word2vec_format(dataset_location, binary=True)
else:
model = Word2Vec.load(dataset_location)
model = model.wv
norm_model = normalize_model(model)
print("INFO :: model Trained")
print("run No", seed)
word_embedding = WordEmbedding(norm_model, pos_only, seed)
start_time = time.time()
start_sol = word_embedding.random_solution(6, eval_word)
hc = HillClimbing(word_embedding, start_sol, eval_word, None)
step = True
print("starting hc")
score = 0
while step:
step = hc.step()
if not step == False:
score = step
print(f"{str(hc.get_current_solution())} : score {step}")
end_time = time.time()
print("--------------")
print(f"{str(hc.get_current_solution())} : score {score}")
print(f"iterations : {hc.iterations}")
run_time = end_time - start_time
print(f"Total execution time: {run_time} seconds")
with open(f"hillclimbing_{dataset_name}_{param_pos_only}_{eval_word}.csv", "a", newline='', encoding='utf-8') as csvfile:
fwriter = csv.writer(csvfile, delimiter=';', quoting=csv.QUOTE_NONE, escapechar='@')
fwriter.writerow(
[seed, str(start_sol), eval_word, str(score), hc.iterations, str(run_time), str(hc.get_current_solution())])
csvfile.close()
print(seed, start_sol, eval_word, score, hc.iterations, run_time, hc.get_current_solution())
if __name__ == '__main__':
main()
#!/bin/bash
DATASET=$1
POS_ONLY=$2
SEED=$3
WORD=$4
RUNPATH=/media/softs_orkad/olivier/peaceword
#SBATCH --job-name=HC_${DATASET}_${WORD}_${SEED}
#SBATCH --partition=debug
#SBATCH --nodes=1
cd $RUNPATH
source $RUNPATH/venv/bin/activate
python3 run_hillclimbing.py $DATASET $POS_ONLY $SEED $WORD
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment