methods for word embedding

964c2400 · Caron Olivier · 987f095e · 964c2400 · 964c2400 · 964c2400
Commit 964c2400 authored 5 months ago by Caron Olivier
--- a/.gitignore
+++ b/.gitignore
+venv
+__pycache__
+.idea
--- a/Authors
+++ b/Authors
+:: PEACEWORD ::
+    Authors:
+    	Olivier Caron
+    	Alexander Bassett
+    	Julie Jacques
+    	Julien Baste
--- a/LICENSE
+++ b/LICENSE
--- a/README.md
+++ b/README.md
 # PEACEWORD
-Prototype for Extracing And Considering the Explainability of WORD Embeddings
+Prototype for Extracting And Considering the Explainability of WORD embeddings.
+This simple Git project contains two classic heuristics for assessing their suitability 
+for word embeddings (proofs of concepts).
+This project is a work from the research team ORKAD of the CRIStAL laboratory of the University of Lille
+[:globe_with_meridians: ORKAD team web site](https://orkad.univ-lille.fr)
+## Required Elements
+* `Python  compiler` (version 3.12 or higher)
+* `Git`
+## Quick installation
+Default installation  can be summarized as follows:
+```bash
+git clone https://gitlab.cristal.univ-lille.fr/orkad-public/peaceword.git
+cd peaceword
+python3 -m venv venv
+. venv/bin/activate
+pip install -r requirements.txt
+```
+## Folder organisation
+- **models** the folder containing the downloaded datasets, note that the 'text8' dataset 
+  (based on Wikipedia text) is constantly evolving. The resulting cosine similarity may therefore 
+  vary depending on when the dataset was downloaded.
+- **methods** python package containing two approaches (hillclimbing and greedy)
+- **project root** or **.** contains the different main programs (described below)
+## Programs
+This section describes the various python programs included in this Git project
+### Downloading datasets
+There are two programs, the first (_load_model.py_) 
+allows you to download a model from the gensim library, 
+the second (_load_glove_model.py_) is specific to glove-XXX datasets.
+It's easy to use: launch the python code with the name of the dataset as argument, and the loaded model is stored in the *models* directory.
+Here's an example for the 'text8' dataset.
+```bash
+python3 load_model.py text8
+```
+### The greedy method
+The main program **run_greedy.py** requires several parameters:
+- `dataset` : the dataset location
+- `only_pos` : 'yes' if the research is limited to positive words, 'no' otherwise.
+- `min_d` : the minimum distance between two dimension values for them to be considered
+closed (double value).
+- `min_p` : the minimum percentage (integer value) of close dimensions for selecting a word
+- `threshold` : the minimum absolute double value for which a dimension value is considered relevant.
+- `target` : target word name
+Here is an example:
+```bash
+python3 run_greedy.py ./models/text8_article yes 0.0279 5 0.2233 yes queen
+```
+## Information
+### Authors
+See [Authors](./Authors)
+### License
+PEACEWORD is licensed under the following license :
+* [GNU General Public License version 3 (GPLv3)](./LICENSE) GPL refers to the GNU General Public License as published by the Free Software Foundation;
+   either version 3 of the License, or (at your option) any later version.
--- a/load_glove_model.py
+++ b/load_glove_model.py
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+PEACEWORD, Prototype for Extracting And Considering
+           the Explainability of WORD embeddings.
+           (c) 2025 University of Lille, CNRS
+copyright: Peaceword developers (see Authors file),
+           GPL v3 License (see LICENSE file)
+"""
+import sys
+import gensim.downloader as api
+def main():
+    """
+    This simple program loads a glove dataset from the gensim library and store it into the "models" directory
+    Example : python3 load_glove_model.py glove_wiki-gigaword-100
+    """
+    if len(sys.argv) < 2:
+        print('Usage: load_glove_model.py gensimModelName')
+        sys.exit()
+    model = api.load(sys.argv[1])
+    model.save_word2vec_format(f"./models/{sys.argv[1]}", binary=True)  # save in binary format
+    print(f"INFO :: model Trained {sys.argv[1]}  saved in ./models")
+if __name__ == '__main__':
+    main()
\ No newline at end of file
--- a/load_model.py
+++ b/load_model.py
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+PEACEWORD, Prototype for Extracting And Considering
+           the Explainability of WORD embeddings.
+           (c) 2025 University of Lille, CNRS
+copyright: Peaceword developers (see Authors file),
+           GPL v3 License (see LICENSE file)
+"""
+from gensim.models.word2vec import Word2Vec
+import sys
+import gensim.downloader as api
+def main():
+    """
+    This simple program loads a gensim dataset and store it into the "models" directory
+    Example : python3 load_model_pretrained text8
+    """
+    if len(sys.argv) < 2:
+        print('Usage: load_model_pretrained.py gensimModelName')
+        sys.exit()
+    corpus = api.load(sys.argv[1])
+    model = Word2Vec(corpus)
+    model.save(f"./models/{sys.argv[1]}")
+    print(f"INFO :: model Trained {sys.argv[1]}  saved in ./models")
+if __name__ == '__main__':
+    main()
\ No newline at end of file
--- a/methods/__init__.py
+++ b/methods/__init__.py
--- a/methods/common.py
+++ b/methods/common.py
+# -*- coding: utf-8 -*-
+"""
+PEACEWORD, Prototype for Extracting And Considering
+           the Explainability of WORD embeddings.
+           (c) 2025 University of Lille, CNRS
+copyright: Peaceword developers (see Authors file),
+           GPL v3 License (see LICENSE file)
+"""
+from numpy import dot
+from numpy.linalg import norm
+import numpy as np
+def normalize_model(model):
+    """
+     normalization of gensim KeyedVectors
+    Args:
+        model: the no normalized downloaded model
+    Returns: the corresponding map where all dimensions of the vector words are normalized
+             and the elapsed time of the normalization
+    """
+    # Retrieve keys and corresponding vectors
+    keys = list(model.key_to_index.keys())
+    vectors = np.stack([model[key] for key in keys])
+    normalized_model = {}
+    for i_dim in range(model.vector_size):
+        vectors.T[i_dim] = vectors.T[i_dim] / max(abs(vectors.T[i_dim]))
+    nb = 0
+    for word in keys:
+        normalized_model[word] = vectors[nb]
+        nb = nb + 1
+    return normalized_model
+def cosine_similarity(vector_a, vector_b) -> float:
+    """
+    compute the cosine similarity between two vectors
+    Arguments:
+    vector_a - the first vector
+    vector_b - the second vector
+    """
+    a = vector_a
+    b = vector_b
+    if len(vector_a) == 0:
+        return 0.0
+    norm_a = norm(a)
+    norm_b = norm(b)
+    if norm_a == 0.0 or norm_b == 0.0:
+        return 0.0
+    cos_sim = dot(a, b) / (norm_a * norm_b)
+    return cos_sim
\ No newline at end of file
--- a/methods/greedy/__init__.py
+++ b/methods/greedy/__init__.py
--- a/methods/greedy/greedy.py
+++ b/methods/greedy/greedy.py
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+PEACEWORD, Prototype for Extracting And Considering
+           the Explainability of WORD embeddings.
+           (c) 2025 University of Lille, CNRS
+copyright: Peaceword developers (see Authors file),
+           GPL v3 License (see LICENSE file)
+"""
+from methods.solution import Solution
+from methods.common import cosine_similarity
+def coverage_similarity(vector_a, vector_b, min_d) -> int:
+    """
+    compute the number of dimensions that are equivalent (according to given min_d)
+    Arguments:
+    vector_a - the first word vector
+    vector_b - the second word vector
+    min_d - the minimum distance to consider dimension values as equivalent
+    """
+    counter = 0
+    for value1, value2 in zip(vector_a, vector_b):
+        if abs(value1-value2)<=min_d:
+                    counter = counter + 1
+    return counter
+def sub_vector(v, coverage):
+    """
+    returns the sub vector of v according to a coverage
+    Arguments:
+    v - the given vector
+    coverage - a vector that contains all vector indexes to take into account
+    """
+    return [v[i] for i in coverage]
+def cosine_closest_match(vocab, vector_words, target_word_vector, coverage, banned_words):
+    """
+    returns the word from the model base that is the closest_match according to cosine_similarity and given coverage
+    Arguments:
+    vocab - array containing the words including target_word
+    vector_words - dictionary of vector words
+    target_word_vector - the given target word vector
+    coverage - a vector that contains all vector indexes to take into account
+    banned_words - list of words to not taking into account
+    """ 
+    best = None
+    best_cosine = None
+    for name in vocab:
+        if name not in banned_words:
+            cosine = cosine_similarity(sub_vector(vector_words[name],coverage), sub_vector(target_word_vector,coverage))
+            if best is None:
+                best = name
+                best_cosine = cosine
+            else:
+                if cosine > best_cosine:
+                    best = name
+                    best_cosine = cosine
+    return best
+def coverage_closest_match(vocab, vector_words, target_word_vector, coverage, banned_words, min_d):
+    """
+    returns the word from the model base that is the closest_match according to coverage_similarity and given coverage
+    Arguments:
+    vocab - array containing the words including target_word
+    vector_words - dictionary of vector words
+    target_word_vector - the given target word vector
+    coverage - a vector that contains all vector indexes to take into account
+    banned_words - list of words to not taking into account
+    min_d - the minimum distance to consider dimension values as equivalent
+    """ 
+    best=None
+    best_match = None
+    for name in vocab:
+        if name not in banned_words:
+            nb = coverage_similarity(sub_vector(vector_words[name],coverage), sub_vector(target_word_vector,coverage),min_d)
+            if best is None:
+                best = name
+                best_match = nb
+            else:
+                if nb > best_match:
+                    best = name
+                    best_match = nb
+    return best
+def test_filter(vector_a, vector_b, vector_size, min_d, min_p):
+    """
+    this function returns true if it exists at least x percent of dimensions equivalent between vector_a and vector_b,
+    returns false otherwise
+    Args:
+        vector_a: first given word vector
+        vector_b: second given word vector
+        vector_size : the size of the two given word vectors
+        min_d: the value used for comparing dimensions
+        min_p : the required percentage of equivalent dimensions to produce a true result
+    Returns: true if vectors are considered as equivalent
+    """
+    if vector_size == 0:
+        return False
+    counter = coverage_similarity(vector_a, vector_b, min_d)  # coincidence test of word
+    return ((counter * 100) / vector_size) >= min_p
+def greedy_prepare_data(norm_model, pos_only, min_d, min_p, threshold, target_word):
+    """
+    Step 1 of the greedy algorithm, returns the vector of words and the pertinent coverage
+    :param norm_model: the dataset
+    :param pos_only: only manage positive words if equals to True
+    :param min_d - the minimum distance  to compare for each dimension of the word vector
+    :param min_p - percentage minimum of dimensions (step 1),
+                                    if equals to zero, consider all words
+    :param threshold - this parameter allows to reduce the coverage for all abs(values) of the target vector > threshold
+    if threshold is equals to 0, the resulting coverage contains all dimensions
+    :param target_word: the name of the target
+    :return: the map of word vectors, the pertinent coverage .
+    """
+    print("before step 1")
+    print("initial number of words:", len(norm_model.keys()))
+    print("pos_only:",pos_only)
+    print("min_d:",min_d, "min_p:",min_p)
+    print("threshold:",threshold)
+    # step 1 : complete the base with negative words if pos_only is equal to False
+    print("Step 1: ")
+    if not pos_only:
+        print("complete base with negative words")
+    vector_words = {}
+    if target_word[0] == "-":
+        target_word_vector = -1 * norm_model[target_word[1:]]
+    else:
+        target_word_vector = norm_model[target_word]
+    wv_size = len(target_word_vector)
+    print("size word:",wv_size)
+    if threshold == 0:
+        coverage = list(range(wv_size)) # init coverage
+    else:
+        coverage = []
+        for i in range(wv_size):
+            if abs(target_word_vector[i]) > threshold:
+                coverage.append(i)
+    for word in norm_model.keys():
+        current_wv = norm_model[word]
+        if (word == target_word or (min_p == 0) or
+                (test_filter(sub_vector(current_wv, coverage), sub_vector(target_word_vector, coverage),
+                                       len(coverage), min_d, min_p))):
+            vector_words[word] = current_wv
+        if not pos_only:
+            inverse_word = f"-{word}"
+            inverse_wv = -1 * current_wv # works with np.ndarray
+            # inverse_wv = [x * -1 for x in current_wv]
+            if (inverse_word == target_word or (min_p == 0) or
+                    (test_filter(sub_vector(inverse_wv, coverage), sub_vector(target_word_vector, coverage),
+                                          len(coverage), min_d, min_p))):
+                vector_words[inverse_word] = inverse_wv
+    print("after step 1")
+    print("number of words:", len(vector_words))
+    print("size of coverage:",len(coverage))
+    return vector_words, coverage
+def algo_greedy(vector_words, target_word, min_d, max_number_words, test_improve, coverage):
+    """
+    Greedy algorithm for finding semantically related words
+    Arguments:
+    model - the model base  provided by the gensim library
+    target_word - the given target word
+    min_d - the initial minimum distance  to compare for each dimension of the word vector
+    max_number_words - max size of the result vector
+    test_improve - if True, the  found word (step 2) must improve the solution to be taken into account
+    coverage - a vector of index of dimensions to take into account
+    returns the found words, the resulting solution and the used coverage size at the end of the process
+    """
+    solution = Solution([], [])
+    target_word_vector = vector_words[target_word]
+    wv_size = len(target_word_vector)
+    vocab = list(vector_words.keys())
+    iteration = 1
+    delta_min_d = (min_d / max_number_words) /2
+    best_cs = -1 # init
+    banned_words = [ target_word ]
+    while iteration <= max_number_words and len(coverage)>0:
+        # Step 2 : find the closest word 
+        print(f"Step 2 (find the closest word by cosine similarity), iteration number :", iteration)
+        print("coverage size:",len(coverage))
+        new_word = cosine_closest_match(vocab, vector_words, target_word_vector, coverage,banned_words)
+        if new_word is not None:
+            improve = True
+            banned_words.append(new_word)
+            if test_improve: # test if new_word improves the solution
+                temp_sol = solution.word_vector(vector_words,wv_size) + vector_words[new_word]
+                current_cs = cosine_similarity(temp_sol, target_word_vector)
+                if current_cs < best_cs:
+                    improve = False
+                else:
+                    best_cs = current_cs
+            if improve:
+                solution.add(new_word)
+                print("After Step 2 : found word:",new_word)
+                # Step 3 :
+                print("Step 3 - update coverage")
+                coverage = update_coverage(vector_words[new_word], target_word_vector, min_d, coverage)
+                print("After Step 3 : coverage size : ", len(coverage))
+            else:
+                print("After Step 2 : the found word:", new_word, "does not improve the solution")
+        iteration = iteration + 1
+        min_d = min_d - delta_min_d
+    return solution, len(coverage)
+def update_coverage(word_vector, target_word_vector, min_d, coverage):
+    """
+      compute a new coverage by subtracting equivalent dimension (according to min_d)
+      between a word vector and the target word vector
+      Arguments:
+      word_vector - the word_vector to compare with the target one
+      target_word_vector the target word vector
+      min_d  - the minimum delta for each dimension
+      coverage - the current coverage
+    """
+    for i,(v1,v2) in enumerate(zip(word_vector,target_word_vector)):
+        if i in coverage and abs(v1-v2)<=min_d:
+            coverage.remove(i)
+    return coverage
--- a/methods/hillclimbing/__init__.py
+++ b/methods/hillclimbing/__init__.py
--- a/methods/hillclimbing/hillclimbing.py
+++ b/methods/hillclimbing/hillclimbing.py
+# -*- coding: utf-8 -*-
+"""
+PEACEWORD, Prototype for Extracting And Considering
+           the Explainability of WORD embeddings.
+           (c) 2025 University of Lille, CNRS
+copyright: Peaceword developers (see Authors file),
+           GPL v3 License (see LICENSE file)
+"""
+import copy
+class HillClimbing:
+    """
+    class that implements the hillclimbing local research algorithm
+    """
+    def __init__(self, word_embedding, solution, target_word, banned_words=None) -> None:
+        """
+        class constructor, initialization of instance variables
+        :param word_embedding: the class that contains the dataset
+        :param solution: an initial solution
+        :param target_word: the target word
+        :param banned_words: list of words to be banned
+        """
+        if banned_words is None:
+            banned_words = []
+        self.model = word_embedding
+        self.initial_sol = solution
+        self.current_sol = solution
+        self.target_word = target_word
+        self.vocab = copy.deepcopy(self.model.get_vocab())
+        for word in banned_words:
+            print(word)
+            self.vocab.remove(word)
+        self.iterations = 0
+        self.evals = 0
+        self.eval_method = self.model.evaluate
+    def step(self):
+        """
+        perform one step of the algorithm
+        :return: the score of a better solution, returns False if no found better  solution
+        """
+        score = self.evaluate(self.current_sol, self.target_word)
+        for neighbor in self.model.neighbor_solutions(self.current_sol, self.vocab, self.target_word):
+            next_score = self.evaluate(neighbor, self.target_word)
+            if next_score - score > 0.05:
+                score = next_score
+                self.set_current_solution(neighbor)
+                return score
+        return False
+    def evaluate(self, sol, predict_word):
+        """
+        computes an evaluation of a given solution
+        :param sol: the current solution
+        :param predict_word:  the target word
+        :return:
+        """
+        self.evals += 1
+        return self.eval_method(sol, predict_word)
+    def set_current_solution(self, new_sol):
+        """
+        set the current solution
+        :param new_sol: the selected solution
+        :return:
+        """
+        self.current_sol = new_sol
+        self.iterations +=1
+    def get_current_solution(self):
+        """
+        provides the current solution
+        :return:
+        """
+        return self.current_sol
+    def get_iterations(self):
+        """
+        provides the number of iterations
+        """
+        return self.iterations
+    def __str__(self) -> str:
+      return f"HillClimbing (model={self.model}, init={self.initial_sol})"
--- a/methods/hillclimbing/utils.py
+++ b/methods/hillclimbing/utils.py
+# -*- coding: utf-8 -*-
+"""
+PEACEWORD, Prototype for Extracting And Considering
+           the Explainability of WORD embeddings.
+           (c) 2025 University of Lille, CNRS
+copyright: Peaceword developers (see Authors file),
+           GPL v3 License (see LICENSE file)
+"""
+import copy
+import random
+from methods.solution import Solution
+from methods.common import cosine_similarity
+class WordEmbedding:
+    """
+    class that contains the dataset, provides getter methods and a method for calculating a new neighborhood
+    """
+    def __init__(self, vectors, pos_only, seed) -> None:
+        """
+        class constructor, initialize the instance variables
+        :param vectors: KeyedVectors object from gensim library
+        :param pos_only: boolean value, restricted to  positive words if set to True
+        :param seed: the random seed
+        """
+        self.wv = vectors
+        self.pos_only = pos_only
+        self.list_keys = list(self.wv.keys())
+        random.seed(seed)
+    def get_vocab_size(self):
+        """ provides the dataset size """
+        return len(self.list_keys)
+    def get_vocab(self):
+        """ provides the list of words managed by the dataset"""
+        return self.list_keys
+    def get_random_word(self):
+        """ provides a random word from the data set """
+        return random.choice(self.list_keys)
+    def random_solution(self, nb_words, eval_word):
+        """ provides a random solution containing nb_words
+        Arguments:
+            nb_words : the solution size to generate
+            eval_word : the target word, this word must not be in the solution
+        """
+        positives = []
+        negatives = []
+        couple = (positives, negatives)
+        for i in range(nb_words):
+            new_word = self.get_random_word()
+            while new_word in (positives+negatives) or new_word == eval_word:
+                new_word = self.get_random_word()
+            if self.pos_only:
+                positives.append(new_word)
+            else:
+                random.choice(couple).append(new_word)
+        return Solution(positives, negatives)
+    def evaluate(self, solution, predict_word):
+        """
+        provides the cosine_similarity of the solution and  the target word
+        :param solution: a solution
+        :param predict_word: the target word
+        :return: the cosine similarity
+        """
+        result = 0
+        if len(solution.positive) == 0 and len(solution.negative) == 0:
+            print("WARNING : EMPTY SOLUTION PASSED")
+            return 0
+        for word in solution.positive:
+            result = result + self.wv[word]
+        for word in solution.negative:
+            result = result - self.wv[word]
+        return round(cosine_similarity(result, self.wv[predict_word]), 6)
+    def neighbor_solutions(self, solution, vocab, eval_word):
+        """
+        calculates a list of neighboring solutions from a given solution;
+        neighboring solutions differ from the initial solutions by a single word
+        :param solution: the current solution
+        :param vocab: vocabulary
+        :param eval_word: the target word
+        """
+        neighbors = []
+        #part 1 : removing a word
+        for word in solution.positive:
+            s_copy = copy.deepcopy(solution)
+            s_copy.positive.remove(word)
+            neighbors.append(s_copy)
+        if not self.pos_only:
+            for word in solution.negative:
+                s_copy = copy.deepcopy(solution)
+                s_copy.negative.remove(word)
+                neighbors.append(s_copy)
+        #part 2 : changing a word
+        for word in solution.positive:
+            s_copy = copy.deepcopy(solution)
+            s_copy.positive.remove(word)
+            for v in vocab:
+                if v not in (solution.positive+solution.negative) and v != eval_word and v != word:
+                    replaced = copy.deepcopy(s_copy)
+                    replaced.positive.append(v)
+                    neighbors.append(replaced)
+        if not self.pos_only:
+            for word in solution.negative:
+                s_copy = copy.deepcopy(solution)
+                s_copy.negative.remove(word)
+                for v in vocab:
+                    if v not in (solution.positive + solution.negative) and v != eval_word and v != word:
+                        replaced = copy.deepcopy(s_copy)
+                        replaced.negative.append(v)
+                        neighbors.append(replaced)
+        # part 3 : add a word
+        for word in vocab:
+            if word not in (solution.positive + solution.negative) and word != eval_word:
+                s_copy = copy.deepcopy(solution)
+                s_copy.positive.append(word)
+                neighbors.append(s_copy)
+            if not self.pos_only and word not in (solution.positive + solution.negative) and word != eval_word:
+                s_copy = copy.deepcopy(solution)
+                s_copy.negative.append(word)
+                neighbors.append(s_copy)
+        # remove empty solutions
+        for n in neighbors:
+            if n.is_empty():
+                neighbors.remove(n)
+        return neighbors
\ No newline at end of file
--- a/methods/solution.py
+++ b/methods/solution.py
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+PEACEWORD, Prototype for Extracting And Considering
+           the Explainability of WORD embeddings.
+           (c) 2025 University of Lille, CNRS
+copyright: Peaceword developers (see Authors file),
+           GPL v3 License (see LICENSE file)
+"""
+import numpy as np
+class Solution:
+    def __init__(self, positives=None, negatives=None) -> None:
+        """ initialisation of a solution with positive and negative set of words"""
+        if positives is None:
+            self.positive = []
+        else:
+            self.positive = positives
+        if negatives is None:
+            self.negative = []
+        else:
+            self.negative = negatives
+    def is_empty(self):
+        """
+        :return: true if the solution contains no word, false otherwise
+        """
+        return  len(self.positive) == 0 and len(self.negative) == 0
+    def word_vector(self, map_word_vector, vector_size):
+        """ computes the corresponding word vector of the solution
+        Args:
+            map_word_vector : the map corresponding to the dataset
+            vector_size: the vector size
+        Returns:
+            result: the corresponding word vector of the solution
+        """
+        result = np.zeros(vector_size)
+        for word in self.positive:
+            # result = [x + y for x, y in zip(result, map_word_vector[word])]
+            result = result + map_word_vector[word]
+        for word in self.negative:
+            # result = [x + y for x, y in zip(result, map_word_vector["-"+word])]
+            result = result + map_word_vector["-"+word] # to test
+        return result
+    def add(self,word):
+        """ add a new word in the  solution
+        Args:
+            word: the new word, if word starts with the character '-', it is a negative word
+        """
+        if word[0] == '-':
+            self.negative.append(word[1:])
+        else:
+            self.positive.append(word)
+    def __str__(self) -> str:
+      return f"Solution (positive={self.positive}, negative={self.negative})"
\ No newline at end of file
--- a/models/readme.txt
+++ b/models/readme.txt
+this folder will contain downloaded datasets
\ No newline at end of file
--- a/requirements.txt
+++ b/requirements.txt
+numpy==1.26.4
+gensim==4.3.3
+pandas==2.2.3
--- a/runHillClimbing.sh
+++ b/runHillClimbing.sh
+#!/bin/bash
+words="queen berlin brother euro athens"
+dataset=$1
+pos_only=$2
+for word in $words
+do
+  for ((seed=1; seed<= 50; seed++))
+  do
+    sbatch we_slurm.sl $dataset $pos_only $seed $word
+  done
+done
--- a/run_greedy.py
+++ b/run_greedy.py
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+PEACEWORD, Prototype for Extracting And Considering
+           the Explainability of WORD embeddings.
+           (c) 2025 University of Lille, CNRS
+copyright: Peaceword developers (see Authors file),
+           GPL v3 License (see LICENSE file)
+"""
+import sys
+import csv
+import re
+import os
+import time
+from gensim.models.word2vec import Word2Vec
+from gensim.models import KeyedVectors
+from methods.greedy.greedy import algo_greedy, greedy_prepare_data
+from methods.common import cosine_similarity, normalize_model
+def main():
+    if len(sys.argv) != 8:
+        print('Usage: python3 run_greedy.py dataset_location  pos_only (yes or no) ',
+              ' min_d (real), min_p (0-10) '
+              'threshold (0 or limit value)  test_improve (yes or no) eval_word')
+        sys.exit()
+    # parameters :
+    max_number_words = 6
+    dataset_location = sys.argv[1]
+    param_pos_only = sys.argv[2]
+    param_min_d = sys.argv[3]
+    param_min_p = sys.argv[4]
+    param_threshold = sys.argv[5]
+    param_test_improve = sys.argv[6]
+    target_word=sys.argv[7]
+    regex = '[+-]?[0-9]+.?[0-9]*'
+    if param_pos_only not in ['yes','no']:
+        print(f'bad pos_only parameter : {param_pos_only} (yes or no are only supported)')
+        sys.exit()
+    if not re.search(regex, param_threshold):
+        print(f'bad threshold parameter : {param_threshold} must be a number')
+        sys.exit()
+    if not re.search(regex, param_min_d):
+        print(f'bad min_d parameter : {param_min_d} must be a number')
+        sys.exit()
+    min_d = float(param_min_d)
+    if not re.search(regex, param_min_p):
+        print(f'bad min_p parameter : {param_min_p} must be a number')
+        sys.exit()
+    if param_test_improve not in ['yes','no']:
+        print(f'bad test_improve parameter : {param_test_improve} (yes or no are only supported)')
+        sys.exit()
+    path, filename = os.path.split(dataset_location)
+    if "glove" in dataset_location:
+        model = KeyedVectors.load_word2vec_format(dataset_location, binary=True)
+    else:
+        model = Word2Vec.load(dataset_location)
+        model = model.wv
+    norm_model = normalize_model(model)
+    start_time = time.time()
+    vector_words, coverage =greedy_prepare_data (norm_model,
+                                                (param_pos_only=='yes'),
+                                                min_d, float(param_min_p),
+                                                float(param_threshold),target_word)
+    word_size = len(vector_words[target_word])
+    nb_unused_dimensions = word_size - len(coverage)
+    print(f"unused_dimensions: {nb_unused_dimensions}")
+    solution, last_coverage_size = algo_greedy(vector_words, target_word, min_d,
+                                       max_number_words, (param_test_improve=='yes'), coverage)
+    end_time = time.time()
+    print(f"found solution for target {target_word} :", solution)
+    run_time=end_time - start_time
+    print(f"Total execution time: {run_time} seconds")
+    print("solution:", solution)
+    cs = cosine_similarity(solution.word_vector(vector_words,word_size), vector_words[target_word])
+    print(f"cosine similarity of solution :{cs}")
+    print(f"last coverage size:{last_coverage_size}")
+    with open(f"greedy_{filename}_{target_word}.csv", "a", newline='', encoding='utf-8') as csvfile:
+        fwriter = csv.writer(csvfile, delimiter=';', quoting=csv.QUOTE_NONE, escapechar='@')
+        fwriter.writerow([filename, target_word, last_coverage_size, str(run_time), cs,
+                          param_pos_only, param_min_p, param_min_d,
+                          param_threshold, param_test_improve,solution,nb_unused_dimensions])
+        csvfile.close()
+    print(filename, target_word, last_coverage_size
+          , str(run_time), cs,
+          param_pos_only, param_min_p,param_min_d,
+          param_threshold, param_test_improve, solution, nb_unused_dimensions)
+if __name__ == '__main__':
+    main()
\ No newline at end of file
--- a/run_hillclimbing.py
+++ b/run_hillclimbing.py
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+PEACEWORD, Prototype for Extracting And Considering
+           the Explainability of WORD embeddings.
+           (c) 2025 University of Lille, CNRS
+copyright: Peaceword developers (see Authors file),
+           GPL v3 License (see LICENSE file)
+"""
+from gensim.models.word2vec import Word2Vec
+from gensim.models import KeyedVectors
+import time
+import sys
+import csv
+import os
+from methods.hillclimbing.utils import WordEmbedding
+from methods.hillclimbing.hillclimbing import HillClimbing
+from methods.common import normalize_model
+def main():
+    if len(sys.argv) != 5:
+        print('Usage: python3 run_hillclimbing.py dataset_location pos_only seed eval_word')
+        sys.exit()
+    dataset_location = sys.argv[1]
+    path, dataset_name = os.path.split(dataset_location)
+    param_pos_only = sys.argv[2]
+    seed = sys.argv[3]
+    eval_word = sys.argv[4]
+    if param_pos_only not in ['yes','no']:
+        print(f'bad pos_only parameter : {param_pos_only} (yes or no are only supported)')
+        sys.exit()
+    else:
+        pos_only = (param_pos_only == 'yes')
+    if "glove" in dataset_location:
+        model = KeyedVectors.load_word2vec_format(dataset_location, binary=True)
+    else:
+        model = Word2Vec.load(dataset_location)
+        model = model.wv
+    norm_model = normalize_model(model)
+    print("INFO :: model Trained")
+    print("run No", seed)
+    word_embedding = WordEmbedding(norm_model, pos_only, seed)
+    start_time = time.time()
+    start_sol = word_embedding.random_solution(6, eval_word)
+    hc = HillClimbing(word_embedding, start_sol, eval_word, None)
+    step = True
+    print("starting hc")
+    score = 0
+    while step:
+        step = hc.step()
+        if not step == False:
+            score = step
+            print(f"{str(hc.get_current_solution())} : score {step}")
+    end_time = time.time()
+    print("--------------")
+    print(f"{str(hc.get_current_solution())} : score {score}")
+    print(f"iterations : {hc.iterations}")
+    run_time = end_time - start_time
+    print(f"Total execution time: {run_time} seconds")
+    with open(f"hillclimbing_{dataset_name}_{param_pos_only}_{eval_word}.csv", "a", newline='', encoding='utf-8') as csvfile:
+        fwriter = csv.writer(csvfile, delimiter=';', quoting=csv.QUOTE_NONE, escapechar='@')
+        fwriter.writerow(
+            [seed, str(start_sol), eval_word, str(score), hc.iterations, str(run_time), str(hc.get_current_solution())])
+        csvfile.close()
+    print(seed, start_sol, eval_word, score, hc.iterations, run_time, hc.get_current_solution())
+if __name__ == '__main__':
+    main()
--- a/we_slurm.sl
+++ b/we_slurm.sl
+#!/bin/bash
+DATASET=$1
+POS_ONLY=$2
+SEED=$3
+WORD=$4
+RUNPATH=/media/softs_orkad/olivier/peaceword
+#SBATCH --job-name=HC_${DATASET}_${WORD}_${SEED}
+#SBATCH --partition=debug
+#SBATCH --nodes=1
+cd $RUNPATH
+source $RUNPATH/venv/bin/activate
+python3 run_hillclimbing.py $DATASET $POS_ONLY $SEED $WORD