Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word Embeddings Class Added #2198

Open
wants to merge 12 commits into
base: master
from

Conversation

@sprintyaf
Copy link

sprintyaf commented Jul 13, 2020

Describe your change:
#2187
I added Word Embeddings Class described in the above-mentioned issue.

  • Add an algorithm?

Checklist:

  • I have read CONTRIBUTING.md.
  • This pull request is all my own work -- I have not plagiarized.
  • I know that pull requests will not be merged if they fail the automated tests.
  • This PR only changes one algorithm file. To ease review, please open separate PRs for separate algorithms.
  • All new Python files are placed inside an existing directory.
  • All filenames are in all lowercase characters with no spaces or dashes.
  • All functions and variable names follow Python naming conventions.
  • All function parameters and return values are annotated with Python type hints.
  • All functions have doctests that pass the automated testing.
  • All new algorithms have a URL in its comments that points to Wikipedia or other similar explanation.
  • If this pull request resolves one or more open issues then the commit message contains Fixes: #{$ISSUE_NO}.
@TravisBuddy
Copy link

TravisBuddy commented Jul 13, 2020

Hey @sprintyaf,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: ceec83a0-c500-11ea-af19-3b271fec1f42
@TravisBuddy
Copy link

TravisBuddy commented Jul 13, 2020

Hey @sprintyaf,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: 66eb3070-c501-11ea-af19-3b271fec1f42
@TravisBuddy
Copy link

TravisBuddy commented Jul 13, 2020

Hey @sprintyaf,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: 187097e0-c502-11ea-af19-3b271fec1f42
@TravisBuddy
Copy link

TravisBuddy commented Jul 13, 2020

Hey @sprintyaf,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: b7c021d0-c502-11ea-af19-3b271fec1f42
@TravisBuddy
Copy link

TravisBuddy commented Jul 13, 2020

Hey @sprintyaf,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: e5e2b610-c505-11ea-af19-3b271fec1f42
Copy link
Member

cclauss left a comment

Please read CONTRIBUTING.md.

@TravisBuddy
Copy link

TravisBuddy commented Jul 13, 2020

Hey @sprintyaf,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: c19c4b70-c507-11ea-af19-3b271fec1f42
@cclauss
Copy link
Member

cclauss commented Jul 13, 2020

You might run psf/black on this code because Travis CI says:

$ flake8 --ignore=E203,W503 --max-complexity=25 --max-line-length=88 --statistics --count .

./natural_language_processing/word_embeddings/usage.py:36:43: E261 at least two spaces before inline comment
./natural_language_processing/word_embeddings/word_embeddings.py:28:5: E303 too many blank lines (2)
./natural_language_processing/word_embeddings/word_embeddings.py:33:89: E501 line too long (125 > 88 characters)
./natural_language_processing/word_embeddings/word_embeddings.py:38:5: E303 too many blank lines (2)
./natural_language_processing/word_embeddings/word_embeddings.py:44:89: E501 line too long (92 > 88 characters)
./natural_language_processing/word_embeddings/word_embeddings.py:49:5: E303 too many blank lines (2)
./natural_language_processing/word_embeddings/word_embeddings.py:54:89: E501 line too long (100 > 88 characters)
./natural_language_processing/word_embeddings/word_embeddings.py:59:5: E303 too many blank lines (2)
./natural_language_processing/word_embeddings/word_embeddings.py:69:5: E303 too many blank lines (2)
./natural_language_processing/word_embeddings/word_embeddings.py:91:5: E303 too many blank lines (2)
./natural_language_processing/word_embeddings/word_embeddings.py:98:5: E303 too many blank lines (2)
./natural_language_processing/word_embeddings/word_embeddings.py:106:5: E303 too many blank lines (2)
./natural_language_processing/word_embeddings/word_embeddings.py:115:89: E501 line too long (94 > 88 characters)
./natural_language_processing/word_embeddings/word_embeddings.py:119:5: E303 too many blank lines (2)
./natural_language_processing/word_embeddings/word_embeddings.py:122:68: W291 trailing whitespace
./natural_language_processing/word_embeddings/word_embeddings.py:136:72: E251 unexpected spaces around keyword / parameter equals
./natural_language_processing/word_embeddings/word_embeddings.py:136:74: E251 unexpected spaces around keyword / parameter equals
./natural_language_processing/word_embeddings/word_embeddings.py:136:89: E501 line too long (93 > 88 characters)
./natural_language_processing/word_embeddings/word_embeddings.py:144:89: E501 line too long (118 > 88 characters)
./natural_language_processing/word_embeddings/word_embeddings.py:152:5: E303 too many blank lines (2)
2     E251 unexpected spaces around keyword / parameter equals
1     E261 at least two spaces before inline comment
10    E303 too many blank lines (2)
6     E501 line too long (125 > 88 characters)
1     W291 trailing whitespace
20
@cclauss
Copy link
Member

cclauss commented Jul 13, 2020

Please avoid backslash line termination as discussed in PEP8 because a whitespace character to the right of the backslash can break the script on a change that is invisibile to the user.

@TravisBuddy
Copy link

TravisBuddy commented Jul 13, 2020

Travis tests have failed

Hey @sprintyaf,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

TravisBuddy Request Identifier: c152e230-c50e-11ea-af19-3b271fec1f42
@TravisBuddy
Copy link

TravisBuddy commented Jul 13, 2020

Travis tests have failed

Hey @sprintyaf,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

TravisBuddy Request Identifier: d8000510-c510-11ea-af19-3b271fec1f42
@TravisBuddy
Copy link

TravisBuddy commented Jul 13, 2020

Travis tests have failed

Hey @sprintyaf,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

TravisBuddy Request Identifier: fc302bc0-c512-11ea-af19-3b271fec1f42
@sprintyaf
Copy link
Author

sprintyaf commented Jul 13, 2020

Hi @cclauss,
First of all thanks for your support.
This algorithm is not deterministic and depends on the data. That's why I can't predict what answer would it return. I don't think that doctests are possible in this case.

@cclauss
Copy link
Member

cclauss commented Jul 13, 2020

assert 70 <= my_function() <= 100, "my_function() should return a result > 70"

@sprintyaf
Copy link
Author

sprintyaf commented Jul 13, 2020

assert 70 <= my_function() <= 100, "my_function() should return a result > 70"

Agree, but my functions return words. It can return any word. You can't test them like this.

@cclauss
Copy link
Member

cclauss commented Jul 13, 2020

closest_words() can return different results with the same input data??

@cclauss
Copy link
Member

cclauss commented Jul 13, 2020

Please add tests to the deterministic functions and place a comment explaining why a doctest is impossible on the non deterministic functions.

@@ -33,8 +37,7 @@ def analogy(self, x1: str, x2: str, y1: str) -> str:
x1, x2, y1 = x1.lower(), x2.lower(), y1.lower()
error_msg = 'every word must be in the vocabulary'
assert all(self.is_in_vocab(x) for x in (x1, x2, y1)), error_msg
result = self._wv.most_similar(positive=[y1, x2], negative=[x1])[0][0]
return result
return self._wv.most_similar(positive=[y1, x2], negative=[x1])[0][0]

def n_similarity(self, list1: list, list2: list) -> float:

This comment has been minimized.

@ruppysuppy

ruppysuppy Jul 15, 2020

Contributor

Add proper type hints in the functions (see here)

This comment has been minimized.

@cclauss

cclauss Jul 15, 2020

Member

Please add proper type hints in the functions...

@@ -54,8 +56,7 @@ def similarity(self, w1: str, w2: str) -> float:
w1, w2 = w1.lower(), w2.lower()
error_msg = 'both words must be in the vocabulary'
assert self.is_in_vocab(w1) and self.is_in_vocab(w2), error_msg

This comment has been minimized.

@ruppysuppy

ruppysuppy Jul 15, 2020

Contributor

If the assertions are for tests, use the doctest format (see here)

This comment has been minimized.

@cclauss

cclauss Jul 15, 2020

Member

These assertions are for runtime checks... If the user gives me garbage then I should complain. These garbage values should also be tested in the doctests.

@TravisBuddy
Copy link

TravisBuddy commented Jul 16, 2020

Travis tests have failed

Hey @sprintyaf,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

TravisBuddy Request Identifier: 8bfb7cf0-c77f-11ea-999c-bf5bae978323
@sprintyaf
Copy link
Author

sprintyaf commented Jul 16, 2020

@cclauss @ruppysuppy
I added type hints and doctests. For doctests I needed to upload a pretrained model. But it looks like it's not visible during testing. How could I fix this?

error

parsed = WordVectors._parse_document(text)
documents.append(parsed)
except IOError:
continue

This comment has been minimized.

@cclauss

cclauss Aug 14, 2020

Member

Please put lines 166 to 173 in a separate function and then do
documents = [get_document_from_file(filename) for filename in glob.glob(full_path)]

@@ -17,3 +17,5 @@ sklearn
sympy
tensorflow
xgboost

This comment has been minimized.

@cclauss

cclauss Aug 14, 2020

Member

Please insert requirements in alphabetical order to reduce the likelihood of duplicate entries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.