Need help with visualizing data from a Python Jupyter Notebook
September 28, 2018 9:30 AM
Seeking suggestions on how to fix a visualization in a Python Jupyter notebook that uses TSNE, or ideas on how an amateur might seek help with such a problem.
I'm a quasi-academic in the humanities who can't quite resist using tools for data analysis that I don't quite understand. Right now, I'm trying to make use of some code posted at this blog that uses Tsne to plot some points from a 3 dimensional Word2Vec model in two dimensions. However, the plots this notebook produces for me are messed up (points are spaced equidistantly from one-another), as they appear in this image I posted in a blog comment asking for help.
I wonder if anyone might have suggestions as to what might produce this kind of problem? Or, alternately, do you have suggestions as to where I can ask for help about this? I was thinking of trying Codementor (I'd be willing to pay, modestly, for some assistance), but I think that site is more geared toward professional developers.
I'm not quite sure whether this is a simple problem or a complex one, and would appreciate any suggestions on how to approach this problem.
I'm a quasi-academic in the humanities who can't quite resist using tools for data analysis that I don't quite understand. Right now, I'm trying to make use of some code posted at this blog that uses Tsne to plot some points from a 3 dimensional Word2Vec model in two dimensions. However, the plots this notebook produces for me are messed up (points are spaced equidistantly from one-another), as they appear in this image I posted in a blog comment asking for help.
I wonder if anyone might have suggestions as to what might produce this kind of problem? Or, alternately, do you have suggestions as to where I can ask for help about this? I was thinking of trying Codementor (I'd be willing to pay, modestly, for some assistance), but I think that site is more geared toward professional developers.
I'm not quite sure whether this is a simple problem or a complex one, and would appreciate any suggestions on how to approach this problem.
What's your Python environment like? Conda or venv or anything? Different module versions would be my first guess. sklearn just had a major version update, and TSNE has been updated since the gist was posted.
Do the intermediate sanity check print match? e.g. model.similar_by_word('computer') ?
Also TSNE isn't deterministic. But being equally spaced makes me think that the iteration process stopped early.
posted by supercres at 10:07 AM on September 28, 2018
Do the intermediate sanity check print match? e.g. model.similar_by_word('computer') ?
Also TSNE isn't deterministic. But being equally spaced makes me think that the iteration process stopped early.
posted by supercres at 10:07 AM on September 28, 2018
I'm now getting these spaced results even when using this code exactly as I find it on the author's GitHubGist, and using the rather large pre-trained model (made from GoogleNews items) that the author used (available here).
I was trying to adapt the Notebook for other models, but now I find that it doesn't work even as posted with the model the author used, at least in my environment. I'm on a Win10 (64-bit) machine, using Jupyter Notebooks an an Anaconda (Python 3.6 or 7, I think) environment.
One thing I notice is that the scale next to the graph also changes dramatically between the author's posted example, where Y-axis runs from 0-0.00025, and the results I produce, where the Y-axis runs from -200 to +150.
It's interesting that TSNE has been updated. I wonder if this might require changes in this code to produce correct results.
posted by washburn at 10:25 AM on September 28, 2018
I was trying to adapt the Notebook for other models, but now I find that it doesn't work even as posted with the model the author used, at least in my environment. I'm on a Win10 (64-bit) machine, using Jupyter Notebooks an an Anaconda (Python 3.6 or 7, I think) environment.
One thing I notice is that the scale next to the graph also changes dramatically between the author's posted example, where Y-axis runs from 0-0.00025, and the results I produce, where the Y-axis runs from -200 to +150.
It's interesting that TSNE has been updated. I wonder if this might require changes in this code to produce correct results.
posted by washburn at 10:25 AM on September 28, 2018
Also, with respect to the intermediate sanity check: looking at the results of "model.similar_by_word('computer')" in the notebook as I run it, I see that these do remain consistent with the results provided in the notebook as posted to the author's blog post and gist.
posted by washburn at 10:37 AM on September 28, 2018
posted by washburn at 10:37 AM on September 28, 2018
I think the perplexity is too high. I'm trying to reproduce it with lower, but give that a shot if you're already up and running.
posted by supercres at 11:25 AM on September 28, 2018
posted by supercres at 11:25 AM on September 28, 2018
Yeah, this at least produces something less equally spaced:
tsne = TSNE(n_components=2, random_state=0, n_iter=100000, method='exact', init='pca', perplexity=5)
If you want to try to reproduce her results exactly, I suggest decrementing sklearn versions. My guess is that it's in the 0.17 range.
posted by supercres at 11:27 AM on September 28, 2018
tsne = TSNE(n_components=2, random_state=0, n_iter=100000, method='exact', init='pca', perplexity=5)
If you want to try to reproduce her results exactly, I suggest decrementing sklearn versions. My guess is that it's in the 0.17 range.
posted by supercres at 11:27 AM on September 28, 2018
The author of the original blog post seems interested in reproducible and even open research, and maybe communities of practice and all that good stuff -- I would do my level best to document my efforts to reproduce their results (whether or not down-versioning sklearn works), and then write in with a succinct "bug report".
I would also be looking for Known-Input-Known-Result simple-to-complex cases and trying to reproduce them. If you're lucky, NIST publishes them, though this may not be in NIST's business yet.
posted by clew at 12:46 PM on September 28, 2018
I would also be looking for Known-Input-Known-Result simple-to-complex cases and trying to reproduce them. If you're lucky, NIST publishes them, though this may not be in NIST's business yet.
posted by clew at 12:46 PM on September 28, 2018
Thanks to everyone who replied, and especially for the insights and solution offered by supercres, which indeed seems to resolve this issue. I've left a link to this discussion as a comment in the author's weblog.
AskMeFi saves the day, yet again!
posted by washburn at 7:26 PM on September 28, 2018
AskMeFi saves the day, yet again!
posted by washburn at 7:26 PM on September 28, 2018
« Older Do car mechanics make house calls for insurance... | Keep building life in new town vs take a very... Newer »
This thread is closed to new comments.
posted by typify at 9:37 AM on September 28, 2018