Need help with visualizing data from a Python Jupyter Notebook
September 28, 2018 9:30 AM   Subscribe

Seeking suggestions on how to fix a visualization in a Python Jupyter notebook that uses TSNE, or ideas on how an amateur might seek help with such a problem.

I'm a quasi-academic in the humanities who can't quite resist using tools for data analysis that I don't quite understand. Right now, I'm trying to make use of some code posted at this blog that uses Tsne to plot some points from a 3 dimensional Word2Vec model in two dimensions. However, the plots this notebook produces for me are messed up (points are spaced equidistantly from one-another), as they appear in this image I posted in a blog comment asking for help.

I wonder if anyone might have suggestions as to what might produce this kind of problem? Or, alternately, do you have suggestions as to where I can ask for help about this? I was thinking of trying Codementor (I'd be willing to pay, modestly, for some assistance), but I think that site is more geared toward professional developers.

I'm not quite sure whether this is a simple problem or a complex one, and would appreciate any suggestions on how to approach this problem.
posted by washburn to Computers & Internet (8 answers total) 3 users marked this as a favorite
 
Could you post the code (pastebin works) and the model?
posted by typify at 9:37 AM on September 28, 2018


Best answer: What's your Python environment like? Conda or venv or anything? Different module versions would be my first guess. sklearn just had a major version update, and TSNE has been updated since the gist was posted.

Do the intermediate sanity check print match? e.g. model.similar_by_word('computer') ?

Also TSNE isn't deterministic. But being equally spaced makes me think that the iteration process stopped early.
posted by supercres at 10:07 AM on September 28, 2018 [1 favorite]


Response by poster: I'm now getting these spaced results even when using this code exactly as I find it on the author's GitHubGist, and using the rather large pre-trained model (made from GoogleNews items) that the author used (available here).

I was trying to adapt the Notebook for other models, but now I find that it doesn't work even as posted with the model the author used, at least in my environment. I'm on a Win10 (64-bit) machine, using Jupyter Notebooks an an Anaconda (Python 3.6 or 7, I think) environment.

One thing I notice is that the scale next to the graph also changes dramatically between the author's posted example, where Y-axis runs from 0-0.00025, and the results I produce, where the Y-axis runs from -200 to +150.

It's interesting that TSNE has been updated. I wonder if this might require changes in this code to produce correct results.
posted by washburn at 10:25 AM on September 28, 2018


Response by poster: Also, with respect to the intermediate sanity check: looking at the results of "model.similar_by_word('computer')" in the notebook as I run it, I see that these do remain consistent with the results provided in the notebook as posted to the author's blog post and gist.
posted by washburn at 10:37 AM on September 28, 2018


I think the perplexity is too high. I'm trying to reproduce it with lower, but give that a shot if you're already up and running.
posted by supercres at 11:25 AM on September 28, 2018 [1 favorite]


Best answer: Yeah, this at least produces something less equally spaced:

tsne = TSNE(n_components=2, random_state=0, n_iter=100000, method='exact', init='pca', perplexity=5)

If you want to try to reproduce her results exactly, I suggest decrementing sklearn versions. My guess is that it's in the 0.17 range.
posted by supercres at 11:27 AM on September 28, 2018 [2 favorites]


The author of the original blog post seems interested in reproducible and even open research, and maybe communities of practice and all that good stuff -- I would do my level best to document my efforts to reproduce their results (whether or not down-versioning sklearn works), and then write in with a succinct "bug report".

I would also be looking for Known-Input-Known-Result simple-to-complex cases and trying to reproduce them. If you're lucky, NIST publishes them, though this may not be in NIST's business yet.
posted by clew at 12:46 PM on September 28, 2018


Response by poster: Thanks to everyone who replied, and especially for the insights and solution offered by supercres, which indeed seems to resolve this issue. I've left a link to this discussion as a comment in the author's weblog.

AskMeFi saves the day, yet again!
posted by washburn at 7:26 PM on September 28, 2018 [1 favorite]


« Older Do car mechanics make house calls for insurance...   |   Keep building life in new town vs take a very... Newer »
This thread is closed to new comments.