Stream: jupyter

Topic: jupyterhub trouble?


view this post on Zulip Deepak Cherian (Mar 30 2021 at 19:12):

is anyone else having trouble spinning up a server on casper using jupyterhub.ucar.edu? I don't see any requests in squeue...

view this post on Zulip Max Grover (Mar 30 2021 at 19:14):

I was able to get in a few minutes ago... but I only requested a single node

view this post on Zulip Anderson Banihirwe (Mar 30 2021 at 19:16):

@Deepak Cherian, I am able to get in as well. Can you confirm that your gladequota is okay?

view this post on Zulip Isla Simpson (Mar 30 2021 at 19:16):

I had trouble yesterday. Eventually it worked in the afternoon and I am not having trouble today.

view this post on Zulip Deepak Cherian (Mar 30 2021 at 19:17):

yup gladequota looks great. is there a log somewhere? THe webiste says "Spawning server..."

view this post on Zulip Deepak Cherian (Mar 30 2021 at 19:18):

now it's failed with a timeout

view this post on Zulip Anderson Banihirwe (Mar 30 2021 at 19:19):

looks great. is there a log somewhere?

The logs reside in /glade/scratch/$USER/.jupyter_logs

view this post on Zulip Anderson Banihirwe (Mar 30 2021 at 19:20):

However, I don't think they show up until the job is up and running

view this post on Zulip Anderson Banihirwe (Mar 30 2021 at 19:26):

now it's failed with a timeout

Are you using the default settings?

view this post on Zulip Deepak Cherian (Mar 30 2021 at 19:27):

ya just tried that.

view this post on Zulip Matt Long (Mar 30 2021 at 21:01):

I wonder if the queue is simply choked up? Many of the Casper nodes are now on PBS—I presume this means they are unavailable via Slurm.

view this post on Zulip Max Grover (Mar 30 2021 at 21:08):

Do all the requests use Slurm by default?

view this post on Zulip Matt Long (Mar 30 2021 at 21:16):

CISL is in the process of transition Casper from all Slurm to all PBS. Perhaps @Brian Vanderwende or @mickc have some insight here.

view this post on Zulip Brian Vanderwende (Mar 30 2021 at 21:25):

Approximately half of the nodes have been moved from Slurm to PBS, so indeed you will see slower dispatch times than normal during busy hours until things are totally migrated to PBS. I expect that this will get better as the week goes along as folks migrate their traditional jobs.

view this post on Zulip Brian Vanderwende (Mar 30 2021 at 21:26):

The production JupyterHub still uses Slurm (I forgot to mention that explicitly!)

view this post on Zulip Matt Long (Apr 07 2021 at 14:37):

@all, in case you missed it, jupyterhub.ucar.edu now redirects to jupyterhub.hpc.ucar.edu; the Hub has been updated and there is a new interface for spawning a server (or multiple servers via the control panel).

view this post on Zulip Cecile Hannay (Apr 12 2021 at 17:04):

I am having issues creating a Jupyter Notebook on Cheyenne.

When I try to create a new notebook, I am getting the error:
"Unexpected error while saving file: Untitled.ipynb attempt to write a readonly database"

I am attaching screenshots of the error message I am getting.
Screen-Shot-2021-04-12-at-10.18.07-AM.png
Screen-Shot-2021-04-12-at-10.18.13-AM.png .

I am getting this error with the JupyterHub and with interactive sessions on cheyenne. I have no issues on my laptop or on the cgd machines. This only happens on the CISL machines. I should have a permission problem somewhere but I haven't been able to figure it out. I looked on stackoverflow, but I haven't find a way to solve this problem

view this post on Zulip Kevin Paul (Apr 12 2021 at 17:15):

There are a lot of things that can look like a permission error, too. Such as not having storage space. But I'm not sure what the problem is.

view this post on Zulip Matt Long (Apr 12 2021 at 18:04):

@Cecile Hannay, you might want to check your quota: Kevin is right, it could simply be no disk space available.

view this post on Zulip Cecile Hannay (Apr 12 2021 at 19:27):

@Kevin Paul and @Matt Long: It was a good thought but I haven't reached my quota and it is not a space issue.

view this post on Zulip Max Grover (Apr 12 2021 at 19:35):

Are you using the new JupyterHub? (jupyterhub.hpc.ucar.edu) And if so, are you on the Cheyenne Login option?

view this post on Zulip Kevin Paul (Apr 12 2021 at 19:38):

I think we need clarification. The first error message you get is the Unexpected error while saving... error, right? And then you get the second error after you click "Dismiss" on the first error message dialog box?

view this post on Zulip Kevin Paul (Apr 12 2021 at 19:39):

@Max Grover: @Cecile Hannay is seeing this on both the JHub and via self-launched (i.e., SSH tunnels) Jupyter sessions.

view this post on Zulip Cecile Hannay (Apr 12 2021 at 19:47):

@Kevin Paul: This is correct. I when the first error when I try to create Notebook and the second error after clicking "dismiss" on the dialog box.

view this post on Zulip Max Grover (Apr 12 2021 at 19:50):

From the command line, can you try running jupyter notebook --NotebookNotary.db_file=':memory:' there is a thread from google groups https://groups.google.com/a/continuum.io/g/anaconda/c/dGcZoFIci1k on this, but I am not sure if you have tried this. It says that it could be that you do not have write permissions to your home directory for some reason. Along with a github issue thread here https://github.com/jupyter/notebook/issues/5321

view this post on Zulip Cecile Hannay (Apr 12 2021 at 21:41):

@Max Grover I am not sure at which stage I should try that command.
I either use the JupyterHub (jupyterhub.hpc.ucar.edu) or start-jupyter.

view this post on Zulip Max Grover (Apr 12 2021 at 22:49):

Can you open up a terminal once you are on the Jupyterhub? Then type it in there?

view this post on Zulip Cecile Hannay (Apr 12 2021 at 23:23):

@Max Grover

Thanks for your reply. I tried that but I am not sure it is doing anything. I still get the same error.

I open a terminal and type the command:
Screen-Shot-2021-04-12-at-5.19.23-PM.png [Screen-Shot-2021-04-12-at-5.19.39-PM.png]

Here is what happens on the screen but I cannot click on these links.
(https://zulip2.cloud.ucar.edu/user_uploads/2/ef/bamQGFCIII15CuIfm7Su_C3J/Screen-Shot-2021-04-12-at-5.19.39-PM.png) [Screen-Shot-2021-04-12-at-5.20.12-PM.png]

When I try to open a Jupyter Notebook, I get the same error.
(https://zulip2.cloud.ucar.edu/user_uploads/2/6d/8XGKw78NoEQc-tZUdD7hhyy3/Screen-Shot-2021-04-12-at-5.20.12-PM.png)

view this post on Zulip Matt Long (Apr 13 2021 at 15:02):

@Cecile Hannay, I think you should send this info to CISL help. I cannot reproduce your problem on my end.

You can look in
/glade/${USER}/scratch/.jupyter_logs

Perhaps there is something useful there? (I tried to look for you, but don't have permission.)

cc @Brian Vanderwende

view this post on Zulip Cecile Hannay (Apr 13 2021 at 17:19):

I have worked with @Max Grover.
From the command: jupyter notebook --NotebookNotary.db_file=':memory:'
it looks like I don't have write permissions to my home directory for jupyter Notebook.

I contacted cislhelp yesterday but I will update with this new piece of information.

view this post on Zulip Cecile Hannay (Apr 13 2021 at 17:31):

We also looked at:
/glade/${USER}/scratch/.jupyter_logs
All my recent attempts didn't create any jupyter_logs.

view this post on Zulip Matt Long (Apr 13 2021 at 17:33):

weird. I don't have any idea what could be going wrong. Is the behavior consistent on Capser PBS Batch, Login, etc.?

view this post on Zulip Cecile Hannay (Apr 13 2021 at 17:34):

Here is the same behavior on casper.

I will try to clean up to get my quota under 90% as Brian suggested.

view this post on Zulip Matt Long (Apr 13 2021 at 17:36):

my home directory is at 94.52%

view this post on Zulip Matt Long (Apr 13 2021 at 17:36):

so I wouldn't guess that that's the trouble...but it's beyond me what might be going wrong.

view this post on Zulip Cecile Hannay (Apr 13 2021 at 18:13):

@Brian Vanderwende
I brought my quota to:
/glade/u/home/hannay 43.09 GB 50.00 GB 86.18 % 147355
I am still getting the same error.

view this post on Zulip Brian Vanderwende (Apr 13 2021 at 18:58):

@Cecile Hannay Thanks for letting me know. Which instance did you use in your most recent attempts? A JupyterHub session or a tunnel? If the Hub, which system and was it batch or login? This information will help me narrow down which logs to have the admins look at.

view this post on Zulip Cecile Hannay (Apr 13 2021 at 18:59):

@Brian Vanderwende In my last attempt after reducing the quota, I tried on the JupyterHub on cheyenne.

view this post on Zulip Brian Vanderwende (Apr 13 2021 at 19:47):

Thanks Cecile. We are taking a look.

view this post on Zulip John Clyne (Apr 13 2021 at 19:49):

@Brian Vanderwende on a related topic, the new JupyterHub interface http://jupyterhub.hpc.ucar.edu/ is really slick. A couple of documentation items that might make it much more accessible:

  1. Provide guidance on which resource to select. The tradeoffs between casper-batch vs cheyenne-batch vs login aren't readily obvious.
  2. More importantly, having comprehensive documentation on how to configure a Dask environment for single-node-multiple-threads and distributed-node use would be really helpful. Most of that information seems to be scattered around various zulip postings and in one of @Anderson Banihirwe 's GitHub repos (https://github.com/NCAR/ncar-jobqueue). It would be great to have documentation and examples that are maintained by CISL. Perhaps they already exist and I simply couldn't find them! Just my 2 cents worth :-)

view this post on Zulip Cecile Hannay (Apr 13 2021 at 21:29):

The problem has been solved with cisl. I am posting the fix here in case someone runs into the same issue.

The problem was that the auto-created file:
~/.local/share/jupyter/nbsignatures.db
got currupted.

Because this database couldn't be accessed properly, I couldn't access/create notebooks. Erasing the file solved the problem.

Last thing: before deleting the file, you need to make sure you don't have any Jupyter sessions that have that file locked / opened.

view this post on Zulip Matt Long (Apr 13 2021 at 22:44):

Great to hear!

view this post on Zulip Kevin Paul (Apr 13 2021 at 22:45):

Thanks, @Cecile Hannay!

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 16:17):

Hello, I'm having trouble accessing jupyterhub this morning. I can type my username and password, but then it brings me to a page that says "This page isn't working" . See screenshot. Would anyone be able to help me?
Screen-Shot-2021-04-15-at-10.16.23-AM.png

view this post on Zulip Cecile Hannay (Apr 15 2021 at 16:18):

I had the same error two days ago. It was temporary and it worked after trying again later.

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 16:19):

Oh ok, thanks Cecile! I'll try again in a little while

view this post on Zulip Matt Long (Apr 15 2021 at 16:49):

@Jared Baker is aware of these issues.

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 16:50):

Ok, thanks... it's still not working.

view this post on Zulip Jared Baker (Apr 15 2021 at 17:02):

What about now?

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 17:04):

still not working..

view this post on Zulip Jared Baker (Apr 15 2021 at 17:14):

Neato. Okay, well I'm going to go see if I can prune your entries in the state database since the API is not doing what it says it is.

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 17:21):

ok thanks! Just let me know when I should try again...

view this post on Zulip Jared Baker (Apr 15 2021 at 18:16):

@Kristen Krumhardt I imagine the web page if you refresh will ask you to log in again.

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 18:18):

no it just keeps saying "This page isn't working"

view this post on Zulip Anderson Banihirwe (Apr 15 2021 at 18:21):

no it just keeps saying "This page isn't working"

Do you get the same error message when accessing the page from a private browser window or a different browser ?

view this post on Zulip Matt Long (Apr 15 2021 at 18:22):

I was just able to spawn a server...

view this post on Zulip Jared Baker (Apr 15 2021 at 18:23):

Made another change Kristen. what about now?

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 18:24):

yes, now it's asking me sign in again

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 18:25):

oh now it looks like it might work! says 'my server is starting up'

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 18:25):

yes! it's working! thank you!

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 18:26):

@Anderson Banihirwe I tried with different browsers before and it just brought me to a blank white page

view this post on Zulip Jared Baker (Apr 15 2021 at 18:26):

Good deal, what a doozy.

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 18:26):

but now problem solved:))

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 19:26):

well, I just got kicked off jupyterhub and it's been stuck on this page for a couple min. Is anyone else having this issue? Screen-Shot-2021-04-15-at-1.23.36-PM.png

view this post on Zulip Max Grover (Apr 15 2021 at 19:30):

I tried to sign in today as unable to do so

view this post on Zulip Frank Bryan (Apr 15 2021 at 19:31):

Yes. I was trying to figure this out just now.

When I first go to jupyterhub.hpc.ucar.edu I get an unfamiliar login page

Screen-Shot-2021-04-15-at-1.25.45-PM.png

Then I get the "Your server is stopping " page as above.

I have tried rebooting, clearing browser cache etc.

I am able to connect from a shell with jupyter lab, but my Dask Dashboard hangs with the same infinite waiting page.

view this post on Zulip Matt Long (Apr 15 2021 at 19:48):

The hub remains unstable for me as well.

view this post on Zulip Jared Baker (Apr 15 2021 at 20:01):

There are some runaway things now. trying to keep it alive.

view this post on Zulip Sheri Mickelson (Apr 15 2021 at 20:04):

Yeah, it looks like there's issues on casper-login1. The load is very high on it and my ssh login is hanging on it.

view this post on Zulip Jared Baker (Apr 15 2021 at 21:29):

I was able to get the node wrangled back and hopefully didn't interrupt too much running through the hub. Apologies, but what a perfect storm.

view this post on Zulip Matt Long (Apr 15 2021 at 21:45):

thanks @Jared Baker!

view this post on Zulip Jared Baker (Apr 15 2021 at 21:46):

Let me know if it's not working for you I suppose.

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 21:47):

I'm still getting that "Your server is stopping" page...

view this post on Zulip Jared Baker (Apr 15 2021 at 21:52):

were you running a casper-batch job?

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 21:59):

yes

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 22:00):

but then I got kicked off.. and then I tried to restart the server with another casper-batch job but it just keeps landing on this page

view this post on Zulip Jared Baker (Apr 15 2021 at 22:04):

Okay, I have a theory on what happened here then. I'm curious if it was in the process of spawning then the hub became overwhelmed and never got the update. Do you mind if I try something interesting?

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 22:10):

I don't mind! try anything:)

view this post on Zulip Frank Bryan (Apr 15 2021 at 22:11):

I am also seeing the same thing I did earlier today (Y"our server is stopping")

view this post on Zulip Kristen Krumhardt (Apr 15 2021 at 22:14):

glad I'm not the only one!

view this post on Zulip Abby Jaye (Apr 15 2021 at 22:15):

I have the same issue! I've been having issues since around noon

view this post on Zulip Frank Bryan (Apr 15 2021 at 22:16):

FYI I was getting some very flaky behavior late yesterday (I thought it might be a glade issue) and just shut down for the day. Perhaps something is still running from that instance?

view this post on Zulip Matt Long (Apr 15 2021 at 22:17):

This feels pretty systemic to me—it's never really been stable since the PBS switch—but today has been particularly bad. I've switched to using SSH tunnels so I can get work done.

@Jared Baker, please let us know what's most helpful for you regarding testing, complaining, etc.

view this post on Zulip Jared Baker (Apr 15 2021 at 23:29):

It was absolutely systemic. I'm not sure I can really blame PBS here. JupyterHub with the new login spawners had at one point 22k open file handles. The system security limits were preventing proper response times, then things started stacking up on the Hub's polling eventually leading up to the "hang" on casper-login1 today.

view this post on Zulip Jared Baker (Apr 15 2021 at 23:31):

@Kristen Krumhardt I've attempted to insert a proxy route and I think I was successful on that, but I'm not sure if it gave you the ability to access your instance again. I think it's still trying to stop.

view this post on Zulip Matt Long (Apr 16 2021 at 00:00):

Thanks @Jared Baker! I'll try again tomorrow!

view this post on Zulip Matt Long (Apr 16 2021 at 12:08):

things were going smoothly for awhile this morning...but then I just lost my kernel. I am not getting any error messages, the interface has simply stopped responding.

view this post on Zulip Matt Long (Apr 16 2021 at 12:09):

...and now it's back. It must have choked on something for a bit

view this post on Zulip Matt Long (Apr 16 2021 at 12:11):

and now it's unresponsive again. I am on crhtc53. load average: 5.11, 5.59, 5.33...doesn't seem terrible.

view this post on Zulip Kristen Krumhardt (Apr 16 2021 at 12:18):

It let me sign in this morning and then it went to "This page isn't working" , like it was yesterday morning.

view this post on Zulip Jared Baker (Apr 16 2021 at 14:27):

Kristen, your instance seemed to have a mismatched route. I've removed it. I'm hoping :fingers_crossed: that will give you options to spawn a server again.

view this post on Zulip Kristen Krumhardt (Apr 16 2021 at 14:37):

Yes! it worked this time!

view this post on Zulip Jared Baker (Apr 16 2021 at 15:21):

@Matt Long I think your issue is hub agnostic. I'm not sure what might be causing the log messages "kernel interrupted" that are in the job logs (_/glade/scratch/$USER/.jupyter_logs/_). I'm going to have to look those messages closer. I honestly have no idea why a kernel may see interruptions like that; presently at least.

view this post on Zulip Matt Long (Apr 16 2021 at 15:26):

I can believe that. hasn't happened again. With Cheyenne and the old Hub, I found that the share queue was hard to use because of intermittent unresponsiveness. Could just be a load issue, I guess.

view this post on Zulip Matt Long (Apr 16 2021 at 21:28):

@Jared Baker, the hub seems pretty stable today. Thanks for all your work on it!

view this post on Zulip Jared Baker (Apr 16 2021 at 21:30):

That's good. Made a couple background changes. Although this has given me a reason to write some tools to inspect the health of the hub as well. Enjoy your weekend!

view this post on Zulip Matt Long (Apr 16 2021 at 21:31):

you too! Thanks!

view this post on Zulip Stephen Yeager (Apr 17 2021 at 14:45):

Seems like yesterday's jupyterhub issues are still not resolved. I clicked on "Production" from the main page and got a CIT login prompt, but then landed on the screen below. Now I get this screen when I start over and click on "Production".
Screen-Shot-2021-04-17-at-8.38.23-AM.png

view this post on Zulip Matt Long (Apr 17 2021 at 15:17):

I was able to spawn a session this morning, a few hours ago, and it's still running.

I was getting that same behavior earlier this week though...tried different browser and such...didn't help

view this post on Zulip Jared Baker (Apr 17 2021 at 18:07):

@Stephen Yeager you can try again whenever. You'll need to log back in.

view this post on Zulip Matt Long (Apr 19 2021 at 18:45):

the hub has become unresponsive for me:
when I click "production" on https://jupyterhub.hpc.ucar.edu/, it seemingly starts to load a new page, but seems to be waiting indefinitely.

view this post on Zulip Jared Baker (Apr 19 2021 at 18:53):

Mine seemingly loaded just fine. I'll go check state, then it'll be to the logs.

view this post on Zulip Britt Stephens (Apr 21 2021 at 18:03):

Hub is not working for me - clicking Production at https://jupyterhub.hpc.ucar.edu/ just spins going on an hour now - tried 2 different machines and 3 different browsers. Matt says it works for him. Any suggestions, or ideas why the experience is user dependent? Thanks!

view this post on Zulip Kevin Paul (Apr 21 2021 at 18:36):

My guess is that the Hub is either bogged down with users and can't respond to more requests, or there is something in your browser cache that needs to be cleared out. But that's just a guess.

view this post on Zulip Britt Stephens (Apr 21 2021 at 18:49):

Thanks Kevin - CISL says "We have had to do some work on the JupyterHub login system since yesterday evening. I believe this is impacting your login ability. We will be continuing to work on it tomorrow and a notice will be sent out soon detailing the downtime while work is being done." Odd that it only affects some users.

view this post on Zulip Jared Baker (Apr 21 2021 at 19:05):

It only affects some users because the hub DoS'ed one of the login nodes and to get it restored, we had to block things at a network level rather than a host-level. I'm cleaning it up, but unfortunately it's a slow process. I have some potential workarounds to restore access if you'd like to pursue that?

view this post on Zulip Britt Stephens (Apr 21 2021 at 19:14):

a workaround would be great, thanks, and thanks for the explanation too

view this post on Zulip Jared Baker (Apr 21 2021 at 22:11):

@Britt Stephens might be back now.

view this post on Zulip Britt Stephens (Apr 21 2021 at 22:19):

thanks - I can log in, but can't open any notebooks (existing or new) - you likely already know, but I have a help ticket open that Daniel Howard has been iterating with me on

view this post on Zulip Jared Baker (Apr 21 2021 at 22:28):

If you refresh the page, it will ask that you re-login at this point, but I think it'll be back to normal

view this post on Zulip Britt Stephens (Apr 21 2021 at 22:36):

thanks Jared - unfortunately same behavior - trying to open an existing notebook gives "File Load Error for cmip6-sno-compute.ipynb Unhandled error" and trying to start a new notebook gives "Launcher Error Cannot read property 'path' of undefined"

view this post on Zulip Jared Baker (Apr 21 2021 at 22:38):

That's a new one for me.

view this post on Zulip Jared Baker (Apr 21 2021 at 22:43):

So I think the error is a red herring. You're $HOME is at 100%.

view this post on Zulip Jared Baker (Apr 21 2021 at 22:46):

Can we move something to the scratch filesystem to check?

view this post on Zulip Jared Baker (Apr 21 2021 at 22:49):

maybe the SOCO2_210309.tar file?

view this post on Zulip Britt Stephens (Apr 21 2021 at 22:58):

ah, great catch - that worked - sorry I didn't notice that before - thanks a lot for the help!


Last updated: Jan 30 2022 at 12:01 UTC