Stream: jupyter
Topic: jupyterhub trouble?
Deepak Cherian (Mar 30 2021 at 19:12):
is anyone else having trouble spinning up a server on casper using jupyterhub.ucar.edu? I don't see any requests in squeue...
Max Grover (Mar 30 2021 at 19:14):
I was able to get in a few minutes ago... but I only requested a single node
Anderson Banihirwe (Mar 30 2021 at 19:16):
@Deepak Cherian, I am able to get in as well. Can you confirm that your gladequota
is okay?
Isla Simpson (Mar 30 2021 at 19:16):
I had trouble yesterday. Eventually it worked in the afternoon and I am not having trouble today.
Deepak Cherian (Mar 30 2021 at 19:17):
yup gladequota
looks great. is there a log somewhere? THe webiste says "Spawning server..."
Deepak Cherian (Mar 30 2021 at 19:18):
now it's failed with a timeout
Anderson Banihirwe (Mar 30 2021 at 19:19):
looks great. is there a log somewhere?
The logs reside in /glade/scratch/$USER/.jupyter_logs
Anderson Banihirwe (Mar 30 2021 at 19:20):
However, I don't think they show up until the job is up and running
Anderson Banihirwe (Mar 30 2021 at 19:26):
now it's failed with a timeout
Are you using the default settings?
Deepak Cherian (Mar 30 2021 at 19:27):
ya just tried that.
Matt Long (Mar 30 2021 at 21:01):
I wonder if the queue is simply choked up? Many of the Casper nodes are now on PBS—I presume this means they are unavailable via Slurm.
Max Grover (Mar 30 2021 at 21:08):
Do all the requests use Slurm by default?
Matt Long (Mar 30 2021 at 21:16):
CISL is in the process of transition Casper from all Slurm to all PBS. Perhaps @Brian Vanderwende or @mickc have some insight here.
Brian Vanderwende (Mar 30 2021 at 21:25):
Approximately half of the nodes have been moved from Slurm to PBS, so indeed you will see slower dispatch times than normal during busy hours until things are totally migrated to PBS. I expect that this will get better as the week goes along as folks migrate their traditional jobs.
Brian Vanderwende (Mar 30 2021 at 21:26):
The production JupyterHub still uses Slurm (I forgot to mention that explicitly!)
Matt Long (Apr 07 2021 at 14:37):
@all, in case you missed it, jupyterhub.ucar.edu now redirects to jupyterhub.hpc.ucar.edu; the Hub has been updated and there is a new interface for spawning a server (or multiple servers via the control panel).
Cecile Hannay (Apr 12 2021 at 17:04):
I am having issues creating a Jupyter Notebook on Cheyenne.
When I try to create a new notebook, I am getting the error:
"Unexpected error while saving file: Untitled.ipynb attempt to write a readonly database"
I am attaching screenshots of the error message I am getting.
Screen-Shot-2021-04-12-at-10.18.07-AM.png
Screen-Shot-2021-04-12-at-10.18.13-AM.png .
I am getting this error with the JupyterHub and with interactive sessions on cheyenne. I have no issues on my laptop or on the cgd machines. This only happens on the CISL machines. I should have a permission problem somewhere but I haven't been able to figure it out. I looked on stackoverflow, but I haven't find a way to solve this problem
Kevin Paul (Apr 12 2021 at 17:15):
There are a lot of things that can look like a permission error, too. Such as not having storage space. But I'm not sure what the problem is.
Matt Long (Apr 12 2021 at 18:04):
@Cecile Hannay, you might want to check your quota: Kevin is right, it could simply be no disk space available.
Cecile Hannay (Apr 12 2021 at 19:27):
@Kevin Paul and @Matt Long: It was a good thought but I haven't reached my quota and it is not a space issue.
Max Grover (Apr 12 2021 at 19:35):
Are you using the new JupyterHub? (jupyterhub.hpc.ucar.edu) And if so, are you on the Cheyenne Login
option?
Kevin Paul (Apr 12 2021 at 19:38):
I think we need clarification. The first error message you get is the Unexpected error while saving...
error, right? And then you get the second error after you click "Dismiss" on the first error message dialog box?
Kevin Paul (Apr 12 2021 at 19:39):
@Max Grover: @Cecile Hannay is seeing this on both the JHub and via self-launched (i.e., SSH tunnels) Jupyter sessions.
Cecile Hannay (Apr 12 2021 at 19:47):
@Kevin Paul: This is correct. I when the first error when I try to create Notebook and the second error after clicking "dismiss" on the dialog box.
Max Grover (Apr 12 2021 at 19:50):
From the command line, can you try running jupyter notebook --NotebookNotary.db_file=':memory:'
there is a thread from google groups https://groups.google.com/a/continuum.io/g/anaconda/c/dGcZoFIci1k on this, but I am not sure if you have tried this. It says that it could be that you do not have write permissions to your home directory for some reason. Along with a github issue thread here https://github.com/jupyter/notebook/issues/5321
Cecile Hannay (Apr 12 2021 at 21:41):
@Max Grover I am not sure at which stage I should try that command.
I either use the JupyterHub (jupyterhub.hpc.ucar.edu) or start-jupyter.
Max Grover (Apr 12 2021 at 22:49):
Can you open up a terminal once you are on the Jupyterhub? Then type it in there?
Cecile Hannay (Apr 12 2021 at 23:23):
@Max Grover
Thanks for your reply. I tried that but I am not sure it is doing anything. I still get the same error.
I open a terminal and type the command:
Screen-Shot-2021-04-12-at-5.19.23-PM.png [Screen-Shot-2021-04-12-at-5.19.39-PM.png]
Here is what happens on the screen but I cannot click on these links.
(https://zulip2.cloud.ucar.edu/user_uploads/2/ef/bamQGFCIII15CuIfm7Su_C3J/Screen-Shot-2021-04-12-at-5.19.39-PM.png) [Screen-Shot-2021-04-12-at-5.20.12-PM.png]
When I try to open a Jupyter Notebook, I get the same error.
(https://zulip2.cloud.ucar.edu/user_uploads/2/6d/8XGKw78NoEQc-tZUdD7hhyy3/Screen-Shot-2021-04-12-at-5.20.12-PM.png)
Matt Long (Apr 13 2021 at 15:02):
@Cecile Hannay, I think you should send this info to CISL help. I cannot reproduce your problem on my end.
You can look in
/glade/${USER}/scratch/.jupyter_logs
Perhaps there is something useful there? (I tried to look for you, but don't have permission.)
cc @Brian Vanderwende
Cecile Hannay (Apr 13 2021 at 17:19):
I have worked with @Max Grover.
From the command: jupyter notebook --NotebookNotary.db_file=':memory:'
it looks like I don't have write permissions to my home directory for jupyter Notebook.
I contacted cislhelp yesterday but I will update with this new piece of information.
Cecile Hannay (Apr 13 2021 at 17:31):
We also looked at:
/glade/${USER}/scratch/.jupyter_logs
All my recent attempts didn't create any jupyter_logs.
Matt Long (Apr 13 2021 at 17:33):
weird. I don't have any idea what could be going wrong. Is the behavior consistent on Capser PBS Batch, Login, etc.?
Cecile Hannay (Apr 13 2021 at 17:34):
Here is the same behavior on casper.
I will try to clean up to get my quota under 90% as Brian suggested.
Matt Long (Apr 13 2021 at 17:36):
my home directory is at 94.52%
Matt Long (Apr 13 2021 at 17:36):
so I wouldn't guess that that's the trouble...but it's beyond me what might be going wrong.
Cecile Hannay (Apr 13 2021 at 18:13):
@Brian Vanderwende
I brought my quota to:
/glade/u/home/hannay 43.09 GB 50.00 GB 86.18 % 147355
I am still getting the same error.
Brian Vanderwende (Apr 13 2021 at 18:58):
@Cecile Hannay Thanks for letting me know. Which instance did you use in your most recent attempts? A JupyterHub session or a tunnel? If the Hub, which system and was it batch or login? This information will help me narrow down which logs to have the admins look at.
Cecile Hannay (Apr 13 2021 at 18:59):
@Brian Vanderwende In my last attempt after reducing the quota, I tried on the JupyterHub on cheyenne.
Brian Vanderwende (Apr 13 2021 at 19:47):
Thanks Cecile. We are taking a look.
John Clyne (Apr 13 2021 at 19:49):
@Brian Vanderwende on a related topic, the new JupyterHub interface http://jupyterhub.hpc.ucar.edu/ is really slick. A couple of documentation items that might make it much more accessible:
- Provide guidance on which resource to select. The tradeoffs between casper-batch vs cheyenne-batch vs login aren't readily obvious.
- More importantly, having comprehensive documentation on how to configure a Dask environment for single-node-multiple-threads and distributed-node use would be really helpful. Most of that information seems to be scattered around various zulip postings and in one of @Anderson Banihirwe 's GitHub repos (https://github.com/NCAR/ncar-jobqueue). It would be great to have documentation and examples that are maintained by CISL. Perhaps they already exist and I simply couldn't find them! Just my 2 cents worth :-)
Cecile Hannay (Apr 13 2021 at 21:29):
The problem has been solved with cisl. I am posting the fix here in case someone runs into the same issue.
The problem was that the auto-created file:
~/.local/share/jupyter/nbsignatures.db
got currupted.
Because this database couldn't be accessed properly, I couldn't access/create notebooks. Erasing the file solved the problem.
Last thing: before deleting the file, you need to make sure you don't have any Jupyter sessions that have that file locked / opened.
Matt Long (Apr 13 2021 at 22:44):
Great to hear!
Kevin Paul (Apr 13 2021 at 22:45):
Thanks, @Cecile Hannay!
Kristen Krumhardt (Apr 15 2021 at 16:17):
Hello, I'm having trouble accessing jupyterhub this morning. I can type my username and password, but then it brings me to a page that says "This page isn't working" . See screenshot. Would anyone be able to help me?
Screen-Shot-2021-04-15-at-10.16.23-AM.png
Cecile Hannay (Apr 15 2021 at 16:18):
I had the same error two days ago. It was temporary and it worked after trying again later.
Kristen Krumhardt (Apr 15 2021 at 16:19):
Oh ok, thanks Cecile! I'll try again in a little while
Matt Long (Apr 15 2021 at 16:49):
@Jared Baker is aware of these issues.
Kristen Krumhardt (Apr 15 2021 at 16:50):
Ok, thanks... it's still not working.
Jared Baker (Apr 15 2021 at 17:02):
What about now?
Kristen Krumhardt (Apr 15 2021 at 17:04):
still not working..
Jared Baker (Apr 15 2021 at 17:14):
Neato. Okay, well I'm going to go see if I can prune your entries in the state database since the API is not doing what it says it is.
Kristen Krumhardt (Apr 15 2021 at 17:21):
ok thanks! Just let me know when I should try again...
Jared Baker (Apr 15 2021 at 18:16):
@Kristen Krumhardt I imagine the web page if you refresh will ask you to log in again.
Kristen Krumhardt (Apr 15 2021 at 18:18):
no it just keeps saying "This page isn't working"
Anderson Banihirwe (Apr 15 2021 at 18:21):
no it just keeps saying "This page isn't working"
Do you get the same error message when accessing the page from a private browser window or a different browser ?
Matt Long (Apr 15 2021 at 18:22):
I was just able to spawn a server...
Jared Baker (Apr 15 2021 at 18:23):
Made another change Kristen. what about now?
Kristen Krumhardt (Apr 15 2021 at 18:24):
yes, now it's asking me sign in again
Kristen Krumhardt (Apr 15 2021 at 18:25):
oh now it looks like it might work! says 'my server is starting up'
Kristen Krumhardt (Apr 15 2021 at 18:25):
yes! it's working! thank you!
Kristen Krumhardt (Apr 15 2021 at 18:26):
@Anderson Banihirwe I tried with different browsers before and it just brought me to a blank white page
Jared Baker (Apr 15 2021 at 18:26):
Good deal, what a doozy.
Kristen Krumhardt (Apr 15 2021 at 18:26):
but now problem solved:))
Kristen Krumhardt (Apr 15 2021 at 19:26):
well, I just got kicked off jupyterhub and it's been stuck on this page for a couple min. Is anyone else having this issue? Screen-Shot-2021-04-15-at-1.23.36-PM.png
Max Grover (Apr 15 2021 at 19:30):
I tried to sign in today as unable to do so
Frank Bryan (Apr 15 2021 at 19:31):
Yes. I was trying to figure this out just now.
When I first go to jupyterhub.hpc.ucar.edu I get an unfamiliar login page
Screen-Shot-2021-04-15-at-1.25.45-PM.png
Then I get the "Your server is stopping " page as above.
I have tried rebooting, clearing browser cache etc.
I am able to connect from a shell with jupyter lab, but my Dask Dashboard hangs with the same infinite waiting page.
Matt Long (Apr 15 2021 at 19:48):
The hub remains unstable for me as well.
Jared Baker (Apr 15 2021 at 20:01):
There are some runaway things now. trying to keep it alive.
Sheri Mickelson (Apr 15 2021 at 20:04):
Yeah, it looks like there's issues on casper-login1. The load is very high on it and my ssh login is hanging on it.
Jared Baker (Apr 15 2021 at 21:29):
I was able to get the node wrangled back and hopefully didn't interrupt too much running through the hub. Apologies, but what a perfect storm.
Matt Long (Apr 15 2021 at 21:45):
thanks @Jared Baker!
Jared Baker (Apr 15 2021 at 21:46):
Let me know if it's not working for you I suppose.
Kristen Krumhardt (Apr 15 2021 at 21:47):
I'm still getting that "Your server is stopping" page...
Jared Baker (Apr 15 2021 at 21:52):
were you running a casper-batch job?
Kristen Krumhardt (Apr 15 2021 at 21:59):
yes
Kristen Krumhardt (Apr 15 2021 at 22:00):
but then I got kicked off.. and then I tried to restart the server with another casper-batch job but it just keeps landing on this page
Jared Baker (Apr 15 2021 at 22:04):
Okay, I have a theory on what happened here then. I'm curious if it was in the process of spawning then the hub became overwhelmed and never got the update. Do you mind if I try something interesting?
Kristen Krumhardt (Apr 15 2021 at 22:10):
I don't mind! try anything:)
Frank Bryan (Apr 15 2021 at 22:11):
I am also seeing the same thing I did earlier today (Y"our server is stopping")
Kristen Krumhardt (Apr 15 2021 at 22:14):
glad I'm not the only one!
Abby Jaye (Apr 15 2021 at 22:15):
I have the same issue! I've been having issues since around noon
Frank Bryan (Apr 15 2021 at 22:16):
FYI I was getting some very flaky behavior late yesterday (I thought it might be a glade issue) and just shut down for the day. Perhaps something is still running from that instance?
Matt Long (Apr 15 2021 at 22:17):
This feels pretty systemic to me—it's never really been stable since the PBS switch—but today has been particularly bad. I've switched to using SSH tunnels so I can get work done.
@Jared Baker, please let us know what's most helpful for you regarding testing, complaining, etc.
Jared Baker (Apr 15 2021 at 23:29):
It was absolutely systemic. I'm not sure I can really blame PBS here. JupyterHub with the new login spawners had at one point 22k open file handles. The system security limits were preventing proper response times, then things started stacking up on the Hub's polling eventually leading up to the "hang" on casper-login1 today.
Jared Baker (Apr 15 2021 at 23:31):
@Kristen Krumhardt I've attempted to insert a proxy route and I think I was successful on that, but I'm not sure if it gave you the ability to access your instance again. I think it's still trying to stop.
Matt Long (Apr 16 2021 at 00:00):
Thanks @Jared Baker! I'll try again tomorrow!
Matt Long (Apr 16 2021 at 12:08):
things were going smoothly for awhile this morning...but then I just lost my kernel. I am not getting any error messages, the interface has simply stopped responding.
Matt Long (Apr 16 2021 at 12:09):
...and now it's back. It must have choked on something for a bit
Matt Long (Apr 16 2021 at 12:11):
and now it's unresponsive again. I am on crhtc53. load average: 5.11, 5.59, 5.33...doesn't seem terrible.
Kristen Krumhardt (Apr 16 2021 at 12:18):
It let me sign in this morning and then it went to "This page isn't working" , like it was yesterday morning.
Jared Baker (Apr 16 2021 at 14:27):
Kristen, your instance seemed to have a mismatched route. I've removed it. I'm hoping :fingers_crossed: that will give you options to spawn a server again.
Kristen Krumhardt (Apr 16 2021 at 14:37):
Yes! it worked this time!
Jared Baker (Apr 16 2021 at 15:21):
@Matt Long I think your issue is hub agnostic. I'm not sure what might be causing the log messages "kernel interrupted" that are in the job logs (_/glade/scratch/$USER/.jupyter_logs/_). I'm going to have to look those messages closer. I honestly have no idea why a kernel may see interruptions like that; presently at least.
Matt Long (Apr 16 2021 at 15:26):
I can believe that. hasn't happened again. With Cheyenne and the old Hub, I found that the share queue was hard to use because of intermittent unresponsiveness. Could just be a load issue, I guess.
Matt Long (Apr 16 2021 at 21:28):
@Jared Baker, the hub seems pretty stable today. Thanks for all your work on it!
Jared Baker (Apr 16 2021 at 21:30):
That's good. Made a couple background changes. Although this has given me a reason to write some tools to inspect the health of the hub as well. Enjoy your weekend!
Matt Long (Apr 16 2021 at 21:31):
you too! Thanks!
Stephen Yeager (Apr 17 2021 at 14:45):
Seems like yesterday's jupyterhub issues are still not resolved. I clicked on "Production" from the main page and got a CIT login prompt, but then landed on the screen below. Now I get this screen when I start over and click on "Production".
Screen-Shot-2021-04-17-at-8.38.23-AM.png
Matt Long (Apr 17 2021 at 15:17):
I was able to spawn a session this morning, a few hours ago, and it's still running.
I was getting that same behavior earlier this week though...tried different browser and such...didn't help
Jared Baker (Apr 17 2021 at 18:07):
@Stephen Yeager you can try again whenever. You'll need to log back in.
Matt Long (Apr 19 2021 at 18:45):
the hub has become unresponsive for me:
when I click "production" on https://jupyterhub.hpc.ucar.edu/, it seemingly starts to load a new page, but seems to be waiting indefinitely.
Jared Baker (Apr 19 2021 at 18:53):
Mine seemingly loaded just fine. I'll go check state, then it'll be to the logs.
Britt Stephens (Apr 21 2021 at 18:03):
Hub is not working for me - clicking Production at https://jupyterhub.hpc.ucar.edu/ just spins going on an hour now - tried 2 different machines and 3 different browsers. Matt says it works for him. Any suggestions, or ideas why the experience is user dependent? Thanks!
Kevin Paul (Apr 21 2021 at 18:36):
My guess is that the Hub is either bogged down with users and can't respond to more requests, or there is something in your browser cache that needs to be cleared out. But that's just a guess.
Britt Stephens (Apr 21 2021 at 18:49):
Thanks Kevin - CISL says "We have had to do some work on the JupyterHub login system since yesterday evening. I believe this is impacting your login ability. We will be continuing to work on it tomorrow and a notice will be sent out soon detailing the downtime while work is being done." Odd that it only affects some users.
Jared Baker (Apr 21 2021 at 19:05):
It only affects some users because the hub DoS'ed one of the login nodes and to get it restored, we had to block things at a network level rather than a host-level. I'm cleaning it up, but unfortunately it's a slow process. I have some potential workarounds to restore access if you'd like to pursue that?
Britt Stephens (Apr 21 2021 at 19:14):
a workaround would be great, thanks, and thanks for the explanation too
Jared Baker (Apr 21 2021 at 22:11):
@Britt Stephens might be back now.
Britt Stephens (Apr 21 2021 at 22:19):
thanks - I can log in, but can't open any notebooks (existing or new) - you likely already know, but I have a help ticket open that Daniel Howard has been iterating with me on
Jared Baker (Apr 21 2021 at 22:28):
If you refresh the page, it will ask that you re-login at this point, but I think it'll be back to normal
Britt Stephens (Apr 21 2021 at 22:36):
thanks Jared - unfortunately same behavior - trying to open an existing notebook gives "File Load Error for cmip6-sno-compute.ipynb Unhandled error" and trying to start a new notebook gives "Launcher Error Cannot read property 'path' of undefined"
Jared Baker (Apr 21 2021 at 22:38):
That's a new one for me.
Jared Baker (Apr 21 2021 at 22:43):
So I think the error is a red herring. You're $HOME is at 100%.
Jared Baker (Apr 21 2021 at 22:46):
Can we move something to the scratch filesystem to check?
Jared Baker (Apr 21 2021 at 22:49):
maybe the SOCO2_210309.tar file?
Britt Stephens (Apr 21 2021 at 22:58):
ah, great catch - that worked - sorry I didn't notice that before - thanks a lot for the help!
Last updated: Jan 30 2022 at 12:01 UTC