Commit Graph

55 Commits

Author SHA1 Message Date
Brad Fitzpatrick cc370314e7 health: don't show login error details with context cancelations
Fixes #12991

Change-Id: I2a5e109395761b720ecf1069d0167cf0caf72876
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2024-08-01 08:29:27 -07:00
Andrea Gottardo b7c3cfe049
health: support delayed Warnable visibility (#12783)
Updates tailscale/tailscale#4136

To reduce the likelihood of presenting spurious warnings, add the ability to delay the visibility of certain Warnables, based on a TimeToVisible time.Duration field on each Warnable. The default is zero, meaning that a Warnable is immediately visible to the user when it enters an unhealthy state.

Signed-off-by: Andrea Gottardo <andrea@gottardo.me>
2024-07-11 18:51:47 +00:00
Andrea Gottardo 732af2f6e0
health: reduce severity of some warnings, improve update messages (#12689)
Updates tailscale/tailscale#4136

High severity health warning = a system notification will appear, which can be quite disruptive to the user and cause unnecessary concern in the event of a temporary network issue.

Per design decision (@sonovawolf), the severity of all warnings but "network is down" should be tuned down to medium/low. ImpactsConnectivity should be set, to change the icon to an exclamation mark in some cases, but without a notification bubble.

I also tweaked the messaging for update-available, to reflect how each platform gets updates in different ways.

Signed-off-by: Andrea Gottardo <andrea@gottardo.me>
2024-07-02 23:11:28 -07:00
Andrew Lytvynov 2064dc20d4
health,ipn/ipnlocal: hide update warning when auto-updates are enabled (#12631)
When auto-udpates are enabled, we don't need to nag users to update
after a new release, before we release auto-updates.

Updates https://github.com/tailscale/corp/issues/20081

Signed-off-by: Andrew Lytvynov <awly@tailscale.com>
2024-06-27 09:36:29 -07:00
Andrea Gottardo 6e55d8f6a1
health: add warming-up warnable (#12553) 2024-06-25 22:02:38 -07:00
Andrea Gottardo d7619d273b health: fix nil DERPMap dereference panic
Looks like a DERPmap might not be available when we try to get the
name associated with a region ID, and that was causing an intermittent
panic in CI.

Fixes #12534

Change-Id: I4ace53681bf004df46c728cff830b27339254243
Signed-off-by: Andrea Gottardo <andrea@gottardo.me>
2024-06-19 12:20:44 -07:00
Andrea Gottardo d6a8fb20e7
health: include DERP region name in bad derp notifications (#12530)
Fixes tailscale/corp#20971

We added some Warnables for DERP failure situations, but their Text currently spits out the DERP region ID ("10") in the UI, which is super ugly. It would be better to provide the RegionName of the DERP region that is failing. We can do so by storing a reference to the last-known DERP map in the health package whenever we fetch one, and using it when generating the notification text.

This way, the following message...

> Tailscale could not connect to the relay server '10'. The server might be temporarily unavailable, or your Internet connection might be down.

becomes:

> Tailscale could not connect to the 'Seattle' relay server. The server might be temporarily unavailable, or your Internet connection might be down.

which is a lot more user-friendly.

Signed-off-by: Andrea Gottardo <andrea@gottardo.me>
2024-06-18 16:03:17 -07:00
Brad Fitzpatrick 7bc9d453c2 health: fix data race in new warnable code
Fixes #12479

Change-Id: Ice84d5eb12d835eeddf6fc8cc337ea6b4dddcf6c
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2024-06-14 21:44:23 -07:00
Andrea Gottardo a8ee83e2c5
health: begin work to use structured health warnings instead of strings, pipe changes into ipn.Notify (#12406)
Updates tailscale/tailscale#4136

This PR is the first round of work to move from encoding health warnings as strings and use structured data instead. The current health package revolves around the idea of Subsystems. Each subsystem can have (or not have) a Go error associated with it. The overall health of the backend is given by the concatenation of all these errors.

This PR polishes the concept of Warnable introduced by @bradfitz a few weeks ago. Each Warnable is a component of the backend (for instance, things like 'dns' or 'magicsock' are Warnables). Each Warnable has a unique identifying code. A Warnable is an entity we can warn the user about, by setting (or unsetting) a WarningState for it. Warnables have:

- an identifying Code, so that the GUI can track them as their WarningStates come and go
- a Title, which the GUIs can use to tell the user what component of the backend is broken
- a Text, which is a function that is called with a set of Args to generate a more detailed error message to explain the unhappy state

Additionally, this PR also begins to send Warnables and their WarningStates through LocalAPI to the clients, using ipn.Notify messages. An ipn.Notify is only issued when a warning is added or removed from the Tracker.

In a next PR, we'll get rid of subsystems entirely, and we'll start using structured warnings for all errors affecting the backend functionality.

Signed-off-by: Andrea Gottardo <andrea@gottardo.me>
2024-06-14 11:53:56 -07:00
Brad Fitzpatrick 96712e10a7 health, ipn/ipnlocal: move more health warning code into health.Tracker
In prep for making health warnings rich objects with metadata rather
than a bunch of strings, start moving it all into the same place.

We'll still ultimately need the stringified form for the CLI and
LocalAPI for compatibility but we'll next convert all these warnings
into Warnables that have severity levels and such, and legacy
stringification will just be something each Warnable thing can do.

Updates #4136

Change-Id: I83e189435daae3664135ed53c98627c66e9e53da
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2024-05-01 15:03:21 -07:00
Brad Fitzpatrick 7f587d0321 health, wgengine/magicsock: remove last of health package globals
Fixes #11874
Updates #4136

Change-Id: Ib70e6831d4c19c32509fe3d7eee4aa0e9f233564
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2024-04-26 17:36:19 -07:00
Brad Fitzpatrick 745931415c health, all: remove health.Global, finish plumbing health.Tracker
Updates #11874
Updates #4136

Change-Id: I414470f71d90be9889d44c3afd53956d9f26cd61
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2024-04-26 12:03:11 -07:00
Brad Fitzpatrick 6d69fc137f ipn/{ipnlocal,localapi},wgengine{,/magicsock}: plumb health.Tracker
Down to 25 health.Global users. After this remains controlclient &
net/dns & wgengine/router.

Updates #11874
Updates #4136

Change-Id: I6dd1856e3d9bf523bdd44b60fb3b8f7501d5dc0d
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2024-04-26 09:43:28 -07:00
Brad Fitzpatrick 723c775dbb tsd, ipnlocal, etc: add tsd.System.HealthTracker, start some plumbing
This adds a health.Tracker to tsd.System, accessible via
a new tsd.System.HealthTracker method.

In the future, that new method will return a tsd.System-specific
HealthTracker, so multiple tsnet.Servers in the same process are
isolated. For now, though, it just always returns the temporary
health.Global value. That permits incremental plumbing over a number
of changes. When the second to last health.Global reference is gone,
then the tsd.System.HealthTracker implementation can return a private
Tracker.

The primary plumbing this does is adding it to LocalBackend and its
dozen and change health calls. A few misc other callers are also
plumbed. Subsequent changes will flesh out other parts of the tree
(magicsock, controlclient, etc).

Updates #11874
Updates #4136

Change-Id: Id51e73cfc8a39110425b6dc19d18b3975eac75ce
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2024-04-25 22:13:04 -07:00
Brad Fitzpatrick cb66952a0d health: permit Tracker method calls on nil receiver
In prep for tsd.System Tracker plumbing throughout tailscaled,
defensively permit all methods on Tracker to accept a nil receiver
without crashing, lest I screw something up later. (A health tracking
system that itself causes crashes would be no good.) Methods on nil
receivers should not be called, so a future change will also collect
their stacks (and panic during dev/test), but we should at least not
crash in prod.

This also locks that in with a test using reflect to automatically
call all methods on a nil receiver and check they don't crash.

Updates #11874
Updates #4136

Change-Id: I8e955046ebf370ec8af0c1fb63e5123e6282a9d3
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2024-04-25 20:45:57 -07:00
Brad Fitzpatrick 5b32264033 health: break Warnable into a global and per-Tracker value halves
Previously it was both metadata about the class of warnable item as
well as the value.

Now it's only metadata and the value is per-Tracker.

Updates #11874
Updates #4136

Change-Id: Ia1ed1b6c95d34bc5aae36cffdb04279e6ba77015
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2024-04-25 14:40:11 -07:00
Brad Fitzpatrick ebc552d2e0 health: add Tracker type, in prep for removing global variables
This moves most of the health package global variables to a new
`health.Tracker` type.

But then rather than plumbing the Tracker in tsd.System everywhere,
this only goes halfway and makes one new global Tracker
(`health.Global`) that all the existing callers now use.

A future change will eliminate that global.

Updates #11874
Updates #4136

Change-Id: I6ee27e0b2e35f68cb38fecdb3b2dc4c3f2e09d68
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2024-04-25 13:46:22 -07:00
Anton Tolchanov 8cc5c51888 health: warn about reverse path filtering and exit nodes
When reverse path filtering is in strict mode on Linux, using an exit
node blocks all network connectivity. This change adds a warning about
this to `tailscale status` and the logs.

Example in `tailscale status`:

```
- not connected to home DERP region 22
- The following issues on your machine will likely make usage of exit nodes impossible: [interface "eth0" has strict reverse-path filtering enabled], please set rp_filter=2 instead of rp_filter=1; see https://github.com/tailscale/tailscale/issues/3310
```

Example in the logs:
```
2024/02/21 21:17:07 health("overall"): error: multiple errors:
	not in map poll
	The following issues on your machine will likely make usage of exit nodes impossible: [interface "eth0" has strict reverse-path filtering enabled], please set rp_filter=2 instead of rp_filter=1; see https://github.com/tailscale/tailscale/issues/3310
```

Updates #3310

Signed-off-by: Anton Tolchanov <anton@tailscale.com>
2024-02-27 00:43:01 +00:00
Andrew Dunham 727acf96a6 net/netcheck: use DERP frames as a signal for home region liveness
This uses the fact that we've received a frame from a given DERP region
within a certain time as a signal that the region is stil present (and
thus can still be a node's PreferredDERP / home region) even if we don't
get a STUN response from that region during a netcheck.

This should help avoid DERP flaps that occur due to losing STUN probes
while still having a valid and active TCP connection to the DERP server.

RELNOTE=Reduce home DERP flapping when there's still an active connection

Updates #8603

Signed-off-by: Andrew Dunham <andrew@du.nham.ca>
Change-Id: If7da6312581e1d434d5c0811697319c621e187a0
2023-12-13 16:33:46 -05:00
Brad Fitzpatrick 4d196c12d9 health: don't report a warning in DERP homeless mode
Updates #3363
Updates tailscale/corp#396

Change-Id: Ibfb0496821cb58a78399feb88d4206d81e95ca0f
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2023-11-16 14:08:47 -08:00
Brad Fitzpatrick dc7aa98b76 all: use set.Set consistently instead of map[T]struct{}
I didn't clean up the more idiomatic map[T]bool with true values, at
least yet.  I just converted the relatively awkward struct{}-valued
maps.

Updates #cleanup

Change-Id: I758abebd2bb1f64bc7a9d0f25c32298f4679c14f
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2023-09-09 10:59:19 -07:00
Brad Fitzpatrick 14320290c3 control/controlclient: merge, simplify two health check calls
I'm trying to remove some stuff from the netmap update path.

Updates #1909

Change-Id: Iad2c728dda160cd52f33ef9cf0b75b4940e0ce64
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2023-08-30 09:14:01 -07:00
Andrew Dunham 2755f3843c health, net/tlsdial: add healthcheck for self-signed cert
When we make a connection to a server, we previously would verify with
the system roots, and then fall back to verifying with our baked-in
Let's Encrypt root if the system root cert verification failed.

We now explicitly check for, and log a health error on, self-signed
certificates. Additionally, we now always verify against our baked-in
Let's Encrypt root certificate and log an error if that isn't
successful. We don't consider this a health failure, since if we ever
change our server certificate issuer in the future older non-updated
versions of Tailscale will no longer be healthy despite being able to
connect.

Updates #3198

Change-Id: I00be5ceb8afee544ee795e3c7a2815476abc4abf
Signed-off-by: Andrew Dunham <andrew@du.nham.ca>
2023-02-01 23:17:41 -05:00
Will Norris 71029cea2d all: update copyright and license headers
This updates all source files to use a new standard header for copyright
and license declaration.  Notably, copyright no longer includes a date,
and we now use the standard SPDX-License-Identifier header.

This commit was done almost entirely mechanically with perl, and then
some minimal manual fixes.

Updates #6865

Signed-off-by: Will Norris <will@tailscale.com>
2023-01-27 15:36:29 -08:00
Tom DNetto 0088c5ddc0 health,ipn/ipnlocal: report the node being locked out as a health issue
Signed-off-by: Tom DNetto <tom@tailscale.com>
2023-01-04 16:20:47 -08:00
Aaron Klotz 659e7837c6 health, ipn/ipnlocal: when -no-logs-no-support is enabled, deny access to tailnets that have network logging enabled
We want users to have the freedom to start tailscaled with `-no-logs-no-support`,
but that is obviously in direct conflict with tailnets that have network logging
enabled.

When we detect that condition, we record the issue in health, notify the client,
set WantRunning=false, and bail.

We clear the item in health when a profile switch occurs, since it is a
per-tailnet condition that should not propagate across profiles.

Signed-off-by: Aaron Klotz <aaron@tailscale.com>
2022-11-29 11:42:20 -06:00
Brad Fitzpatrick ea25ef8236 util/set: add new set package for SetHandle type
We use this pattern in a number of places (in this repo and elsewhere)
and I was about to add a fourth to this repo which was crossing the line.
Add this type instead so they're all the same.

Also, we have another Set type (SliceSet, which tracks its keys in
order) in another repo we can move to this package later.

Change-Id: Ibbdcdba5443fae9b6956f63990bdb9e9443cefa9
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2022-11-28 10:44:17 -08:00
Brad Fitzpatrick 3f8e185003 health: add Warnable, move ownership of warnable items to callers
The health package was turning into a rando dumping ground. Make a new
Warnable type instead that callers can request an instance of, and
then Set it locally in their code without the health package being
aware of all the things that are warnable. (For plenty of things the
health package will want to know details of how Tailscale works so it
can better prioritize/suppress errors, but lots of the warnings are
pretty leaf-y and unrelated)

This just moves two of the health warnings. Can probably move more
later.

Change-Id: I51e50e46eb633f4e96ced503d3b18a1891de1452
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2022-11-13 08:00:27 -08:00
Brad Fitzpatrick e55ae53169 tailcfg: add Node.UnsignedPeerAPIOnly to let server mark node as peerapi-only
capver 48

Change-Id: I20b2fa81d61ef8cc8a84e5f2afeefb68832bd904
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2022-11-02 21:55:04 -07:00
Brad Fitzpatrick 3562b5bdfa envknob, health: support Synology, show parse errors in status
Updates #5114

Change-Id: I8ac7a22a511f5a7d0dcb8cac470d4a403aa8c817
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2022-09-17 08:42:41 -07:00
Brad Fitzpatrick 74674b110d envknob: support changing envknobs post-init
Updates #5114

Change-Id: Ia423fc7486e1b3f3180a26308278be0086fae49b
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2022-09-15 15:04:02 -07:00
Jordan Whited 43f9c25fd2
cmd/tailscale: surface authentication errors in status.Health (#4748)
Fixes #3713

Signed-off-by: Jordan Whited <jordan@tailscale.com>
2022-06-03 10:52:07 -07:00
Brad Fitzpatrick 2ff481ff10 net/dns: add health check for particular broken-ish Linux DNS config
Updates #3937 (need to write docs before closing)

Change-Id: I1df7244cfbb0303481e2621ee750d21358bd67c6
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2022-02-16 10:40:04 -08:00
Brad Fitzpatrick 41fd4eab5c envknob: add new package for all the strconv.ParseBool(os.Getenv(..))
A new package can also later record/report which knobs are checked and
set. It also makes the code cleaner & easier to grep for env knobs.

Change-Id: Id8a123ab7539f1fadbd27e0cbeac79c2e4f09751
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2022-01-24 11:51:23 -08:00
Brad Fitzpatrick 0aa4c6f147 net/dns/resolver: add debug HTML handler to see what DNS traffic was forwarded
Change-Id: I6b790e92dcc608515ac8b178f2271adc9fd98f78
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-12-21 14:32:36 -08:00
Brad Fitzpatrick 8f43ddf1a2 ipn/ipnlocal, health: populate self node's Online bit in tailscale status
One option was to just hide "offline" in the text output, but that
doesn't fix the JSON output.

The next option was to lie and say it's online in the JSON (which then
fixes the "offline" in the text output).

But instead, this sets the self node's "Online" to whether we're in an
active map poll.

Fixes #3564

Change-Id: I9b379989bd14655198959e37eec39bb570fb814a
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-12-16 10:14:08 -08:00
David Anderson 6c82cebe57 health: add a health state for net/dns.OSConfigurator.
Lets the systemd-resolved OSConfigurator report health changes
for out of band config resyncs.

Updates #3327

Signed-off-by: David Anderson <danderson@tailscale.com>
2021-11-19 11:09:32 -08:00
Josh Bleecher Snyder 3fd5f4380f util/multierr: new package
github.com/go-multierror/multierror served us well.
But we need a few feature from it (implement Is),
and it's not worth maintaining a fork of such a small module.

Instead, I did a clean room implementation inspired by its API.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-11-02 17:50:15 -07:00
Brad Fitzpatrick 09e692e318 health: don't look for UDP goroutines in js/wasm health check
Updates #3157

Change-Id: I43d97e6876eeb2d1936fc567835134568bb8615c
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-10-22 09:12:00 -07:00
Brad Fitzpatrick aae622314e tailcfg, health: add way for control plane to add problems to health check
So if the control plane knows that something's broken about the node, it can
include problem(s) in MapResponse and "tailscale status" will show it.
(and GUIs in the future, as it's in ipnstate.Status/JSON)

This also bumps the MapRequest.Version, though it's not strictly
required. Doesn't hurt.

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-09-19 17:55:49 -07:00
Brad Fitzpatrick 4549d3151c cmd/tailscale: make status show health check problems
Fixes #2775

RELNOTE=tailscale status now shows health check problems

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-09-16 11:24:59 -07:00
Brad Fitzpatrick 5bacbf3744 wgengine/magicsock, health, ipn/ipnstate: track DERP-advertised health
And add health check errors to ipnstate.Status (tailscale status --json).

Updates #2746
Updates #2775

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-09-02 10:20:25 -07:00
Josh Bleecher Snyder 9d542e08e2 wgengine/magicsock: always run ReceiveIPv6
One of the consequences of the 	bind refactoring in 6f23087175
is that attempting to bind an IPv6 socket will always
result in c.pconn6.pconn being non-nil.
If the bind fails, it'll be set to a placeholder packet conn
that blocks forever.

As a result, we can always run ReceiveIPv6 and health check it.
This removes IPv4/IPv6 asymmetry and also will allow health checks
to detect any IPv6 receive func failures.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-28 11:07:14 -07:00
Josh Bleecher Snyder fe50ded95c health: track whether we have a functional udp4 bind
Suggested-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-28 11:07:14 -07:00
Josh Bleecher Snyder 744de615f1 health, wgenegine: fix receive func health checks for the fourth time
The old implementation knew too much about how wireguard-go worked.
As a result, it missed genuine problems that occurred due to unrelated bugs.

This fourth attempt to fix the health checks takes a black box approach.
A receive func is healthy if one (or both) of these conditions holds:

* It is currently running and blocked.
* It has been executed recently.

The second condition is required because receive functions
are not continuously executing. wireguard-go calls them and then
processes their results before calling them again.

There is a theoretical false positive if wireguard-go go takes
longer than one minute to process the results of a receive func execution.
If that happens, we have other problems.

Updates #1790

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-26 17:35:49 -07:00
Josh Bleecher Snyder 0d4c8cb2e1 health: delete ReceiveFunc health checks
They were not doing their job.
They need yet another conceptual re-think.
Start by clearing the decks.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-26 17:35:49 -07:00
Josh Bleecher Snyder 8d7f7fc7ce health, wgenegine: fix receive func health checks yet again
The existing implementation was completely, embarrassingly conceptually broken.

We aren't able to see whether wireguard-go's receive function goroutines
are running or not. All we can do is model that based on what we have done.
This commit fixes that model.

Fixes #1781

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-23 08:42:04 -07:00
Josh Bleecher Snyder 5835a3f553 health, wgengine/magicsock: avoid receive function false positives
Avery reported a sub-ms health transition from "receiveIPv4 not running" to "ok".

To avoid these transient false-positives, be more precise about
the expected lifetime of receive funcs. The problematic case is one in which
they were started but exited prior to a call to connBind.Close.
Explicitly represent started vs running state, taking care with the order of updates.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-22 12:48:10 -07:00
Josh Bleecher Snyder f845aae761 health: track whether magicsock receive functions are running
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-22 08:57:36 -07:00
David Anderson f007a9dd6b health: add DNS subsystem and plumb errors in.
Signed-off-by: David Anderson <danderson@tailscale.com>
2021-04-05 10:55:35 -07:00