Issue 27657: urlparse fails if the path is numeric - Python tracker

classification

Title:	urlparse fails if the path is numeric
Type:	behavior	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.8, Python 3.7, Python 2.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	orsenthil	Nosy List:	Björn.Lindqvist, Chris Dent, Tim.Graham, benjamin.peterson, lukasz.langa, martin.panter, miss-islington, ned.deily, orsenthil, r.david.murray, roguelazer
Priority:	release blocker	Keywords:	3.7regression, 3.8regression, patch

Created on 2016-07-30 19:57 by Björn.Lindqvist, last changed 2020-02-16 21:47 by orsenthil.

Pull Requests
URL	Status	Linked	Edit
PR 661	merged	Tim.Graham, 2017-03-13 17:39
PR 16837	merged	miss-islington, 2019-10-18 13:07
PR 16839	merged	orsenthil, 2019-10-18 13:51
PR 18525	merged	orsenthil, 2020-02-16 18:17
PR 18526	merged	orsenthil, 2020-02-16 18:19

Messages (18)
msg271702 - (view)	Author: Björn Lindqvist (Björn.Lindqvist)	Date: 2016-07-30 19:57
This affects both Python 2 and 3. This is as expected: >>> urlparse('abc:123.html') ParseResult(scheme='abc', netloc='', path='123.html', params='', query='', fragment='') >>> urlparse('123.html:abc') ParseResult(scheme='123.html', netloc='', path='abc', params='', query='', fragment='') >>> urlparse('abc:123/') ParseResult(scheme='abc', netloc='', path='123/', params='', query='', fragment='') This is NOT: >>> urlparse('abc:123') ParseResult(scheme='', netloc='', path='abc:123', params='', query='', fragment='') Expected is path='123' and scheme='abc'. At least according to my reading of the rfc (https://tools.ietf.org/html/rfc1808.html) that is what should happen.
msg271703 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2016-07-30 21:12
See issue 14072. It may be time to look at this again, but we may still be constrained by backward compatibility.
msg271719 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-07-31 02:37
The main backward compatibility consideration would be Issue 754016, but don’t agree with the changes made, and would support reverting them. The original bug reporter wanted urlparse("1.2.3.4:80", "http") to be treated as the URL http://1.2.3.4:80, but the IP address was being parsed as a scheme, so the default “http” scheme was ignored. The original fix (r83701) affected any URL that had a digit 0–9 immediately after the “scheme:” prefix. In such URLs, the scheme component was no longer parsed. A test case for “path:80” was added, and a demonstration of not parsing any scheme from www.cwi.nl:80/%7Eguido/Python.html was added in the documentation. Later, the logic was altered to test if the URL looked like an integer (revision 495d12196487, Issue 11467). This restored proper parsing of clsid:85bbd92o-42a0-1o69-a2e4-08002b30309d and mailto:1337@example.org, although another URL given, javascript:123, remains misparsed. The documentation was subsequently adjusted in Issue 16932 to just demonstrate www.cwi.nl/%7Eguido/Python.html being parsed as a path. The logic was watered down to its current form by revision 9f6b7576c08c, Issue 14072. Now it tests for a non-digit anywhere after the scheme, so that tel:+31641044153 is again parsed properly. But it was pointed out that tel:1234 remains misparsed. What’s the next step in the watering-down process? All the attempts so far break valid URLs in favour of special-casing inputs that are not valid URLs.
msg271738 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2016-07-31 14:02
I hate to say it, but this may require a python-dev discussion. We probably ought to be parsing valid urls correctly as our top priority, but if that breaks our parsing of "reasonable" non-valid URLs (that existing code is depending on), it's going to be a backward compatibility problem.
msg271739 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2016-07-31 14:04
On second thought, what are the chances that special casing something that looks like an IP address in the scheme position would maintain backward compatibility?
msg271823 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-08-02 13:55
Depends on how you define “looks like an IP address”. Does the www.cwi.nl:80 case look like an IP address? What about “path:80” or “localhost:80”? If there is any code relying on the bug, it may just as easily involve host name as a numeric IP address.
msg271824 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2016-08-02 14:07
Ah, good point, I misread the scope of the problem.
msg289557 - (view)	Author: Tim Graham (Tim.Graham) *	Date: 2017-03-14 01:34
Based on discussion in issue 16932, I agree that reverting the parsing decisions from issue 754016 (as Martin suggested in msg271719) seems appropriate. I created a pull request that does that.
msg354889 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2019-10-18 13:07
New changeset 5a88d50ff013a64fbdb25b877c87644a9034c969 by Senthil Kumaran (Tim Graham) in branch 'master': bpo-27657: Fix urlparse() with numeric paths (#661) https://github.com/python/cpython/commit/5a88d50ff013a64fbdb25b877c87644a9034c969
msg354894 - (view)	Author: miss-islington (miss-islington)	Date: 2019-10-18 13:24
New changeset 82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f by Miss Islington (bot) in branch '3.7': bpo-27657: Fix urlparse() with numeric paths (GH-661) https://github.com/python/cpython/commit/82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f
msg354903 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2019-10-18 15:23
New changeset 0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384 by Senthil Kumaran in branch '3.8': [3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) (#16839) https://github.com/python/cpython/commit/0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384
msg355320 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-10-24 10:31
This issue got fixes, so I close it.
msg359273 - (view)	Author: James Brown (roguelazer)	Date: 2020-01-04 02:37
This is a surprising change to put in a minor release. This change totally changes the semantics of parsing scheme-less URLs with ports in them and ended up breaking a significant amount of my software. It turns out that urls like `example.com:80` are more common than one might hope, and a lot of software has always assumed that `example.com:80` would get parsed as the netloc and the software can guess the scheme based on the port...
msg359277 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2020-01-04 05:26
@James - Originally the issue was considered a revert and the versions were set for the merge, but I certainly recognize the problem when parsing can fail for simple URLs like `localhost:8000` which is very common. Another developer had raised the concerns with the change in this PR: https://github.com/python/cpython/pull/16839#issuecomment-570758153 I am reopening this issue, and re-read the arguments again to understand and propose the next steps.
msg360196 - (view)	Author: Chris Dent (Chris Dent)	Date: 2020-01-17 15:21
Just to add to the list of places this is causing a regression. This has broken the target host determination routines in gabbi: https://github.com/cdent/gabbi/issues/277 While the original fix may have been strictly correct in some ways, it results in a terrible UX, and as several others have noted violated backwards compatibility.
msg361815 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2020-02-11 13:20
Hi Lukaz / Ned: I will like to revert the backports done in 3.8 and 3.7. Preferably in 3.8.2 and 3.7.7, so that this undesirable behavior exists only for a single release. I have set this is a release blocker to catch your attention.
msg362103 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2020-02-16 21:07
New changeset 505b6015a1579fc50d9697e4a285ecc64976397a by Senthil Kumaran in branch '3.7': Revert "bpo-27657: Fix urlparse() with numeric paths (GH-661)" (#18526) https://github.com/python/cpython/commit/505b6015a1579fc50d9697e4a285ecc64976397a
msg362107 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2020-02-16 21:47
New changeset ea316fd21527dec53e704a5b04833ac462ce3863 by Senthil Kumaran in branch '3.8': Revert "[3.8] bpo-27657: Fix urlparse() with numeric paths (GH-16839)" (GH-18525) https://github.com/python/cpython/commit/ea316fd21527dec53e704a5b04833ac462ce3863

History
Date	User	Action	Args
2020-02-16 21:47:25	orsenthil	set	messages: + msg362107
2020-02-16 21:07:29	orsenthil	set	messages: + msg362103
2020-02-16 18:19:45	orsenthil	set	pull_requests: + pull_request17903
2020-02-16 18:17:09	orsenthil	set	keywords: + patch stage: commit review -> patch review pull_requests: + pull_request17902
2020-02-11 13:20:49	orsenthil	set	priority: deferred blocker -> release blocker nosy: + lukasz.langa, benjamin.peterson, ned.deily messages: + msg361815
2020-01-17 16:14:32	vstinner	set	nosy: - vstinner
2020-01-17 15:21:32	Chris Dent	set	nosy: + Chris Dent messages: + msg360196
2020-01-04 17:49:08	ned.deily	set	keywords: + 3.7regression, 3.8regression, - patch priority: normal -> deferred blocker
2020-01-04 05:26:14	orsenthil	set	status: closed -> open messages: + msg359277 assignee: orsenthil resolution: fixed -> stage: resolved -> commit review
2020-01-04 02:37:16	roguelazer	set	nosy: + roguelazer messages: + msg359273
2019-10-24 10:31:31	vstinner	set	status: open -> closed nosy: + vstinner messages: + msg355320 resolution: fixed stage: patch review -> resolved
2019-10-18 15:23:21	orsenthil	set	messages: + msg354903
2019-10-18 13:51:49	orsenthil	set	pull_requests: + pull_request16388
2019-10-18 13:24:31	miss-islington	set	nosy: + miss-islington messages: + msg354894
2019-10-18 13:07:37	miss-islington	set	keywords: + patch pull_requests: + pull_request16382
2019-10-18 13:07:36	orsenthil	set	messages: + msg354889
2018-03-15 18:57:46	cheryl.sabella	set	stage: patch review versions: + Python 3.7, Python 3.8, - Python 3.5, Python 3.6
2017-03-14 01:34:28	Tim.Graham	set	nosy: + Tim.Graham messages: + msg289557
2017-03-13 17:39:32	Tim.Graham	set	pull_requests: + pull_request543
2016-08-02 14:07:03	r.david.murray	set	messages: + msg271824
2016-08-02 13:55:05	martin.panter	set	messages: + msg271823
2016-07-31 14:04:36	r.david.murray	set	messages: + msg271739
2016-07-31 14:02:56	r.david.murray	set	messages: + msg271738
2016-07-31 02:37:12	martin.panter	set	nosy: + martin.panter, orsenthil messages: + msg271719 versions: + Python 2.7, Python 3.5, Python 3.6
2016-07-30 23:52:19	martin.panter	link	issue22891 dependencies
2016-07-30 21:12:06	r.david.murray	set	nosy: + r.david.murray messages: + msg271703
2016-07-30 19:57:17	Björn.Lindqvist	create