|
msg271702 - (view) |
Author: BjΓΆrn Lindqvist (BjΓΆrn.Lindqvist) |
Date: 2016-07-30 19:57 |
This affects both Python 2 and 3. This is as expected:
>>> urlparse('abc:123.html')
ParseResult(scheme='abc', netloc='', path='123.html', params='', query='', fragment='')
>>> urlparse('123.html:abc')
ParseResult(scheme='123.html', netloc='', path='abc', params='', query='', fragment='')
>>> urlparse('abc:123/')
ParseResult(scheme='abc', netloc='', path='123/', params='', query='', fragment='')
This is NOT:
>>> urlparse('abc:123')
ParseResult(scheme='', netloc='', path='abc:123', params='', query='', fragment='')
Expected is path='123' and scheme='abc'. At least according to my reading of the rfc (https://tools.ietf.org/html/rfc1808.html) that is what should happen.
|
|
msg271703 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2016-07-30 21:12 |
See issue 14072. It may be time to look at this again, but we may still be constrained by backward compatibility.
|
|
msg271719 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2016-07-31 02:37 |
The main backward compatibility consideration would be Issue 754016, but donβt agree with the changes made, and would support reverting them. The original bug reporter wanted urlparse("1.2.3.4:80", "http") to be treated as the URL http://1.2.3.4:80, but the IP address was being parsed as a scheme, so the default βhttpβ scheme was ignored.
The original fix (r83701) affected any URL that had a digit 0β9 immediately after the βscheme:β prefix. In such URLs, the scheme component was no longer parsed. A test case for βpath:80β was added, and a demonstration of not parsing any scheme from www.cwi.nl:80/%7Eguido/Python.html was added in the documentation.
Later, the logic was altered to test if the URL looked like an integer (revision 495d12196487, Issue 11467). This restored proper parsing of clsid:85bbd92o-42a0-1o69-a2e4-08002b30309d and mailto:1337@example.org, although another URL given, javascript:123, remains misparsed. The documentation was subsequently adjusted in Issue 16932 to just demonstrate www.cwi.nl/%7Eguido/Python.html being parsed as a path.
The logic was watered down to its current form by revision 9f6b7576c08c, Issue 14072. Now it tests for a non-digit anywhere after the scheme, so that tel:+31641044153 is again parsed properly. But it was pointed out that tel:1234 remains misparsed.
Whatβs the next step in the watering-down process? All the attempts so far break valid URLs in favour of special-casing inputs that are not valid URLs.
|
|
msg271738 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2016-07-31 14:02 |
I hate to say it, but this may require a python-dev discussion. We probably ought to be parsing valid urls correctly as our top priority, but if that breaks our parsing of "reasonable" non-valid URLs (that existing code is depending on), it's going to be a backward compatibility problem.
|
|
msg271739 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2016-07-31 14:04 |
On second thought, what are the chances that special casing something that looks like an IP address in the scheme position would maintain backward compatibility?
|
|
msg271823 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2016-08-02 13:55 |
Depends on how you define βlooks like an IP addressβ. Does the www.cwi.nl:80 case look like an IP address? What about βpath:80β or βlocalhost:80β? If there is any code relying on the bug, it may just as easily involve host name as a numeric IP address.
|
|
msg271824 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2016-08-02 14:07 |
Ah, good point, I misread the scope of the problem.
|
|
msg289557 - (view) |
Author: Tim Graham (Tim.Graham) * |
Date: 2017-03-14 01:34 |
Based on discussion in issue 16932, I agree that reverting the parsing decisions from issue 754016 (as Martin suggested in msg271719) seems appropriate. I created a pull request that does that.
|
|
msg354889 - (view) |
Author: Senthil Kumaran (orsenthil) *  |
Date: 2019-10-18 13:07 |
New changeset 5a88d50ff013a64fbdb25b877c87644a9034c969 by Senthil Kumaran (Tim Graham) in branch 'master':
bpo-27657: Fix urlparse() with numeric paths (#661)
https://github.com/python/cpython/commit/5a88d50ff013a64fbdb25b877c87644a9034c969
|
|
msg354894 - (view) |
Author: miss-islington (miss-islington) |
Date: 2019-10-18 13:24 |
New changeset 82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f by Miss Islington (bot) in branch '3.7':
bpo-27657: Fix urlparse() with numeric paths (GH-661)
https://github.com/python/cpython/commit/82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f
|
|
msg354903 - (view) |
Author: Senthil Kumaran (orsenthil) *  |
Date: 2019-10-18 15:23 |
New changeset 0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384 by Senthil Kumaran in branch '3.8':
[3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) (#16839)
https://github.com/python/cpython/commit/0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384
|
|
msg355320 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2019-10-24 10:31 |
This issue got fixes, so I close it.
|
|
msg359273 - (view) |
Author: James Brown (roguelazer) |
Date: 2020-01-04 02:37 |
This is a surprising change to put in a minor release. This change totally changes the semantics of parsing scheme-less URLs with ports in them and ended up breaking a significant amount of my software. It turns out that urls like `example.com:80` are more common than one might hope, and a lot of software has always assumed that `example.com:80` would get parsed as the netloc and the software can guess the scheme based on the port...
|
|
msg359277 - (view) |
Author: Senthil Kumaran (orsenthil) *  |
Date: 2020-01-04 05:26 |
@James - Originally the issue was considered a revert and the versions were set for the merge, but I certainly recognize the problem when parsing can fail for simple URLs like `localhost:8000` which is very common.
Another developer had raised the concerns with the change in this PR: https://github.com/python/cpython/pull/16839#issuecomment-570758153
I am reopening this issue, and re-read the arguments again to understand and propose the next steps.
|
|
msg360196 - (view) |
Author: Chris Dent (Chris Dent) |
Date: 2020-01-17 15:21 |
Just to add to the list of places this is causing a regression. This has broken the target host determination routines in gabbi: https://github.com/cdent/gabbi/issues/277
While the original fix may have been strictly correct in some ways, it results in a terrible UX, and as several others have noted violated backwards compatibility.
|
|
msg361815 - (view) |
Author: Senthil Kumaran (orsenthil) *  |
Date: 2020-02-11 13:20 |
Hi Lukaz / Ned:
I will like to revert the backports done in 3.8 and 3.7.
Preferably in 3.8.2 and 3.7.7, so that this undesirable behavior exists only for a single release.
I have set this is a release blocker to catch your attention.
|
|
msg362103 - (view) |
Author: Senthil Kumaran (orsenthil) *  |
Date: 2020-02-16 21:07 |
New changeset 505b6015a1579fc50d9697e4a285ecc64976397a by Senthil Kumaran in branch '3.7':
Revert "bpo-27657: Fix urlparse() with numeric paths (GH-661)" (#18526)
https://github.com/python/cpython/commit/505b6015a1579fc50d9697e4a285ecc64976397a
|
|
msg362107 - (view) |
Author: Senthil Kumaran (orsenthil) *  |
Date: 2020-02-16 21:47 |
New changeset ea316fd21527dec53e704a5b04833ac462ce3863 by Senthil Kumaran in branch '3.8':
Revert "[3.8] bpo-27657: Fix urlparse() with numeric paths (GH-16839)" (GH-18525)
https://github.com/python/cpython/commit/ea316fd21527dec53e704a5b04833ac462ce3863
|