html - Scraping a page with awkwardly formatted JSON in python -
I am scraping a webpage for data and it has a format:
& Lt;! - Web header here - & gt; ["Foo": "Bar", "Foo 2": "Bar 2"}, {"Foo 3": ["Hello", "World"], "Foo 4": "Bar 4"}, ... ] & Lt;! - Web footer here - & gt; The issue is that JSON appears on the page with other content, and within the quoted list in the JSON list in the source of the page, inside the 'Pre' tag, other HTML tags inside JSON Therefore with:
& lt; Pre & gt; "[[" Foo ":" bar ", & lt; p & gt;" Foo2 ":" bar 2 "& lt; / p & gt;}, ...]" & lt; / Pre & gt; Is there a way to capture this unstable formatting and get a list of JSON objects, which is given to JSON-Objects of the string-off-a-list? Preferably getting rid of embedded tags in this process?
Edit: I have now installed and started learning Sundarasup 4 recommended by Mauricio, but I am still coming for a while. 'On soup. The operator gives me access & lt; Pre & gt; [{... (well formatted JSON, but still inside the tag) ...}] HTML: ( & lt; east & Gt; code has some headers on top and bottom.) & lt; Pre & gt; [{"Title": â ???? Blahâ ????, "refs": [A, A, ????, a ???? AA, ????], "Description": [A "A", "A", "A", "A", "A", "A": [A. ???: One ???? A "}], {" title ": an" aa "," riffs ": [an aan ????, a ?? ?? aa ????]," description ": [A, A,? ???, A, A, ????, A 'A',], ?? Ai ????: [[AC ??? A: ??? AA ????}]}] & Lt; / pre & gt;
To get more quotes and spaces Required.
Then, you can use the JSSN string to load:
Import jsson bs4 import from beautiful soup data = "" "& lt ; Div & gt; & Lt; Pre & gt; "[" Foo ":" Bar ", " foo2 ":" Bar 2 "& lt; / p & gt;}]" & lt; / Pre & gt; & Lt; / Div> Soup = Beautiful soup (data) json_data = Soup.pre.text.strip ('' ') Print json.loads (json_data)
Print:
Quotes inside pre are not normal and you should change them: # - * - Coding: UTF-8 - * - Imports imported from BS 4 Jason Sundasup Data = "U" "" "> > [{" Title ": â ???? blahâ ????," refs ": [A, A, ????, "A", "A", "A", "A", "A": [A: "A", "A", "A" One ???? one "}]}, {" title ": The "AA", "riffs" [a Aaan ????, â ?? ?? Aa ????], "description": [A, A, ????, A, A, ????, A 'A',],? ??? Ai ????: [[AC ??? a: ??? AA ????}]}] & lt; / Pre & gt; & Lt; / Div & gt; "" "Soup = Beautiful soup (data) json_data = soup.pre.text.encode ('utf-8'). Bar ('' ') .replace (' '' ',' ''). Replace (print Json.loads (json_data) Print:
[[u'a ': [(U'a': u'a '}] , You 'u': '[u', 'u'], u desecration ': [u', 'you', you '], u'tital': u 'blah'}, {o '': [ [U'a ': u'a']], you'fes': [u'a ', you'], u'description ': [u'a', u'a ', u'a'], U'title ': u'a'}]
Comments
Post a Comment