2017-03-05 12:14:27
0
I'm working on a script which scrapes thousands of different web pages. Since these pages are usually different (has different sites), I use multithreading to speed up scraping.
EDIT: SIMPLE SHORT EXPLANATION
-------
I'm loading 300 urls (htmls) in one pool of 300 workers. Since the size of html is variable, sometimes, the sum of sizes is probably too big and python raises: internal buffer error : Memory allocation failed : growing buffer. I want to somehow check if this can happend and if, wait until the buffer is not full.
-------
This approach works but sometimes, python starts to throw:
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
into console. I suppose that it is because of size of html I store in memory, which can be 300*(for example 1mb) = 300mb
EDIT:
I know that I can decrease number of workers and I will. But it's not a solution, there would be just lower chance to get such error. I want to avoid this error at all...
I started to log html sizes:
ram_logger.debug('SIZE: {}'.format(sys.getsizeof(html)))
And the result is (part):
2017-03-05 13:02:04,914 DEBUG SIZE: 243940
2017-03-05 13:02:05,023 DEBUG SIZE: 138384
2017-03-05 13:02:05,026 DEBUG SIZE: 1185964
2017-03-05 13:02:05,141 DEBUG SIZE: 1203715
2017-03-05 13:02:05,213 DEBUG SIZE: 291415
2017-03-05 13:02:05,213 DEBUG SIZE: 287030
2017-03-05 13:02:05,224 DEBUG SIZE: 1192165
2017-03-05 13:02:05,230 DEBUG SIZE: 1193751
2017-03-05 13:02:05,234 DEBUG SIZE: 359193
2017-03-05 13:02:05,247 DEBUG SIZE: 23703
2017-03-05 13:02:05,252 DEBUG SIZE: 24606
2017-03-05 13:02:05,275 DEBUG SIZE: 302388
2017-03-05 13:02:05,329 DEBUG SIZE: 334925
This is my simplified scraping approach:
def scrape_chunk(chunk):
pool = Pool(300)
results = pool.map(scrape_chunk_item, chunk)
pool.close()
pool.join()
return results
def scrape_chunk_item(item):
root_result = _load_root(item.get('url'))
# parse using xpath and return
And the function to load html:
def _load_root(url):
for i in xrange(settings.ENGINE_NUMBER_OF_CONNECTION_ATTEMPTS):
try:
headers = requests.utils.default_headers()
headers['User-Agent'] = ua.chrome
r = requests.get(url, timeout=(settings.ENGINE_SCRAPER_REQUEST_TIMEOUT + i, 10 + i), verify=False, )
r.raise_for_status()
except requests.Timeout as e:
if i >= settings.ENGINE_NUMBER_OF_CONNECTION_ATTEMPTS - 1:
tb = traceback.format_exc()
return {'success': False, 'root': None, 'error': 'timeout', 'traceback': tb}
except Exception:
tb = traceback.format_exc()
return {'success': False, 'root': None, 'error': 'unknown_error', 'traceback': tb}
else:
break
r.encoding = 'utf-8'
html = r.content
ram_logger.debug('SIZE: {}'.format(sys.getsizeof(html)))
try:
root = etree.fromstring(html, etree.HTMLParser())
except Exception:
tb = traceback.format_exc()
return {'success': False, 'root': None, 'error': 'root_error', 'traceback': tb}
return {'success': True, 'root': root}
Do you know how to make it safe? Something which make workers wait if there would be buffer overflow problem?