python内存分配失败_内存分配失败：增长缓冲区 - Python (Memory allocation failed : growing buffer - Python)...

闾丘德业

2023-12-01

2017-03-05 12:14:27

I'm working on a script which scrapes thousands of different web pages. Since these pages are usually different (has different sites), I use multithreading to speed up scraping.

EDIT: SIMPLE SHORT EXPLANATION

-------

I'm loading 300 urls (htmls) in one pool of 300 workers. Since the size of html is variable, sometimes, the sum of sizes is probably too big and python raises: internal buffer error : Memory allocation failed : growing buffer. I want to somehow check if this can happend and if, wait until the buffer is not full.

-------

This approach works but sometimes, python starts to throw:

internal buffer error : Memory allocation failed : growing buffer

internal buffer internal buffer error : Memory allocation failed : growing buffer

internal buffer error : Memory allocation failed : growing buffer

error : Memory allocation failed : growing buffer

internal buffer error : Memory allocation failed : growing buffer

into console. I suppose that it is because of size of html I store in memory, which can be 300*(for example 1mb) = 300mb

EDIT:

I know that I can decrease number of workers and I will. But it's not a solution, there would be just lower chance to get such error. I want to avoid this error at all...

I started to log html sizes:

ram_logger.debug('SIZE: {}'.format(sys.getsizeof(html)))

And the result is (part):

2017-03-05 13:02:04,914 DEBUG SIZE: 243940

2017-03-05 13:02:05,023 DEBUG SIZE: 138384

2017-03-05 13:02:05,026 DEBUG SIZE: 1185964

2017-03-05 13:02:05,141 DEBUG SIZE: 1203715

2017-03-05 13:02:05,213 DEBUG SIZE: 291415

2017-03-05 13:02:05,213 DEBUG SIZE: 287030

2017-03-05 13:02:05,224 DEBUG SIZE: 1192165

2017-03-05 13:02:05,230 DEBUG SIZE: 1193751

2017-03-05 13:02:05,234 DEBUG SIZE: 359193

2017-03-05 13:02:05,247 DEBUG SIZE: 23703

2017-03-05 13:02:05,252 DEBUG SIZE: 24606

2017-03-05 13:02:05,275 DEBUG SIZE: 302388

2017-03-05 13:02:05,329 DEBUG SIZE: 334925

This is my simplified scraping approach:

def scrape_chunk(chunk):

pool = Pool(300)

results = pool.map(scrape_chunk_item, chunk)

pool.close()

pool.join()

return results

def scrape_chunk_item(item):

root_result = _load_root(item.get('url'))

# parse using xpath and return

And the function to load html:

def _load_root(url):

for i in xrange(settings.ENGINE_NUMBER_OF_CONNECTION_ATTEMPTS):

try:

headers = requests.utils.default_headers()

headers['User-Agent'] = ua.chrome

r = requests.get(url, timeout=(settings.ENGINE_SCRAPER_REQUEST_TIMEOUT + i, 10 + i), verify=False, )

r.raise_for_status()

except requests.Timeout as e:

if i >= settings.ENGINE_NUMBER_OF_CONNECTION_ATTEMPTS - 1:

tb = traceback.format_exc()

return {'success': False, 'root': None, 'error': 'timeout', 'traceback': tb}

except Exception:

tb = traceback.format_exc()

return {'success': False, 'root': None, 'error': 'unknown_error', 'traceback': tb}

else:

break

r.encoding = 'utf-8'

html = r.content

ram_logger.debug('SIZE: {}'.format(sys.getsizeof(html)))

try:

root = etree.fromstring(html, etree.HTMLParser())

except Exception:

tb = traceback.format_exc()

return {'success': False, 'root': None, 'error': 'root_error', 'traceback': tb}

return {'success': True, 'root': root}

Do you know how to make it safe? Something which make workers wait if there would be buffer overflow problem?

python内存分配失败_内存分配失败：增长缓冲区 - Python (Memory allocation failed : growing buffer - Python)...

相关阅读

相关文章

相关问答

相关文档