我正在尝试从PDF中提取文本(https://www.sec.gov/litigation/admin/2015/34-76574.pdf)使用PyPDF2,我得到的唯一结果是以下字符串:
b''
这是我的代码:
import PyPDF2
import urllib.request
import io
url = 'https://www.sec.gov/litigation/admin/2015/34-76574.pdf'
remote_file = urllib.request.urlopen(url).read()
memory_file = io.BytesIO(remote_file)
read_pdf = PyPDF2.PdfFileReader(memory_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(1)
page_content = page.extractText()
print(page_content.encode('utf-8'))
这段代码在我正在使用的一些PDF上正常工作(例如。https://www.sec.gov/litigation/admin/2016/34-76837-proposed-amended-distribution-plan.pdf),但上面的其他文件不起作用。知道怎么了吗?
我认为可能有一个问题,你是如何提取页面尝试做一个循环,并调用每个页面分别像这样
for i in range(0 , number_of_pages ):
pageObj = pdfReader.getPage(i)
page = pageObj.extractText()
我不知道为什么pypdf2不能从PDF中提取信息,但包pdftotext可以:
import pdftotext
from six.moves.urllib.request import urlopen
import io
url = 'https://www.sec.gov/litigation/admin/2015/34-76574.pdf'
remote_file = urlopen(url).read()
memory_file = io.BytesIO(remote_file)
pdf = pdftotext.PDF(memory_file)
# Iterate over all the pages
for page in pdf:
print(page)
UNITED STATES OF AMERICA
Before the
SECURITIES AND EXCHANGE COMMISSION
SECURITIES EXCHANGE ACT OF 1934
Release No. 76574 / December 7, 2015
ADMINISTRATIVE PROCEEDING
File No. 3-16987
ORDER INSTITUTING CEASE-AND-DESIST
In the Matter of PROCEEDINGS, PURSUANT TO SECTION
21C OF THE SECURITIES EXCHANGE ACT
KEFEI WANG OF 1934, MAKING FINDINGS, AND
IMPOSING REMEDIAL SANCTIONS AND A
Respondent. CEASE-AND-DESIST ORDER
I.
The Securities and Exchange Commission (“Commission”) deems it appropriate and in the
public interest that cease-and-desist proceedings be, and hereby are, instituted pursuant to 21C of
the Securities Exchange Act of 1934 (“Exchange Act”) against Kefei Wang (“Respondent”).
II.
In anticipation of the institution of these proceedings, Respondent has submitted an Offer
of Settlement (the “Offer”) which the Commission has determined to accept. Solely for the
purpose of these proceedings and any other proceedings brought by or on behalf of the
Commission, or to which the Commission is a party, and without admitting or denying the findings
herein, except as to the Commission’s jurisdiction over him and the subject matter of these
proceedings, which are admitted, and except as provided herein in Section V, Respondent consents
to the entry of this Order Instituting Cease-and-Desist Proceedings, Pursuant to Section 21C of the
Securities Exchange Act of 1934, Making Findings, and Imposing Remedial Sanctions and a
Cease-and-Desist Order (“Order”), as set forth below.
III.
On the basis of this Order and Respondent’s Offer, the Commission finds1 that:
Summary
1. Respondent violated Section 15(a)(1) of the Exchange Act by acting as an
unregistered broker-dealer in connection with his representation of clients who were seeking U.S.
residency through the Immigrant Investor Program. Respondent helped effect certain individuals’
securities purchases in an EB-5 Regional Center. Respondent received a commission from that
Regional Center for each investment he facilitated.
Respondent
2. Kefei Wang, age 39, is a resident of China. During the relevant time period, he was
a U.S. resident and an owner of Nautilus Global Capital, LLC , a now defunct entity that was based
in Fremont, California.
Background
3. The United States Congress created the Immigrant Investor Program, also known as
“EB-5,” in 1990 to stimulate the U.S. economy through job creation and capital investment by
foreign investors. The Program offers EB-5 visas to individuals who invest $1 million in a new
commercial enterprise that creates or preserves at least 10 full-time jobs for qualifying U.S.
workers (or $500,000 in an enterprise located in a rural area or an area of high unemployment). A
certain number of EB-5 visas are set aside for investors in approved Regional Centers. A Regional
Center is defined as “any economic unit, public or private, which is involved with the promotion of
economic growth, including increased export sales, improved regional productivity, job creation,
and increased domestic capital investment.” 8 C.F.R. § 204.6(e) (2015).
4. Typical Regional Center investment vehicles are offered as limited partnership
interests. The partnership interests are securities, usually offered pursuant to one or more
exemptions from the registration requirements of the U.S. securities laws. The Regional Centers
are often managed by a person or entity which acts as a general partner of the limited partnership.
The Regional Centers, the investment vehicles, and the managers are collectively referred to herein
as “EB-5 Investment Offerers.”
5. Various EB-5 Investment Offerers paid commissions to anyone who successfully
sold limited partnership interests to new investors.
1
The findings herein are made pursuant to Respondent’s Offer of Settlement and are not
binding on any other person or entity in this or any other proceeding.
2
Respondent Received Commissions for His Clients’ EB-5 Investments
6. From at least January 2010 through May 2014, Respondent received a portion of
commissions from one EB-5 Investment Offerer totaling $40,000. The commissions constituted
his portion of the commissions that were paid pursuant to a written Agency Agreement between
Nautilus Global Capital and the EB-5 Investment Offerer. On one or more occasions the
commission was paid to a foreign bank account identified by the Respondent despite the fact that
the Respondent was U.S.-based during the relevant time period.
7. Respondent performed activities necessary to effectuate the transaction, including
recommending the specific EB-5 Investment Offerer referenced in paragraph 6 to his clients;
acting as a liaison between the EB-5 Investment Offerer and the investors; and facilitating the
transfer and/or documentation of investment funds to the EB-5 Investment Offerer. Respondent
received his portion of transaction-based commissions due to Nautilus Global Capital for its
services from that EB-5 Investment Offerer.
8. As a result of the conduct described above, Respondent violated Section 15(a)(1) of
the Exchange Act which makes it unlawful for any broker or dealer which is either a person other
than a natural person or a natural person not associated with a broker or dealer to make use of the
mails or any means or instrumentality of interstate commerce “to effect any transactions in, or to
induce or attempt to induce the purchase or sale of, any security” unless such broker or dealer is
registered in accordance with Section 15(b) of the Exchange Act.
IV.
In view of the foregoing, the Commission deems it appropriate to impose the sanctions
agreed to in Respondent Kefei Wang’s Offer.
Accordingly, pursuant to Section 21C of the Exchange Act, it is hereby ORDERED that:
A. Respondent shall cease and desist from committing or causing any violations and
any future violations of Section 15(a)(1) of the Exchange Act.
B. Respondent shall, within ten (10) days of the entry of this Order, pay disgorgement
of $40,000, prejudgment interest of $1,590, and a civil money penalty of $25,000 to the Securities
and Exchange Commission for transfer to the general fund of the United States Treasury in
accordance with Exchange Act Section 21F(g)(3). If timely payment of disgorgement and
prejudgment interest is not made, additional interest shall accrue pursuant to SEC Rule of Practice
600 [17 C.F.R. § 201.600]. If timely payment of the civil money penalty is not made, additional
interest shall accrue pursuant to 31 U.S.C. § 3717. Payment must be made in one of the following
ways:
(1) Respondent may transmit payment electronically to the Commission, which will
provide detailed ACH transfer/Fedwire instructions upon request;
3
(2) Respondent may make direct payment from a bank account via Pay.gov through the
SEC website at http://www.sec.gov/about/offices/ofm.htm; or
(3) Respondent may pay by certified check, bank cashier’s check, or United States
postal money order, made payable to the Securities and Exchange Commission and
hand-delivered or mailed to:
Enterprise Services Center
Accounts Receivable Branch
HQ Bldg., Room 181, AMZ-341
6500 South MacArthur Boulevard
Oklahoma City, OK 73169
Payments by check or money order must be accompanied by a cover letter identifying
Kefei Wang as a Respondent in these proceedings, and the file number of these proceedings; a
copy of the cover letter and check or money order must be sent to Stephen L. Cohen, Associate
Director, Division of Enforcement, Securities and Exchange Commission, 100 F St., NE,
Washington, DC 20549-5553.
V.
It is further Ordered that, solely for purposes of exceptions to discharge set forth in Section
523 of the Bankruptcy Code, 11 U.S.C. § 523, the findings in this Order are true and admitted by
Respondent, and further, any debt for disgorgement, prejudgment interest, civil penalty or other
amounts due by Respondent under this Order or any other judgment, order, consent order, decree
or settlement agreement entered in connection with this proceeding, is a debt for the violation by
Respondent of the federal securities laws or any regulation or order issued under such laws, as set
forth in Section 523(a)(19) of the Bankruptcy Code, 11 U.S.C. § 523(a)(19).
By the Commission.
Brent J. Fields
Secretary
4
[Finished in 0.5s]
我正在尝试使用表单识别器-Azure认知服务从pdf文件中提取文本。我使用的是定制模型,我用我的模型训练这项服务,然后尝试提取数据。 我的PDF通常有超过1页。但是我对从第一页提取文本感兴趣。Rest所有页面没有任何重要性。 那么,有没有什么方法可以训练我的系统通过给出页码从选定的页面中提取文本? 祝好 玛杜
我正在使用PyPDF2库通过其函数从PDF文件中提取文本,对于大多数PDF来说,它工作得很好! 但是,一些PDF生成的文本如下所示: \n!"#$% 根据文件,这是可以预期的: 这适用于某些PDF文件,但对其他文件效果不佳,具体取决于使用的生成器。 不幸的是,函数在输出上述文本时不会引发任何异常。 所以,我的问题是,有没有一种方法可以以编程方式检测函数何时返回胡言乱语?
假设我的用户去了他们办公室的扫描仪。扫描仪能够生成扫描文档的PDF。这基本上就是我拥有的文件类型。 我想做的是从这个PDF中提取文本。这不是“第一代”pdf,因为文本没有嵌入到pdf中。文本嵌入在PDF中的图像中。 PDFBox的iText中是否有允许检索此数据的功能?如果可能的话,我正在尝试避免对图像进行OCR。我希望IText或PDFBox中有一些内置的东西可以做到这一点。 请注意,我不是在谈
问题内容: 我正在尝试使用提取此 PDF文件中包含的文本。 我正在使用PyPDF2模块,并具有以下脚本: 运行代码时,得到以下输出,该输出与PDF文档中包含的输出不同: 如何提取PDF文档中的文本? 问题答案: 要从PDF提取文本,请使用以下代码
问题内容: 如何 使用PHP 从PDF文档中提取文本? (我不能使用其他工具,我没有root用户访问权限) 我发现一些函数可用于纯文本,但是它们不能很好地处理Unicode字符: http://www.hashbangcode.com/blog/zend-lucene-and-pdf-documents-part-2-pdf- data-extraction-437.html 问题答案: 下载 c
问题内容: 我想知道是否可以仅使用Javascript将文本包含在PDF文件中?如果是,谁能告诉我如何? 我知道有一些服务器端的Java,C#等库,但我宁愿不使用服务器。谢谢 问题答案: 这是一个古老的问题,但是由于pdf.js多年来一直在发展,所以我想给出一个新的答案。也就是说,它可以在本地完成,而无需涉及任何服务器或外部服务。新的pdf.js具有一个函数:page.getTextContent