问题：

GCP和TPU、实验性连接到集群没有响应

龙兴学

2023-03-14

我正在尝试使用tensorflow 2.1和Keras API的GCP上的TPU。不幸的是，我在创建tpu节点后被卡住了。事实上，我的虚拟机似乎“看到”了tpu，但无法连接到它。

我正在使用的代码：

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(TPU_name)
print('Running on TPU ', resolver.master())
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

代码卡在第3行，我收到的消息很少，然后什么也没有，所以我不知道可能是什么问题。因此，我怀疑VM和TPU之间存在某种连接问题。

信息：

2020-04-22 15:46:25.383775:I tensorflow/core/platform/cpu_feature_guard.cc:142]您的cpu支持此tensorflow二进制文件未编译为使用的指令：SSE4.1 SSE4.2 AVX AVX2 FMA 2020-04-22 15:46:25.992977:I tensorflow/core/platform/profile_utils/cpu_utils.cc:94]cpu频率：2300000000 Hz 2020-04-22 15:46:26.042269:IXLA Service X/Laa/ Service / Service .CC：168）XLA服务0x5636E446610为平台主机初始化（这并不保证XLA将被使用）。设备：2020-04-22 15:46:26.042403:I tensorflow/compiler/xla/service/service.cc:176]StreamExecutor设备（0）：主机，默认版本2020-04-22 15:46:26.080879:I tensorflow/core/common_runtime/process_util.cc:147]使用默认操作间设置创建新线程池：2。使用inter_op_parallelism_线程进行优化，以获得最佳性能。E0422 15:46:26.263937297 2263 socket_utils_common_posix.cc:198]检查SO_重用端口：{“已创建”：“@1587570386.263923266”，“说明”：“SO_重用端口在编译系统上不可用”，“文件”：“外部/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc”，“文件行”：166}2020-04-22 15:46:26.269134:I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300]为作业工人初始化GrpcChannelCache-

此外，我使用的是来自gcp的“深度学习”映像，所以我不需要安装任何东西，对吗？

有人对TF2.1有同样的问题吗？附言：同样的代码在Kaggle和Colab上运行良好。

共有2个答案

荣曾笑

2023-03-14

我创建了我的虚拟机TPU与ctpu up--zone=Europe-West4-a--disk-size-gb=50--Macher-type=n1-Standard-2--tf-version=2.2--tpu-size v3-8--name cola-tpu

但我仍然无法访问TPU，它像OP描述的那样挂起。

我打开了一个谷歌问题，并在那里得到了答案：

这是一个已知的问题，有时会发生，产品团队目前正在尝试解决它。

在这种情况下，让我提出一些故障排除步骤：

1-禁用然后重新启用TPU API

如果这不起作用：

2.1-转到专有网络

2.2-检查cp到tp对等默认值[某些数字]是否处于inactive状态。

2.3-如果有，删除它并再次创建一个tpu节点

请让我们知道这些是否对您有效，以便我们可以关闭此票据（如果有）或继续提供支持（如果没有）。

对我来说，删除cp到tp peeringdefault并重新创建VM-TPU是有效的。

闻人宇定

2023-03-14

为了重现，我使用ctpu up--zone=Europe-West4-a--disk-sige-gb=50--Macher-type=n1-Standard-8--tf-version=2.1来创建vm和tpu。然后运行你的代码，它成功了。

taylanbil@taylanbil:~$ python3 run.py 
Running on TPU  grpc://10.240.1.2:8470
2020-04-28 19:18:32.597556: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-04-28 19:18:32.627669: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2000189999 Hz
2020-04-28 19:18:32.630719: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x471b980 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-28 19:18:32.630759: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-04-28 19:18:32.665388: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2020-04-28 19:18:32.665439: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33355}
2020-04-28 19:18:32.683216: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2020-04-28 19:18:32.683268: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33355}
2020-04-28 19:18:32.690405: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:33355
taylanbil@taylanbil:~$ cat run.py 
import tensorflow as tf
TPU_name='taylanbil'
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(TPU_name)
print('Running on TPU ', resolver.master())
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

如何创建tpu资源？你能再检查一下是否没有版本不匹配吗？

类似资料：

无法从GCP群集使用VPC对等互连连接到Mongo Atlas

我正在尝试将在GCP库伯内特斯引擎集群上运行的Java应用程序与Mongo Atlas集群（M20）连接起来。以前，当我没有打开VPC Peering并且我使用常规连接字符串时，它运行良好。但我现在正在尝试使用VPC Peering，在我的GCP项目中使用VPC网络。我按照https://docs.atlas.mongodb.com/security-vpc-peering/.中的步骤选择了192
redisson连接到远程群集

我已经创建了一个redis集群，它自己是工作的，但我不能连接我的客户到它。我正在使用redisson连接到它，下面的代码其中，redisURL是csv，格式为:，但包含集群中的所有6个节点。
Hazelcast：连接到远程群集

我们有一个Hazelcast节点集群，所有这些节点都运行在一个远程系统（具有许多节点的单个物理系统）上。我们希望从一个外部客户机连接到这个集群--一个Java应用程序，它使用如下代码连接到HazelCast：其中，主机是远程的IP，端口是5701。这仍然连接到本地主机(127.0.0.1)。我错过了什么？编辑：如果java客户端是本地系统上运行的唯一hazelcast应用程序，则它无法连接
无法连接到和描述Kafka群集。Apache kafka连接

我尝试了kafka-console-consumer.sh和kafka-console-producer.sh,它工作得很好。我能够看到生产者在消费者中发送的消息 1）我已经下载了s3连接器(https://docs.confluent.io/current/connect/kafka-connect-S3/index.html) 2）将文件解压缩到/home/ec2-user/plugins/
Qpid客户端连接工厂连接到ArtemisMQ集群

我正在尝试使用Apache Camel和Qpid JMS客户端连接到在两个不同节点（VM）中运行的ActiveMQ Artemis主动-主动集群。我正在使用ActiveMQ Artemis 2.17.0。我正在试图找出我的组织的远程URI配置应该是什么。阿帕奇。qpid。jms。JmsConnectionFactory实例。使用<代码>ampq://host1:5672,ampq://host2
有没有办法用私有GKE集群运行GCP的云运行？

据我所知，在Google Cloud Run上部署容器有两种方式：云运行完全管理：它由GCP独立管理，无需我们创建集群云运行for Anthos：这需要我们创建一个支持云运行的GKE集群我想选择第二个选项，但希望将GKE集群保持为私有，这样就不允许任何外部通信。

GCP和TPU、实验性连接到集群没有响应

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档