

Grey Beards on Systems
Ray Lucchesi and others
Analyst defined systems
Episodes
Mentioned books

Feb 14, 2025 • 42min
169: GreyBeards talk AgenticAI with Luke Norris, CEO&Co-founder, Kamiwaza AI
Luke Norris (@COentrepreneur), CEO and Co-Founder, Kamiwaza AI, is a serial entreprenaur in Silverthorne CO, where the company is headquartered.. They presented at AIFD6 a couple of weeks back and the GreyBeards thought it would be interesting to learn more about what they were doing, especially since we are broadening the scope of the podcast, to now be GreyBeards on Systems.
Describing Kamiwaza AI is a bit of a challenge. They settled on “AI orchestration” for the enterprise but it’s much more than that. One of their key capabilities is an inference mesh which supports accessing data in locations throughout an enterprise across various data centers to do inferencing, and then gathering replies/responses together, aggregating them into one combined response. All this without violating HIPPA, GDPR or other data compliance regulations.
Kamiwaza AI offer an opinionated AI stack, which consists of 155 components today and growing that supplies a single API to access any of their AI services. They support multi-node clusters and multiple clusters, located in different data centers, as well as the cloud. For instance, they are in the Azure marketplace and plans are to be in AWS and GCP soon.
Most software vendors provide a proof of concept, Kamiwaza offers a pathway from PoC to production. Companies pre-pay to install their solution and then can use those funds when they purchase a license.
And then there’s their (meta-)data catalogue. It resides in local databases (and replicated maybe) throughout the clusters and is used to identify meta data and location information about any data in the enterprise that’s been ingested into their system.
Data can be ingested for enterprise RAG databases and other services. As this is done, location affinity and metadata about that data is registered to the data catalogue. That way Kamiwaza knows where all of an organization’s data is located, which RAG or other database it’s been ingested into and enough about the data to understand where it might be pertinent to answer a customer or service query.
Maybe the easiest way to understand what Kamiwaza is, is to walk through a prompt.
A customer issues a prompt to a Kamiwaza endpoint which triggers,
A search through their data catalog to identify what data can be used to answer that prompt.
If all the data resides in one data center, the prompt can be handed off to the GenAI model and RAG services at that data center.
But if the prompt requires information from multiple data centers,
Separate prompts are then distributed to each data center where RAG information germane to that prompt is located
As each of these generate replies, their responses are sent back to an initiating/coordinating cluster
Then all these responses are combined into a single reply to the customer’s prompt or service query.
But the key point is that data located in each data center used to answer the prompt are NOT moved to other data centers. All prompting is done locally, at the data center where the data resides. Only prompt replies/responses are sent to other data centers and then combined into one comprehensive answer.
Luke mentioned a BioPharma company that had genonome sequences located in various data regimes, some under GDPR, some under APAC equivalents, others under USA HIPPA requirements. They wanted to know information about how frequent a particular gene sequence occurred. They were able to issue this as a prompt at a single location which spun up separate, distributed prompts for each data center that held appropriate information. All those replies were then transmitted back to the originating prompt location and combined/summarized.
Kamiwaza AI also has an AIaaS offering. Any paying customer is offered one (AI agentic) outcome per month per cluster license. Outcomes could effectively be any AI application they would like to perform.
One outcome he mentioned included:
A weather-risk researcher had tons of old weather data in a multitude of formats, over many locations, that had been recorded over time.
They wanted to have access to all this data so they can tell when extreme weather events had occurred in the past.
Kamiwaza AI assigned one of their partner AI experts to work with the researcher to have an AI agent comb through these archives, transform and clean all the old weather data into HTML data more amenable to analysis .
But that was just the start.. They really wanted to understand the risk of damage due to the extreme weather events. So the AI application/system was then directed to go and gather from news and insurance archives, any information that identified the extent of the damage from those weather events.
He said that today’s AgenticAI can implement a screen mouse click and perform any function that an application or a human could do on a screen. Agentic AI can also import an API and infer where an API call might be better to use than a screen GUI interaction.
He mentioned that Kamiwaza can be used to generate and replace a lot of what enterprises do today with Robotics Process Automation (RPAs). Luke feels that anything an enterprise was doing with RPA’s can be done better with Kamiwaza AI agents.
SaaS solution tasks are also something AgenticAI can easily displace . Luke said at one customer they went from using SAP APIs to provide information to SAP, to using APIs to extract information from SAP, to completely replacing the use of SAP for this task at the enterprise.
How much of this is fiction or real is subject of some debate in the industry. But Kamiwaza AI is pushing the envelope on what can and can’t be done. And with their AI aaS offering, customers are making use of AI like they never thought possible before. .
Kamiwaza AI has a community edition, a free download that’s functionally restricted, and provides a desktop experience of Kamiwaza AI’s stack. Luke sees this as something a developer could use to develop to Kamiwaza APIs and test functionality before loading on the enterprise cluster.
We asked where they were finding the most success. Luke mentioned anyone that’s heavily regulated, where data movement and access were strictly constrained. And they were focused on large, multi-data center, enterprises.
Luke mentioned that Kamiwaza AI has been doing a number of hackathons with AI Tinkerers around the world. He suggested prospects take a look at what they have done with them and perhaps join them in the next hackathon in their area.
Luke Norris, CEO & Co-Founder, Kamiwaza AI
Luke Norris is the co-founder of Kamiwaza.AI, driving enterprise AI innovation with a focus on secure, scalable GenAI deployments. With extensive experience raising over $100M in venture capital and leading global AI/ML deployments for Fortune 500 companies.
Luke is passionate about enabling enterprises to unlock the full potential of AI with unmatched flexibility and efficiency.

Dec 30, 2024 • 41min
168: GreyBeards Year End 2024 podcast
It’s time once again for our annual YE GBoS podcast. This year we have Howard back making a guest appearance with our usual cast of Jason and Keith in attendance. And the topic de jour seemed to be AI rolling out to the enterprise and everywhere else in the IT world.
We led off with our discussion from last year, AI (again) but then it was all about new announcements, new capabilities and new functionality. This year it’s all about starting to take AI tools and functionality and make them available to help optimize organizational functionality.
We talked some about RAGs and Chatbots but these seemed almost old school.
Agentic AI
Keith mentioned Agentic AI which purports to improve businesses by removing/optimizing intermediate steps in business processes. If one can improve human and business productivity by 10%, the impact on the US and world’s economies would be staggering.
And we’re not just talking about knowledge summarization, curation, or discussion, agentic AI takes actions that would have been previously done by a human, if done at all.
Manufacturers could use AI agents to forecast sales, allowing the business to optimize inventory positioning to better address customer needs.
Most, if not all, businesses have elaborate procedures which require a certain amount of human hand holding. Reducing human hand holding, even a little bit, with AI agents, that never slees, and can occasionally be trained to do better, could seriously help the bottom and top lines for any organization
We can see evidence of Agentic AI proliferating in SAAS solutions, i.e., SalesForce, SAP, Oracle and all others are spinning out Agentic AI services.
I think it was Jason that mentioned GEICO, a US insurance company, is re-factoring, re-designing and re-implementing all their applications to take advantage of Agentic AI and other AI options.
AI’s impact on HW & SW infrastructure
The AI rollout is having dramatic impacts on both software and hardware infrastructure. For example, customers are building their own OpenStack clouds to support AI training and inferencing.
Keith mentioned that AWS just introduced S3 Tables, a fully managed services meant to store and analyze massive amounts of tabular data for analytics. Howard mentioned that AWS’s S3 Tables had to make a number of tradeoffs to use immutable S3 object storage. VAST’s Parquet database provides the service without using immutable objects.
Software impacts are immense as AI becomes embedded in more and more applications and system infrastructure. But AI’s hardware impacts may be even more serious.
Howard made mention of the power zero sum game, meaning that most data centers have a limited amount of power they support. Any power saved from other IT activities are immediately put to use to supply more power to AI training and infererencing.
Most IT racks today support equipment that consumes 10-20Kw of power. AI servers will require much more
Jason mentioned one 6u server with 8 GPUS that cost on the order of 1 Ferrari ($250K US), draws 10Kw of power, with each GPU having 2-400 GigE links not to mention the server itself having 2-400 GigE links. So a single 6U (GPU) server has 18-400GbE links or could need 7.2Tb of bandwidth.
Unclear how many of these one could put in a rack but my guess is it’s not going to be fully populated. 6 of these servers would need >42Tb of bandwidth and over 60Kw of power and that’s not counting the networking and other infrastructure required to support all that bandwidth.
Speaking of other infrastructure, cooling is the other side of this power problem. It’s just thermodynamics, power use generates heat, that heat needs to be disposed of. And with 10Kw servers we are talking a lot of heat. Jason mentioned that at this year’s SC24 conference, the whole floor was showing off liquid cooling. Liquid cooling was also prominent at OCP.
At the OCP summit this year Microsoft was talking about deploying near term 150Kw racks and down the line 1Mw racks. AI’s power needs are why organizations around the world are building out new data centers in out of the way places that just so happen to have power and cooling nearby.
Organizations have an insatiable appetite for AI training data. And good (training) data is getting harder to find. Solidigm latest 122TB SSD may be coming along just when the data needs for AI are starting to take off.
SCI is pivoting
We could have gone on for hours on AI’s impact on IT infrastructure, but I had an announcement to make.
Silverton Consulting will be pivoting away from storage to a new opportunity that is based in space. I discuss this on SCI’s website but the opportunities for LEO and beyond services are just exploding these days and we want to be a part of that.
What that means for GBoS is TBD. But we may be transitioning to something more broader than just storage. But heck we have been doing that for years.
Stay tuned, it’s going to be one hell of a ride
Jason Collier, Principal Member Of Technical Staff at AMD, Data Center and Embedded Solutions Business Group
Jason Collier (@bocanuts) is a long time friend, technical guru and innovator who has over 25 years of experience as a serial entrepreneur in technology.
He was founder and CTO of Scale Computing and has been an innovator in the field of hyperconvergence and an expert in virtualization, data storage, networking, cloud computing, data centers, and edge computing for years.
He’s on LinkedIN. He’s currently working with AMD on new technology and he has been a GreyBeards on Storage co-host since the beginning of 2022
Howard Marks, Technologist Extraordinary and Plenipotentiary at VAST Data
Howard Marks is Technologist Extraordinary and Plenipotentiary at VAST Data, where he explains engineering to customers and customer requirements to engineers.
Before joining VAST, Howard was an independent consultant, analyst, and journalist, writing three books and over 200 articles on network and storage topics since 1987 and, most significantly, a founding co-host of the Greybeards on Storage podcast.
Keith Townsend, President of The CTO Advisor, a Futurum Group Company
Keith Townsend (@CTOAdvisor) is a IT thought leader who has written articles for many industry publications, interviewed many industry heavyweights, worked with Silicon Valley startups, and engineered cloud infrastructure for large government organizations. Keith is the co-founder of The CTO Advisor, blogs at Virtualized Geek, and can be found on LinkedIN.

Nov 4, 2024 • 52min
167: GreyBeards talk Distributed S3 storage with Enrico Signoretti, VP Product & Partnerships, Cubbit
Long time friend, Enrico Signoretti (LinkedIn), VP Product and Partnerships, Cubbit, used to be a common participant at Storage Field Day (SFD) events and I’ve known him since we first met there. Since then, he’s worked for a startup and a prominent analyst firms. But he’s back at another startup and this one looks like it’s got legs.
Cubbit offers Distributed S3 compatible object storage that offers geo-distribution and geo-fencing for object data, in which the organization owns the hardware and Cubbit supplies the software. There’s a management component, the Coordinator, which could run on your hardware or as a SaaS service they provide but other than that, IT controls the rest of the system hardware. Listen to the podcast to learn more.
Cubbit comes in 3 components:
One or more Storage nodes which includes their agent software running ontop of a linux system with direct attached storage.
One or more Gateway nodes which provides S3 protocol acces to the objects stored on storage nodes. Typical S3 access points https://S3.company_name, com/… points to either a load balancer, front end or one or more Gateway nodes. Gateway nodes provide the mapping between the bucket name/object identifier and where the data currently resides or will reside.
One Coordinator node which provides the metadata to locate the data for objects, manage the storage nodes, gateways and monitor the service. The Coordinator node can be a SaaS service supplied by Cubbit or a VM/bare metal node running Cubbit Coordinator software. Metadata is protected internally within the Coordinator node.
With these three components one can stand up a complete, geo-distributed/geo-fenced, S3 object storage system which the organization controls.
Cubbit encrypts data as it at the gateway and decrypts data when accessed. Sign-on to the system uses standard security offerings. Security keys can be managed by Cubbit or by standard key management systems.
All data for an object is protected by nested erasure codes. That is 1) erasure code within a data center/location over its storage drives and 2) erasure code across geographical locations/data centers..
With erasure coding across locations, customer with say 10 data center locations can have their data stored in such a fashion that as long as at least 8 data centers are online they still have access to their data, that is the Cubbit storage system can still provide data availability.
Similarly for erasure coding within the data center/location or across storage drives, say with 12 drives per stripe, one could configure lets say 9+3 erasure coding, where as long as 9 of the drives still operate, data will be available.
Please note the customer decides the number of locations to stripe across for erasure coding, and diet for the number of storage drives.
The customer supplies all the storage node hardware. Some customers start with re-purposed servers/drives for their original configuration and then upgrade to higher performing storage-servers-networking as performance needs change. Storage nodes can be on prem, in the cloud or at the edge.
For adequate performance gateways and storage nodes (and coordinator nodes) should be located close to one another. Although Coordinator nodes are not in the data path they are critical to initial object access.
Gateways can provide a cache for faster local data access.. Cubbit has recommendations for Gateway server hardware. And similar to storage nodes, Gateways can operate at the edge, in the cloud or on prem.
Use cases for the Distributed S3 storage include:
As a backup target for data elsewhere
As a geographically distributed/fenced object store.
As a locally controlled object storage to feed AI training/inferencing activity.
Most backup solutions support S3 object storage as a target for backups.
Geographically distributed S3 storage means that customers control where object data is located. This could be split across a number of physical locations, the cloud or at the edge.
Geographically fenced S3 storage means that the customer controls which of its many locations to store an object. For GDPR countries with multi-nation data center locations this could provide the compliance requirements to keep customer data within country.
Cubbit’s distributed S3 objects storage is strongly consistent in that an object loaded into the system at any location is immediately available to any user accessing it through any other gateway. Access times vary but the data will be the same regardless of where you access it from.
The system starts up through an Ansible playbook which asks a bunch of questions and loads and sets up the agent software for storage nodes, gateway nodes and where applicable, the coordinator node.
At any time, customers can add more gateways or storage nodes or retire them. The system doesn’t perform automatic load balancing for new nodes but customers can migrate data off storage nodes and onto other ones through api calls/UI requests to the Coordinator.
Cubbit storage supports multi-tenancy, so MSPs can offer their customers isolated access.
Cubbit charges for their service on data storage under management. Note it has no egress charges, and you don’t pay for redundancy. But you do supply all the hardware used by the system. They offer a discount for M&E customers as the metadata to data ratio is much smaller (lots of large files) than most other S3 object stores (mix of small and large files).
Cubbit is presently available only in Europe but will be coming to USA next year. So, if you are interested in geo-distributed/geo-fenced S3 object storage that you control and can be had for much cheaper than hyperscalar object storage, check it out.
Enrico Signoretti, VP Products & Partnerships
Enrico Signoretti has over 30 years of experience in the IT industry, having held various roles including IT manager, consultant, head of product strategy, IT analyst, and advisor.
He is an internationally renowned visionary author, blogger, and speaker on next-generation technologies. Over the past four years, Enrico has kept his finger on the pulse of the evolving storage industry as the Head of Research Product Strategy at GigaOm. He has worked closely and built relationships with top visionaries, CTOs, and IT decision makers worldwide.
Enrico has also contributed to leading global online sites (with over 40 million readers) for enterprise technology news.

Sep 25, 2024 • 41min
166: Greybeard talks MLperf Storage benchmark with Michael Kade, Sr. Solutions Architect, Hammerspace
Sponsored By:
This is the first time we have talked with Hammerspace and Michael Kade (Hammerspace on X), Senior Solutions Architect. We have known about Hammerspace for years now and over the last couple of years, as large AI clusters have come into use, Hammerspace’s popularity has gone through the roof..
Mike’s been benchmarking storage for decades now and recently submitted results for MLperf Storage v1.0, an AI benchmark that focuses on storage activity for AI training and inferencing work. We have written previously on v0.5 of the benchmark, (see: AI benchmark for storage, MLperf Storage). Listen to the podcast to learn more.
Some of the changes between v0.5 and v1.0 of MLperf’s Storage benchmark include:
Workload changes, they dropped BERT NLP, kept U-net3D (3D volumetric object detection) and added ResNet-50 and CosmoFlow. ResNet-50 is an (2D) image object detection model and CosmoFlow uses a “3D convolutional neural network on N-body cosmology simulation data to predict physical parameters of the universe.” Both ResNet-50 and CosmoFlow are TensorFlow batch inferencing activities. U-net3D is a PyTorch training activity.
Accelerator (GPU simulation) changes, they dropped V100 and added A100 and H100 emulation to the benchmarks.
MLperf Storage benchmarks have to be run 5 times in a row and results reported are the average of the 5 runs. Metrics include samples/second (~files processed/second), overall storage bandwidth (MB/sec) and number of accelerators kept busy during the run (90% busy for U-net3D & ResNet-50 and 70% for CosmoFlow).
Hammerspace submitted 8 benchmarks: 2 workloads (U-net3D & ResNet-50) X 2 accelerators (A100 & H100 GPUs) X 2 client configurations (1 & 5 clients). Clients are workstations that perform training or inferencing work for the models. Clients can be any size. GPUs or accelerators are not physically used during the benchmark but are simulated as dead time, depending on the workload and GPU type (note this doesn’t depend on client size)
Hammerspace also ran their benchmarks with 5 and 22 DSX storage servers. Storage configurations matter for MLperf storage benchmarks and for v0.5, storage configurations weren’t well documented. V1.0 was intended to fix this but it seems there’s more work to get this right.
For ResNet-50 inferencing, Hammerspace drove 370 simulated A100s and 135 simulated H100s and for U-net3D training, Hammerspace drove 35 simulated A100s and 10 simulated H100s. Storage activity for training demands a lot more data than inferencing.
It turns out that training IO also uses checkpointing (which occasionally writes out models to save them in case of run failure). But the rest of the IO is essentially random sequential. Inferencing has much more randomized IO activity to it.
Hammerspace is a parallel file system (PFS) which uses NFSv4.2. NFSv4.2 is available native, in the Linux kernel. The main advantages of PFS is that IO activity can be parallelized by spreading it across many independent storage servers and data can move around without operational impact.
Mike ran their benchmarks in AWS. I asked about cloud noisy neighbors and networking congestion and he said, if you ask for a big enough (EC2) instance, high speed networks come with it, and noisy neighbors-networking congestion are not a problem.
Michael Kade, Senior Solutions Architect, Hammerspace
Michael Kade has over a 45-year history with the computer industry and over 35 years of experience working with storage vendors. He has held various positions with EMC, NetApp, Isilon, Qumulo, and Hammerspace.
He specializes in writing software that bridges different vendors and allows their software to work harmoniously together. He also enjoys benchmarking and discovering new ways to improve performance through the correct use of software tuning.
In his free time, Michael has been a helicopter flight instructor for over 25 years for EMS.

Sep 9, 2024 • 46min
165: GreyBeard talks VMware Explore’24 wrap-up with Gina Rosenthal, Founder&CEO Digital Sunshine Solutions
I’ve known Gina Rosenthal (@gminks@mas.to), Founder&CEO, Digital Sunshine Solutions seems like forever and she’s been on the very short list for being a GBoS co-host but she’s got her own Tech Aunties Podcast now. We were both at VMware Explore last week in Vegas. Gina was working in the community hub and I was in their analyst program.
VMware (World) Explore has changed a lot since last year. I found the presentation/sessions to be just as insightful and full of users as last years, but it seems like there may have been fewer of them. Gina found the community hub sessions to be just as busy and the Code groups were also very well attended. On the other hand, the Expo was smaller than last year and there were a lot less participants (and [maybe] analysts) at the show. Listen to the podcast to learn more.
The really big news was VCF 9.0. Both a new number (for VCF) and an indicator of a major change in direction for how VMware functionality will be released in the future. As one executive told me, VCF has now become the main (release) delivery vehicle for all functionality.
In the past, VCF would generally come out with some VMware functionality downlevel to what’s generally available in the market. With VCF 9, that’s going to change now. From now on, all individual features/functions of VCF 9.0 will be at the current VMware functionality levels. Gina mentioned this is a major change for how VMware released functionality, and signals much better product integrations than available in the past.
Much of VMware distinct functionality has been integrated into VCF 9 including SDDC, Aria and other packages. They did, however, create a new class of “advanced services” that runs ontop of VCF 9. We believe these are individually charged for and some of these advanced services include:
Private AI Foundation – VMware VCF, with their Partner NVIDIA, using NVIDIA certified servers, can now run NVIDIA Enterprise AI suite of offerings which includes just about anything an enterprise needs to run GenAI in house or any other NVIDIA AI service for that matter. The key here is that all enterprise data stays within the enterprise AND the GenAI runs on enterprise (VCF) infrastructure. So all data remains private.
Container Operations – this is a bundling of all the Spring Cloud and other Tanzu container services. It’s important to note, TKG (Tanzu Kubernetes Grid) is still part of the base vSphere release, which allows any VVF (VMware vSpere Foundation) or VCF users to run K8S standalone, but with minimal VMware support services.
Advanced Security – include vDefend firewall/gateway, WAF, Malware prevention, etc.
There were others, but we didn’t discuss them on the podcast.
I would have to say that Private AI was of most interest to me and many other analysts at the show. In fact, I heard that it’s VMware’s (and supposedly NVIDIA’s) intent to reach functional parity with GCP Vertex and others with Private AI. This could come as soon as VCF 9.0 is released. I pressed them on this point and they held firm to that release number.
My only doubt is that VMware or NVIDIA don’t have their own LLM. Yes, they can use Meta’s LLama 3.1, OpenAI or any other LLM on the market. But running them in-house on enterprise VCF servers is another question.
The lack of an “owned” LLM should present some challenges with reaching functional parity with organizations that have one. On the other hand, Chris Walsh mentioned that they (we believe VMware internal AI services) have been able to change their LLM 3 times over the last year using Private AI Foundation.
Chris repeated more than once that VMware’s long history with DRS and HA makes VCF 9 Private AI Foundation an ideal solution for enterprises to run AI workloads. He specifically mentioned GPU HA that can take GPUs from data scientists when enterprise inferencing activities suffer GPU failures. Unclear whether any other MLops cloud or otherwise can do the same.
From a purely storage perspective I heard a lot about vVols 2.0, This is less a functional enhancement, than a new certification to make sure primary storage vendors offer full vVol support in their storage.
Gina mentioned and it came up in the Analyst sessions, that Broadcom has stopped offering discounts for charities and non-profits. This is going to hurt most of those organizations which are now forced to make a choice, pay full subscription costs or move off VMware.
The other thing of interest was that Broadcom spent some time trying to soothe over the bad feelings of VMware’s partners. There was a special session on “Doing business with Broadcom VMware for partners” but we both missed it so can’t report any details.
Finally, Gina and I, given our (lengthy) history in the IT industry and Gina’s recent attendance at IBM Share started hypothesizing on a potential linkup between Broadcom’s CA and VMware offerings.
I mentioned multiple times there wasn’t even a hint of the word “mainframe” during the analyst program. Probably spent more time discussing this than we should of, but it’s hard to take the mainframe out of IT (as most large enterprises no doubt lament).
Gina Rosenthal, Founder & CEO, Digital Sunshine Solutions
As the Founder and CEO of Digital Sunshine Solutions, Gina brings over a decade of expertise in providing marketing services to B2B technology vendors. Her strong technical background in cloud computing, SaaS, and virtualization enables her to offer specialized insights and strategies tailored to the tech industry.
She excels in communication, collaboration, and building communities. These skills to help her create product positioning, messaging, and content that educates customers and supports sales teams. Gina breaks down complex technical concepts and turn them into simple, relatable terms that connect with business goals.
She is the co-host of The Tech Aunties podcast, where she shares thoughts on the latest trends in IT, especially the buzz around AI. Her goal is to help organizations tackle the communication and organizational challenges associated with modern datacenter transitions.

Aug 16, 2024 • 43min
164: GreyBeards talk FMS24 Wrap-up with Jim Handy, General Dir., Objective Analysis
Jim Handy, General Director, Objective Analysis, is our long, time goto guy on SSD and Memory Technologies and we were both at FMS (Future of Memory and Storage – new name/broader focus) 2024 conference last week in Santa Clara, CA. Lots of new SSD technology both on and off the show floor as well as new memory offerings and more.
Jim helps Jason and I understand what’s happening with NAND, and other storage/memory technologies that matter to today’s IT infrastructure. Listen to the podcast to learn more.
First off, I heard at the show that the race for more (3D NAND) layers is over. According to Jim, companies are finding it’s more expensive to add layers than it is just to do a lateral (2D, planar) shrink (adding more capacity per layer).
One vendor mentioned that the CapEx Efficiencies were degrading as they add more layers. Nonetheless, I saw more than one slide at the show with a “3xx” layers column.
Kioxia and WDC introduced a 218 layer, BICS8 NAND technology with 1Tb TLC and up to 2Tb QLC NAND per chip. Micron announced a 233 layer Gen 9 NAND chip.
Some vendor showed a 128TB (QLC) SSD drive. The challenge with PCIe Gen 5 is that it’s limited to 4GB/sec per lane and for 16 lanes, that’s 64GB/s of bandwidth and Gen 4 is half that. Jim called using Gen 4/Gen 5 interfaces for a 128TB SSD like using a soda straw to get to data.
The latest Kioxia 2Tb QLC chip is capable of 3.6Gbps (source: Kioxia America) and with (4*128 or) 512 of these 2Tb chips needed to create a 128TB drive that’s ~230GB/s of bandwidth coming off the chips being funneled down to 16X PCIe Gen5 64GB/s of bandwidth, wasting ~3/4ths of chip bandwidth.
Of course they need (~1.3x?) more than 512 chips to make a durable/functioning 128TB drive, which would only make this problem worse. And I saw one slide that showed a 240TB SSD!
Enough on bandwidth, let’s talk data growth. Jason’s been doing some research and had current numbers on data growth. According to his research, the world’s data (maybe xmitted over internet) in 2010 was 2ZB (ZB, zettabytes = 10^21 bytes), and in 2023 it was 120ZB and by 2025 it should be 180ZB. For 2023, thats over 328 Million TB/day or 328EB/day (EB, exabytes=10^18 bytes).
Jason said ~54% of this is video. He attributes the major data growth spurt since 2010 to mainly social media videos.
Jason also mentioned that the USA currently (2023?) had 5,388 data centers, Germany 522, UK 517, and China 448. That last number seems way low to all of us but they could just be very, very big data centers.
No mention on the average data center size (meters^2, # servers, #GPUs, Storage size, etc). But we know, because of AI, they are getting bigger and more power hungry,
There were more FMS 2024 topics discussed, like the continuing interest in TLC SSDs, new memory offerings, computational storage/memory, etc.
Jim Handy, General Director, Objective Analysis
Jim Handy of Objective Analysis has over 35 years in the electronics industry, including 20 years as a leading semiconductor and SSD industry analyst. Early in his career he held marketing and design positions at leading semiconductor suppliers including Intel, National Semiconductor, and Infineon.
A frequent presenter at trade shows, Mr. Handy is known for his technical depth, accurate forecasts, widespread industry presence and volume of publication.
He has written hundreds of market reports, articles for trade journals, and white papers, and is frequently interviewed and quoted in the electronics trade press and other media.
He posts blogs at www.TheMemoryGuy.com, and www.TheSSDguy.com

Apr 2, 2024 • 48min
163: GreyBeards talk Ultra Ethernet with Dr J Metz, Chair of UEC steering committee, Chair of SNIA BoD, & Tech. Dir. AMD
Dr J Metz, (@drjmetz, blog) has been on our podcast before mostly in his role as SNIA spokesperson and BoD Chair, but this time he’s here discussing some of his latest work on the Ultra Ethernet Consortium (UEC) (LinkedIN: @ultraethernet, X: @ultraethernet)
The UEC is a full stack re-think of what Ethernet could do for large single application environments. UEC was originally focused on HPC, with 400-800 Gbps networks and single applications like simulating a hypersonic missile or airplane. But with the emergence of GenAI and LLMs, UEC could also be very effective for large AI model training with massive clusters doing a single LLM training job over months. Listen to the podcast to learn more.
The UEC is outside the realm of normal enterprise environments. But as AI training becomes more ubiquitous, who knows whether UEC may not find a place in the enterprise. However, it’s not intended for mixed network environments with multiple applications. It’s a single application network.
One wouldn’t think, HPC was a big user of Ethernet for their main network. But Dr J pointed out that the top 3 of the HPC 500, all use Ethernet and more are looking to use it in the future.
UEC is essentially an optimized software stack and hardware for networking used by single application environments. These types of workloads are constantly pushing the networking envelope. And by taking advantage of the “special networking personalities” of these workloads, UEC can significantly reduce networking overheads, boosting bandwidth and workload execution.
The scale of networks is extreme. The UEC is targeting up to a million endpoints, over >100K servers, with each network link >100Gbps and more likely 400-800Gpbs. With the new (AMD and others) networking cards coming out that support 4 400/800Gbps network ports, having a pair of these on each server, with 100K server cluster gives one 800K endpoints. A million is not that far away when you think of it at that scale.
Moreover, LLM training and HPC work are starting to look more alike these days. Yes there are differences but the scale of their clusters are similar, and the way work is sometimes fed to them is similar, which leads to similar networking requirements
UEC is attempting to handle a 5% problem. That is 95% of the users will not have 1M endpoints in their LAN, but maybe 5% will and for these 5%, a more mixed networking workload is unnecessary. In fact, a mixed network becomes a burden slowing down packet transmission.
UEC is finding that with a few select networking parameters, almost like workload fingerprints, network stacks can be much more optimized than current Ethernet and thereby support reduced packet overheads, and more bandwidth.
AI and HPC networks share a very limited set of characteristics which can be used as fingerprints. These characteristics are like reliable or unreliable transport, ordered or unordered delivery, multi-path packet spraying or not, etc, With a set of these types of parameters, selected for an environment, UEC can optimize a network stack to better support a million networking endpoints
We asked where CXL fits in with UEC? DrJ said it could potentially be an entity on the network but he sees CXL more as a within server or between a tight (limited) cluster of servers, solution rather than something on a UEC network.
Just 12 months ago the UEC had 10 members or so and this past week they were up to 60. UEC seems to have struck a chord.
The UEC plans to release a 1.0 specification, near the end of this year. UEC 1.0 is intended to operate on current (>100Gbps) networking equipment with firmware/software changes.
Considering the UEC was just founded in 2023, putting out their 1.0 technical spec. within 1.5 years is astonishing. But also speaks volumes to the interest in the technology.
The UEC has a blog post which talks more about UEC 1.0 specification and the technology behind it.
Dr J Metz, Chair of UEC Steering Committee, Chair of SNIA BoD, Technical Director of Systems Design, AMD
J works to coordinate and lead strategy on various industry initiatives related to systems architecture. Recognized as a leading storage networking expert, J is an evangelist for all storage-related technology and has a unique ability to dissect and explain complex concepts and strategies. He is passionate about the innerworkings and application of emerging technologies.
J has previously held roles in both startups and Fortune 100 companies as a Field CTO, R&D Engineer, Solutions Architect, and Systems Engineer. He has been a leader in several key industry standards groups, sitting on the Board of Directors for the SNIA, Fibre Channel Industry Association (FCIA), and Non-Volatile Memory Express (NVMe). A popular blogger and active on Twitter, his areas of expertise include NVMe, SANs, Fibre Channel, and computational storage.
J is an entertaining presenter and prolific writer. He has won multiple awards as a speaker and author, writing over 300 articles and giving presentations and webinars attended by over 10,000 people. He earned his PhD from the University of Georgia.

Feb 21, 2024 • 42min
162: GreyBeards talk cold storage with Steffen Hellmold, Dir. Cerabyte Inc.
Steffen Hellmold, Director, Cerabyte Inc. is extremely knowledgeable about the storage device business. He has worked for WDC in storage technology and possesses an in-depth understanding of tape and disk storage technology trends.
Cerabyte, a German startup, is developing cold storage. Steffen likened Cerabyte storage to ceramic punch cards that dominated IT and pre-IT over much of the last century. Once cards were punched, they created near-WORM storage that could be obliterated or shredded but was very hard to modify. Listen to the podcast to learn more.
Cerabyte uses a unique combination of semiconductor (lithographic) technology, ceramic coated glass, LTO tape (form factor) cartridge and LTO automation in their solution. So, for the most part, their critical technologies all come from somewhere else.
Their main technology uses a laser-lithographic process to imprint onto a sheet (ceramic coated glass) a data page (block?). There are multiple sheets in each cartridge.
Their intent is to offer a robotic system (based on LTO technology) to retrieve and replace their multi-sheet cartridges and mount them in their read-write drive.
As mentioned above, the write operation is akin to a lithographic data encoded mask that is laser imprinted on the glass. Once written, the data cannot be erased. But it can be obliterated, by something akin to writing all ones or it can be shredded and recycled as glass.
The read operation uses a microscope and camera to take scans of the sheet’s imprint and convert that into data.
Cerabyte’s solution is cold or ultra-cold (frozen) storage. If LTO robotics are any indication, a Cerabyte cartridge with multiple sheets can be presented to a read-write drive in a matter of seconds. However, extracting the appropriate sheet in a cartridge, and mounting it in a read-write drive will take more time. But this may be similar in time to an LTO tape leader being threaded through a tape drive, again a matter of seconds
Steffen didn’t supply any specifications on how much data could be stored per sheet other than to say it’s on the order of many GB. He did say that both sides of a Cerabyte sheet could be recording surfaces.
With their current prototype, an LTO form factor cartridge holds less than 5 sheets of media but they are hoping that they can get this to a 100 or more. in time.
We talked about the history of disk and tape storage technology. Steffen is convinced (as are many in the industry) that disk-tape capacity increases have slowed over time and that this is unlikely to change. I happen to believe that storage density increases tend to happen in spurts, as new technology is adopted and then trails off as that technology is built up. We agreed to disagree on this point.
Steffen predicted that Cerabyte will be able to cross over disk cost/capacity this decade and LTO cost/capacity sometime in the next decade.
We discussed the market for cold and frozen storage. Steffen mentioned that the Office of the Director of National Intelligence (ODNI) has tasked the National Academies of Sciences, Engineering, and Medicine to conduct a rapid expert consultation on large-scale cold storage archives. And that most hyperscalers have use for cold and frozen storage in their environments and some even sell this (Glacier storage) to their customers.
The Library of Congress and similar entities in other nations are also interested in digital preservation that cold and frozen technology could provide. He also thinks that medical is a prime market that is required to retain information for the life of a patient. IBM, Cerabyte, and Fujifilm co-sponsored a report on sustainable digital preservation.
And of course, the media libraries for some entertainment companies represent a significant asset that if on tape has to be re-hosted every 5 years or so. Steffen and much of the industry are convinced that a sizeable market for cold and frozen storage exists.
I mentioned that long archives suffer from data format drift (data formats are no longer supported). Steffen mentioned there’s also software version drift (software that processed that data is no longer available/runnable on current OSs). And of course the current problem with tape is media drift (LTO media formats can be read only 2 versions back).
Steffen seemed to think format and software drift are industry-wide problems and they are being worked on. Cerabyte seems to have a great solution for media drift. As it can be read with a microscope. And the (ceramic glass) media has a predicted life of 100 years or more.
I mentioned the “new technology R&D” problem. Historically, as new storage technology has emerged, they have always end up being left behind (in capacity), because disk-tape-NAND R&D ($Bs each) over spends them. Steffen said it’s certainly NOT B$ of R&D for tape and disk.
Steffen countered by saying that all storage technology R&D spending pales in comparison to semiconductor R&D spending focused on reducing feature size. And as Cerabyte uses semiconductor technologies to write data, sheet capacity is directly a function of semiconductor technology. So, Cerabyte’s R&D technology budget should not be a problem. And in fact they have been able to develop their prototype, with just $7M in funding.
Steffen mentioned there is an upcoming Storage Technology Showcase conference in early March where Cerabyte will be at.
Steffen Hellmold, Director, Cerabyte Inc.
Steffen has more than 25 years of industry experience in product, technology, business & corporate development as well as strategy roles in semiconductor, memory, data storage and life sciences.
He served as Senior Vice President, Business Development, Data Storage at Twist Bioscience and held executive management positions at Western Digital, Everspin, SandForce, Seagate Technology, Lexar Media/Micron, Samsung Semiconductor, SMART Modular and Fujitsu.
He has been deeply engaged in various industry trade associations and standards organizations including co-founding the DNA Data Storage Alliance in 2020 as well as the USB Flash Drive Alliance, serving as their president from 2003 to 2007.
He holds an economic electrical engineering degree (EEE) from the Technical University of Darmstadt, Germany.

Jan 19, 2024 • 48min
161: Greybeards talk AWS S3 storage with Andy Warfield, VP Distinguished Engineer, Amazon
In this episode, Andy Warfield, VP Distinguished Engineer at Amazon and expert in data storage, discusses the evolution and advancements of AWS S3. He sheds light on S3 Express and One Zone storage, which promise lower response times. Andy dives into the role of S3 in supporting generative AI and the complexities of file versus object storage. With insights from his teaching background, he explains the importance of durability in data storage and highlights innovations that enhance operator experience and efficiency in various industries.

Jan 4, 2024 • 23min
160: GreyBeard talks data security with Jonathan Halstuch, Co-Founder & CTO, RackTop Systems
Sponsored By:
This is the last in this year’s, GreyBeards-RackTop Systems podcast series and once again we are talking with Jonathan Halstuch (@JAHGT), Co-Founder and CTO, RackTop Systems. This time we discuss why traditional security practices can’t cut it alone, anymore. Listen to the podcast to learn more.
Turns out traditional security practices are keeping the bad guys out or supplies perimeter security with networking equivalents. But the problem is sometimes the bad guy is internal and at other times the bad guys pretend to be good guys with good credentials. Both of these aren’t something that networking or perimeter security can catch.
As a result, the enterprise needs both traditional security practices as well as something else. Something that operates inside the network, in a more centralized place, that can be used to detect bad behavior in real time.
Jonathan talked about a typical attack:
A phishing email link is clicked on ==> attacker now owns the laptop/desktop user’s credentials
Attacker scans the laptop/desktop for admin credentials or one time pass codes which can be just as good, in some cases ==> the attacker attempts to escalate privileges above the user and starts scanning customer data for anything worthwhile to steal, e.g. crypto wallets, passwords, client data, IP, etc.
Attacker copies data of interest and continues to scan for more data and to escalate privileges ==> by now if not later, your data is compromised, either it’s in the hands of others that may want to harm you or extract money from you or it’s been copied by a competitor, or worse a nation state.
At some point the attacker has scanned and copied any data of interest ==> at this point, depending on the attacker, they could install malware which can be easily detected to signal the IT organization it’s been compromised.
By the time security systems detect the malware, the attacker has been in your systems and all over your network for months, and it’s way too late to stop them from doing anything they want with your data.
In the past detection like this could have been 3rd party tools that scanned backups for malware or storage systems copying logs to be assessed, on a periodic basis.
The problem with such tools is that they always lag behind the time when the theft/corruption has occurred.
The need to detect in real time, at something like the storage system, is self-evident. The storage is the central point of access to data. If you could detect illegal or bad behavior there, and stop it before it could cause more harm that would be ideal.
In the past, storage system processors were extremely busy, just doing IO. But with today’s modern, multi-core, NUMA CPUs, this is no longer be the case.
Along with high performing IO, RackTop Systems supports user and admin behavioral analysis and activity assessors. These processes run continuously, monitoring user and admin IO and command activity, looking for known, bad or suspect behaviors.
When such behavior is detected, the storage system can prevent further access automatically, if so configured, or at a minimum, warn the security operations center (SOC) that suspicious behavior is happening and inform SOC of who is doing what. In this case, with a click of a link in the warning message, SOC admins can immediately stop the activity.
If it turns out the suspicious behavior was illegal, having the detection at the storage system can also provide SOC a list of files that have been accessed/changed/deleted by the user/admin. With these lists, SOC has a rapid assessment of what’s at risk or been lost.
Jonathan and I talked about RackTop Systems deployment options, which span physical appliances, SAN gateways to virtual appliances. Jonathan mentioned that RackTop Systems has a free trial offer using their virtual appliance that any costumer can download to try them out.
Jonathan Halstuch, Co-Founder & CTO, Racktop Systems
Jonathan Halstuch is the Chief Technology Officer and Co-Founder of RackTop Systems. He holds a bachelor’s degree in computer engineering from Georgia Tech as well as a master’s degree in engineering and technology management from George Washington University.
With over 20-years of experience as an engineer, technologist, and manager for the federal government, he provides organizations the most efficient and secure data management solutions to accelerate operations while reducing the burden on admins, users, and executives.