depending on resource and timing constraints, you can use a cascade of adders, where you repeatdly add each bit starting with bit 7 to each other. this is slow because the critical path is on the Cin -> Cout.
to improve, you can go further and use a 7bit decoder/any arrangement of decoder/column muxer + decoder for a lookup, which is essentially an SRAM array design to store this value so that next time you try to do this computation, you can directly access it. Essentially caching computation result. This requires extra circuitry overhead, but means you only have to compute the sums once.